I've done a lot of yowling about high cardinality -- what it is, why you can't have observability without it.
I haven't made nearly as much noise about ✨high dimensionality✨. Which is unfortunate, because it is every bit as fundamental to true observability. Let's fix this!
If you accept my definition of observability (the ability to understand any unknown system state just by asking questions from the outside; it's all about the unknown-unknowns) then you understand why o11y is built on building blocks of arbitrarily-wide structured data blobs.
If you want to brush up on any of this, here are some links on observability:
These wide events consist of rich context and metadata, collected into one event per request per service. You can also think of them as "spans". They have of lots of key/value pairs (and simple data structures) and some dedicated fields for trace id, span id, request id, etc.
When we talk about high cardinality for observability, we are saying that any of the values in these key/value pairs can have a kajillion possible values. Like query strings or UUIDs.
Here's where some tools get tricksy with the marketing language. They say they support high-c..
.. but it comes with lots of hidden costs or limitations. Like, you can have a FEW high-cardinality dimensions, but only a limited number, and you have to choose them in advance.
Or they won't actually let you break down or group by that one in a million UUIDs,
... only append it as a tag. Or maybe they'll let you add some number of high-cardinality dimensions, but the cost goes up linearly with each and every one.
These are good clues that you're being sold a bunch of patches on top of a last generation tool, not real observability.
If cardinality refers to the values, dimensionality refers to the keys. "High dimensionality" means the tool supports extremely wide events. Like many hundreds of k/v pairs and structs per each event.
The wider the event, the richer the context, the more powerful the tool.
Say you have a schema that defines six high-cardinality dimensions per event:
* time
* app
* host
* user
* endpoint
* status
This allows you to slice and dice and look for any combination of results or outliers. Like, "all of the errors in that spike were for host xyz",
or, "all of those 403s were to the /export endpoint by user abc", or "all of the timeouts to the /payment endpoint were on host blah".
Super useful, right?
The wider the event, the richer the context; and the more powerful and high-precision your scalpel becomes.
Now imagine that instead of six basic dimensions, you can toss in literally any detail, value, counter, string, etc that seems like it might be useful at some future date.
(below is a kludgey screencap of some of the dimensions captured for our API service)
It's effectively free if you toss a few more bytes onto the event -- effectively free for us to store, that is; we charge by the event so it's LITERALLY free for you.
And you can slice and dice, mix and match at will. String along five, ten, twenty or more dimensions in a query.
But who can keep track of a schema with 300+ dimensions in it?? Great question; nobody can. 🙃
That's why we say "arbitrarily-wide" events: because the "schema" is inferred. You just append details whenever something seems useful, and don't send it when it's not.
With modern systems, it's not enough to gather a few trusty rusty dimensions about your user requests.
You need to gather incredibly rich detail about everything happening at the intersection of users and code. More dimensions equals more context. And then you need a scalpel.
.. that can pick apart traffic at the individual request level, identifying outliers based on version numbers, feature flags, and any/every combination of details about your headers, environment, devices, queries, storage internals, user metadata, and more.
There are a million different definitions of observability floating around, and I'm probably responsible for more than my fair share of them. But as a user, the key functionality to watch for is:
Several people asked this. It's a good question! I will share my thoughts, but I am certainly not religious about this. You should do what works for you and your teams and their workflows. 📈🥂☺️
1) "assuming you have good deduplication"... can a pretty big assumption. You never want to be in a situation where you spend more time tweaking dupe, retry, re-alert thresholds than fixing the problem.
2) having to remember to go futz with a ticket after every little thing feels like a lot of busywork. You've already committed some code, mentioned it in #ops or wherever, and now you have to go paste all that information into a task (or many tasks) too?
@beajammingh the title particularly caught my eye. for the past month or two i've been sitting on a rant about how i no longer associate the term "devops"** with modern problems, but with fighting the last war.
** infinitely malleable as it may be
yes, if you have massive software engineering teams and operations teams and they are all siloed off from each other, then you should be breaking down (i can't even say it, the phrase is so annoying) ... stuff.
but this is a temporary stage, right? a bridge to a better world.
Close! "If you're considering replacing $(working tool) with $(different tool for same function), don't do it unless you expect a 10x productivity improvement"
cvs to git? ✅
mysql to postgres? ❌
puppet to chef? ❌
redhat to ubuntu? ❌
The costs of ripping and replacing, training humans, updating references and docs, the overhead of managing two systems in the meantime, etc -- are so high that otherwise you are likely better off investing that time in making the existing solution work for you.
Of course, every situation is unique. And the interesting conversations are usually around where that 10x break-even point will be.
The big one of the past half-decade has been when to move from virtualization to containerization.
Maybe not "full transparency", but I think *lots* of engineers chafe at the level of detail they have access to, and wish they were looped in to decision-making processes much earlier.
One of the most common reasons people become managers is they want to know EVERYTHING. They are tired of feeling left out, or like information is being kept from them (true or no).
All they want is to be "in the room where it happens", every time it happens.
I mean, that's why I got in to management. 🙃👋 And it works! It scratches the itch. Everything routes through you. It feels great...for you.
But you still have a team where people feel like they have to become managers in order to be included and heard.
cindy and i have talked many times about the kind of blowback one gets for posting these kinds of things; writing it anyway takes guts, and nerves of steel.
@copyconstruct there's no shame in wanting power and influence to advance your agenda, if you're trying to fix things or improve complex situations. in fact, i think it's a moral imperative for people who care to not cede the space to those who ONLY want power and influence.
twitter is full of critics who have never written a line of production code or managed more than a pet rabbit. and that's fine.
but if you care about enacting real change more than being Right On The Internet, that means working through the vehicle of imperfect organizations.
i completely agree. the more a company tends to talk about their diversity, transparency, etc, the more suspicious i get about how much they doth protest.
especially when they start conducting marketing campaigns around pay-to-play lists for "best employer" awards. 🙄
the best thing about real diversity (and real transparency) is that you don't have to THINK about it all the fucking time. it's not ✨broken✨ and in your face infuriating you with its brokenness all the time.
the most insidious thing about teams that aren't diverse is the constant cognitive and emotional load borne by those who happen to be different.
on a diverse team, people are relieved of most of that tax, and can just focus on being who they are doing what they do.