Ben Sigelman Profile picture
12 Feb, 16 tweets, 3 min read
0/ When large eng orgs rely on metrics for both monitoring *and* observability, they struggle with cardinality.

This is a thread about “the two drivers of cardinality.” And which one of those we should kill. :)

🧵👇
1/ Okay, first off: “what is cardinality, anyway?” And why is it such a big deal for metrics?

“Cardinality” is a mathematical term: it’s *the number of elements in a set*... boring! So why tf does anybody care??

Well, because people think they need it, then suddenly, "$$$$$$$."
2/ When a developer inserts a (custom) metric to their code, they might think they’re just adding, well, *one metric*. …
3/ … But when they add “tags” to that metric – like <version>, <host>, or (shiver) <customer_id> – they are actually creating a *set* of metric time series, with the *cardinality* of that set being the total number of unique combinations of those tags.
4/ The problem is that some of those tags have many distinct values.

E.g., when I was working on Monarch at Google, there was a single gmail metric that became over 300,000,000 (!!!) distinct time series. In a TSDB, that cardinality is the unit of cost.

So again, “$$$$$$$.”
5/ Okay, so that’s why people care about metrics cardinality. Now, what are the two *drivers* of that cardinality?

A) Monitoring: more detailed *health measurements*
B) Observability: more detailed *explanations of changes*

Let’s take these one at a time…
6/ First, “More Detailed Health Measurements” (monitoring):

Consider an RPC metric that counts errors. You need to independently monitor the error rate for different methods. And so – voila – someone adds a “method” tag, and now there’s 10x the cardinality for that metric.
7/ … And also 10x the cost. But that’s justifiable, as there’s a business case for independently monitoring the reliability of distinct RPC methods.

Put another way, you might rightly have different error budgets for different RPC methods, so their statistics must be separable.
8/ Bottom line: When it comes to “measuring health,” we often *need* cardinality in order to hone in on the signals we actually care about. Increasing cardinality to *proactively* monitor the signals we care most about is usually a worthwhile tradeoff.
9/ Now what about using cardinality for “More Detailed Explanations of Changes?”

This is the real culprit! And, frankly, should be abolished. :) Metrics cardinality is the wrong way to do observability – to explain changes.

More on that…
10/ Say your monitoring tells you that there’s a problem with a critical symptom – e.g., you’re suddenly burning through an SLO error budget at an unsustainable clip.
11/ After a painful outage, say you realize a single customer DOSed your service. So someone adds a `customer` tag “for next time.”

But this is unsustainable: each incident reveals a new failure mode, devs keep adding tags, and before long your metrics bill is out of control.
12/ The problem, in a nutshell:

Distributed systems can fail for a staggeringly large number of reasons. *You can't use metrics cardinality to isolate each one.*
13/ How *should* we explain changes to production systems?

Understanding change is the central problem of observability. Effective workflows might *start* with metrics, but they must pivot towards a multi-telemetry, multi-service guided analysis.
14/ So, to sum up: spend your limited cardinality budget on *monitoring*, and then look for observability that (a) naturally explains changes and (b) relies on transactional data sources that do not penalize you for high/unbounded cardinality.
15/ PS: For more on how to distinguish monitoring and observability, see this thread:

PPS: If you’d like to discuss/debate/request-more-detail any of the above, reply to this thread or DM me!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ben Sigelman

Ben Sigelman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @el_bhs

5 Feb
0/ This is a 🧵about my experiences building both the Distributed Tracing and Metrics infra at Google.

And, particularly, my regrets. :)

Here goes: 👇 Image
1/ Dapper certainly did some fancy tricks, and I’m sure it still does. If it’s possible to fall in love with an idea or a piece of technology, that’s what happened with me and Dapper. It wasn’t just new data, it was a new *type* of data – and lots of it. So. Much. Fun. …
2/ … And yet: early on, *hardly anybody actually used it.*

Dapper was certainly _valuable_ (we saved GOOG untold $10Ms in latency improvements alone) but not “day-to-day essential.” Why?
Read 13 tweets
13 Jan
0/ I’m tired of hearing about observability replacing monitoring. It’s not going to, and that’s because it shouldn’t.

Observability will not _replace_ monitoring, it will _augment_ monitoring.

Here’s a thread about observability, and how monitoring can evolve to fit in: 👇 Image
1/ Let’s start with the diagram (above) illustrating the anatomy of observability. There are three layers:

I. (Open)Telemetry: acquire high-quality data with minimal effort
II. Storage: “Stats over time” and “Transactions over time”
III. Benefits: *solve actual problems*
2/ The direction for “Telemetry” is simple: @opentelemetry.

(This is the (only) place where the so-called "three pillars” come in, by the way. If you think you’ve solved the observability problem by collecting traces, metrics, and logs, you’re about to be disappointed. :-/ ) Image
Read 12 tweets
16 Jun 20
0/ This is a thread about *Logging* and how – for decades – it’s been a needlessly “selfish” technology.

And how that should change.

I promise this eventually gets concrete and involves real examples from production. :)

👇
1/ First off, a crucial clarification: I don’t mean that the “loggers” – that is, the human operators – are selfish, of course! The problem has been that their (IMO primitive) toolchain needlessly localizes and *constrains* the value of the logging telemetry data.
2/ How? Well, traditional logging tech encourages the code-owner to add logs *in order to explain what’s happening with their code, in production, to themself.* Or, maybe, to other/future owners of that particular patch of code.

Useful? Sometimes. But both limited and limiting.
Read 15 tweets
8 Jun 20
0/ Sometimes we should philosophize about observability… and sometimes we should just get ultra-pragmatic and examine real use cases from real systems!

Here is one about a bad deploy we had at @LightstepHQ the other day. Let’s get started with a picture…

Thread 👇
1/ In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to *detect* the regression and roll back, but in order to fix the underlying issue, of course we had to understand it.
2/ We knew the failure was related to a bad deploy of the `liveview` service. The screenshot above shows `liveview` endpoints, ranked by the “biggest change” for the new release; at the top is “ExplorerService/Create” with a huge (!!) increase in error ratio.
Read 13 tweets
17 Apr 20
0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”

This is a thread about *observability value:* both the benefits and the costs.
1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:

- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*

This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
Read 16 tweets
11 Feb 20
0/ Sometimes we should philosophize about observability.

And sometimes we should just get pragmatic and examine real-world use cases in real-world systems! So here is a simple example of what cutting-edge observability can do today.

We begin with an SLI that looks off… Image
1/ A quick prologue: this real-world example comes from @LightStepHQ’s meta-monitoring (of our own SaaS). This way I can show real data at scale (Lightstep customers generate billions of traces every hour!!) without needing approval from customer eng+PR departments.
2/ So, we run a microservice called “maggie” (stands for “m”etrics “agg”regator). It had this weird blip at about 12:30pm. That’s not supposed to happen, so the obvious question is “why?” Image
Read 17 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!