Ben Sigelman Profile picture
13 Jan, 12 tweets, 4 min read
0/ I’m tired of hearing about observability replacing monitoring. It’s not going to, and that’s because it shouldn’t.

Observability will not _replace_ monitoring, it will _augment_ monitoring.

Here’s a thread about observability, and how monitoring can evolve to fit in: 👇 Image
1/ Let’s start with the diagram (above) illustrating the anatomy of observability. There are three layers:

I. (Open)Telemetry: acquire high-quality data with minimal effort
II. Storage: “Stats over time” and “Transactions over time”
III. Benefits: *solve actual problems*
2/ The direction for “Telemetry” is simple: @opentelemetry.

(This is the (only) place where the so-called "three pillars” come in, by the way. If you think you’ve solved the observability problem by collecting traces, metrics, and logs, you’re about to be disappointed. :-/ ) Image
3/ The answer for “Storage” depends on your workload, but we’ve learned that it’s glib to expect a data platform to support observability with *just* a TSDB or *just* a transaction/trace/logging DB. And also that “cost profiling and control” is a core platform feature. Image
4/ But what about “Benefits”? There’s all of that business about Control Theory (too academic) and “unknown unknowns” (too abstract). And “three pillars” which is complete BS, per the above (it’s just “the three pillars of telemetry,” at best).
5/ Really, Observability *Benefits* divide neatly into two categories: understanding *health* (i.e., monitoring) and understanding *change* (i.e., finding and exposing signals and statistical insights hidden within the firehose of telemetry).
6/ Somewhere along the way, “monitoring” was thrown under a bus, which is unfortunate. If we define monitoring as *an effort to connect the health of a system component to the health of the business* – it’s actually quite vital. And ripe for innovation! E.g., SLOs. Image
7/ “Monitoring” got a bad name because operators were *trying to monitor every possible failure mode of a distributed system.* That doesn’t work because there are too many of them.

(And that’s why you have too many dashboards at your company.)
8/ Monitoring doesn’t have to be that way. It can actually be quite clarifying, and there’s still ample room for innovation. I’d argue that SLOs, done properly, are what monitoring can and should be (or become).
9/ So what if we do things differently? What if we do things *right*? We treat Monitoring as a first-class citizen, albeit only one aspect of observability, and we closely track the signals that best express and predict the health of each component in our systems.
10/ … And then we need a new kind of observability value that’s purpose-built to manage *changes* in those signals. More on that part in a future post. :) But the idea is to facilitate intentional change (e.g., CI/CD) while mitigating unintentional change (Incident Response). Image
11/ Zooming out: Monitoring will never be *replaced* by Observability: it’s not just "part of Observability’s anatomy," it’s a vital organ! Our challenge is to *evolve* Monitoring, and to use it as a scaffold for the patterns and insights in our telemetry that explain change.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Ben Sigelman

Ben Sigelman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @el_bhs

16 Jun 20
0/ This is a thread about *Logging* and how – for decades – it’s been a needlessly “selfish” technology.

And how that should change.

I promise this eventually gets concrete and involves real examples from production. :)

1/ First off, a crucial clarification: I don’t mean that the “loggers” – that is, the human operators – are selfish, of course! The problem has been that their (IMO primitive) toolchain needlessly localizes and *constrains* the value of the logging telemetry data.
2/ How? Well, traditional logging tech encourages the code-owner to add logs *in order to explain what’s happening with their code, in production, to themself.* Or, maybe, to other/future owners of that particular patch of code.

Useful? Sometimes. But both limited and limiting.
Read 15 tweets
8 Jun 20
0/ Sometimes we should philosophize about observability… and sometimes we should just get ultra-pragmatic and examine real use cases from real systems!

Here is one about a bad deploy we had at @LightstepHQ the other day. Let’s get started with a picture…

Thread 👇
1/ In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to *detect* the regression and roll back, but in order to fix the underlying issue, of course we had to understand it.
2/ We knew the failure was related to a bad deploy of the `liveview` service. The screenshot above shows `liveview` endpoints, ranked by the “biggest change” for the new release; at the top is “ExplorerService/Create” with a huge (!!) increase in error ratio.
Read 13 tweets
17 Apr 20
0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”

This is a thread about *observability value:* both the benefits and the costs.
1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:

- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*

This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
Read 16 tweets
11 Feb 20
0/ Sometimes we should philosophize about observability.

And sometimes we should just get pragmatic and examine real-world use cases in real-world systems! So here is a simple example of what cutting-edge observability can do today.

We begin with an SLI that looks off… Image
1/ A quick prologue: this real-world example comes from @LightStepHQ’s meta-monitoring (of our own SaaS). This way I can show real data at scale (Lightstep customers generate billions of traces every hour!!) without needing approval from customer eng+PR departments.
2/ So, we run a microservice called “maggie” (stands for “m”etrics “agg”regator). It had this weird blip at about 12:30pm. That’s not supposed to happen, so the obvious question is “why?” Image
Read 17 tweets
14 Jan 20
0/ Deep systems have come to the fore in recent years, largely due to the industry-wide migration to microservices.

But weren't monoliths "deep", too? Well, yes and no.

And this is all related to tracing, observability, and the slow death of APM.

1/ First, let's start with monoliths. Of course they've been around for a while, and it’s where most of us started. There is plenty of depth and complexity from a monolithic-codebase standpoint, but operationally it's just one big – and often brittle – binary.
2/ Hundreds of developers work across dozens of teams to develop countless packages that are (slowly) tested and compiled into *a single monolithic binary*, pictured here.
Read 14 tweets
8 Nov 19
1/ APM is dying – and that’s ok.

What happened? And why?

2/ In APM’s heyday (think “New Relic and AppDynamics circa 2015”), the value prop was straightforward: “Just add this one magic agent and you’ll never need to wonder why your monolithic app is broken!”

But then things changed.
3a/ *Systems got deep:* APM was designed for monoliths – where development revolved around a single app server. Monoliths slowed down dev velocity, so we broke them into layer upon layer of services.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!