Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Ben Sigelman

@el_bhs

Jan 13, 2021 • 12 tweets • 4 min read • Read on X

0/ I’m tired of hearing about observability replacing monitoring. It’s not going to, and that’s because it shouldn’t.

Observability will not _replace_ monitoring, it will _augment_ monitoring.

Here’s a thread about observability, and how monitoring can evolve to fit in: 👇

1/ Let’s start with the diagram (above) illustrating the anatomy of observability. There are three layers:

I. (Open)Telemetry: acquire high-quality data with minimal effort
II. Storage: “Stats over time” and “Transactions over time”
III. Benefits: *solve actual problems*

@opentelemetry

2/ The direction for “Telemetry” is simple: @opentelemetry.

(This is the (only) place where the so-called "three pillars” come in, by the way. If you think you’ve solved the observability problem by collecting traces, metrics, and logs, you’re about to be disappointed. :-/ )

3/ The answer for “Storage” depends on your workload, but we’ve learned that it’s glib to expect a data platform to support observability with *just* a TSDB or *just* a transaction/trace/logging DB. And also that “cost profiling and control” is a core platform feature.

4/ But what about “Benefits”? There’s all of that business about Control Theory (too academic) and “unknown unknowns” (too abstract). And “three pillars” which is complete BS, per the above (it’s just “the three pillars of telemetry,” at best).

5/ Really, Observability *Benefits* divide neatly into two categories: understanding *health* (i.e., monitoring) and understanding *change* (i.e., finding and exposing signals and statistical insights hidden within the firehose of telemetry).

6/ Somewhere along the way, “monitoring” was thrown under a bus, which is unfortunate. If we define monitoring as *an effort to connect the health of a system component to the health of the business* – it’s actually quite vital. And ripe for innovation! E.g., SLOs.

7/ “Monitoring” got a bad name because operators were *trying to monitor every possible failure mode of a distributed system.* That doesn’t work because there are too many of them.

(And that’s why you have too many dashboards at your company.)

8/ Monitoring doesn’t have to be that way. It can actually be quite clarifying, and there’s still ample room for innovation. I’d argue that SLOs, done properly, are what monitoring can and should be (or become).

9/ So what if we do things differently? What if we do things *right*? We treat Monitoring as a first-class citizen, albeit only one aspect of observability, and we closely track the signals that best express and predict the health of each component in our systems.

10/ … And then we need a new kind of observability value that’s purpose-built to manage *changes* in those signals. More on that part in a future post. :) But the idea is to facilitate intentional change (e.g., CI/CD) while mitigating unintentional change (Incident Response).

11/ Zooming out: Monitoring will never be *replaced* by Observability: it’s not just "part of Observability’s anatomy," it’s a vital organ! Our challenge is to *evolve* Monitoring, and to use it as a scaffold for the patterns and insights in our telemetry that explain change.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @el_bhs

Ben Sigelman

@el_bhs

Nov 30, 2021

0/ In recent years, SLOs have graduated from being an “SRE 201” advanced topic to an outright buzzword.

But despite their promise, SLO deployments today are messy and often unsuccessful. Why? And what can we do about it?

Thread: 👇

1/ All good engineers care about their users. In the olden days of monolithic software apps, engineers even got to deploy software that touched those users directly!

But given the depth of modern architectures, the user is often many, many hops away from that (good) engineer. 😢

https://twitter.com/el_bhs/status/1372636288021524482

2/ At the tippity-top of the stack, users are fiddling with their phones or clicking on websites, etc. In cloud-native apps, these activities create *transactions* that propagate from service to service, consuming resources along the way.

(More detail:

https://twitter.com/el_bhs/status/1372636288021524482

)

Read 15 tweets

Ben Sigelman

@el_bhs

Sep 22, 2021

0/ Every software org that's tried to scale distributed tracing has probably wrestled with sampling.

And yet the standard approach to sampling is needlessly narrow and limiting! What if we step back and frame things in terms of use cases, queries, and verbosity?

Thread: 👇

1/ So, first things first: the only reason anyone cares about sampling is that distributed tracing can generate a *vast* amount of telemetry, and we need to be thoughtful about how we transmit, store, and analyze it.

2/ When we prototyped Dapper at Google in 2004, we realized we needed something to reduce costs (and, because we stupidly created kernel contention by writing the data to local disk, overhead).

We did the simplest thing: uniform random sampling.

Read 18 tweets

Ben Sigelman

@el_bhs

Sep 17, 2021

0/ The more time I spend in this industry, the more I realize that, while “Digital Transformation” is real, there are actually three of them. :)

And that we need a new kind of observability to complete the transition.

🧵👇

1/ The first “Digital Transformation” is a transformation of Operations. This isn’t just SRE or ITOps, to be clear, it’s much broader – basically “all Opex,” or “everything that employees do.”

2/ First we digitized the filing cabinet via mainframe databases, then we added boxed software and LANs, and now we’ve gone further with SaaS tools that cover operational workflows across roles and functions: see ServiceNow, Microsoft, Salesforce, etc.

Read 14 tweets

Ben Sigelman

@el_bhs

Apr 12, 2021

0/ This is a thread about why tracing will gradually replace most logging, at least where distributed or cloud-native architectures are concerned. And we’re going to explore this through the lens of a relational data model.

It’s going to be fun!

Thread: 👇

1/ The best logging is always *structured* logging. That is, logging statements are most useful if they encode key:value pairs which can then be queried and *analyzed* in the aggregate.

(Even for plain, textual logs, NLP and stats can extract basic structure.)

2/ A structured log implicitly defines a *relational table*, with the keys for each attribute defining the columns, and the values for each log line defining rows in this (theoretical) table.

Like this:

Read 16 tweets

Ben Sigelman

@el_bhs

Mar 26, 2021

0/ The easier part of Continuous Delivery (“CD”) is, well, “continuously delivering software.”

The harder part is doing it reliably.

This is a thread about the critical differences between what we’ll refer to as “local CD” and “global CD,” and how observability fits in.

👇

1/ Let’s begin by restating the conventional wisdom about how to do “Continuous Delivery” for a single (micro)service:

i) <CD run starts>
ii) Qualify release in pre-prod
iii) Deploy to prod
iv) If the deployed service is unstable, roll back the deploy

Safe, right? Not really.

2/ The above is what I would term “Local CD” – we are checking whether *the deployed service itself* is (locally) stable… but that’s it.

The problem is that a *majority* of production incidents are due to planned deployments in *other* services. “Local CD” cannot find those.

Read 14 tweets

Ben Sigelman

@el_bhs

Mar 18, 2021

0/ Fundamentally, there are only two types of “things worth observing” when it comes to production systems:

1) Resources
2) Transactions

The tricky (and interesting) part is that they’re entirely codependent. This is a thread about that tricky/interesting part…

👇

1/ But first, some definitions.

*Transactions:* these are the things that traverse your system and (hopefully) “do something.” The classic example would be an end-user request that propagates across networks and process boundaries.

2/ “Transactions” can be described at wildly different granularities: actions in a mobile app, HTTP reqs, function calls, CPU instructions, etc. They’re *all* part of the overall transaction, and the question is how much detail you can *afford* to expose to observability tools.

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Ben Sigelman

Try unrolling a thread yourself!

More from @el_bhs

Ben Sigelman

Ben Sigelman

Ben Sigelman

Ben Sigelman

Ben Sigelman

Ben Sigelman

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!