Ben Sigelman Profile picture
I like writing threads! Archive ➙ https://t.co/ry8HHZdaye Co-founded @LightstepHQ. Co-created @OpenTelemetry & @OpenTracing. Co-created Dapper and Monarch at Google.
Nov 30, 2021 15 tweets 5 min read
0/ In recent years, SLOs have graduated from being an “SRE 201” advanced topic to an outright buzzword.

But despite their promise, SLO deployments today are messy and often unsuccessful. Why? And what can we do about it?

Thread: 👇 Image 1/ All good engineers care about their users. In the olden days of monolithic software apps, engineers even got to deploy software that touched those users directly!

But given the depth of modern architectures, the user is often many, many hops away from that (good) engineer. 😢 Image
Sep 22, 2021 18 tweets 6 min read
0/ Every software org that's tried to scale distributed tracing has probably wrestled with sampling.

And yet the standard approach to sampling is needlessly narrow and limiting! What if we step back and frame things in terms of use cases, queries, and verbosity?

Thread: 👇 Image 1/ So, first things first: the only reason anyone cares about sampling is that distributed tracing can generate a *vast* amount of telemetry, and we need to be thoughtful about how we transmit, store, and analyze it.
Sep 17, 2021 14 tweets 3 min read
0/ The more time I spend in this industry, the more I realize that, while “Digital Transformation” is real, there are actually three of them. :)

And that we need a new kind of observability to complete the transition.

🧵👇 1/ The first “Digital Transformation” is a transformation of Operations. This isn’t just SRE or ITOps, to be clear, it’s much broader – basically “all Opex,” or “everything that employees do.”
Apr 12, 2021 16 tweets 6 min read
0/ This is a thread about why tracing will gradually replace most logging, at least where distributed or cloud-native architectures are concerned. And we’re going to explore this through the lens of a relational data model.

It’s going to be fun!

Thread: 👇 1/ The best logging is always *structured* logging. That is, logging statements are most useful if they encode key:value pairs which can then be queried and *analyzed* in the aggregate.

(Even for plain, textual logs, NLP and stats can extract basic structure.)
Mar 26, 2021 14 tweets 4 min read
0/ The easier part of Continuous Delivery (“CD”) is, well, “continuously delivering software.”

The harder part is doing it reliably.

This is a thread about the critical differences between what we’ll refer to as “local CD” and “global CD,” and how observability fits in.

👇 1/ Let’s begin by restating the conventional wisdom about how to do “Continuous Delivery” for a single (micro)service:

i) <CD run starts>
ii) Qualify release in pre-prod
iii) Deploy to prod
iv) If the deployed service is unstable, roll back the deploy

Safe, right? Not really.
Mar 18, 2021 16 tweets 4 min read
0/ Fundamentally, there are only two types of “things worth observing” when it comes to production systems:

1) Resources
2) Transactions

The tricky (and interesting) part is that they’re entirely codependent. This is a thread about that tricky/interesting part…

👇 1/ But first, some definitions.

*Transactions:* these are the things that traverse your system and (hopefully) “do something.” The classic example would be an end-user request that propagates across networks and process boundaries.
Feb 23, 2021 17 tweets 5 min read
0/ If you or someone you love uses Kafka in production, I’m sure there’s been some emotional toil when a single producer floods a topic and creates a cascading failure.

This is a 🧵 about how monitoring and observability can make that far less painful.

👇 1/ At a certain level, Kafka is just like any other resource in your system: e.g., your database’s CPUs, your NICs, or your RAM reservations.

All resources are finite, and when they participate in transactions, there is a little bit less of them than when they don’t.
Feb 12, 2021 16 tweets 3 min read
0/ When large eng orgs rely on metrics for both monitoring *and* observability, they struggle with cardinality.

This is a thread about “the two drivers of cardinality.” And which one of those we should kill. :)

🧵👇 1/ Okay, first off: “what is cardinality, anyway?” And why is it such a big deal for metrics?

“Cardinality” is a mathematical term: it’s *the number of elements in a set*... boring! So why tf does anybody care??

Well, because people think they need it, then suddenly, "$$$$$$$."
Feb 5, 2021 13 tweets 4 min read
0/ This is a 🧵about my experiences building both the Distributed Tracing and Metrics infra at Google.

And, particularly, my regrets. :)

Here goes: 👇 Image 1/ Dapper certainly did some fancy tricks, and I’m sure it still does. If it’s possible to fall in love with an idea or a piece of technology, that’s what happened with me and Dapper. It wasn’t just new data, it was a new *type* of data – and lots of it. So. Much. Fun. …
Jan 13, 2021 12 tweets 4 min read
0/ I’m tired of hearing about observability replacing monitoring. It’s not going to, and that’s because it shouldn’t.

Observability will not _replace_ monitoring, it will _augment_ monitoring.

Here’s a thread about observability, and how monitoring can evolve to fit in: 👇 Image 1/ Let’s start with the diagram (above) illustrating the anatomy of observability. There are three layers:

I. (Open)Telemetry: acquire high-quality data with minimal effort
II. Storage: “Stats over time” and “Transactions over time”
III. Benefits: *solve actual problems*
Jun 16, 2020 15 tweets 5 min read
0/ This is a thread about *Logging* and how – for decades – it’s been a needlessly “selfish” technology.

And how that should change.

I promise this eventually gets concrete and involves real examples from production. :)

👇 1/ First off, a crucial clarification: I don’t mean that the “loggers” – that is, the human operators – are selfish, of course! The problem has been that their (IMO primitive) toolchain needlessly localizes and *constrains* the value of the logging telemetry data.
Jun 8, 2020 13 tweets 4 min read
0/ Sometimes we should philosophize about observability… and sometimes we should just get ultra-pragmatic and examine real use cases from real systems!

Here is one about a bad deploy we had at @LightstepHQ the other day. Let’s get started with a picture…

Thread 👇 1/ In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to *detect* the regression and roll back, but in order to fix the underlying issue, of course we had to understand it.
Apr 17, 2020 16 tweets 4 min read
0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”

This is a thread about *observability value:* both the benefits and the costs. 1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:

- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
Feb 11, 2020 17 tweets 5 min read
0/ Sometimes we should philosophize about observability.

And sometimes we should just get pragmatic and examine real-world use cases in real-world systems! So here is a simple example of what cutting-edge observability can do today.

We begin with an SLI that looks off… Image 1/ A quick prologue: this real-world example comes from @LightStepHQ’s meta-monitoring (of our own SaaS). This way I can show real data at scale (Lightstep customers generate billions of traces every hour!!) without needing approval from customer eng+PR departments.
Jan 14, 2020 14 tweets 5 min read
0/ Deep systems have come to the fore in recent years, largely due to the industry-wide migration to microservices.

But weren't monoliths "deep", too? Well, yes and no.

And this is all related to tracing, observability, and the slow death of APM.

Thread: 1/ First, let's start with monoliths. Of course they've been around for a while, and it’s where most of us started. There is plenty of depth and complexity from a monolithic-codebase standpoint, but operationally it's just one big – and often brittle – binary.
Nov 8, 2019 7 tweets 2 min read
1/ APM is dying – and that’s ok.

What happened? And why?

(thread) 2/ In APM’s heyday (think “New Relic and AppDynamics circa 2015”), the value prop was straightforward: “Just add this one magic agent and you’ll never need to wonder why your monolithic app is broken!”

But then things changed.
Oct 15, 2019 6 tweets 2 min read
1/ We all know that observability is "hard and getting harder."

What single aspect of a system best predicts broken observability workflows? It’s not just *scale*, it’s *depth*.

And that’s why we need to talk more about deep systems. (Thread)

lightstep.com/blog/how-deep-… 2/ Why are deep systems so problematic?

It’s because what we control in production is so much smaller than what we’re responsible for.

To illustrate the fearsome scope of responsibility with real-world data, here are some anonymized @LightStepHQ customer system diagrams:
Oct 9, 2019 9 tweets 3 min read
1/ First things first: Metrics, Logs, and Traces are not “the three pillars of observability.”

They are just the raw materials – the *telemetry* – and we must reframe our discussion of observability around use cases and problems-to-solve.

Thread: 2/ The conventional wisdom looks like this…

Observability is this cool 6-syllable word that you know you want because it’s trendier than monitoring. And you get it (somehow) by purchasing Logs, Metrics, and Tracing.
Aug 17, 2019 10 tweets 2 min read
For observability, scale is a red herring: what really matters is *depth*.

And once you start thinking about “deep systems,” you realize why conventional observability dogma is nonsensical. And also why tracing will become the foundation for effective observability.

Thread: /0 First, let’s talk about scale and why it’s the wrong way to think about system complexity.

“Scale” refers to throughput, cost, developer headcount, or some other linear attribute of a system. /1