0/ This is a 🧵about my experiences building both the Distributed Tracing and Metrics infra at Google.
And, particularly, my regrets. :)
Here goes: 👇
1/ Dapper certainly did some fancy tricks, and I’m sure it still does. If it’s possible to fall in love with an idea or a piece of technology, that’s what happened with me and Dapper. It wasn’t just new data, it was a new *type* of data – and lots of it. So. Much. Fun. …
2/ … And yet: early on, *hardly anybody actually used it.*
Dapper was certainly _valuable_ (we saved GOOG untold $10Ms in latency improvements alone) but not “day-to-day essential.” Why?
3/ Dapper’s early-days usage issues boiled down to two core challenges:
a) The insights were *restricted to the tracing (span) telemetry*
b) Those insights could only be accessed *from Dapper.* (And hardly anybody “started” in Dapper from a UX standpoint)
4/ Eventually we did make some progress (e.g., @JMacDee1’s brilliant integration into Google’s /requestz – the ancestor of OpenCensus zpages: opencensus.io/zpages/#tracez).
Yet it still didn’t feel like Dapper was vital, can’t-live-without-it technology for most SREs and devs.
5/ Now, rather than step back to think about *how* we might harvest the insights from Dapper and integrate them into daily workflows, we let the project “evolve in place” – and I regret that.
Anyway, I wanted to work on something “more P0,” so I talked with lots of SREs.
6/ At the time, what tool *did* every SRE at Google use every day?
Borgmon.
And what tool caused every SRE at Google endless frustration and pain?
Also borgmon.
So we created Monarch: scalable, HA monitoring that was also, well, *usable*. vldb.org/pvldb/vol13/p3…
7/ The complete story about Monarch’s early days is an interesting one, but it will have to wait for a different thread/post (too long!). What I would emphasize, though, is that Monarch only tried to solve the *monitoring* problem, not the *observability* problem.
8/ And while I am proud of the team’s technical accomplishments (Monarch is a *vast* system: over 220,000 (!) processes in steady-state), I regret that we stopped at “monitoring.” Why did such an *expensive* system have such limitations?
Correction: I *really* regret that.
9/ So what would a scalable, HA monitoring product look like if observability was built into its fabric, into its very infrastructure? If monitoring was there to measure critical signals, and observability was there to explain changes to those signals?
10/ (TBH, we never even *tried* to build that at Google… though admittedly it would have been very difficult to take on given all of the hurdles that large companies bring to any and every development process.)
11/ So, ultimately, of course there were ∞ small regrets, and two *big* regrets:
I) We built Dapper to find patterns in traces, but we failed to make those findings *discoverable.*
II) We built Monarch for core monitoring, but we failed to make that monitoring *actionable.*
12/ Why am I telling this story now?
Well, this week, after years of effort and experimentation, we at @LightstepHQ are ready to share some news.
0/ I’m tired of hearing about observability replacing monitoring. It’s not going to, and that’s because it shouldn’t.
Observability will not _replace_ monitoring, it will _augment_ monitoring.
Here’s a thread about observability, and how monitoring can evolve to fit in: 👇
1/ Let’s start with the diagram (above) illustrating the anatomy of observability. There are three layers:
I. (Open)Telemetry: acquire high-quality data with minimal effort
II. Storage: “Stats over time” and “Transactions over time”
III. Benefits: *solve actual problems*
2/ The direction for “Telemetry” is simple: @opentelemetry.
(This is the (only) place where the so-called "three pillars” come in, by the way. If you think you’ve solved the observability problem by collecting traces, metrics, and logs, you’re about to be disappointed. :-/ )
0/ This is a thread about *Logging* and how – for decades – it’s been a needlessly “selfish” technology.
And how that should change.
I promise this eventually gets concrete and involves real examples from production. :)
👇
1/ First off, a crucial clarification: I don’t mean that the “loggers” – that is, the human operators – are selfish, of course! The problem has been that their (IMO primitive) toolchain needlessly localizes and *constrains* the value of the logging telemetry data.
2/ How? Well, traditional logging tech encourages the code-owner to add logs *in order to explain what’s happening with their code, in production, to themself.* Or, maybe, to other/future owners of that particular patch of code.
0/ Sometimes we should philosophize about observability… and sometimes we should just get ultra-pragmatic and examine real use cases from real systems!
Here is one about a bad deploy we had at @LightstepHQ the other day. Let’s get started with a picture…
Thread 👇
1/ In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to *detect* the regression and roll back, but in order to fix the underlying issue, of course we had to understand it.
2/ We knew the failure was related to a bad deploy of the `liveview` service. The screenshot above shows `liveview` endpoints, ranked by the “biggest change” for the new release; at the top is “ExplorerService/Create” with a huge (!!) increase in error ratio.
0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”
This is a thread about *observability value:* both the benefits and the costs.
1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:
- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*
This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
0/ Sometimes we should philosophize about observability.
And sometimes we should just get pragmatic and examine real-world use cases in real-world systems! So here is a simple example of what cutting-edge observability can do today.
We begin with an SLI that looks off…
1/ A quick prologue: this real-world example comes from @LightStepHQ’s meta-monitoring (of our own SaaS). This way I can show real data at scale (Lightstep customers generate billions of traces every hour!!) without needing approval from customer eng+PR departments.
2/ So, we run a microservice called “maggie” (stands for “m”etrics “agg”regator). It had this weird blip at about 12:30pm. That’s not supposed to happen, so the obvious question is “why?”