, 16 tweets, 4 min read Read on Twitter
If your tool aggregates at write time and strips context, it is not a debugging tool because you can not answer specific questions.

At best, it may get you close enough that a few lucky guesses or intuitive leaps can land you on the right answer (or some part of it)
This role of sitting between devs and their code, interpreting low level systems graphs and translating them into the language of services and endpoints and what your code is /actually/ doing, has long been filled by your harried ops professional. Ops fingers in the o11y dikes.
Ops builds up enough scar tissue over time that we can make those intuitive leaps to almost *anything*, because we've seen everything.

Not gonna lie, it feels great to be a wizard. 🔮🧹🌚
But then it all falls apart once the system is too complex and ephemeral to fit in your head and reason about, and unknown unknowns start to out pace the known unknowns.

I'll miss being a wizard, but I'm ready to stop being a translation layer. Better shit to do with our time.
Is it clear what I mean when I say you can't build a debugging tool with metrics? 🤔 Hmm.. maybe an example will help. Let's see. 🤔🤔

Imagine you're looking at your dashboards and you see a big spike in errors. So you start investigating.
You start flipping through your other dashboards, looking for other shapes that correlate to the error spike. (Right?)

You dismiss some of them because your intuition or knowledge of the system tells you they are likely effects of the errors, not caused by the errors.
But some look suspicious.

You zero in on a few in particular: it looks like the errors are only to a particular shard, and from there you can narrow it down further: the errors are only to the primary node, there was a spike in SELECT queries and queue length shortly before, and
your disk IOPS and nscanned are consistent with a pattern you've seen before where a user launches a bot and it takes a couple minutes to get auto throttled. You check the log and this user did get throttled. Satisfied, you move on.

Except, that wasn't the problem.
You jumped to the log and looked for a thing. So you didn't notice that actually LOTS of users got throttled. The actual culprit was an index running or some other write lock being held which caused everyone's queries to back up and clients to issue retries,
and the autothrottle went on a rampage.

With metrics and dashboards you we always looking for whatever you managed predict. If instead you start with the error spike, then break down by user or app etc, and follow the data where it takes you, you don't have to
lean on guesses anymore.

Another super common version of this dashboard blindness is when you have a clump of errors and you check your logs to see if they all have $x in common, and they do! ... but they also have ten other dimensions in common.
It is literally mathematically impossible to derive this information from a metrics based system. Using events it can be done. (Using honeycomb it's a breeze, just draw a bubble around the spike, we precompute the baseline vs bubble for all dimensions, and you can see
what the differences are in a second or two. Feels like magic)

So if all the errors happen to be for requests by user ID 555, shard 2, shopping cart id 10565, item 23, region us-west-1, client type iOS, version 13, app version 9099, timezone GMT...
... you can either pattern match enough from metrics dashboards to guess that you need to jump into your logs and start grepping around to find the culprit,

or you can draw a bubble around the spike and get a list of the dimensions that differ.
You always have to jump from metrics dashboards into something, to narrow it down to a real answer. Metrics only get you within guessing distance.

If you're lucky. So good luck.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!