Read on Twitter

12,399 views

Charity Majors

@mipsytipsy

, 16 tweets, 4 min read Read on Twitter

https://twitter.com/digbyevan/status/1148204214612418561

https://twitter.com/digbyevan/status/1148204214612418561

If your tool aggregates at write time and strips context, it is not a debugging tool because you can not answer specific questions.

At best, it may get you close enough that a few lucky guesses or intuitive leaps can land you on the right answer (or some part of it)

https://twitter.com/digbyevan/status/1148204214612418561

This role of sitting between devs and their code, interpreting low level systems graphs and translating them into the language of services and endpoints and what your code is /actually/ doing, has long been filled by your harried ops professional. Ops fingers in the o11y dikes.

Ops builds up enough scar tissue over time that we can make those intuitive leaps to almost *anything*, because we've seen everything.

Not gonna lie, it feels great to be a wizard. 🔮🧹🌚

But then it all falls apart once the system is too complex and ephemeral to fit in your head and reason about, and unknown unknowns start to out pace the known unknowns.

I'll miss being a wizard, but I'm ready to stop being a translation layer. Better shit to do with our time.

Is it clear what I mean when I say you can't build a debugging tool with metrics? 🤔 Hmm.. maybe an example will help. Let's see. 🤔🤔

Imagine you're looking at your dashboards and you see a big spike in errors. So you start investigating.

You start flipping through your other dashboards, looking for other shapes that correlate to the error spike. (Right?)

You dismiss some of them because your intuition or knowledge of the system tells you they are likely effects of the errors, not caused by the errors.

But some look suspicious.

You zero in on a few in particular: it looks like the errors are only to a particular shard, and from there you can narrow it down further: the errors are only to the primary node, there was a spike in SELECT queries and queue length shortly before, and

your disk IOPS and nscanned are consistent with a pattern you've seen before where a user launches a bot and it takes a couple minutes to get auto throttled. You check the log and this user did get throttled. Satisfied, you move on.

Except, that wasn't the problem.

You jumped to the log and looked for a thing. So you didn't notice that actually LOTS of users got throttled. The actual culprit was an index running or some other write lock being held which caused everyone's queries to back up and clients to issue retries,

and the autothrottle went on a rampage.

With metrics and dashboards you we always looking for whatever you managed predict. If instead you start with the error spike, then break down by user or app etc, and follow the data where it takes you, you don't have to

lean on guesses anymore.

Another super common version of this dashboard blindness is when you have a clump of errors and you check your logs to see if they all have $x in common, and they do! ... but they also have ten other dimensions in common.

It is literally mathematically impossible to derive this information from a metrics based system. Using events it can be done. (Using honeycomb it's a breeze, just draw a bubble around the spike, we precompute the baseline vs bubble for all dimensions, and you can see

what the differences are in a second or two. Feels like magic)

So if all the errors happen to be for requests by user ID 555, shard 2, shopping cart id 10565, item 23, region us-west-1, client type iOS, version 13, app version 9099, timezone GMT...

... you can either pattern match enough from metrics dashboards to guess that you need to jump into your logs and start grepping around to find the culprit,

or you can draw a bubble around the spike and get a list of the dimensions that differ.

You always have to jump from metrics dashboards into something, to narrow it down to a real answer. Metrics only get you within guessing distance.

If you're lucky. So good luck.

Honeycomb.io/signup

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

This content may be removed anytime!

Try unrolling a thread yourself!

More from @mipsytipsy see all

Related threads

Trending hashtags

Did Thread Reader help you today?