Profile picture
Charity Majors @mipsytipsy
, 10 tweets, 2 min read Read on Twitter
How can you predict what data Future-You will need to have gathered to debug an unpredictable problem? (And if you *could* predict the problem, why do you need a fancy debugging tool?)

I hear this a lot, and the answer gets at the heart of the difference btwn monitoring & o11y.
Monitoring is very much biased towards actionable alerts, and as such it trucks in rapidly identifying the same complex failure conditions repeatedly. Your known-unknowns.
Good observability tools are very different. They are exploratory, letting you slice and dice and play with the data, and following the bread crumbs wherever they lead you.

Think about how business intelligence tools work. You don’t know where you’re going til you get there.
This exploratory approach can be slower at dealing with known-unknowns. But it’s the only game in town for unknown-unknowns.

You’re screwed if you have to rely on a few humans magically holding the system in their heads and reasoning far beyond anything your tools can show.
So: how do you predict what data you will need to gather? Unlike with monitoring, you don’t have to know nearly as much in advance.

First: the q is often “which {service,node,version,shard,cluster,etc} is this coming from?” ... so instrument the shit outta all network hops
Second: the next question is often “ok I found an error ... who or what else is experiencing this and what do they have in common?” You’ll want to tag everything with uuid, query, unique request id, build id, region, any other hi-card slice you can think of.
Low cardinality details rarely tell you shit worth knowing. (“Ohh, boyeee, all the errors are in us-east-1??!” And using mysql)

High cardinality attributes give you the exact fucking residential address of the needle you seek, and all the losers on its block too.
In summary: gather all the information you can about the movement of the request, and all the detail you can about the context of the request.

System metrics and language internals are nice-to-have, usually useless and mostly a crutch.
Dirty secret of distributed systems is how often “fixing” the errors just means “identify the component with a problem super fast, and route around it or decommission it or programmatically attempt to return it to a known good state”
Understanding and diving deep into weird problems is *hard*, man. Takes time. Getting to take the time to really deeply explore a weird edge case is a luxury.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!