Profile picture
Charity Majors @mipsytipsy
, 16 tweets, 3 min read Read on Twitter
I've begun to see the inexorable sprawl of alerts, monitoring checks and dashboards as a deep well of technical debt.
you have an outage, or some system impacting event. you resolve it. you call a postmortem or retrospective. at the end, someone asks:

"can we make a dashboard so we can find the problem immediately next time?" and
"what alert can we set up, to notify us when this happens?"
fast-forward a year. how many dashboards does your team have to wade through? how many alerts wake you up in the dark of night? how much time do you spend tending and curating them or tweaking thresholds?

does everyone even use the same ones, or are you fragmented?
monitoring checks, alerts on symptoms, pane-of-glass dashboards -- the trusty-rusty tools of yore are powerful tools for stable systems of knowable scope.

the frame you want for modern systems is debugging, not monitoring. if you can't spot a problem in a glance, shouldn't try.
think about BI tools. they would think you batshit crazy if you said "here's a pile of dashboards, which one represents the user behavior you are currently trying to understand?"

it's not hard. but it's an exploration game. you follow the breadcrumbs where they take you.
no, you should not add a paging alert for every symptom that may or may not signal a problem worth escalating about.

no, you should not add a monitoring check for every system state that sometimes represents a problem worth escalating about

no, not another dashboard. just no.
here's the truth about alerts: the overwhelming majority of problems that ever happen in a system do not AND SHOULD NOT generate an alert. esp during off hours. esp paging alerts.
in a distributed system, innumerable bugs and catastrophic states exist at any time. i.e. in your system, right now.

but you aren't going to notice many (if not most) of the bugs or problems, and the badness needs to rise to a certain level to even be worth your time.
the only paging alerts you really need are request rate, errors, latency, and some end-to-end checks that traverse the critical code paths, probably around what makes you money. (if you're larger, this set for each service.)
all your other paging alerts are technical debt. they're a symptom of your inability to explore your systems and ask simple questions in an effective and timely way. they're a bandage over your archaic tooling.
(god, the number of times i remember relying on a cluster of paging alerts to go off ... to signal a problem in a COMPLETELY UNRELATED COMPONENT. glaarrrgh.)
likewise debugging with static dashboards isn't debugging. it's pattern-matching with your eyeballs.

dashboard-flipping isn't science. with science you ask questions -- you formulate a hypothesis, you test it. you follow your bread crumbs where they lead you.
and once you have forty thousand static dashboards, you're just drowning in them

every dashboard is an artifact of some past failure, and the data sources may or may not even be working, and your team's entire view of the world has fragmented. so just fuck dashboards.
you can't model the system in your head any more, and you shouldn't try. get that shit out of your head and in to a tool, where you can interact with it. ... and more importantly so can your team.
have a few blessed entry points that are maintained and shared by the team. make exploration the expectation, debug by interacting with the problem not by flipping through dashboards.

send nearly all "alerts" to a non paging source with an SLA of hours, not minutes.
(honestly just set all your alerts to email only and see what happens 😈

it probably won't be worse. it might be better.)
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!