Read on Twitter

12,399 views

Charity Majors

@mipsytipsy

, 12 tweets, 5 min read Read on Twitter

@nicolefv

@nicolefv

When people ask me how they can convince their bosses to shell out for observability, I often toss them two links -- the DORA report by @nicolefv et al and the @stripe developer productivity paper.

And now, thanks to @jasonallen206, I have a third link: m.subbu.org/incidents-tren…

@sallamar

@sallamar

The link is a roundup by @sallamar of several hundred production outages, with some fascinating (and well executed) attempts at grouping by proximate cause and breaking down by impact to users. m.subbu.org/incidents-tren…

Pie chart-induced eye trauma aside, his findings are blunt, and resonate completely with my experiences running systems. To wit:

1) change is the trigger in 2/3 of outages
2) config drift is deadly
3) we don't know why things fail
4) infra changes are a shrinking %
5) certs lol

Due to hyperconnectedness, ripple effects, and "hope-driven releases", none of these trends are going away any time soon, and in fact *they are all going to accelerate*. Sorry.

The distributedness of systems is increasingly their most salient characteristic.

And check out that point on how the share of incidents that are infra related is small and decreasing!

Hardware isn't failing any less, it's just been successfully made into someone else's job. Ops is moving up the stack.

He has a few very sensible and obvious recommendations on how to deal with this. I will summarize them as "make your system tolerant to faults and resilient to failures"

"..except that's hard? So invest in observability and spend more time ACTUALLY UNDERSTANDING your systems."

And change safety. Apply slightly less duct tape and slightly more rigor to your change control.

Or as I keep barking, invest real developer hours into instrumentation, your CI/CD pipeline, and deployment code, and practice observability-driven development.

As requested: the DORA report devops-research.com and stripe developer productivity paper stripe.com/reports/develo…

One last thing. You notice what isn't mentioned? Better monitoring. Monitoring is ~useless to developers shipping services and debugging code.

Monitor a few high level metrics and end-to-end checks, absolutely. But aggregates and counters and metrics won't do ✨jack shit✨ when it comes to understanding emergent behaviors or high granularity outliers.

@honeycombio

@honeycombio

You need to debug in the language you develop. You need request-level instrumentation with ordering so you can flip back and forth between request traces and request aggregates.

Only @honeycombio gives you that. That's why our customers give such breathless happy quotes. 💖🐝

Observability is never going to be as turnkey easy as the old school black box monitoring agents, because it has to come from your code.

That said, it's pretty damn easy -- just install the gem or go get package or whatever. If you wanna try us, here are three links:

🌈Our white papers and ebooks on observability, honeycomb.io/resources/whit…

🌈Play in a sandbox, honeycomb.io/play

🌈Sign up for a trial, ui.honeycomb.io/signup

Onward, unto the undebuggable breaches of tomorrow. 🐝📈❤️

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

This content may be removed anytime!

Try unrolling a thread yourself!

More from @mipsytipsy see all

Related threads

Trending hashtags

Did Thread Reader help you today?