LARGE SYSTEMS USUALLY OPERATE IN FAILURE MODE, via @dangolant

Or like I used to say, your distributed system exists in a continuous state of partial degradation. There are bugs and flakes and failures all the way down, and hardly any of them ever matter. Until they do.
This is why observability matters. SLOs make large multitenant systems tractable from the top down, but observability makes them comprehensible from the bottom up.
Maybe only .001% of all software system behaviors and bugs ever need to be closely inspected and understood, but that tiny percentage defines the success of your business and the happiness of your users.

And you CANNOT predict what will matter in advance.
Remember, it's not just about "is this broken?" It is equally about, "how does this work, and what is my user experiencing?"

The more equipped you are to answer the latter, and the more actively you seek those answers out, the less you will experience them as breakage.
Issues get *exponentially* more expensive to discover and fix the longer it takes to find them. You CANNOT rely on monitoring to find problems with the intersection of your code, your infra and your users.

If you try, you will doom yourself to a life of reactivity and toil.
We have to learn to be more proactive about examining that intersection. You instrument as you go, merge smaller diffs more often, autodeploy and keep a lid on delivery times.

Your job isn't done until you have closed the loop by checking your instrumentation in production.
And not just like, "is it up? is it down?" but rather,

* what are the distribution of response times by normalized query? raw query?

* what is the breakdown of response codes for a particular user for this endpoint?

* are 504s dominated by any userstring, browser, header, etc?
Only you know what changes you are hoping and expecting to see in the system after your changes roll out. Look for them specifically.

You also need to develop a sixth sense for "something is weird". Which you can only do by checking up on your code in prod daily, habitually.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Charity Majors

Charity Majors Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mipsytipsy

15 Jan
I would actually argue that, with the right tooling, production is *exactly* the most effective place to take chances, make mistakes, and get messy. 🙃 Otherwise you're like this guy.
Here are the slides for a talk of Liz's that I modified slightly and delivered on Wednesday for the O'Reilly InfraOps superstream. speakerdeck.com/charity/observ…

We walk you through the honeycomb backend, some of the ways we perform chaos engineering, and some infamous outages,
to show just how swiftly, accurately, and powerfully you can manipulate systems with modern tooling (feature flags, fast delivery, superb observability) and do whatever the fuck you want in prod without hurting your users.
Read 9 tweets
13 Jan
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have l
I was homeschooled, and escaped to college when I was 15. I was a seething mess of pent-up rage and ambition (and undiagnosed ADHD) who had never done any sort of formal schooling. I had no idea what I wanted to do other than ALL OF IT. RIGHT NOW.
You're supposed to register for 12-15 credits, so I promptly registered for 24 (plus I had a piano performance scholarship I was supposed to maintain).

I didn't have any family support, money, or ability to take out loans, so I signed up for three local minimum wage jobs.
Read 22 tweets
9 Nov 21
it's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand.
Becoming well versed in exploring your systems via production tooling has never been a more important part of being a good engineer.

It's also never been *easier* to derive rich insights. (why, in MY day, all we had was sar and *stat AND WE LIKED IT)

Apologies to whoever originally made this awesome gif about testing in production, but it holds just as true for alerting and debugging. 🙃
Read 15 tweets
29 Oct 21
hey man, you know me, I don't like talking smack about others, and I'm not sitting over here whittling and looking for excuses to litigate people's usage of the word observability.

but then there's this chronosphere.io/wp-content/upl…
and this chronosphere.io/learn/explain-…

and i go 🤯🥵😵‍💫🤯
they are literally describing monitoring. good ol', 30-year-old traditional monitoring.

* Notify
* Triage
* Understand

this is a company with a billion dollar valuation and they literally don't know the difference between monitoring and observability
i mean, we can all argue over the subtleties of observability and that's relatively understandable, but doesn't fucking EVERYBODY know what *monitoring* is and does?

cause it hasn't changed. in like.. ever
Read 9 tweets
20 Oct 21
good morning kittens, guess what honeycomb been up to? ? oh not much really, we've only just STAVED OFF OUR OWN INEVITABLE DEMISE AND DESTRUCTION, 🔥YET AGAIN🔥.

We can hardly even fail if we try for another two, three years now! Take that, heat death of the universe!🪐🌑 💜
(There, second time's the charm. Sorry!)

I wonder if it will ever stop feeling so bizarre just to still exist. 🙃 The list of people we are grateful for and permanently indebted to gets longer and and longer and longer with each passing year.
From our investors, who are principled, curious, endlessly thoughtful and helpful -- nothing like the stories and stereotypes about VCs that tend to filter down to eng circles -- to our family members, especially anyone who had to live with us those early few years 😬
Read 5 tweets
27 Sep 21
I've been talking to lots of teams about their observability journey, or how they managed to dig themselves out of hell and get a handle on shit. Some patterns definitely emerge.
The first thing many teams look at is the on call rotation. (Smart; heading straight for the pain.)

Folks are worn out, product is upset whenever something unexpected comes up -- it's a bad scene, because they're too tightly coupled. ANY non feature work means a deadline slips.
So the first thing they do is enact a simple rule: no product work during on call weeks. Period. Those weeks are for fixing and maintaining the system.

This forces leadership to plan for using 75-85% of full capacity as a steady state. Whew; now we have some flex in the system.
Read 29 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(