it's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand.
Becoming well versed in exploring your systems via production tooling has never been a more important part of being a good engineer.

It's also never been *easier* to derive rich insights. (why, in MY day, all we had was sar and *stat AND WE LIKED IT)

Apologies to whoever originally made this awesome gif about testing in production, but it holds just as true for alerting and debugging. 🙃
The overwhelming majority of the failures in your system and the bugs in your code will never create problems that rise to the level of our paging alerts.

And thank heavens for that, or it would be buzzing *nonstop*
But that doesn't mean they aren't /there/. And it doesn't mean they aren't a constant drag on your performance and productivity. These failures manifest in bug reports, customer frustration, flakiness, and mysterious behaviors that no one can reproduce.
We talk a lot about how important it is to reduce the pager volume, to stop paging on symptoms, to move to SLO-based alerting. All of this is absolutely true.

But we don't always talk about the fact that you need to *pair* that cleanup with time and tooling upstream.
Imagine that you just shipped a small feature or change to your storage engine. It passes tests, all alerts are green. Do you beam proudly and move on?

No, you *poke* at it for a while. You look at your instrumentation to see if any counters are incrementing correctly,
you watch the count(*) of new users as they're clicking on it, with full row details so you can watch for any odd patterns (exclusively from one device type?) on the success events.

you watch the count(*) and contents of any errors or unknown events. maybe you set a watch on it,
so you'll get a slack ping if they start skyrocketing for some reason. Maybe you peek at the memory and cpu usage.

You wait a couple days, and you come back and check on it again as part of your morning routine, along with checking tickets and deciding what to do today.
You keep an eye out for outliers of any type, until you're satisfied it's baked. Because you *know* that production is the only place interesting bugs show up, and you know that takes time, and you know they won't show their rotten little faces in any aggregates
Deploying software, after all, is not like flipping a switch. It is not blue or green, or off and on. It is the *beginning* of the process of validating your software in production.
I feel like the only reason this isn't as universal as running tests is that until recently, the tooling wouldn't *allow* you to ask the kinds of questions you need to ask to stick the landing.

You need raw rows, high cardinality, high dimensionality and traces to ask them.
With modern observability tooling, though, it is terribly easy and doesn't take much time at all.

Not going back to check on your code is like flying blind. It's the last and most important step in developing software. It will never again be this easy to find your bugs.
The cost of finding and fixing bugs rises *exponentially* the longer it takes between writing them and finding them.

I can't find the fb eng article I used to cite, but this one isn't bad.

deepsource.io/blog/exponenti…
So, if:

* you have to run in prod to find the bugs
* it gets exponentially more expensive to fix them

..you see how profound that moment is, when you first turn your code on in prod. Anything you do to empower engineers at that point is probably the best investment you can make

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Charity Majors

Charity Majors Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mipsytipsy

29 Oct
hey man, you know me, I don't like talking smack about others, and I'm not sitting over here whittling and looking for excuses to litigate people's usage of the word observability.

but then there's this chronosphere.io/wp-content/upl…
and this chronosphere.io/learn/explain-…

and i go 🤯🥵😵‍💫🤯
they are literally describing monitoring. good ol', 30-year-old traditional monitoring.

* Notify
* Triage
* Understand

this is a company with a billion dollar valuation and they literally don't know the difference between monitoring and observability
i mean, we can all argue over the subtleties of observability and that's relatively understandable, but doesn't fucking EVERYBODY know what *monitoring* is and does?

cause it hasn't changed. in like.. ever
Read 9 tweets
20 Oct
good morning kittens, guess what honeycomb been up to? ? oh not much really, we've only just STAVED OFF OUR OWN INEVITABLE DEMISE AND DESTRUCTION, 🔥YET AGAIN🔥.

We can hardly even fail if we try for another two, three years now! Take that, heat death of the universe!🪐🌑 💜
(There, second time's the charm. Sorry!)

I wonder if it will ever stop feeling so bizarre just to still exist. 🙃 The list of people we are grateful for and permanently indebted to gets longer and and longer and longer with each passing year.
From our investors, who are principled, curious, endlessly thoughtful and helpful -- nothing like the stories and stereotypes about VCs that tend to filter down to eng circles -- to our family members, especially anyone who had to live with us those early few years 😬
Read 5 tweets
27 Sep
I've been talking to lots of teams about their observability journey, or how they managed to dig themselves out of hell and get a handle on shit. Some patterns definitely emerge.
The first thing many teams look at is the on call rotation. (Smart; heading straight for the pain.)

Folks are worn out, product is upset whenever something unexpected comes up -- it's a bad scene, because they're too tightly coupled. ANY non feature work means a deadline slips.
So the first thing they do is enact a simple rule: no product work during on call weeks. Period. Those weeks are for fixing and maintaining the system.

This forces leadership to plan for using 75-85% of full capacity as a steady state. Whew; now we have some flex in the system.
Read 29 tweets
19 Sep
Yeah. This gets to a weakness of engineering leveling systems. We rightly encourage high level engineers to seek out work that is a challenge at their level...

But there isn't always enough of that highly difficult or tech lead work to go around.
When level-appropriate work comprises a lot of your performance review, you get something very dangerous: roving bands of skilled, restless engineers competing for vanity projects and systems that should never, ever have been built, but which you now have to maintain. 😬
One way to prevent this is to *not* over hire, especially very senior engineers. Hire juniors and mid-levels with room to grow.

Most engineering work is not rocket science, and mid levels in particular are often the most prolific and productive engineers you have.
Read 10 tweets
19 Sep
YES. Great section. The edges of tool adoption create silos.

Also: ultra relevant to the thread on software to sabotage your org, and the ~50% of responders who replied, "Jira."
Communication pathways are sooo hard to get right, and inspire such frothing, unreasonable rage when they get it wrong.

The last time I used jira was well over a decade ago, and I thought it was impenetrable spaghetti at the time. I can't imagine it's gotten any simpler...
But it's kind of an impossible problem, of course it's going to turn into feature soup when you've been making bank on enterprise for this long.

Every team starts out trying to replicate and "improve" on how a squintillion people and teams interoperate,
Read 5 tweets
12 Sep
I was just editing the o11y book chapter on build vs buy and ROI, and this sentence jumped out at me:

"High-performing organizations use great tools."
It's true, right? Behold all the FAANG engineers who leave their cushy perches and are shocked by the amount of tooling they had come to take for granted. It's almost like having to learn to engineer all over again

Big companies know how critical good tooling is, and pay for it.
I'm going to say two very contradictory things, both of which are true:

1) Tools are getting better and better, and you should try to keep up

2) Switching tooling is hard, and you should only do it when the gain is ~an order of magnitude better than what you've got.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(