Tweet

Charity Majors

9 Nov, 15 tweets, 4 min read

https://twitter.com/arclight/status/1454013188366798849

it's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand.

https://twitter.com/arclight/status/1454013188366798849

https://twitter.com/arclight/status/1454014973659729921?s=20

Becoming well versed in exploring your systems via production tooling has never been a more important part of being a good engineer.

It's also never been *easier* to derive rich insights. (why, in MY day, all we had was sar and *stat AND WE LIKED IT)

https://twitter.com/arclight/status/1454014973659729921?s=20

Apologies to whoever originally made this awesome gif about testing in production, but it holds just as true for alerting and debugging. 🙃

The overwhelming majority of the failures in your system and the bugs in your code will never create problems that rise to the level of our paging alerts.

And thank heavens for that, or it would be buzzing *nonstop*

But that doesn't mean they aren't /there/. And it doesn't mean they aren't a constant drag on your performance and productivity. These failures manifest in bug reports, customer frustration, flakiness, and mysterious behaviors that no one can reproduce.

We talk a lot about how important it is to reduce the pager volume, to stop paging on symptoms, to move to SLO-based alerting. All of this is absolutely true.

But we don't always talk about the fact that you need to *pair* that cleanup with time and tooling upstream.

Imagine that you just shipped a small feature or change to your storage engine. It passes tests, all alerts are green. Do you beam proudly and move on?

No, you *poke* at it for a while. You look at your instrumentation to see if any counters are incrementing correctly,

you watch the count(*) of new users as they're clicking on it, with full row details so you can watch for any odd patterns (exclusively from one device type?) on the success events.

you watch the count(*) and contents of any errors or unknown events. maybe you set a watch on it,

so you'll get a slack ping if they start skyrocketing for some reason. Maybe you peek at the memory and cpu usage.

You wait a couple days, and you come back and check on it again as part of your morning routine, along with checking tickets and deciding what to do today.

You keep an eye out for outliers of any type, until you're satisfied it's baked. Because you *know* that production is the only place interesting bugs show up, and you know that takes time, and you know they won't show their rotten little faces in any aggregates

Deploying software, after all, is not like flipping a switch. It is not blue or green, or off and on. It is the *beginning* of the process of validating your software in production.

I feel like the only reason this isn't as universal as running tests is that until recently, the tooling wouldn't *allow* you to ask the kinds of questions you need to ask to stick the landing.

You need raw rows, high cardinality, high dimensionality and traces to ask them.

With modern observability tooling, though, it is terribly easy and doesn't take much time at all.

Not going back to check on your code is like flying blind. It's the last and most important step in developing software. It will never again be this easy to find your bugs.

The cost of finding and fixing bugs rises *exponentially* the longer it takes between writing them and finding them.

I can't find the fb eng article I used to cite, but this one isn't bad.

deepsource.io/blog/exponenti…

So, if:

* you have to run in prod to find the bugs
* it gets exponentially more expensive to fix them

..you see how profound that moment is, when you first turn your code on in prod. Anything you do to empower engineers at that point is probably the best investment you can make

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @mipsytipsy

Charity Majors

@mipsytipsy

29 Oct

hey man, you know me, I don't like talking smack about others, and I'm not sitting over here whittling and looking for excuses to litigate people's usage of the word observability.

but then there's this chronosphere.io/wp-content/upl…
and this chronosphere.io/learn/explain-…

and i go 🤯🥵😵‍💫🤯

they are literally describing monitoring. good ol', 30-year-old traditional monitoring.

* Notify
* Triage
* Understand

this is a company with a billion dollar valuation and they literally don't know the difference between monitoring and observability

i mean, we can all argue over the subtleties of observability and that's relatively understandable, but doesn't fucking EVERYBODY know what *monitoring* is and does?

cause it hasn't changed. in like.. ever

Read 9 tweets

Charity Majors

@mipsytipsy

20 Oct

https://twitter.com/honeycombio/status/1450795843863334922

good morning kittens, guess what honeycomb been up to? ? oh not much really, we've only just STAVED OFF OUR OWN INEVITABLE DEMISE AND DESTRUCTION, 🔥YET AGAIN🔥.

We can hardly even fail if we try for another two, three years now! Take that, heat death of the universe!🪐🌑 💜

https://twitter.com/honeycombio/status/1450795843863334922

(There, second time's the charm. Sorry!)

I wonder if it will ever stop feeling so bizarre just to still exist. 🙃 The list of people we are grateful for and permanently indebted to gets longer and and longer and longer with each passing year.

From our investors, who are principled, curious, endlessly thoughtful and helpful -- nothing like the stories and stereotypes about VCs that tend to filter down to eng circles -- to our family members, especially anyone who had to live with us those early few years 😬

Read 5 tweets

Charity Majors

@mipsytipsy

27 Sep

I've been talking to lots of teams about their observability journey, or how they managed to dig themselves out of hell and get a handle on shit. Some patterns definitely emerge.

The first thing many teams look at is the on call rotation. (Smart; heading straight for the pain.)

Folks are worn out, product is upset whenever something unexpected comes up -- it's a bad scene, because they're too tightly coupled. ANY non feature work means a deadline slips.

So the first thing they do is enact a simple rule: no product work during on call weeks. Period. Those weeks are for fixing and maintaining the system.

This forces leadership to plan for using 75-85% of full capacity as a steady state. Whew; now we have some flex in the system.

Read 29 tweets

Charity Majors

@mipsytipsy

19 Sep

https://twitter.com/mariaruizv/status/1438544872445599753

Yeah. This gets to a weakness of engineering leveling systems. We rightly encourage high level engineers to seek out work that is a challenge at their level...

But there isn't always enough of that highly difficult or tech lead work to go around.

https://twitter.com/mariaruizv/status/1438544872445599753

When level-appropriate work comprises a lot of your performance review, you get something very dangerous: roving bands of skilled, restless engineers competing for vanity projects and systems that should never, ever have been built, but which you now have to maintain. 😬

One way to prevent this is to *not* over hire, especially very senior engineers. Hire juniors and mid-levels with room to grow.

Most engineering work is not rocket science, and mid levels in particular are often the most prolific and productive engineers you have.

Read 10 tweets

Charity Majors

@mipsytipsy

19 Sep

https://twitter.com/1tss4r/status/1438442702434185216

YES. Great section. The edges of tool adoption create silos.

Also: ultra relevant to the thread on software to sabotage your org, and the ~50% of responders who replied, "Jira."

https://twitter.com/1tss4r/status/1438442702434185216

Communication pathways are sooo hard to get right, and inspire such frothing, unreasonable rage when they get it wrong.

The last time I used jira was well over a decade ago, and I thought it was impenetrable spaghetti at the time. I can't imagine it's gotten any simpler...

But it's kind of an impossible problem, of course it's going to turn into feature soup when you've been making bank on enterprise for this long.

Every team starts out trying to replicate and "improve" on how a squintillion people and teams interoperate,

Read 5 tweets

Charity Majors

@mipsytipsy

12 Sep

I was just editing the o11y book chapter on build vs buy and ROI, and this sentence jumped out at me:

"High-performing organizations use great tools."

It's true, right? Behold all the FAANG engineers who leave their cushy perches and are shocked by the amount of tooling they had come to take for granted. It's almost like having to learn to engineer all over again

Big companies know how critical good tooling is, and pay for it.

I'm going to say two very contradictory things, both of which are true:

1) Tools are getting better and better, and you should try to keep up

2) Switching tooling is hard, and you should only do it when the gain is ~an order of magnitude better than what you've got.

Read 15 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Share this page!

Charity Majors

Try unrolling a thread yourself!

More from @mipsytipsy

Charity Majors

Charity Majors

Charity Majors

Charity Majors

Charity Majors

Charity Majors

Did Thread Reader help you today?

Like this author's thread?