I would actually argue that, with the right tooling, production is *exactly* the most effective place to take chances, make mistakes, and get messy. πŸ™ƒ Otherwise you're like this guy.
Here are the slides for a talk of Liz's that I modified slightly and delivered on Wednesday for the O'Reilly InfraOps superstream. speakerdeck.com/charity/observ…

We walk you through the honeycomb backend, some of the ways we perform chaos engineering, and some infamous outages,
to show just how swiftly, accurately, and powerfully you can manipulate systems with modern tooling (feature flags, fast delivery, superb observability) and do whatever the fuck you want in prod without hurting your users.
I even included the full speakers' notes, since there isn't a ton of text on the slides. (You're welcome.)

It culminates in this ridiculous yet totally serious conversation, which I love.
The moral of the story is, error budgets are there to be used to find problems during peak daytime hours, so you don't find them in the middle of the night.

USE them.
My other favorite part of this talk is towards the very end, where we talk about the Kafka Month of Pain that didn't result in blowing any customer SLOs, but did exceed our internal human SLOs. Which are just as important.
Honeycomb on call teams have an SLO of not getting paged more than twice a week, or having to work an incident outside of business hours more than once every six months.

We were burning out our people, so we had to call a halt to the flurry of changes.
Also, I learned while preparing for this talk that we have an official policy that "incident responders are encouraged to expense meals for themselves and family during an incident", which just fucking makes sense once you think about it.
One of our core values is that "we hire adults", and adults have responsibilities outside of work. You cannot build a healthy, sustainable sociotechnical system if you don't account for that.

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Charity Majors

Charity Majors Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mipsytipsy

15 Jan
LARGE SYSTEMS USUALLY OPERATE IN FAILURE MODE, via @dangolant

Or like I used to say, your distributed system exists in a continuous state of partial degradation. There are bugs and flakes and failures all the way down, and hardly any of them ever matter. Until they do.
This is why observability matters. SLOs make large multitenant systems tractable from the top down, but observability makes them comprehensible from the bottom up.
Maybe only .001% of all software system behaviors and bugs ever need to be closely inspected and understood, but that tiny percentage defines the success of your business and the happiness of your users.

And you CANNOT predict what will matter in advance.
Read 8 tweets
13 Jan
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have limits
bodies have l
I was homeschooled, and escaped to college when I was 15. I was a seething mess of pent-up rage and ambition (and undiagnosed ADHD) who had never done any sort of formal schooling. I had no idea what I wanted to do other than ALL OF IT. RIGHT NOW.
You're supposed to register for 12-15 credits, so I promptly registered for 24 (plus I had a piano performance scholarship I was supposed to maintain).

I didn't have any family support, money, or ability to take out loans, so I signed up for three local minimum wage jobs.
Read 22 tweets
9 Nov 21
it's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand.
Becoming well versed in exploring your systems via production tooling has never been a more important part of being a good engineer.

It's also never been *easier* to derive rich insights. (why, in MY day, all we had was sar and *stat AND WE LIKED IT)

Apologies to whoever originally made this awesome gif about testing in production, but it holds just as true for alerting and debugging. πŸ™ƒ
Read 15 tweets
29 Oct 21
hey man, you know me, I don't like talking smack about others, and I'm not sitting over here whittling and looking for excuses to litigate people's usage of the word observability.

but then there's this chronosphere.io/wp-content/upl…
and this chronosphere.io/learn/explain-…

and i go 🀯πŸ₯΅πŸ˜΅β€πŸ’«πŸ€―
they are literally describing monitoring. good ol', 30-year-old traditional monitoring.

* Notify
* Triage
* Understand

this is a company with a billion dollar valuation and they literally don't know the difference between monitoring and observability
i mean, we can all argue over the subtleties of observability and that's relatively understandable, but doesn't fucking EVERYBODY know what *monitoring* is and does?

cause it hasn't changed. in like.. ever
Read 9 tweets
20 Oct 21
good morning kittens, guess what honeycomb been up to? ? oh not much really, we've only just STAVED OFF OUR OWN INEVITABLE DEMISE AND DESTRUCTION, πŸ”₯YET AGAINπŸ”₯.

We can hardly even fail if we try for another two, three years now! Take that, heat death of the universe!πŸͺπŸŒ‘ πŸ’œ
(There, second time's the charm. Sorry!)

I wonder if it will ever stop feeling so bizarre just to still exist. πŸ™ƒ The list of people we are grateful for and permanently indebted to gets longer and and longer and longer with each passing year.
From our investors, who are principled, curious, endlessly thoughtful and helpful -- nothing like the stories and stereotypes about VCs that tend to filter down to eng circles -- to our family members, especially anyone who had to live with us those early few years 😬
Read 5 tweets
27 Sep 21
I've been talking to lots of teams about their observability journey, or how they managed to dig themselves out of hell and get a handle on shit. Some patterns definitely emerge.
The first thing many teams look at is the on call rotation. (Smart; heading straight for the pain.)

Folks are worn out, product is upset whenever something unexpected comes up -- it's a bad scene, because they're too tightly coupled. ANY non feature work means a deadline slips.
So the first thing they do is enact a simple rule: no product work during on call weeks. Period. Those weeks are for fixing and maintaining the system.

This forces leadership to plan for using 75-85% of full capacity as a steady state. Whew; now we have some flex in the system.
Read 29 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(