Read on Twitter

12,399 views

Charity Majors

@mipsytipsy

, 34 tweets, 19 min read Read on Twitter

@lizthegrey

@lizthegrey

Yay, @lizthegrey taking the stage at @qconlondon in the huge keynote auditorium to tell us morality tales about complex systems!!

@lizthegrey

@lizthegrey

@lizthegrey telling horror stories right off the bat. "So I spun up an infra, I did all the right things... now I have a bazillion dashboards and outages take forever and only one person actually knows how to debug anything and holy shit are my engineers getting cranky"

@lizthegrey

@lizthegrey

@lizthegrey You're drowning in operational overload, and *no amount of tooling* is going to help if you don't have the right mental model and the right plan.

Tools can only help if you know what you're doing.

@lizthegrey

@lizthegrey

@lizthegrey You forgot about who runs your software: people do. Your plan has to be human-centered and people-focused. This requires ... dun-dun-dun ... ✨Production Excellence✨.

@lizthegrey

@lizthegrey

@lizthegrey You have to invest in making your systems more reliable and friendly. It's not okay to feed your systems with the blood of your humans.

Culture is _everything_. Changing the culture of the team is an intentional effort that takes everyone on the team and everyone adjacent.

@lizthegrey

@lizthegrey

@lizthegrey Engineers: when was the last time you invited sales, marketing, execs, your customer support folks to your meetings?

Production excellence must involve all of these stakeholders, or you will leave folks out and it will be unsustainable.

@lizthegrey

@lizthegrey

@lizthegrey A big part of production excellence is building people up: increasing their confidence so they are willing to touch prod. You have to encourage asking questions, and make people feel safe taking some time to think.

@lizthegrey

@lizthegrey

@lizthegrey (call out to @sarahjwells keynote where developers wouldn't touch mysql to restart the db for 20 min because they were so traumatized by their ops team)

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells so where do we start?

* know where to start
* and be able to debug
* ... debug together, if you span services
* and pay down complexity, reduce duplication and drudgery

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells our systems are ✨always failing.✨ your systems should be resilient to a million different types of errors.

[ed: i feel bad about not posting all this terrific art, but i could not keep up live tweeeting from my phone. THIS IS WAY HARDER THAN IT LOOKS]

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells We can not and should not have to care about any and every failure or error on our systems. So how do we decide *which* to care about?

Enter Service Level Indicators!

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells How do you establish an SLI? Well, this is where collaboration comes in. Ask your product managers -- what delights users? What annoys them? Ask around -- ask all your stakeholders. They will have Opinions!

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells (ugh still no way to remove tagged people from tweet threads, sorry sarah)

You have to establish some arbitrary thresholds, so you can bucket your events into good and bad. (Non user-impacting events are *excluded* from these calculations.)

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Now you can compute your SLI. Use a window (1 month, 3 month) and target percentage. If you reset the window every month, you will run into a very serious problem: users have memories, and will remember that you were down yesterday even if your monthly uptime is shiny new 100%.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells You can and should drive alerting with SLOs, and drive business decision-making with SLOs.

If you are going to run out of SLO within minutes, maybe you want to wake someone up. If you aren't going to run out for days, let them sleep ffs.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells If you are bleeding error budget, maybe you need to invest in more reliability (dollars, engineering effort, process)

You cannot act on what you don't measure. Just start with something and iterate, something is always better than nothing.

perfect SLO > good SLO >>> no SLO

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells If you have a significant outage and nobody complains, maybe you are calibrated too high.

The job of calibrating an SLO is never done, it will need to be continually revised through conversations with stakeholders.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells But SLIs and SLOs are only half the picture. They will tell you when something is wrong, but will never tell you what to do about it.

Our outages are never wholly predictable, the same thing doesn't happen exactly twice.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells We have to build observable systems, and collect data at a level that will let us question our systems' inner state without shipping new code to handle that question.

It's the only way to deal with new and complex failures.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells But let's take it a step farther. Can you mitigate your impact immediately, and debug later? Focus on your SLI and SLO, then someone can look at the data in the next day and figure out what went wrong ... at their leisure, during daytime hours.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells SLOs and observability go hand in hand ... but they are not enough. They can't create collaboration on their own. But debugging is not a solo activity. Your shit talks to other people's shit and many different stakeholders have pieces of the full picture in their head.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Debugging works a lot better when we all put our heads together. Debugging is for everyone. Give everyone access to the tools, encourage everyone to learn the skill sets. Encourage curiosity and exploration and asking questions.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Collaboration is intensely interpersonal. It's also about creating a growth mindset in your teams, reminding people that none of us were born knowing how to debug systems and it's natural to need to learn things continually.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Production excellence demands that operations must be sustainable. And sustainability means *flexibility*.

Everyone can and *must* contribute to production excellence, but this can take many forms. It is a fiercely human-scale problem.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Sharing knowledge is the best way to conquer hero culture. Don't praise people for solving problems themselves or being the hero, don't shame people for asking questions or needing help.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Reward curiosity and practice team work. Game days, wheels of misfortune, and other games can help establish these habits.

[ed: we learn best when it's FUN! tap into our sense of play, and we forget we are learning at all <3)

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Quantify risks by frequency & impact.

But also: how many users do they affect? Control your blast radius by ✨Progressive Delivery✨ (woop woop) and other strategies.

And always address teh risks that threaten the SLO. The SLO is intensely clarifying.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells The SLO also provides you a business cudgel to swing around to stop feature development or get resources from management, when the SLO is threatened and users complain.

You all must trust and believe in the SLO.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Lack of observability is a critical systemic risk.

If you can't understand your systems, bringing them back up from an outage will be a terrifying open-ended process that will take much longer than it should.

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells Lack of collaboration is also a critical systemic risk.

TLDR: Production Excellence brings teams closer together. 🧡Measure.❤️ Debug.💜 Collaborate.💙 Fix.💚

@lizthegrey

@lizthegrey

@lizthegrey @sarahjwells from questions: liz mentions a google rule that no operator should have to deal with more than two incidents per shift -- you're burned out after that.

some of my favorite art from the talk ☺️📈🌈

https://twitter.com/mipsytipsy/status/1102991558620663809?s=21

https://twitter.com/mipsytipsy/status/1102991558620663809?s=21

Lol oops I accidentally truncated this thread, here is the second half

https://twitter.com/mipsytipsy/status/1102991558620663809?s=21

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Charity Majors

This content may be removed anytime!

Try unrolling a thread yourself!

More from @mipsytipsy see all

Related threads

Trending hashtags

Did Thread Reader help you today?