Tools can only help if you know what you're doing.
Culture is _everything_. Changing the culture of the team is an intentional effort that takes everyone on the team and everyone adjacent.
Production excellence must involve all of these stakeholders, or you will leave folks out and it will be unsustainable.
* know where to start
* and be able to debug
* ... debug together, if you span services
* and pay down complexity, reduce duplication and drudgery
[ed: i feel bad about not posting all this terrific art, but i could not keep up live tweeeting from my phone. THIS IS WAY HARDER THAN IT LOOKS]
Enter Service Level Indicators!
You have to establish some arbitrary thresholds, so you can bucket your events into good and bad. (Non user-impacting events are *excluded* from these calculations.)
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
A good SLO barely keeps users happy.
If you are going to run out of SLO within minutes, maybe you want to wake someone up. If you aren't going to run out for days, let them sleep ffs.
You cannot act on what you don't measure. Just start with something and iterate, something is always better than nothing.
perfect SLO > good SLO >>> no SLO
The job of calibrating an SLO is never done, it will need to be continually revised through conversations with stakeholders.
Our outages are never wholly predictable, the same thing doesn't happen exactly twice.
It's the only way to deal with new and complex failures.
Everyone can and *must* contribute to production excellence, but this can take many forms. It is a fiercely human-scale problem.
[ed: we learn best when it's FUN! tap into our sense of play, and we forget we are learning at all <3)
But also: how many users do they affect? Control your blast radius by ✨Progressive Delivery✨ (woop woop) and other strategies.
And always address teh risks that threaten the SLO. The SLO is intensely clarifying.
You all must trust and believe in the SLO.
If you can't understand your systems, bringing them back up from an outage will be a terrifying open-ended process that will take much longer than it should.
TLDR: Production Excellence brings teams closer together. 🧡Measure.❤️ Debug.💜 Collaborate.💙 Fix.💚