Users won’t care if you say everything looks good on your end, if they’re having a bad day. The only perspective that matters for measuring reliability is your users’ perspective. @drensin#alldaydevops
Users won’t notice any reliability you have, over the least reliable thing between them and your system. @drensin#alldaydevops
To add an extra 9 of reliability is an order of magnitude increase in cost, at least 10x. @drensin#alldaydevops
Definitions from @drensin#alldaydevops (sorry for the glare and unintentional selfie)
SLAs are a business discussion, not an engineering discussion. @drensin#alldaydevops
The reason we go oncall is to learn what needs to be automated. @drensin#alldaydevops
You should think of SRE as a highly opinionated implementation of DevOps (class SRE inherits DevOps). @drensin#alldaydevops
Nice @lizthegrey shoutout in @drensin’s talk :) Liz is awesome and has done a ton to evangelize SRE.
Error budgets and SLOs prevent “intuition fatigue.” @drensin (I love that term! And have definitely experienced it.) #alldaydevops
SRE principles incentivize you to reduce complexity. And simpler systems are easier to reason about. @drensin#alldaydevops
Ahh this talk is so good. I really recommend that people interested in SRE track down the recording when it’s posted. Especially if you are newer to the concepts. @drensin explains them so well. #alldaydevops
A pacemaker needs 3 1/2 nines of reliability. Good rule of thumb to measure against. @drensin#alldaydevops
Start with one application. So that first. @drensin (This is a good idea for many things, like implementing config management or other automation. Pick one thing and make it work.) #alldaydevops
(Once you have a success, you can use that to evangelize internally. People often need to see something work before they will buy into it.) #alldaydevops
More logging and measurement is probably better. More alerting is probably not. Only alert on the symptoms of your users’ pain. @drensin#alldaydevops
Alert on SLO violations, or if you’re burning your error budget too fast and the SLO is in jeopardy. Alert on things your users would want to know about. @drensin#alldaydevops
Dave is talking a lot about blameless culture to wrap up. You won’t learn from mistakes if people don’t feel safe being honest. @drensin#alldaydevops
This is so important. I’ve worked in shops where people were worried about getting fired if they make a mistake. There are so many negative results from a culture like that. #alldaydevops
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I’ve had a few people ask me to give more details about @dustychipura’s awesome #ADHD bootcamp that i attended. So here goes a 🧵 1/
The bootcamp was held over two weekends. The first weekend was two eight hour days. The second was an eight hour day and a four hour day to wrap up. It was a lot of work and I was super wiped out by the end, but it was great. 2/
I think some people probably went in expecting a lot of lectures or something but it’s called a bootcamp for a reason. There were some times when Dusty would talk for a bit about a topic, but most of it was doing things. 3/
Latency issues during the Airbnb k8s migration gave people the sads. #KubeCon
Some of it was due to hardware changes. The generation of hardware that the app had been running on made a difference. Also the host OS. #KubeCon
They had some issues with noisy neighbors. One specific service was causing problems for others due to the CPU it was using. They hadn’t set CPU limits because they thought it might hurt performance. #KubeCon