Read on Twitter

12,399 views

Matt Brown

@xleem

, 28 tweets, 13 min read Read on Twitter

@jaqx0r

@jaqx0r

Now: @jaqx0r on A theory and practice of alerting with service level objectives.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r this talk is based on Jamie's experiences being on a rotation that eventually burnt him out.

Showing a photo of his contribution to @alicegoldfuss oncall photo collection, where he looked happy, but didn't know he wasn't yet.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r context is the team was lowest rating in Google's SRE reviews for 2 6-month periods in a row. Not a great place to be.

So they got instructed to focus on fixing that and reducing oncall load.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r They started with alerting based on cause - really tempting, easy to understand, but also really really noisy as your scope grows (e.g. thousands of disks).

Jamie prefers to have alerts on Symptoms - e.g. latency high.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r so why does this suck?

typically our services are growing, so ops work will grow too - but to be sustainable, we need to reduce that cost of maintenance to something sublinear, so our team doesn't get overloaded.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r think about CI/CD - where bad tests slow down delivery, we all understand that slow delivery is bad.

specialized cause based alerts are similar - they cause operational load in the same way that flaky tests do.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r so don't try and alert on an exhaustive set of causes (you'll never succeed).

instead, alert on a small number of symptoms that relate to what your users need.

despite having less alerts, you'll have better observability!

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r so, how do we decide what is a symptom, and what's not?

An easy way to answer this is to look at whether the user cares about the issue. Does user care that a queue is long? no.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r detour; but parts do fail, we know they will. how much failure is OK.

physical engineering calls this tolerance, in SRE our error budget can be thought of as availability tolerance.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r recap of SLI/SLO/SLA.

SLI - indicator
SLO - objective
SLA - agreement

opinion: everone should be measured as if they're the only user, so you don't lose small users in the noise.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r Does your service have an SLO?

The answer may surprise you - even if you haven't described one, your common performance will become a de-facto SLO that your users expect.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r looping back; now we have an SLO we can define a symptom.

A symptom is anything that can be measured by the SLO.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r if a microservice crashes in a cloud and no-one is listening, does it actually crash?

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r availability is typically mesaured as success/total.

we can instrument our code to report these stats - much better than external probing, as we don't have to interpolate between probes.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r SLO based alerting - use our favourite time-series monitoring platform to collect these, alert if the availability rate is below our SLO target [example prometheus code on screen].

But this is still quite nosiy, we can do better!

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r We should instead look at the rate of consumption, and then alert if it looks like our rate of consumption will exhaust the budget. We call this the "burn rate".

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r we can look at the burn rate over different windows to tune how sensitive our alerts are - e.g. 1 day/1 week windows.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r we're going to a live demo with prometheus!

Jamie has a prometheus server running, with some programs sending in data as if they were webservers, and apachebench generating load.

#SREcon

@jaqx0r

@jaqx0r

@jaqx0r demo showed burn rate alerting ignoring a low-rate of errors happily, but then nicely kicking in when the rate spiked.

#SREcon

so if we're in a burn-based alerting world, and we don't have cause-based alerting, how do we know what's wrong?

This is where observability comes in - it's a property of your system.

#SREcon

back in the old days, GDB gave visibility in a single-process world.

in our new distributed system world, we need to add this type of visibility explicitly.

#SREcon

so we add logs/tracing/event streams/metrics to give us visibility.

Our brains + observability output are what replace the cause based alerts.

#SREcon

this is more flexible than trying to debug by alerting on every cause, there's always going to be something we haven't anticipated.

#SREcon

if you have existing cause based alerts, and you can't quite bring yourself to remove it - consider deprioritizing it.

Rather than having it page, put it on a dashboard. When someone does get based (by burn rate), they can look for it.

#SREcon

if you find your cause based alert dashboards not getting used, then you can remove the alerts with confidence now.

#SREcon

alerts will grow like mould if you're not careful. They get added, but never reviewed. It's a good idea to regularly review your alerts.

Set a goal for how may pages you want to receive; work towards it.

#SREcon

summary:
1) symptom based alerts are good
2) SLO is defined by you, customers, system.
3) SLO implies error budget; informs tolerance.
4) Page only on SLO risk, because that's what matters.

#SREcon

using these techniques Jamie's team was able to recover from being the 2nd to worst at Google and get to a sustainable place which they sustain.

#SREcon

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Brown

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Brown

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

More from @xleem see all

Related threads

Trending hashtags

Did Thread Reader help you today?