Profile picture
Matt Brown @xleem
, 50 tweets, 17 min read Read on Twitter
Up Now: @lizthegrey and @adam7mck on Resolving Outages Faster with Better Debugging Strategies #SREcon
I work with Liz, Adam is a top guy. There's some exciting stuff in this talk. Listen up! #SREcon
Liz - 8+ years at Google, now CRE. Excited about sharing things we've learnt with the world.

Adam - 18mths at Google, SRE/DevOps type-person for 8-10 years. Excited to share things that blew his socks off when joining Google.

Adam's realisation was that much of what Google is doing can be done by anyone if you know about it - demoing on internal Google tools, but transferable. Take this and use it! #SREcon
Liz - assumptions for this talk - you run a service composed of microservices running across hosts; metrics * hosts > 1000 and also across regions; serving latency sensitive queries.

Talk about other systems (e.g. batch) some other time.

About 50% of room says they have SLO alerts - so once you have SLO alerts, and you've turned off your other noisy alerts, how do you narrow down the location of fault, when your SLO alert fires. #SREcon
we call this process debugging - this talk is about how to make debugging faster.

How can we find the blast radius and do something to mitigate impact faster!

If the quick check doesn't work, you've got to hypothesise, test hypothesis, then develop a solution, test/verify it.

But often our hypothesis is wrong, so we loop testing it, or we loop developing a fix.

We need to speed up both those loops. #SREcon
3 best practices examples for doing that presented today.
- layer peeling
- dynamic data joins
- exemplars

Adam gets paged!

Percentage of slow user queries too high.

Page says something is wrong (slow) in us-west1

Adam - I'm diving in to figure out what's wrong. So many potential reasons for why this SLO could be threatened!

Could be the entire service is broken - this is the early warning, or might just be a couple of bad replicas.

Layer peeling can help distinguish this. #SREcon
Layer peeling:
1) start with a single SLI stream (check blast radius)
2) examine latency by zone
3) filter to affected zones
4) examine by replica
5) filter to affected replicas
6) isolate slow ones.

Adam is showing us a graph of the alert crossing the threshold.

Liz turns on the laser pointer!

The graph shows something is wrong, but not much. We're looking at the graph in "Pcon" a tool that both displays graphs, but is also an interactive editor, so we can slice/dice/display in realtime.

So first up, we're removing the region filter, are things broken in all regions?

Now we have 3 lines (one per region) on our graph, other regions are fine. great, we've scoped the issue to just the region that paged. #SREcon
Obvious question - why not skip all this and just look at all the replicas immediately.

[shows ugly graph with heaps of lines, everyone laughs]

It's clearly too hard to see data here - even if there's one bad replica you don't know it's the problem #SREcon
so focusing into us-west1, Pcon lets us replace the region-level query (a simple Fetch > Align > Group) with it's definition to dive deeper into the data.

So now we have a bigger (Fetch > Align > Group > Point > Align > Group) query #SREcon
The queries are like layers of aggregation - the simple precomputed queries are easy to work with - but we can dive into the more complex queries for debugging like this #SREcon
With the expanded query, Adam showing how we can now interactively change the grouping, so instead of showing the result per-region, now we see the individual zones in the region. #SREcon
so we see 3 lines, one consistenly above our SLO (100ms), one consistently below and one that jerked up 10 mins ago when we got paged.

Obviously the one that jerked is where we want to dig. #SREcon
So changing our query to filter on just that zone, we see in the query editor that our final operation is to take the 98th percentile - which is great for the aggregate views.

For debugging, lets remove it and see the underlying data. #SREcon
Now, we're looking at a distribution of response time for all queries in us-west1-a.

This heatmap blows me away - so much information you can see here.

Look at the lines, 4 of them, overload on the chart. They're the percentiles.

Yellow = lots of queries.
Black = not many queries.
Flame/Distribution graph - several names - brighter colours = more queries.

So you can see the majority of queries are still in the 60ms buckets, as is 50th percentile line.
but we can see the outliers at the top have risen from around 100 to 140, so it's likely an outlier issue. #SREcon
so back to our query - we filter into the zone, and then remove the group to show instances individually - lucky the zone is small enough to do this.

If it wasn't, we could use tools in Pcon to look at "top" streams or similar. #SREcon
So we filter into a specific replica, and go into distribution view again - and now we see that for this replica, everything is slower, all the buckets have moved up in latency, as has the 50th percentile.

This replica ia broken. Immediate mitigation is to kill this.

Recap - no way to go straight from high-level SLI aggregation to the instance.

We bisected the problem into smaller and smaller groups, until we found the issue.

Now, Dynamic Data Joins.

Let's say we killed those replicas, they're not hurting users, but I want to work out why they're bad before I close the issue.

We have some labels on the metrics we're using for alerting, but we can't put everything on those metrics.

There's lots of other information too.

Every task at Google exports many metircs. We want to use those too!

Let's join the streams.
So I have a theory this might be a kernel version issue.

I'm going to start an ad-hoc query - shows kernel version of all tasks of my job. This isn't very useful by itself, we see there's a range of versions.

First useful thing is just to group, using a count operation, so I see how many tasks on each kernel.

I see most on the latest version, but a few stragglers.

Now I can join this with the latency data!

So, skipping back to our useless all task latency query, if we join it with kernel, we now have an extra column for each task - we have latency and kernel. Still useless.

But if we group by kernel now, we see that one kernel version has much higher latency!

So that's dynamic data joins.

The point is that you can only go so far by adding labels to latency metrics, etc .

You really need to be thinking about exporting peripheral information as separate metrics and joining at query time.

Liz - so we've shown you how to do it step by step.

Now I'm going to show you how to do this in just a few minutes. This is exemplars.

This makes distributed tracing and high cardinality streams useful?
Who has trouble finding the right trace to look at? [many hands raised]

The idea is by correlating metrics and samples, you can associate samples to histogram buckets, and then dive into traces for outliers #SREcon
Exemplar Tags - how to skip the messy layer peeling steps.

[Showing a histogram view, annotated with traces]

Option in Pcon to sample traces - e.g. 1 for each bucket. #SREcon
Now we have a green dot on each histogram bucket, saying there's trace data available.

Hovering, shows all traces with common data - e.g. pulls out the exemplar fields to show me that task 4 in us-west1-a is generating all the traces in that high-latency bucket.

so getting those tags is really cool, but tracing can be even cooler. #SREcon
for Cloud Bigtable we have a blackbox prober, for that prober, we tell our tracing system to sample at 100% since we know the rate of queries.

Now we can look at our SLIs in detail.
[Showing graph with 99th percentile latency creeping up]

The heatmap is crucial, we can see there are more and more queries taking 1k units - and that's what's bringing up the 99th percentile.

So the exemplar traces overlaid on this graph mean I can click in and immediately see the trace graph for those slow queries.

I see immediately the RPC within bigtable that is taking all the time.

It's finalizing files. Page Colossus!!

Trying to do this via dashboards is much slower - especially for something like Cloud Bigtable with many dashboards/graphs #SREcon
Quick pitch from SD - dapper just shown is an internal tool - Stackdriver APM and trace gives yo uthe same functionality! #SREcon
Data starts in a monitoring library (e.g. OpenCensus) on a task - sends metrics with trace exemplars. Goes to a collector which stores into a repo.

Aggregator layer filters down to unique traces before returning to users.

[Liz is showing example code for how to do this on the client side - not going to try and summarize that here, get the slides later!]

Summarizing Exemplars:
- Traces make it easier to form hypotheses
- Finding a host and timestamp helps
- To test hypothesis you still need dynamic query building

Distributed tracing and graphing exist in the wild.

We're working with providers to get this whole set of techniques implemented. Coming soon hopefully!

Post incident clean-up.

Sure, document common queries, but don't make too many precomputed queries - outages are not alike.

You really need the ad-hoc query/drill-down functionality.
last minute pitches:
- check out OpenCensus & Stackdriver APM+ Trace
- see Gina Maini's talk on distribute tracing!
- see Matt Brown's talk [(that's me!!)] on risk analysis!

no crowd questions, end. #SREcon
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Brown
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!