Read on Twitter

Matt Brown @xleem

, 50 tweets, 17 min read Read on Twitter

@lizthegrey

@lizthegrey

Up Now: @lizthegrey and @adam7mck on Resolving Outages Faster with Better Debugging Strategies #SREcon

I work with Liz, Adam is a top guy. There's some exciting stuff in this talk. Listen up! #SREcon

Liz - 8+ years at Google, now CRE. Excited about sharing things we've learnt with the world.

Adam - 18mths at Google, SRE/DevOps type-person for 8-10 years. Excited to share things that blew his socks off when joining Google.

#SREcon

Adam's realisation was that much of what Google is doing can be done by anyone if you know about it - demoing on internal Google tools, but transferable. Take this and use it! #SREcon

Liz - assumptions for this talk - you run a service composed of microservices running across hosts; metrics * hosts > 1000 and also across regions; serving latency sensitive queries.

Talk about other systems (e.g. batch) some other time.

#SREcon

About 50% of room says they have SLO alerts - so once you have SLO alerts, and you've turned off your other noisy alerts, how do you narrow down the location of fault, when your SLO alert fires. #SREcon

we call this process debugging - this talk is about how to make debugging faster.

How can we find the blast radius and do something to mitigate impact faster!

#SREcon

If the quick check doesn't work, you've got to hypothesise, test hypothesis, then develop a solution, test/verify it.

But often our hypothesis is wrong, so we loop testing it, or we loop developing a fix.

We need to speed up both those loops. #SREcon

3 best practices examples for doing that presented today.
- layer peeling
- dynamic data joins
- exemplars

#SREcon

Adam gets paged!

Percentage of slow user queries too high.

Page says something is wrong (slow) in us-west1

#SREcon

Adam - I'm diving in to figure out what's wrong. So many potential reasons for why this SLO could be threatened!

Could be the entire service is broken - this is the early warning, or might just be a couple of bad replicas.

Layer peeling can help distinguish this. #SREcon

Layer peeling:
1) start with a single SLI stream (check blast radius)
2) examine latency by zone
3) filter to affected zones
4) examine by replica
5) filter to affected replicas
6) isolate slow ones.

#SREcon

Adam is showing us a graph of the alert crossing the threshold.

Liz turns on the laser pointer!

#SREcon

The graph shows something is wrong, but not much. We're looking at the graph in "Pcon" a tool that both displays graphs, but is also an interactive editor, so we can slice/dice/display in realtime.

#SREcon

So first up, we're removing the region filter, are things broken in all regions?

Now we have 3 lines (one per region) on our graph, other regions are fine. great, we've scoped the issue to just the region that paged. #SREcon

Obvious question - why not skip all this and just look at all the replicas immediately.

[shows ugly graph with heaps of lines, everyone laughs]

It's clearly too hard to see data here - even if there's one bad replica you don't know it's the problem #SREcon

so focusing into us-west1, Pcon lets us replace the region-level query (a simple Fetch > Align > Group) with it's definition to dive deeper into the data.

So now we have a bigger (Fetch > Align > Group > Point > Align > Group) query #SREcon

The queries are like layers of aggregation - the simple precomputed queries are easy to work with - but we can dive into the more complex queries for debugging like this #SREcon

With the expanded query, Adam showing how we can now interactively change the grouping, so instead of showing the result per-region, now we see the individual zones in the region. #SREcon

so we see 3 lines, one consistenly above our SLO (100ms), one consistently below and one that jerked up 10 mins ago when we got paged.

Obviously the one that jerked is where we want to dig. #SREcon

So changing our query to filter on just that zone, we see in the query editor that our final operation is to take the 98th percentile - which is great for the aggregate views.

For debugging, lets remove it and see the underlying data. #SREcon

Now, we're looking at a distribution of response time for all queries in us-west1-a.

This heatmap blows me away - so much information you can see here.

Look at the lines, 4 of them, overload on the chart. They're the percentiles.

#SREcon

Yellow = lots of queries.
Black = not many queries.
Flame/Distribution graph - several names - brighter colours = more queries.

So you can see the majority of queries are still in the 60ms buckets, as is 50th percentile line.
#SREcon

but we can see the outliers at the top have risen from around 100 to 140, so it's likely an outlier issue. #SREcon

so back to our query - we filter into the zone, and then remove the group to show instances individually - lucky the zone is small enough to do this.

If it wasn't, we could use tools in Pcon to look at "top" streams or similar. #SREcon

So we filter into a specific replica, and go into distribution view again - and now we see that for this replica, everything is slower, all the buckets have moved up in latency, as has the 50th percentile.

This replica ia broken. Immediate mitigation is to kill this.

#SREcon

Recap - no way to go straight from high-level SLI aggregation to the instance.

We bisected the problem into smaller and smaller groups, until we found the issue.

#SREcon

Now, Dynamic Data Joins.

Let's say we killed those replicas, they're not hurting users, but I want to work out why they're bad before I close the issue.

#SREcon

We have some labels on the metrics we're using for alerting, but we can't put everything on those metrics.

There's lots of other information too.

Every task at Google exports many metircs. We want to use those too!

Let's join the streams.
#SREcon

So I have a theory this might be a kernel version issue.

I'm going to start an ad-hoc query - shows kernel version of all tasks of my job. This isn't very useful by itself, we see there's a range of versions.

#SREcon

First useful thing is just to group, using a count operation, so I see how many tasks on each kernel.

I see most on the latest version, but a few stragglers.

Now I can join this with the latency data!

#SREcon

So, skipping back to our useless all task latency query, if we join it with kernel, we now have an extra column for each task - we have latency and kernel. Still useless.

But if we group by kernel now, we see that one kernel version has much higher latency!

#SREcon

So that's dynamic data joins.

The point is that you can only go so far by adding labels to latency metrics, etc .

You really need to be thinking about exporting peripheral information as separate metrics and joining at query time.

#SREcon

Liz - so we've shown you how to do it step by step.

Now I'm going to show you how to do this in just a few minutes. This is exemplars.

This makes distributed tracing and high cardinality streams useful?
#SREcon

Who has trouble finding the right trace to look at? [many hands raised]

The idea is by correlating metrics and samples, you can associate samples to histogram buckets, and then dive into traces for outliers #SREcon

Exemplar Tags - how to skip the messy layer peeling steps.

[Showing a histogram view, annotated with traces]

Option in Pcon to sample traces - e.g. 1 for each bucket. #SREcon

Now we have a green dot on each histogram bucket, saying there's trace data available.

Hovering, shows all traces with common data - e.g. pulls out the exemplar fields to show me that task 4 in us-west1-a is generating all the traces in that high-latency bucket.

#SREcon

so getting those tags is really cool, but tracing can be even cooler. #SREcon

for Cloud Bigtable we have a blackbox prober, for that prober, we tell our tracing system to sample at 100% since we know the rate of queries.

Now we can look at our SLIs in detail.

[Showing graph with 99th percentile latency creeping up]

The heatmap is crucial, we can see there are more and more queries taking 1k units - and that's what's bringing up the 99th percentile.

#SREcon

So the exemplar traces overlaid on this graph mean I can click in and immediately see the trace graph for those slow queries.

I see immediately the RPC within bigtable that is taking all the time.

It's finalizing files. Page Colossus!!

#SREcon

Trying to do this via dashboards is much slower - especially for something like Cloud Bigtable with many dashboards/graphs #SREcon

Quick pitch from SD - dapper just shown is an internal tool - Stackdriver APM and trace gives yo uthe same functionality! #SREcon

Data starts in a monitoring library (e.g. OpenCensus) on a task - sends metrics with trace exemplars. Goes to a collector which stores into a repo.

Aggregator layer filters down to unique traces before returning to users.

#SREcon

[Liz is showing example code for how to do this on the client side - not going to try and summarize that here, get the slides later!]

#SREcon

Summarizing Exemplars:
- Traces make it easier to form hypotheses
- Finding a host and timestamp helps
- To test hypothesis you still need dynamic query building

#SREcon

Distributed tracing and graphing exist in the wild.

We're working with providers to get this whole set of techniques implemented. Coming soon hopefully!

#SREcon

Post incident clean-up.

Sure, document common queries, but don't make too many precomputed queries - outages are not alike.

You really need the ad-hoc query/drill-down functionality.
#SREcon

last minute pitches:
- check out OpenCensus & Stackdriver APM+ Trace
- see Gina Maini's talk on distribute tracing!
- see Matt Brown's talk [(that's me!!)] on risk analysis!

Thanks
#SREcon

no crowd questions, end. #SREcon

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Brown

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Brown

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

Related threads

Trending hashtags

Did Thread Reader help you today?