Adam - 18mths at Google, SRE/DevOps type-person for 8-10 years. Excited to share things that blew his socks off when joining Google.
#SREcon
Talk about other systems (e.g. batch) some other time.
#SREcon
How can we find the blast radius and do something to mitigate impact faster!
#SREcon
But often our hypothesis is wrong, so we loop testing it, or we loop developing a fix.
We need to speed up both those loops. #SREcon
- layer peeling
- dynamic data joins
- exemplars
#SREcon
Percentage of slow user queries too high.
Page says something is wrong (slow) in us-west1
#SREcon
Could be the entire service is broken - this is the early warning, or might just be a couple of bad replicas.
Layer peeling can help distinguish this. #SREcon
1) start with a single SLI stream (check blast radius)
2) examine latency by zone
3) filter to affected zones
4) examine by replica
5) filter to affected replicas
6) isolate slow ones.
#SREcon
Liz turns on the laser pointer!
#SREcon
#SREcon
Now we have 3 lines (one per region) on our graph, other regions are fine. great, we've scoped the issue to just the region that paged. #SREcon
[shows ugly graph with heaps of lines, everyone laughs]
It's clearly too hard to see data here - even if there's one bad replica you don't know it's the problem #SREcon
So now we have a bigger (Fetch > Align > Group > Point > Align > Group) query #SREcon
Obviously the one that jerked is where we want to dig. #SREcon
For debugging, lets remove it and see the underlying data. #SREcon
This heatmap blows me away - so much information you can see here.
Look at the lines, 4 of them, overload on the chart. They're the percentiles.
#SREcon
Black = not many queries.
Flame/Distribution graph - several names - brighter colours = more queries.
So you can see the majority of queries are still in the 60ms buckets, as is 50th percentile line.
#SREcon
If it wasn't, we could use tools in Pcon to look at "top" streams or similar. #SREcon
This replica ia broken. Immediate mitigation is to kill this.
#SREcon
We bisected the problem into smaller and smaller groups, until we found the issue.
#SREcon
Let's say we killed those replicas, they're not hurting users, but I want to work out why they're bad before I close the issue.
#SREcon
There's lots of other information too.
Every task at Google exports many metircs. We want to use those too!
Let's join the streams.
#SREcon
I'm going to start an ad-hoc query - shows kernel version of all tasks of my job. This isn't very useful by itself, we see there's a range of versions.
#SREcon
I see most on the latest version, but a few stragglers.
Now I can join this with the latency data!
#SREcon
But if we group by kernel now, we see that one kernel version has much higher latency!
#SREcon
The point is that you can only go so far by adding labels to latency metrics, etc .
You really need to be thinking about exporting peripheral information as separate metrics and joining at query time.
#SREcon
Now I'm going to show you how to do this in just a few minutes. This is exemplars.
This makes distributed tracing and high cardinality streams useful?
#SREcon
The idea is by correlating metrics and samples, you can associate samples to histogram buckets, and then dive into traces for outliers #SREcon
[Showing a histogram view, annotated with traces]
Option in Pcon to sample traces - e.g. 1 for each bucket. #SREcon
Hovering, shows all traces with common data - e.g. pulls out the exemplar fields to show me that task 4 in us-west1-a is generating all the traces in that high-latency bucket.
#SREcon
Now we can look at our SLIs in detail.
The heatmap is crucial, we can see there are more and more queries taking 1k units - and that's what's bringing up the 99th percentile.
#SREcon
I see immediately the RPC within bigtable that is taking all the time.
It's finalizing files. Page Colossus!!
#SREcon
Aggregator layer filters down to unique traces before returning to users.
#SREcon
#SREcon
- Traces make it easier to form hypotheses
- Finding a host and timestamp helps
- To test hypothesis you still need dynamic query building
#SREcon
We're working with providers to get this whole set of techniques implemented. Coming soon hopefully!
#SREcon
Sure, document common queries, but don't make too many precomputed queries - outages are not alike.
You really need the ad-hoc query/drill-down functionality.
#SREcon
- check out OpenCensus & Stackdriver APM+ Trace
- see Gina Maini's talk on distribute tracing!
- see Matt Brown's talk [(that's me!!)] on risk analysis!
Thanks
#SREcon