Adam - 18mths at Google, SRE/DevOps type-person for 8-10 years. Excited to share things that blew his socks off when joining Google.
Talk about other systems (e.g. batch) some other time.
How can we find the blast radius and do something to mitigate impact faster!
But often our hypothesis is wrong, so we loop testing it, or we loop developing a fix.
We need to speed up both those loops. #SREcon
- layer peeling
- dynamic data joins
Percentage of slow user queries too high.
Page says something is wrong (slow) in us-west1
Could be the entire service is broken - this is the early warning, or might just be a couple of bad replicas.
Layer peeling can help distinguish this. #SREcon
1) start with a single SLI stream (check blast radius)
2) examine latency by zone
3) filter to affected zones
4) examine by replica
5) filter to affected replicas
6) isolate slow ones.
Liz turns on the laser pointer!
Now we have 3 lines (one per region) on our graph, other regions are fine. great, we've scoped the issue to just the region that paged. #SREcon
[shows ugly graph with heaps of lines, everyone laughs]
It's clearly too hard to see data here - even if there's one bad replica you don't know it's the problem #SREcon
So now we have a bigger (Fetch > Align > Group > Point > Align > Group) query #SREcon
Obviously the one that jerked is where we want to dig. #SREcon
For debugging, lets remove it and see the underlying data. #SREcon
This heatmap blows me away - so much information you can see here.
Look at the lines, 4 of them, overload on the chart. They're the percentiles.
Black = not many queries.
Flame/Distribution graph - several names - brighter colours = more queries.
So you can see the majority of queries are still in the 60ms buckets, as is 50th percentile line.
If it wasn't, we could use tools in Pcon to look at "top" streams or similar. #SREcon
This replica ia broken. Immediate mitigation is to kill this.
We bisected the problem into smaller and smaller groups, until we found the issue.
Let's say we killed those replicas, they're not hurting users, but I want to work out why they're bad before I close the issue.
There's lots of other information too.
Every task at Google exports many metircs. We want to use those too!
Let's join the streams.
I'm going to start an ad-hoc query - shows kernel version of all tasks of my job. This isn't very useful by itself, we see there's a range of versions.
I see most on the latest version, but a few stragglers.
Now I can join this with the latency data!
But if we group by kernel now, we see that one kernel version has much higher latency!
The point is that you can only go so far by adding labels to latency metrics, etc .
You really need to be thinking about exporting peripheral information as separate metrics and joining at query time.
Now I'm going to show you how to do this in just a few minutes. This is exemplars.
This makes distributed tracing and high cardinality streams useful?
The idea is by correlating metrics and samples, you can associate samples to histogram buckets, and then dive into traces for outliers #SREcon
[Showing a histogram view, annotated with traces]
Option in Pcon to sample traces - e.g. 1 for each bucket. #SREcon
Hovering, shows all traces with common data - e.g. pulls out the exemplar fields to show me that task 4 in us-west1-a is generating all the traces in that high-latency bucket.
Now we can look at our SLIs in detail.
The heatmap is crucial, we can see there are more and more queries taking 1k units - and that's what's bringing up the 99th percentile.
I see immediately the RPC within bigtable that is taking all the time.
It's finalizing files. Page Colossus!!
Aggregator layer filters down to unique traces before returning to users.
- Traces make it easier to form hypotheses
- Finding a host and timestamp helps
- To test hypothesis you still need dynamic query building
We're working with providers to get this whole set of techniques implemented. Coming soon hopefully!
Sure, document common queries, but don't make too many precomputed queries - outages are not alike.
You really need the ad-hoc query/drill-down functionality.
- check out OpenCensus & Stackdriver APM+ Trace
- see Gina Maini's talk on distribute tracing!
- see Matt Brown's talk [(that's me!!)] on risk analysis!