At best, it may get you close enough that a few lucky guesses or intuitive leaps can land you on the right answer (or some part of it)
Not gonna lie, it feels great to be a wizard. 🔮🧹🌚
I'll miss being a wizard, but I'm ready to stop being a translation layer. Better shit to do with our time.
Imagine you're looking at your dashboards and you see a big spike in errors. So you start investigating.
You dismiss some of them because your intuition or knowledge of the system tells you they are likely effects of the errors, not caused by the errors.
You zero in on a few in particular: it looks like the errors are only to a particular shard, and from there you can narrow it down further: the errors are only to the primary node, there was a spike in SELECT queries and queue length shortly before, and
Except, that wasn't the problem.
With metrics and dashboards you we always looking for whatever you managed predict. If instead you start with the error spike, then break down by user or app etc, and follow the data where it takes you, you don't have to
Another super common version of this dashboard blindness is when you have a clump of errors and you check your logs to see if they all have $x in common, and they do! ... but they also have ten other dimensions in common.
So if all the errors happen to be for requests by user ID 555, shard 2, shopping cart id 10565, item 23, region us-west-1, client type iOS, version 13, app version 9099, timezone GMT...
or you can draw a bubble around the spike and get a list of the dimensions that differ.
If you're lucky. So good luck.