then show how they would have found it in honeycomb in 1-2 clicks, every goshdarn time.
he was describing this thundering herd problem, where thousands of workers would spin up and hammer the one redis cpu. it took a long time to discern this and why.
me "oh god so easy. just sum up all the time spent by the workers, break down by backend or userid, either way it's *right there*."
e.g. when the errors are all a particular version of ios, device, language pack, region, hitting a certain endpoint, etc.
...not enough to get up into your top 10 or 100 users, but enough to put strain on a shared service.
(i said you *could*. i wouldn't.)
"oh look, that guy's consuming 90% of the processing time and he pays us $20/month." block the fucker and go for a drink.
jesus people this isn't rocket science, just good clean fun with high cardinality.
of course it's complicated if you're trying to hop from tool to tool to tool, just to recreate what happened from log spew and metrics and traces.
if you just had the fucking folder of data, it probably wouldn't be that hard.
nope. newsflash: your tools suck even if you're small, even if all you have is a monolith.
* deploys are hard or scary or flaky
* it takes more than a few minutes to deploy
* things happen that are not well understood
* on call is scary and stressful
* you are afraid of your systems
* if the "debugger of last resort" is always the person or people who have been there the longest
... then you have a system that is being propped up by tribal knowledge and tall tales and cargo culting, not true debugging.
i love it, but it's sooo toxic and deadly. for everyone.
it's better if you move this shit out of our heads and put it in a tool, where it is accessible to everyone. democratize access to systems information.
they're the people who are persistent and curious, who regularly go poke around in production.
while senior folks often keep limping along with the thing they know that's good enough.
(yes yes, talking about myself again)