, 23 tweets, 8 min read
My Authors
Read all threads
finally catching up on the webinar i missed, where @ceejbot and @captfuzzbucket describe their o11y transformation @eaze.

i hope all of you tech leaders constantly asking me what it looks like, how hard it is, is it worth it? -- listen to this, 👉please👉 honeycomb.io/resources/eaze…
@ceejbot @captfuzzbucket @eaze some crazy shit. when @ceejbot was hired, they had never had a big sales day where they managed to stay up through it. 🥵

when @captfuzzbucket was hired, deploys took *four* hours every day of trying and failing and trying and failing to get their code out. every day. 🥶
the team (pre cj/ryan) already had some decent tools. some logs, some good front-end performance stuff, exceptions management. but they didn't think there was anything better out there for .NET.

frankly, they didn't realize that there was a better way, *period*.
little bit of eaze history: they hit product market fit, all their naive solutions stopped working, and they started hitting the classic distsys scaling scenario: starts with latency problems, then one thing goes down, then everything tips over. and over and over.
basically things were wobbly but ok until california legalized weed, and then --

"success is a catastrophe you have to survive" -- @ceejbot quote (destined for a sticker)
@ceejbot so the site went down, and nobody knew why. the team literally did not know why. they put it into maint mode, and it went down again. wtffff?

they had absolutely no visibility into what was happening. the answers were nowhere. ANY data at all would be a help.
@ceejbot ... *pause* gotta run do something, back later. 🌷

at minute 18. they are talking about how soul killingly awful these systems are to support; they burn good people out, and fast. 😕
@ceejbot As we rejoin our heroes they are talking about an early battle where the entire site would go down, ~instantly. Their website seemed to somehow be completely saturating AWS's largest ElastiCache instance...

"but WHY? It wasn't doing that much! Something's very wrong here."
They had plenty of logs and telemetry, but nothing obviously helpful there. Logs are super useful, but only IF you knew what to log in advance. Great for known unknowns, does fuckall for unknown-unknowns.

And then came 4/20.
HUGE day for eaze. Lots of money spent on ads, inventory, partnerships ... and they were down for eight hours.

EIGHT. HOURS. [ed: cringing in solidarity]

So that's when @ceejbot had the leverage and the exec attention to bring in honeycomb, and spend the time integrating it.
@ceejbot This was a total leap of faith. Nobody on the eng team really believed or trusted that anything could be substantially better than logs, which they already had. [ed: CJ is ballsy as fuuuck 😳]

Ben started pumping ELB logs into honeycomb. it took 10 minutes, and right away...
they were like ... 👀 what the fuck👀 some things jumped out at them 🤔

* 403s spiking *everywhere* (microservices just bein flappy)
* random bursts of 500s (when driver shifts changed)
* random spikes of 400s for no apparent reason
* random spikes of 500s for no apparent reason
... and all that was just background noise. tip of the iceberg. ❄️

so they instrumented and rolled honeycomb out on their node services, racked up some wins, and took a deep breath and realized: they had to rewrite the monolith. it was unsalvageable. 😬 [ed: omgg been there]
unfortunately, since writing the monolith, the entire engineering team had burned out and been replaced.

❄️twice.❄️

no one remained from the team who built it, and the current team didn't really understand the codebase very well. they had to start by documenting it all.
so now they have two projects going on in parallel:

❄️1❄️ understand the monolith, redesign a new architecture, and begin building it up using a whoel new stack
❄️2❄️ also keep the monolith limping along. DO NOT DIE.

both of which kicked off with honeycomb and instrumentation.
now they tell an epic tale of the Elasticache Of Doom. i can't do it justice, you really have to go and hear it for yourself.. (minute 28:30). GO.

because holy hell they were DDoSing themselves in so many unbelievable and creative ways 😍❄️😍❄️😍
"seeing those first monolith graphs in honeycomb was -- oh my god, this brave little toaster. How much work it's doing -- for so little effect. It was the first stunning revelation of what was going on in our code base."
"my boss could walk in and figure it out -- with like ad hoc queries. Someone can walk in and not have to know in advance what the question they want to ask is."

"Logging is phenomenal, but you have to know what you're looking for.. at the time you write the code."
"Metrics are useful to a point too, because they let you do that graphs, but they don't let you figure out oh: it's that one userID that's causing it, wtf is up with that? Because you can't have high cardinality data in there."
"This is where the union of logs and metrics is just what makes Honeycomb *awesome*. That, plus ad hoc queries and the inviting nature of it."

ben "lol -- again, the only thing ops had to do was just point the logs at honeycomb and walk away. this is the best part for me.."
ben goes off rhapsodizing about how much fun it is to break down by user agent! app id! certificates! status code!! and other high cardinality dimensions, all in combination with each other. 😍

AND IT SAVED MY HISTORY, so i could go back and look up what i tried the last time!
ceej -- "and i could look at ben's queries and hop off them!"

... alright y'all i just realized there is an ACTUAL TRANSCRIPT at the bottom of the page with the webinary, so i do not know why the fuck i am tweeting all the quotes when i coudl be in bed sleeping. lmao
in conclusion, if you would like to hear more stories of terrifying code and microdosing, from two hilarious nerds who solved some horrendous problems, go here: honeycomb.io/resources/eaze…

not many systems stories have a happy ending, so you are in for a treat. 🐝🌈
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Charity Majors

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!