It is staggering how incredibly durable the myth of "you can't afford events, use metrics" has proven to be. 🤔
I think there are several contributing factors. First of all, most people's frame of reference is logs. Shitty, spammy, undisciplined string-infested logs.
The median log line contains maybe 1-5 nouns of information, and repeats any/all correlating identifiers on every line. That's...not a lot of information density per write or buffer flush.
But it gets worse! The strings are often padded with sentences and human readable crap,
and the log lines themselves are virtually useless unless you reassemble the full context of the event in post processing.
Your write amplification is massive (could easily be tens, hundreds per request) and a typo here can be fatal to disk space or budget.
Events, on the other hand, are set at one per request per service. A mature instrumented service tends to have 300-500 dimensions, most of which are populated.
Adding another dimension doesn't mean another write, just appending a few more chars to the existing one.
So structurally events are compact, dense and resistant to bloat -- and no post processing necessary to make them usable.
No printing out the unique ids and time stamps again and again, on every log line. No need to allocate the memory and setup tcp every time.
And that's just what you save by aggregating context around the event. I know y'all don't have access to an efficient columnar store; the closest options are probably elastic (built for text search) and druid (lacks flexible schemas). Surely there's something in the works tho.
We've written extensively on some of the things we did to optimize storage costs, from compression to replacing repeated strings with pointers, to (most recently) aging the files out to S3 and moving the query planner to lambda jobs.
(Aka "We serverlessed our database 😍")
All that without even mentioning the loaded S-word: sampling.
To be clear: honeycomb does not depend on sampling in ANY way, many of our customers don't sample at all, it's completely up to you. But dynamic sampling is a fucking superpower. You ignore it at your own peril.
Any time a monitoring vendor tells you smugly that THEY don't throw away any data, ask them what time interval they aggregate on.
(That's called throwing data away too, btw, and it's way more fatal to observability than simply getting a representative sample.)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I woke up this am, scanned Twitter from bed, and spent an hour debating whether I could stomach the energy to respond to the latest breathless fatwa from Paul Graham.
I fell asleep again before deciding; just as well, because @clairevo said it all more nicely than I would have.
(Is that all I have to say? No, dammit, I guess it is not.)
This is so everything about PG in a nutshell, and why I find him so heartbreakingly frustrating.
The guy is brilliant, and a genius communicator. He's seen more and done more than I ever will, times a thousand.
And he is so, so, so consistently blinkered in certain predictable ways. As a former fundamentalist, my reference point for this sort of conduct is mostly religious.
And YC has always struck me less like an investment vehicle, much more like a cult dedicated to founder worship.
Important context: that post was quote tweeting this one.
Because I have also seen designers come in saying lovely things about transformation and user centricity, and end up wasting unthinkable quantities of organizational energy and time.
If you're a manager, and you have a boot camp grad designer who comes in the door wanting to transform your org, and you let them, you are committing professional malpractice.
The way you earn the right to transform is by executing consistently, and transforming incrementally.
(by "futureproof" I mean "true 5y from now whether AI is writing 0% or 100% our lines of code)
And you know what's a great continuous e2e test of your team's prowess at learning and sensemaking?
1, regularly injecting fresh junior talent
2, composing teams of a range of levels
"Is it safe to ask questions" is a low fucking bar. Better: is it normal to ask questions, is it an expected contribution from every person at every level? Does everyone get a chance to explain and talk through their work?
The advance of LLMs and other AI tools is a rare opportunity to radically upend the way we talk and think about software development, and change our industry for the better.
The way we have traditionally talked about software centers on writing code, solving technical problems.
LLMs challenge this -- in a way that can feel scary and disorienting. If the robots are coming for our life's work, what crumbs will be left for you and me?
But I would argue that this has always been a misrepresentation of the work, one which confuses the trees for the forest.
Something I have been noodling on is, how to describe software development in a way that is both a) true today, and b) relatively futureproof, meaning still true 5 years from now if the optimists have won and most code is no longer written by humans.
A couple days back I went on a whole rant about lazy billionaires punching down and blaming wfh/"work life balance" for Google's long slide of loss dominance.
I actually want to take this up from the other side, and defend some of the much hated, much-maligned RTO initiatives.
I'm purposely not quote tweeting anyone or any company. This is not about any one example, it's a synthesis of conversations I have had with techies and seen on Twitter.
There seems to be a sweeping consensus amongst engineers that RTO is unjust, unwarranted and cruel. Period.
And like, I would never argue that RTO is being implemented well across the board. It's hard not to feel cynical when:
* you are being told to RTO despite your team not being there
* you are subject to arbitrary badge checks
* reasonable accommodations are not being made