A good question! Monitoring checks and unit tests perform exactly the same function: they regularly and automatedly check that the code or system is operating within "normal" bounds.
So do we still need all these tests and checks, in a post-o11y world? The answer is yes...and no.
Yes, you still want to write tests and monitoring checks: to catch regressions, to catch or rule out all the dumb problems before you waste your precious curiosity on them.
But here's where tests and monitoring diverge. Tests don't (usually) wake you up when they fail, whereas the whole raison d'etre of monitoring is alerts, those every-alert-must-be-actionable fucking alerts.
So there's a cost to be borne. Is it worth it? 🤔
Here is where I would argue that in the absence of o11y tooling, team have been horribly overloading their usage of monitoring tools and alerts.
Instead of just a few top level service and e2e alerts that clearly reflect user pain, many shops have accumulated decades of
sedimentary layers of warnings and alerts and monitoring notifications. Not just to alert a human to investigate, but to *try to debug for them.*
They don't have tools to follow the bread crumbs. So they set off fireworks and town criers shouting clues on every affected block.
In a densely interconnected system, it's nearly impossible to issue a single, clean alert that is also correct about the root cause. (First of all, there is rarely "a root cause").
Instead what you get is a few hundred things squalling about getting slower --
none of which are the cause. However, your experienced sysadmin will roll over in bed, groan, skim a handful of the alerts at random; pronounce "redis again" and go back to sleep.
These squalling alerts -- that tell you details about the things you shouldn't have to care about,
but you leave them up because it's the only heuristic you have for diagnosing complex system states -- these monitoring checks can and should die off once you have observability.
With extreme prejudice. They burn you out, make you reactive, and they make you a worse engineer.
Use o11y for what it's great at -- swiftly understanding and diagnosing complex systems, from the perspective of your users.
Use monitoring for what it's great at -- errors, latency, req/sec, and e2e checks.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I woke up this am, scanned Twitter from bed, and spent an hour debating whether I could stomach the energy to respond to the latest breathless fatwa from Paul Graham.
I fell asleep again before deciding; just as well, because @clairevo said it all more nicely than I would have.
(Is that all I have to say? No, dammit, I guess it is not.)
This is so everything about PG in a nutshell, and why I find him so heartbreakingly frustrating.
The guy is brilliant, and a genius communicator. He's seen more and done more than I ever will, times a thousand.
And he is so, so, so consistently blinkered in certain predictable ways. As a former fundamentalist, my reference point for this sort of conduct is mostly religious.
And YC has always struck me less like an investment vehicle, much more like a cult dedicated to founder worship.
Important context: that post was quote tweeting this one.
Because I have also seen designers come in saying lovely things about transformation and user centricity, and end up wasting unthinkable quantities of organizational energy and time.
If you're a manager, and you have a boot camp grad designer who comes in the door wanting to transform your org, and you let them, you are committing professional malpractice.
The way you earn the right to transform is by executing consistently, and transforming incrementally.
(by "futureproof" I mean "true 5y from now whether AI is writing 0% or 100% our lines of code)
And you know what's a great continuous e2e test of your team's prowess at learning and sensemaking?
1, regularly injecting fresh junior talent
2, composing teams of a range of levels
"Is it safe to ask questions" is a low fucking bar. Better: is it normal to ask questions, is it an expected contribution from every person at every level? Does everyone get a chance to explain and talk through their work?
The advance of LLMs and other AI tools is a rare opportunity to radically upend the way we talk and think about software development, and change our industry for the better.
The way we have traditionally talked about software centers on writing code, solving technical problems.
LLMs challenge this -- in a way that can feel scary and disorienting. If the robots are coming for our life's work, what crumbs will be left for you and me?
But I would argue that this has always been a misrepresentation of the work, one which confuses the trees for the forest.
Something I have been noodling on is, how to describe software development in a way that is both a) true today, and b) relatively futureproof, meaning still true 5 years from now if the optimists have won and most code is no longer written by humans.
A couple days back I went on a whole rant about lazy billionaires punching down and blaming wfh/"work life balance" for Google's long slide of loss dominance.
I actually want to take this up from the other side, and defend some of the much hated, much-maligned RTO initiatives.
I'm purposely not quote tweeting anyone or any company. This is not about any one example, it's a synthesis of conversations I have had with techies and seen on Twitter.
There seems to be a sweeping consensus amongst engineers that RTO is unjust, unwarranted and cruel. Period.
And like, I would never argue that RTO is being implemented well across the board. It's hard not to feel cynical when:
* you are being told to RTO despite your team not being there
* you are subject to arbitrary badge checks
* reasonable accommodations are not being made