a) logically impossible
b) a bottomless money pit
c) mostly junk (boring things being infinitely more common than interesting things)
d) ends up with a heavy bias towards errors, and
e) causes good people to lurch desperately towards bad ideas (like "AI")
just a few minutes ago i was having a conversation with @paulosman about the future of our profession as systems engineers, and he pulled up a quote about four ways to solve problems: solution, resolution, absolution, and dissolution. (e.g. squiretothegiants.com/2016/06/16/so-…)
logs are such a canonical example of a solution that teams are used to either absolving (ignoring it and hoping it will go away) or solving (an outcome that is good enough to move on).
if your problem is "do something with logs", a good enough answer is "pick any log vendor."
logs are a commodity now, a race to the bottom. you can ship them to any one of twenty or fifty vendors, pay a few cents per gig to not think about them, and sleep fine at night.
but what if the problem you are trying to solve is "understand my systems" or "understand my code?"
(er sorry, i got that wrong above ... "resolve" is the one that yields a good enough outcome. "solve" is do something that yields the best possible outcome. CRITICAL DISTINCTION ☺️)
if what you're trying to do is *understand your systems*, logs are a shit-poor way of trying to do that. we fall back to logs when that's all we have. but what if you didn't log the detail you needed to understand the problem in front of you?
what if what you logged is misleading, or flat out wrong? what if the problem only manifests when you examine system behaviors in aggregate, and is invisible from the perspective of any given host? what if it doesn't match up with the telemetry from your monitoring systems?
what if your logs are filled with useless spew left over from the last engineer who was frantically stuffing in shit to help her "step" through the code from the last outage?
worst of all, what if you don't know what to look for? logs _only work_ when you know what to look for.
the problem most people face is that they've been using logs for so long that they've wrestled them into a state where it's "good enough" -- everyone is familiar enough with the local whimsies that they know how to get what they need out to get thru most situations.
and so they completely lose sight of the fact that the problem they're trying to solve is, "understand my systems".
and they start shoveling more and more and more engineering energy into the gaping maw of their logging "solution". the problem instead becomes "managing logs".
many companies end up sinking so much engineering talent into this problem, you'd think they had a hybrid mission -- whatever their company does, plus log management.
logs for logs' sake will swiftly become a millstone around the neck of any team that lets this happen.
there's so much emotional attachment around this topic, it seems like many teams (and leaders) have come to imbue their logs with a kind of emotional security blanket magic.
(for which you can prob thank the millions of dollars spent on marketing by aforementioned logs vendors.)
but what is the problem you are trying to solve?
it is not "keep a record of every thing that happened".
it is not "search our logs".
it is not "keep everything".
the problem you are trying to solve is understanding your systems and the code you write that makes your business.
if the goal is to "solve" it (find the best possible outcome) and, in some cases, "dissolve" it (reframing the question or redesigning the system so it no longer exists) ... what would you do?
well, first identify the minimal set of logs that you actually do need (usually for compliance reasons). pick any log vendor, stash and forget.
then, of course, spin up a prometheus instance or a datadog account and -- LOLJK, you know what i'm going to say about observability ☺️
but if you CAN'T start using honeycomb or lightstep or similar, if you're stuck with what you've got, what can you do to dig yourself out of logs hell?
at LEAST you can shift away from random acts of logging violence towards emitting arbitrarily-wide structured data blobs,
one per service per request, containing the full context of the request, env, parameters and so forth. i've written about this extensively charity.wtf/2019/02/05/log…
aws apparently does this, and stripe calls it "canonical logs" stripe.com/blog/canonical… so it's not just my bullshit.
then ship these events to someplace where you can aggregate and slice and dice them. you can also tee them off to a honeycomb free account, just fyi ;) since that's basically what our client side integrations do.
btw, if you haven't yet done this refactoring and if you shudder to think how much labor it would take, check out @cribl_io. it's by ex splunk folks and it reconstitutes log spew into coherent events for you.
this is *not* a solution, but could be a bridge to your future. 😉
in the end, if your goal is to *understand your systems* then you need to be aggregating loads of rich context around the request, because that is what most closely tracks your users' experience.
this is observability.
logs can (*can*) be pressed into the service of this goal, if gathered and formatted and expressed in the correct ways, but the presence (or lack) of logs is orthogonal to the goal.
use logs or don't use logs, but don't let "logs" become the problem you solve for. the end.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I woke up this am, scanned Twitter from bed, and spent an hour debating whether I could stomach the energy to respond to the latest breathless fatwa from Paul Graham.
I fell asleep again before deciding; just as well, because @clairevo said it all more nicely than I would have.
(Is that all I have to say? No, dammit, I guess it is not.)
This is so everything about PG in a nutshell, and why I find him so heartbreakingly frustrating.
The guy is brilliant, and a genius communicator. He's seen more and done more than I ever will, times a thousand.
And he is so, so, so consistently blinkered in certain predictable ways. As a former fundamentalist, my reference point for this sort of conduct is mostly religious.
And YC has always struck me less like an investment vehicle, much more like a cult dedicated to founder worship.
Important context: that post was quote tweeting this one.
Because I have also seen designers come in saying lovely things about transformation and user centricity, and end up wasting unthinkable quantities of organizational energy and time.
If you're a manager, and you have a boot camp grad designer who comes in the door wanting to transform your org, and you let them, you are committing professional malpractice.
The way you earn the right to transform is by executing consistently, and transforming incrementally.
(by "futureproof" I mean "true 5y from now whether AI is writing 0% or 100% our lines of code)
And you know what's a great continuous e2e test of your team's prowess at learning and sensemaking?
1, regularly injecting fresh junior talent
2, composing teams of a range of levels
"Is it safe to ask questions" is a low fucking bar. Better: is it normal to ask questions, is it an expected contribution from every person at every level? Does everyone get a chance to explain and talk through their work?
The advance of LLMs and other AI tools is a rare opportunity to radically upend the way we talk and think about software development, and change our industry for the better.
The way we have traditionally talked about software centers on writing code, solving technical problems.
LLMs challenge this -- in a way that can feel scary and disorienting. If the robots are coming for our life's work, what crumbs will be left for you and me?
But I would argue that this has always been a misrepresentation of the work, one which confuses the trees for the forest.
Something I have been noodling on is, how to describe software development in a way that is both a) true today, and b) relatively futureproof, meaning still true 5 years from now if the optimists have won and most code is no longer written by humans.
A couple days back I went on a whole rant about lazy billionaires punching down and blaming wfh/"work life balance" for Google's long slide of loss dominance.
I actually want to take this up from the other side, and defend some of the much hated, much-maligned RTO initiatives.
I'm purposely not quote tweeting anyone or any company. This is not about any one example, it's a synthesis of conversations I have had with techies and seen on Twitter.
There seems to be a sweeping consensus amongst engineers that RTO is unjust, unwarranted and cruel. Period.
And like, I would never argue that RTO is being implemented well across the board. It's hard not to feel cynical when:
* you are being told to RTO despite your team not being there
* you are subject to arbitrary badge checks
* reasonable accommodations are not being made
* a calcified, risk averse culture
* an addiction to fat, monopolistic margins on advertising revenues -- the fossil fuels of internet monetization
* a system that rewarded dilettante engineering over value creation or maintenance (HI READER)
Once, just once, I'd love to hear a billionaire admit that monopolies aren't healthy for anyone, even the monopoly holder.
Remember how bloated and sluggish MA Bell got before it got broken up in the mid 80s? Monopoly profits are death to innovation.