* not communicating changes well to users
* relying on users to report issues
* widespread production illiteracy
* big bang deploys, poor tooling
Last week I wrote that piece on developing at honeycomb vs Parse, and I can't stop thinking over how I might communicate this better.
Imagine going the dentist for the first time in 10 years, vs going in for your regular 6 month checkup. Those are likely to be very different conversations, right?
What are the chances they ever get around to giving a fuck about some of those?
Now this is hardly a novel insight. Many of you would protest: you already do this -- you have have for years!
And yes: absolutely. The intent was there. But the *tooling* wasn't.
What you needed was what I have called "observability". And if you think about it for a minute you will see why it matters, and why I am such a stickler for the definition.
...arbitrarily wide events, emitting all the context per request, such that you get to group by any dimensions, or see what attributes any set of errors have in common, etc. All in near real time.
* break down by build id, app id, user ID, etc, and any combination thereof
* trace the request
* know the full context of every request or error at every hop
* find out exactly what any group of errors had in common
Shipping code every day under these circumstances is like going day after day, meal after meal, without brushing your teeth.
The stripe developer report says that developers spend 41% of their time on bullshit. stripe.com/reports/develo…. Ok, that's not great.
And I confess that we still spend maybe 25-30% of our time on this crap.
"That's not SO much of a difference, is it?" they say, frowning.
First, we have a tiny fucking team for how much surface area we build and maintain. A storage engine, query planner, API, many microservices, SDKs and beelines in half a dozen languages, a UI, billing, an on prem crypto proxy...
7 engineers.
We haven't had to do that.
There is a huge, huge difference between a team that is lurching breathlessly from crisis to crisis -- the technical equivalent of broken arms and smallpox --
My team puts in 25-30% of their time doing maintenance type work that doesn't push the product forward. But the time is planned and allocated just like any other work.
We spend our cleanup cycles on hangnails and high blood pressure, so that problems aren't allowed to *become* big.
It isn't the only thing, of course. Observability is necessary but not sufficient. But it's the main one that most of you haven't got. ☺️😴🐝