I've been talking to lots of teams about their observability journey, or how they managed to dig themselves out of hell and get a handle on shit. Some patterns definitely emerge.
The first thing many teams look at is the on call rotation. (Smart; heading straight for the pain.)
Folks are worn out, product is upset whenever something unexpected comes up -- it's a bad scene, because they're too tightly coupled. ANY non feature work means a deadline slips.
So the first thing they do is enact a simple rule: no product work during on call weeks. Period. Those weeks are for fixing and maintaining the system.
This forces leadership to plan for using 75-85% of full capacity as a steady state. Whew; now we have some flex in the system.
Now that you have some devoted resources, you can embrace the tenet that ✨on call should not suck✨. It should *never* be something that people have to plan their lives around. Aim those resources straight to your on call pains.
When you've started taking on call and the health of your systems seriously, instead of just treading water and gasping, it's time to haul in some observability.
The stuff that lets you zero in on specific users, endpoints, delivery bugs etc and FIX THEM.thenewstack.io/observability-…
Observability is a superpower in the hands of an engineering on call team. It is *critical* for helping them swiftly identify unknown-unknowns with no prior knowledge of the answer.
You can't arm those poor little pups with a bunch of system stats and aggregate dashboards.
Tracing, also. Start by tracing your build pipeline and knocking off, refactoring, parallelizing whatever's slow, until you get it down to ~15 min.
ALL the conversations you have while trying to streamline your build pipeline are great conversations to have.
"Do we really still need this test suite that used to be helpful, but now takes 20min to run?"
"Can we upload these assets in a faster way, or maybe not at all?"
"Do these fuzz tests REALLY belong behind the gate to prod?"
ALL of these questions will make you a better team.
Ok, next!
Now it's time to negotiate some SLIs and SLOs with your counterparts in business and product. Don't go crazy, just a few. Like say the 3-5 most critical flows for your business:
SLOs are the API between business and engineering. You can argue all day about spending your cycles on features or resiliency, and no one will ever win or feel good about it.
OR, you can negotiate a number that you all buy in to and agree comes first.
The next issue that usually comes up has to do with your product org, and its relationship with your engineering (and design) teams.
I often hear something like "product only wants to ship features!" or "we try, but they're constantly getting pressured by sales!"
This also comes in the guise of, "WE (the engineers and eng managers) know the non-feature work that needs to be done, but we report up to a director (or VP) that wants us to be a feature factory. We never win."
Now ideally, everyone in the org would be perfectly aligned and in full agreement at all times. After all, we are all in this together, and we all want what's best for the org, right?
EVERYBODY wants to ship swiftly and safely and make users happy. (Remember that.)
So if you keep losing the battle to pay down your infra investments and actually fix things, consider two possibilities:
1) they don't understand why backend/reliability work matters, 2) they do understand, but aren't getting enough of the right signals to make good decisions.
If 2), you probably need to surface more information about the facts on the ground, and you need to do it in a way that is *specific* and *actionable*. Moaning about how everything is fucked is not going to help.
Accept that product *is* getting lobbied by sales. This is fine and good -- it's their job! But hopefully product is hand in glove with engineering, and has built up a relationship of trust and good faith, so they can synthesize the inputs and make good calls.
Engineering shouldn't dictate the product map, any more than product should give assignments to engineers.
But engineers, you can really help things by not just being aware of upcoming investment work, but also letting managers and product partners know ✨well in advance✨,
Nobody likes to get a six week project sprung on them with a start time of "tomorrow". The whole point of this exercise is to reduce the firefighting and get better planning out into the future, *together*.
Write up a doc. Provide context. Assign certainty levels.
This is a place where engineers SO often get tripped up, because they "don't have the data" they need.
You are NEVER going to know to that level of certainty. Get comfortable with pulling estimates out of your ass and improving on them with time. Be like your business teams.
Engineering managers play such a critical role here, bc it is literally your job to translate between biz objectives and engineering projects, between biz teams and product and your own team.
Sometimes this means it's your job to push information up the chain, *vigorously*.
Lastly let's talk about culture. If you hear phrases like "real work" or "my actual job" being bandied about, consider how corrosive this may be.
Words reflect and reaffirm beliefs, which affects judgment, which plays out in reviews, comp, and leveling decisions.
It's worth gently identifying and trying to replace phrases like that.
Because reactive shitwork and firefighting is ~not~ a defining characteristic of backend or operations engineering.
I can see how one may become tightly associated with the other, at an organization where backend/ops engineering is so degraded, so far behind their debt payments, that they scramble from one fire to the next.
But that's a dead giveaway of an org that doesn't value the work.
Infrastructure, backend, ops -- whatever you want to call it -- is not a cost center. It does not distract and take away from the "real work" of shipping features.
It is the *foundation* upon which all your shiny features rest. Done well, it is an accelerator, a force multiplier
You can't build a house without a fucking foundation. You can't build and ship value to users without one either.
Infrastructure done well is the difference between shipping weekly or several times an hour. It's the difference between delighting your users or frustrating them.
Infrastructure done well is the difference between having to hire 50 engineers or 500 to get the job done. It's the difference between having to spend 20-30% of your engineering labor on running your systems or 70-80%.
Good infrastructure is the difference between your site going down for five minutes or an entire day.
It's the difference between engineers spending most of their cycles moving the business forward, or fighting with their tools and waiting on each other.
This isn't a hard case to make. It has the benefit of being absolutely fucking true. By helping your org understand the value and impact of good infrastructure, you are doing the LORD's good work for yourself, your team, and your company.
Infrastructure is all about enabling and empowering engineers to be their most powerful and effective selves.
But isn't always obvious how and why. Sometimes it's even counter intuitive.
That is why they need your help, convincing, explaining, drawing the dots. Good luck.💜
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Yeah. This gets to a weakness of engineering leveling systems. We rightly encourage high level engineers to seek out work that is a challenge at their level...
But there isn't always enough of that highly difficult or tech lead work to go around.
When level-appropriate work comprises a lot of your performance review, you get something very dangerous: roving bands of skilled, restless engineers competing for vanity projects and systems that should never, ever have been built, but which you now have to maintain. 😬
One way to prevent this is to *not* over hire, especially very senior engineers. Hire juniors and mid-levels with room to grow.
Most engineering work is not rocket science, and mid levels in particular are often the most prolific and productive engineers you have.
Communication pathways are sooo hard to get right, and inspire such frothing, unreasonable rage when they get it wrong.
The last time I used jira was well over a decade ago, and I thought it was impenetrable spaghetti at the time. I can't imagine it's gotten any simpler...
But it's kind of an impossible problem, of course it's going to turn into feature soup when you've been making bank on enterprise for this long.
Every team starts out trying to replicate and "improve" on how a squintillion people and teams interoperate,
I was just editing the o11y book chapter on build vs buy and ROI, and this sentence jumped out at me:
"High-performing organizations use great tools."
It's true, right? Behold all the FAANG engineers who leave their cushy perches and are shocked by the amount of tooling they had come to take for granted. It's almost like having to learn to engineer all over again
Big companies know how critical good tooling is, and pay for it.
I'm going to say two very contradictory things, both of which are true:
1) Tools are getting better and better, and you should try to keep up
2) Switching tooling is hard, and you should only do it when the gain is ~an order of magnitude better than what you've got.
You don't owe it to your employer to fix all the ways they are fucked up. Before going to battle, ask yourself:
* how much power do I have here?
* is the problem within my domain of responsibility or influence?
* who are my allies?
* do I have a reasonable chance of success?
and also: are they worth it? Is your employer fundamentally worth you staying and fighting? Is their product a net good for the world? Are your leaders decent, ethical people who care a lot?
If so, sure, pick some battles. See what happens. ☺️
Ah! This is a very good point. Good recruiters are outnumbered by bad ones, which are indistinguishable from spam. And yes, the more you put out the more you'll get.
Here we are, now going on the fourth straight month of headlines all about how a record number of people are quitting their jobs.
There's a lot of pain behind that statistic, but also a strident, activated edge to labor that feels unlike anything seen in my lifetime.
I am *all for* more people quitting their jobs. I am *all for* employers needing to compete for employees by treating them better, increasing their wages, and offering more flexibility and support.
Most people in our industry stay at jobs they don't love, far too long.
So here's a piece of advice that I find myself giving over and over again, to senior folks who are daunted by the prospect of having to go out and search for the right role, the right team, the right company ... it's like looking for a needle in a haystack, right? 😰