, 21 tweets, 4 min read Read on Twitter
"What is SRE?"

SRE's job is dialogue.
Many of us have worked in companies with traditional "engineering" and "operations" teams.

A senior executive will identify a need for some software service as part of a business strategy. That executive will give a manager some funding - headcount, "etc."
The manager finds their self in charge of a service that doesn't exist yet, what do they do? Hire engineers, engineering managers, and start writing the software for this new service.

Time goes by, and everyone is happy and productive.
As the service nears completion, something happens. Probably the development lifecycle ran long, but usually not by more than was budgeted. The Quality Assurance apparatus for this service probably got thrown together at the last minute, using the last of their funding.
Time to deploy! The traditional model involves "handing off" the service's operation to the ops team. Which hasn't been involved up to this point.
At this point in the story, even if the manager wanted to change the trajectory of their service, they don't have any impetus left to do it. And they've contacted their operations team for the first time.

And nothing actually works like it was expected to.
Things that worked in test break apart when not using their test clients and test backends. Things that seemed to be running fine turn out to have bad algorithmic complexities - probably there is not even an estimate for the service's "footprint" in CPU and memory.
No one documented how to start and stop the systems. They did it once, and don't need to do it again. "Capacity plans" are estimates based on estimates of data that has never been loaded in entirety.
Because the ops budget doesn't come out of the engineering "bucket," and even if the organization had budgeted ops with enough resources to resolve these issues, they still are not an engineering team. They can't re-engineer the service.

They're stuck on a trolley problem.
That train only has one direction, and it's going to run on those tracks no matter what you do. It will launch, (nearly) as it is, or it will just not launch. A/B switch, no in-between.

So these services have launch failures.
SRE differs from this old model in many ways - like today we generally require end-to-end testing and iteration of a live service, running - if not publicly released - in the production environment it will run in when it will be released. We involve operations aspects early.
A team writing a new service contacts their SRE representatives regularly throughout development, and by doing it themselves writes the "run book" or service operational documentation which they use themselves. They adopt best practices during service development (not after).
Their future SRE team is more like a mentor through this process, teaching them how to run their operations themselves.
However, the one critical component of SRE, in my opinion, which drives what SRE "is": When it is time to hand-off the service and receive SRE support, the developers have to *ask*. The SRE team can say "no."

But of course you never really say "no."
"Not yet" is much more common - you constructively specify the specific things to improve ("pager rate is too high," "capacity plan is lacking predictions (N) months out", "documentation of (X) doesn't specify impact of outage (Y)", etc).
Generally speaking these are all really trivial things. Because everything "hard" getting a new service to run in a living environment has already been ironed out /in-line/ with development.

But that "no" is critical, because it creates dialogue.
Exactly how reliable should a service be? How much operational toil is "too much"?

In what ways can we expend small efforts - or include in some future expansion planning - improvements in these areas?

It's a dialogue, advancing the state of the art in reliable systems.
You see, you can take an engineering and operations team, and make them work together from the start on a new service, and that's how you get most of the benefit of SRE (and a lot of companies are successful doing just this).
But to get that constant improvement vector, that zealous approach to perfection, you've got to let your SRE team say "no."

The other things a SRE team does for you; footprinting, taking ops interrupts, updating infrastructure - those are responsibilities which inform dialogue.
It's not that they get to say "no" because they do those things. It's because those things are their responsibility that they get the logical authority to object, and must meet the burden of proof to do so.
So if you ever wondered why our SRE team seems to say no to you all the time; now you know.

That's their job.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to David W. Hankins
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!