, 10 tweets, 2 min read
My Authors
Read all threads
I've been working with a large-ish startup that's revamping their incident response and incident program, and ended up writing up notes on some of the phases and challenges of creating and operating these programs.

Some tips!

lethain.com/incident-respo…
Incident programs start out focused on fast incident response. Later on they focus on consistent response across much larger team. Larger still, they end up justifying company's overall investment in reliability. These transitions are easy to miss and end up playing wrong game
Most companies hang on to the "small group of long-tenured experts" approach too long, even though almost no one likes it (except the occasional hero). This is because structured approaches initially work worse than heroics, but they are still only thing that scale in long-run
Most companies wait too long to create incident tooling team, and instead treat incident program purely as a collection of process rather than process *and* product! You need both approaches, especially product mentality to have great incident response/program
I think the typical norms around incident review aren't scaling very well. Some incident reviews are great, but so many are an exercise in compliance. This is a place ripe for innovation! Try new approaches! Share them! We're not done yet!
Many companies end up enforcing "incident law", where form dominates over function and folks are obsessed with compliance to process that isn't obviously working. Don't do this! Focus on effective response workflows and driving overall reliability. Tickets are just tickets
As company grows, you either rely on fear or business case to support continued investment into reliability. Prefer business case over fear! This requires forecasting reliability and impact of your work

lethain.com/forecasting-sy…
I still really prefer product-oriented incident tooling teams over centralized response teams. w/good incident program most incidents are novel, so you really want system builders involved, can't delegate novel problems to folks not touching systems w/frequency
The best incident programs come from harmonious incorporation of many perspectives: org program, architecting for reliability, reliability as business feature, product-oriented tooling. Only path to long-term failure is indexing too heavily on any one approach
Finally, nothing lasts for ever. If you're working at a company with a strong business, then the context you're operating in will change continuously. Excellence is transient, keep evolving!
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Will Larson

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!