12,399 views

Will Larson

@Lethain

, 10 tweets, 2 min read

My Authors

I've been working with a large-ish startup that's revamping their incident response and incident program, and ended up writing up notes on some of the phases and challenges of creating and operating these programs.

Some tips!

lethain.com/incident-respo…

Incident programs start out focused on fast incident response. Later on they focus on consistent response across much larger team. Larger still, they end up justifying company's overall investment in reliability. These transitions are easy to miss and end up playing wrong game

Most companies hang on to the "small group of long-tenured experts" approach too long, even though almost no one likes it (except the occasional hero). This is because structured approaches initially work worse than heroics, but they are still only thing that scale in long-run

Most companies wait too long to create incident tooling team, and instead treat incident program purely as a collection of process rather than process *and* product! You need both approaches, especially product mentality to have great incident response/program

I think the typical norms around incident review aren't scaling very well. Some incident reviews are great, but so many are an exercise in compliance. This is a place ripe for innovation! Try new approaches! Share them! We're not done yet!

Many companies end up enforcing "incident law", where form dominates over function and folks are obsessed with compliance to process that isn't obviously working. Don't do this! Focus on effective response workflows and driving overall reliability. Tickets are just tickets

As company grows, you either rely on fear or business case to support continued investment into reliability. Prefer business case over fear! This requires forecasting reliability and impact of your work

lethain.com/forecasting-sy…

I still really prefer product-oriented incident tooling teams over centralized response teams. w/good incident program most incidents are novel, so you really want system builders involved, can't delegate novel problems to folks not touching systems w/frequency

The best incident programs come from harmonious incorporation of many perspectives: org program, architecting for reliability, reliability as business feature, product-oriented tooling. Only path to long-term failure is indexing too heavily on any one approach

Finally, nothing lasts for ever. If you're working at a company with a strong business, then the context you're operating in will change continuously. Excellence is transient, keep evolving!

Enjoying this thread?

Keep Current with Will Larson

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

More from @Lethain see all

Related threads

Trending hashtags

Did Thread Reader help you today?