Follow @SteveSmithCD

12,399 views

Steve Smith

Follow @SteveSmithCD

, 13 tweets, 4 min read

My Authors

How should you decide if an incident merits a post-incident review?

The answer isn't "if it's a P1 or P2, forget P3s". In fact, it's the wrong question to be asking... #Operability 1/n

Many orgs I work with have a policy of "do a review if the incident was a P1 or a P2". Lower priority incidents don't get a review.

This might be due to a high volume of P3s from untuned alerts, friction in the incident review process, lack of emphasis on improvement, etc. 2/n

@AdaptiveCLabs

@AdaptiveCLabs

At best, the incident review process involves team members working together to uncover a shared timeline and improvement actions.

I'd call this a "shallow analysis" that pre-dates an understanding of resilience engineering, operability, the work of @AdaptiveCLabs etc. 3/n

@allspaw

@allspaw

In contrast, a "deep analysis" would emphasise incident analysis prior to an incident review meeting, to obtain richer information on the socio-technical factors involved. This is the work @allspaw specialises in 👑

I don't see deep analyses often. Shallow is still rare :( 4/n

The question shouldn't be

How should you decide if an incident merits a post-incident review?

It should be

Given mandatory incident reviews, how should you decide if an incident merits deep or shallow analysis?

#Operability 5/n

@AdaptiveCLabs

@AdaptiveCLabs

I don't pretend to know _how_ to do a deep incident analysis, like others I learn from @AdaptiveCLabs 🙇‍♂️

I do know that alert priority is *not* a good way to decide on shallow/deep incident analysis, or yes/no incident review 6/n

The idea of alert priority is deeply subjective. One person's P1 is another person's P2.

An org might say "only review P1s and P2s, not P3s" because they are drowning in P3s and want to save review time/money... but a P3 can still cost you revenue 7/n

Production support is revenue insurance

If a P2 alert is linked to an expected max loss of £500K and a P3 is linked to £100K... if the P3 keeps occurring with no reviews, no learnings, it can become as or more costly as the P2

(And that's before reputational damage)

8/n

A P1, P2, P3 incident should have an incident review
A near-miss should have an incident review
A Chaos Day should have an incident review

There needs to be a relentless focus on improvement, on learning, on removing friction from the post-incident process 9/n

#Operability

A deep analysis of an incident, a near-miss, or a Chaos Day should happen if a substantial revenue loss has happened, or is predicted in the future

A shallow analysis of an incident should happen if a low revenue loss has happened, or is predicted 10/n

Incident revenue loss (incurred or forecast) , not incident priority, should govern the post-incident process

#Operability is about reliability, which is about revenue protection 11/n

One consequence of this is a revenue impact calculator must be available *during* an incident, not afterwards.

I've seen too many orgs where revenue impact is considered during a post-incident review, or not at all

It is an input, not an output /end

https://twitter.com/SteveSmithCD/status/1287696757003034625

https://twitter.com/SteveSmithCD/status/1287696757003034625

Thanks for all the comments on incident reviews! Keep them coming

And a reminder I'm available for #ContinuousDelivery and #Operability work from 31 Aug. Get in touch!

https://twitter.com/SteveSmithCD/status/1287696757003034625

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Try unrolling a thread yourself!

Related hashtags

More from @SteveSmithCD see all

Embed code for your website

Did Thread Reader help you today?