The answer isn't "if it's a P1 or P2, forget P3s". In fact, it's the wrong question to be asking... #Operability 1/n
This might be due to a high volume of P3s from untuned alerts, friction in the incident review process, lack of emphasis on improvement, etc. 2/n
I'd call this a "shallow analysis" that pre-dates an understanding of resilience engineering, operability, the work of @AdaptiveCLabs etc. 3/n
I don't see deep analyses often. Shallow is still rare :( 4/n
How should you decide if an incident merits a post-incident review?
It should be
Given mandatory incident reviews, how should you decide if an incident merits deep or shallow analysis?
#Operability 5/n
I do know that alert priority is *not* a good way to decide on shallow/deep incident analysis, or yes/no incident review 6/n
An org might say "only review P1s and P2s, not P3s" because they are drowning in P3s and want to save review time/money... but a P3 can still cost you revenue 7/n
If a P2 alert is linked to an expected max loss of £500K and a P3 is linked to £100K... if the P3 keeps occurring with no reviews, no learnings, it can become as or more costly as the P2
(And that's before reputational damage)
8/n
A near-miss should have an incident review
A Chaos Day should have an incident review
There needs to be a relentless focus on improvement, on learning, on removing friction from the post-incident process 9/n
#Operability
A shallow analysis of an incident should happen if a low revenue loss has happened, or is predicted 10/n
#Operability is about reliability, which is about revenue protection 11/n
I've seen too many orgs where revenue impact is considered during a post-incident review, or not at all
It is an input, not an output /end
And a reminder I'm available for #ContinuousDelivery and #Operability work from 31 Aug. Get in touch!