My Authors
Read all threads
I've been going through a few production incident postmortems - the aftermath of a production system going down. Here's a way to go about these, and strengthen the reliability of the underlying systems. Thread.
1. First, understand the timeline. What happened, and when? How was the incident detected? How was it mitigated, and everything cleaned up?
2. Talk times. TTD (Time to Detect) + TTM (Time to Mitigate) = total incident time. How long was this?
3. What was the business impact? Okay, we know that a system that other systems use went down for some time... what did users actually see? Was there money or customers lost? What is the value? Getting this information is often not trivial, and can take time.
4. What was the root cause? I don't mean the "we deployed code that broke" type of surface scratching. Go deep, with the five "whys" technique. Why did an automated system not catch this issue? Or perhaps: why is there no automated system to catch such issues?
5. How can we reduce the time to detect the outage? If it took seconds to detect from deploy: great job. It it was hours, days, or weeks, let's address this. Same on time to mitigate: mitigation should be minutes, once the outage is detected, tops.
6. How can we prevent this, and similar issues happening? Can we have automation catch things? Processes that will detect early? Get creative.
7. Finally: what are the 3 key learnings of this incident? By this time, we surely have at least 3 things.
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with Gergely Orosz

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!