Profile picture
, 13 tweets, 9 min read Read on Twitter
First tutorial of the day: “Move Fast and Learn from Incidents” with @lhochstein , @this_hits_home , and @nora_js along with a great crowd of TAs from Netflix, Slack, New Relic, and OSU! #VelocityConf
We are going to be role playing a few incident investigations during the session #velocityconf
A few things we’ll be looking for during these investigations which @lhochstein deems “gaps” falling into the following categories:
1) Tools
2) Operational expertise
3) Resource (people) gaps

Another benefit of incident reviews is their ability to enable skill transfer. Experts (unintentionally) hoard tons of knowledge that doesn’t always get out to others. Well run investigations help extract this expertise. #VelocityConf
Types of questions you might ask during investigations:
1 Cues about observations
2 How judgements are made
3 Options considered
4 How they knew something
5 External forces at play
6 How things normally work
7 Drivers to ask for help


If you run systems/services consumed by others (hint: everyone in software development) and haven’t taken the opportunity to read the Etsy Debrief Guide, I highly recommend giving it a quick read. It’s a great gateway into Resilience Engineering concepts...

Additional guidance for investigations:
1 Avoid the hunt for Root Cause
2 “Human error” limits learning about contributing factors
3 Counterfactuals, statements about things that could/should/would have happened, do not enable learnings of what *did* happen

The first tutorial session, investigating a self inflicted DDoS. Individuals had character sheets for various stakeholder groups. Having “purposefully” hidden info for a role is a great simulation for how we all have “secret info” that others may not realize.

My face when @kitchens rolls out a list of contributors and mitigators for the tutorial incident that is a mile long (who says root cause is a real thing?)

Some takeaways from the sessions:

1 ROI comes from distribution of expertise
2 Expertise helps us understand how things operate normally not just how they fail
3 Minimalistic RCA provides minimal value
4 Focus on both Error reduction and insight generation

We weren’t able to run an exercise using 1:1 interviews for incident reviews during the session given time constraints but they can be *very* powerful tools.

When thinking about writing up the incident review for broader distro:

1 After the summary, is it easy to digest?
2 Does the message inspire readers to dig deeper?
3 Get feedback on a draft of the publication

Fantastic job and excellent tutorial by all involved: @nora_js, @this_hits_home, @this_hits_home, the TAs, and all the participants!

Can’t wait to run similar workshops back home!

Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Tom Leaman
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!