John Allspaw Profile picture
Cofounder, @AdaptiveCLabs, “the NTSB of Tech” bringing Resilience Engineering to industry. he/him. Won’t speak on all-male panels, and #blacklivesmatter.
Sumit Agarwal Profile picture johan averstedt Profile picture Ronak Kogta Profile picture 3 subscribed
Apr 30, 2021 13 tweets 4 min read
Some comments on the use and value of statistical approaches to incidents...

First, there's a rich history of these approaches in Human Factors and Safety Science, going back at least to the early 1900s with Taylor(ism) and Heinrich's "domino theory" and accident "pyramid." An excellent account of this history can be found in @sidneydekkercom's Foundations of Safety Science (bookshop.org/books/foundati…) as well as this very accessible work by @SINTEF (sintef.no/globalassets/u…)
Jan 15, 2021 15 tweets 3 min read
An excellent overview of @LauraMDMaguire's dissertation ("Controlling the Costs of Coordination in Large-scale Distributed Software Systems") is on the Resilience Engineering Association's site.

Will give the url after some fascinating bits regarding the results... (1)Incident Commanders needing to recruit other folks to help with a response underway have to make multiple efforts.

They have to:
- Monitor the current capacity (of the response) relative to changing demands and identifying additional resource requirements
Jan 14, 2021 4 tweets 2 min read
A thing that continually fascinates me is the rediscovery that alerting used in modern technology (not just in software stacks) will always represent an *unsolvable* challenge with respect to design (of the alerts). I keep coming back to the paradox that comes with attention wrt false positives, via @ddwoods2:

“...how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it -- in which case it hasn’t been ignored.”
Jul 4, 2020 6 tweets 2 min read
A story: a classmate of mine in my master’s program in Human Factors and Systems Safety (@lunduniversity) who came from the world of safety in oil and gas construction told me a thing that new safety people were taught early on...(1/n) “Always keep a bottle of Jack Daniels in your trunk.”

Why? Because sooner or later, someone will get hurt on the site and because it’s out in the middle of nowhere, you, the safety person, will offer to drive them to the nearest hospital. (2/n)
May 21, 2020 8 tweets 2 min read
Software Engineers: at some point during this next week, some of you may make changes to production code or infrastructure.

This thread is for you. You will do all the things you think are necessary to be confident that the change will do what you intend, and you’ll not do any more than that.

You’ll be as thorough as you believe you should, just like you have with every successful change you’ve made before.
Mar 18, 2020 7 tweets 3 min read
Software and tech folks: interested in capacity, saturation, and adaptation that happens in medical facilities? I got you — my partner at @AdaptiveCLabs Dr. Richard Cook (@ri_cook) is a pioneer of Resilience Engineering and studied this for literally decades (thread)... First, for those who have not read it, I give you the seminal “How Complex Systems Fail” paper: adaptivecapacitylabs.com/HowComplexSyst…

(largely considered to be the “gateway” paper to safety in complex systems and incidents)
Dec 19, 2019 9 tweets 2 min read
The potential for you as an engineer to know things about your systems that your colleagues (even those on your team, who you've worked alongside for a long time) don't know is higher than many think. What's more is that not only do they not know what you know, you might mistakenly believe they they *do* know it.

Which means when opportunities come up to reveal this "knowledge-past-each-other", you're not likely to bring it up with them.
Nov 20, 2019 7 tweets 2 min read
On the topic of Incident Analysis, I don't believe the industry has a "dissemination" problem - that they produce post-incident write-ups but people just won't take the time to read them.

It seems many think this is the case. I don't. I believe that the skills necessary to produce quality, compelling and genuinely insightful write-ups is in very short supply.

(I also think the allure of "template-driven postmortems" unfortunately enables this poor quality to continue.)
Jun 10, 2019 8 tweets 2 min read
"Resilience engineering is about identifying and then enhancing the positive capabilities of people and organizations that allow them to adapt effectively and safely under varying circumstances. Resilience is not about reducing negatives (incidents, errors, violations)."
1/8
"Resilience engineering is based on the premise that we are not custodians of already safe systems. Complex systems do not allow us to draw up all the rules by which they run, and not all scenarios they can get into are foreseeable."
2/8
May 29, 2019 6 tweets 2 min read
The tendency to view incidents as having neat boundaries between "phases" of them is problematic at best. A good deal of traditional thinking about learning from incidents depend on this construction... But this isn't what we find when we look closely at *real* incidents in the wild. They tend to be much "messier" than the above picture purports.
Mar 26, 2019 12 tweets 4 min read
I am somehow surprised when I encounter folks with such strong beliefs in the "Humans-Are-Better-At/Machines-Are-Better-At" approach to designing software. At this point I shouldn't be surprised, but still am. 1/n This "HABA-MABA" philosophy has been so ingrained that it almost flies under the radar as worthy of attention to some.

Who will argue with "make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff"? (apparently, I will) 2/n
Feb 15, 2019 10 tweets 3 min read
On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that's not the topic of this thread.

In the aftermath, the amount of hindsight-bias-fueled-armchair-quarterbacking on this event knew no bounds. 1/n From HN to blogs to Twitter, the finger pointing on what they "should have" done was rampant. This fervor culminated and was validated by the SEC's official report on the event, which was effectively a case study on what a post-incident review should *not* look like. 2/n
Jan 3, 2019 9 tweets 2 min read
"Finally, we consider the challenge offered by proponents of automation. Some researchers in the automation community have promulgated the myth that more automation can obviate the need for humans, including experts. The enthusiasm for technologies is often extreme." "Too many technologists believe that automation can compensate for human limitations and substitute for humans. They also believe the myth that tasks can be cleanly allocated to either the human or the machine."
Dec 13, 2018 13 tweets 4 min read
Contrary to popular belief (tooling marketing and some SRE/Ops/Infra writings) the “substitution myth” continues to hold and I think it requires repeating. @courtneynash summed it up best (more than 3 years ago now!) in this way...

(oreilly.com/ideas/ghosts-i…) Another way of putting this is:
Sep 7, 2018 9 tweets 2 min read
I've been studying cognitive work in software environments for some years now. I want to bring attention to a phenomenon that is both familiar (in an "uh, of course!") and AFAICT does not receive much attention. 1/n Before I write about that, I'll mention that a primary way this research is done is by using a family of methods called "process tracing" which involves the triangulation of multiple data sources to make analytically valid inferences about cognition "in-the-wild." 2/n
Aug 30, 2018 4 tweets 1 min read
Reminders about “cognitive biases” and heuristics:

1) They are not signals of human frailty - they exist and stick because they are SUCCESSFUL in almost all cases!
2) Without them, people could actually never get anything done - they are necessary! 1/n 3) So, guidance to “avoid” them categorically is unhelpful. Also impossible.
4) That these phenomena exist and sometimes contribute to mistakes should not be seen as blanket license to “automate away” human contributions to technology operations. 2/n
Apr 12, 2018 16 tweets 3 min read
Because @jezhumble brought me to reconsider how "risk" is constructed by people, I'll thread up some statements on the term from a reference I'll show at the end... 1/n - Risk is in everything we do. Short of never doing anything, there is no way to avoid all risk or ever to be 100% safe. 2/n
Feb 23, 2018 10 tweets 3 min read
The following thread is a short bit from @ri_cook's talk at @velocityconf in 2012, "Resilience in Complex Adaptive Systems" - I believe it's even more relevant than it was then.
(thread 👇) 1/"Let me start out by saying that the future of all your systems -- although you do not realize it right now -- is safety. You think of your systems as those "web app" systems …but they are also business critical systems." 1/n