@norootcause@hachyderm.io on mastodon Profile picture
Lorin Hochstein Student of complex systems failures, resilience eng, cognitive sys eng. Will talk your ear off about @LFISoftware. @norootcause@hachyderm.io
Jul 30, 2022 5 tweets 1 min read
I’m honestly more inclined to believe the Secret Service did not intentionally delete the texts given that their IT was doing a migration and they had to do a factory reset as part of it. I can totally see this happening in a migration.

secretservice.gov/newsroom/relea… People are like “how could this happen” and I’m like “I totally see how this could happen in a migration”. Migrations are one-off things. Lots can go wrong. Lots often does.
Mar 13, 2022 30 tweets 2 min read
A 🧵 on barriers to problem detection, taken from @KleInsight 's excellent paper titled "The strengths and limitations of teams for detecting problems" 1/ An anomaly may be masked (by other symptoms, other problems, background variability) 2/
Dec 27, 2021 30 tweets 3 min read
As a software developer, you may be called upon to perform some of these tasks in your career.

How well a CS degree prepares you for these tasks (and whether it even should prepare you for these) is left as an exercise to the reader.

🧵
1/
Make a behavioral change to a medium-to-large system that you don't understand. 2/
May 1, 2021 8 tweets 2 min read
Twitter thread of me reading the @auth0 RCA that was just released.
They have a feature flag service, which is fronted by a cache. The cache suffered from saturation:

"An increase in traffic exceeded the caching capacity of that service and caused it to stop responding in a timely manner".
Jan 9, 2021 7 tweets 2 min read
Resilience engineering makes the following seemingly contradictory claims:

1. Small incidents don’t provide insight into big ones.
2. To get insight into the nature of big incidents, study the small ones.

How can that be?

Here comes a thread. . @sidneydekkercom puts it like this: “Incidents do not precede accidents. Normal work does.”

This has implications for both points.

(Terminology may be a little confusing here because what software people call “incidents” is what safety people call “accidents”.)
Jul 30, 2020 6 tweets 2 min read
Learned about the "McNamara fallacy" the other day and I just love it: en.wikipedia.org/wiki/McNamara_… "The first step is to measure whatever can be easily measured. This is OK as far as it goes."
Jul 3, 2020 9 tweets 2 min read
I recently listened to a @BookTV podcast interview with Bryan Caplan about his book "The Case Against Education". That inspired a thought experiment, which I'll describe in this thread: Caplan argues that only 20% of the economic value of a college degree is in the development of skills useful for employment, and that the other 80% of the value is just signaling.

Question: what would happen if we could drive the signaling down to 0?
Jul 6, 2019 37 tweets 4 min read
Liked this quote from Todd Conklin's "Pre-Accident Investigations" book:

"Failure happens because the worker believes that what is about to happen to them is simply not possible." "It worked last time. It worked the last 10,000 times. Normally, it always work OK. Why would it not work the next time?"
Jun 22, 2019 6 tweets 2 min read
This is a good thread by @andyfleener. Thought I’d share another example in my own thread. We have monitors in our “war room” that cycle through our dashboards. We also use that room for regular meetings. But we don’t normally look at those dashboards unless that alert has gone off (there are no such monitors in our bullpen).
Jun 16, 2019 6 tweets 1 min read
When I was younger, I was very interested in learning new technologies as a way of becoming a better software developer. These days, I’m more interested in learning how to think better, in general. (thread) By “think better”, I mean either improve my intuition, or learn techniques that can compensate for when my intuition is likely to fail me.
Apr 23, 2019 20 tweets 3 min read
Learning from incidents, let’s talk specifics! A thread. Those of us who advocate learning from incidents tend to be a bit vague about what, specifically, you might learn. I’m going to try to provide some examples, to make things more concrete.
Apr 13, 2019 14 tweets 3 min read
Why incidents are a good opportunity for learning, a thread. First, let's talk briefly about *what* we want to learn. One valuable topic to learn about in an organization is how the organization actually works, in the what-do-people-do-day-to-day-to-keep-things-running sense.

(I'll leave why this is valuable to a future thread).
Mar 15, 2019 14 tweets 3 min read
Let's talk about reciprocity in software orgs, a thread: According to @ddwoods2, "reciprocity" is one of the ingredients for achieving resilience. If one unit gets overloaded, other units will compensate. As he puts it: "I will help you when you're crunched".
Feb 17, 2019 7 tweets 1 min read
Robustness, resilience, and the maturation of startups, a thread. 1/7 Per @ddwoods2, there's a tradeoff between being robust (good at handling known unknowns) and being resilient (good at handling unknown unknowns). In general, you want both, what Woods calls "net adaptive value". 2/7
Jan 19, 2019 14 tweets 2 min read
Confession: I don't get "DevOps", a thread. 1/14 I've read The Phoenix Project and the DevOps Handbook, and I still don't really get it. 2/14