An excellent overview of @LauraMDMaguire's dissertation ("Controlling the Costs of Coordination in Large-scale Distributed Software Systems") is on the Resilience Engineering Association's site.
Will give the url after some fascinating bits regarding the results...
(1)Incident Commanders needing to recruit other folks to help with a response underway have to make multiple efforts.
They have to:
- Monitor the current capacity (of the response) relative to changing demands and identifying additional resource requirements
- Identify the skills and experienced required
- Identify who is available
- Determine how to contact them
- Contact them and alert them to the event
- Wait for a response
- Adapt the current work to accommodate this new engagement (waiting, slowing down or speeding up, completing other tasks to aid coordination)
- Prepare for the new folks coming to the scene by a) anticipating what they'll need, b) developing a situation assessment or status update, c) giving access/permissions to tools & coordination channels, d) generating shared artifacts (dashboards, screenshots)
and also e)dealing with any access issues (inability to join web conference or trouble establishing audio bridge)
A 2nd bit from Laura's research was to identify how all participants in smooth coordinative activity incur coordination costs – it cannot be proceduralized away or assigned to a single role.
There can be significant effort expended by the new folks joining the response, in terms of:
- Being interrupted in their work
- Assessing the request relative to their capabilities
- Assessing the request relative to their capacity to act
- Deferring or abandoning their own work
- Acknowledging their orientation to the problem
- Communicating about the deferral or abandonment to the parties they coordinate with
- Gaining access to collaboration tools
- Assessing available information
- Clarifying (available data and expectations)
- Requesting additional information
- Forming questions about the state of the event or system
- Determining interruptability of the participants already in the event
- Forming interjections
- Interjecting
- Determining roles or role reallocation within the existing group
- Assessing work underway
- Assessing implications of work underway
- Considering their contributions relative to problem constraints
and
- Assessing how their contributions may influence work underway
(these activities can happen quite fluidly, but for sure: there's typically a lot going on here that we rarely acknowledge!)
These costs are often incurred at points in time when they are least ‘affordable’ – during high tempo, highly demanding cognitive efforts – which can lead to degradations in the joint activities and coordination breakdowns."
A thing that continually fascinates me is the rediscovery that alerting used in modern technology (not just in software stacks) will always represent an *unsolvable* challenge with respect to design (of the alerts).
I keep coming back to the paradox that comes with attention wrt false positives, via @ddwoods2:
“...how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it -- in which case it hasn’t been ignored.”
Complexity wrt to the anomaly detection mechanisms, the alert’s representation to the receiver, the breadth of the detecting agent’s perspective, the dynamics of what is being sensed...all contribute to situations where false positives will never go to zero.
A story: a classmate of mine in my master’s program in Human Factors and Systems Safety (@lunduniversity) who came from the world of safety in oil and gas construction told me a thing that new safety people were taught early on...(1/n)
“Always keep a bottle of Jack Daniels in your trunk.”
Why? Because sooner or later, someone will get hurt on the site and because it’s out in the middle of nowhere, you, the safety person, will offer to drive them to the nearest hospital. (2/n)
On the way, you say “hey, so it’s a bit of a long drive and you’re in some pain. If you want, you can take some swigs from this to at least help with the pain, it’s all I’ve got...”
Which, invariably, your passenger will take you up on.
Software Engineers: at some point during this next week, some of you may make changes to production code or infrastructure.
This thread is for you.
You will do all the things you think are necessary to be confident that the change will do what you intend, and you’ll not do any more than that.
You’ll be as thorough as you believe you should, just like you have with every successful change you’ve made before.
When it becomes clear that the change didn’t do what you expected and triggers an incident, a shift in focus will take place amongst your colleagues and you might take part in it yourself.
Software and tech folks: interested in capacity, saturation, and adaptation that happens in medical facilities? I got you — my partner at @AdaptiveCLabs Dr. Richard Cook (@ri_cook) is a pioneer of Resilience Engineering and studied this for literally decades (thread)...
(largely considered to be the “gateway” paper to safety in complex systems and incidents)
Next up, “Being Bumpable: Consequences of resource saturation and near-saturation for cognitive demands on ICU practitioners” on the topic of what coping strategies hospitals use in managing saturation of incoming patients
The potential for you as an engineer to know things about your systems that your colleagues (even those on your team, who you've worked alongside for a long time) don't know is higher than many think.
What's more is that not only do they not know what you know, you might mistakenly believe they they *do* know it.
Which means when opportunities come up to reveal this "knowledge-past-each-other", you're not likely to bring it up with them.
(aside: this phenomenon is known as the 'Fundamental Common Ground Breakdown")
On the topic of Incident Analysis, I don't believe the industry has a "dissemination" problem - that they produce post-incident write-ups but people just won't take the time to read them.
It seems many think this is the case. I don't.
I believe that the skills necessary to produce quality, compelling and genuinely insightful write-ups is in very short supply.
(I also think the allure of "template-driven postmortems" unfortunately enables this poor quality to continue.)
People aren't reading (and commenting, and highlighting, and referencing, and asking questions) about them because they don't (typically) reveal anything of interest for the reader!
Doing this well can be done, and the results are undeniable when this expertise exists!