Tom Leaman Profile picture
Husband, father, VP Site Reliability Engineering @ Warner Bros. Discovery, wannabe woodworker, baker, 3D printer, cyclist. Opinions 100% my own

Jun 13, 2019, 27 tweets

Really excited for this next talk by @LauraMDMaguire on “Lowering costs of coordination during service outages: A multiple case analysis”

#VelocityConf

@LauraMDMaguire is going to be going into details associated with the research done as a part of the SNAFU Catchers Consortium (snafucatchers.com)

#VelocityConf

@LauraMDMaguire’s engagement with SNAFU actually starts with the tri-can device which is used in rock climbing.

vdiffclimbing.com/tricams/

#VelocityConf

I love it when speakers make things real through narrating real personal stories and tying them to the subject. @LauraMDMaguire's talk covers experience in rock climbing involving failed assumptions, mental and physical exhaustion, cognitive overload, and pressure

#VelocityConf

Similar demands are placed on engineers managing incidents and operations every single day.

Needless to say this is not a topic that's new to #VelocityConf...

@allspaw, @ddwoods2 @ri_cook and others have all come to #VelocityConf in the past to share their expertise in Cognitive Systems and safety at previous #VelocityConf

The second cycle of SNAFU Catchers has involved contributions from:
- IBM
- New Relic
- KeyBank
- SalesForce

The focus this year has been on controlling the costs of coordination during incident response #VelocityConf

We'll be covering some early insights from this second round (the official report will be out later this summer).

#VelocityConf

Nearly all forms of meaningful work is 'joint activity'

- There's effort necessary to maintain what's known as 'Common Ground' in this activity
- As the speed and scale of systems increase the demands of this maintenance becomes greater

#VelocityConf

Coordination is not isolated to human to human interaction it also includes human to machine interaction as well.

There's a lot of automation used in incident response which aids in coordination but they can also generate cognitive demand on operators to utilize.

#VelocityConf

All of these demands are being made on top of other cognitive demands associated with managing the incident. Time can compound this pressure - it can increase intensity of effort and also limit actionable opportunities as well.

#VelocityConf

In complex activities and systems the human activity ebbs and flows over time. In busy high tempo operations task performance is more critical and consequences can escalate.

#VelocityConf

"Wait this is why we have incident command right?"

There's evidence that ICs alone are insufficient for smooth communication. Everyone has to invest in coordination in some way.

#VelocityConf

When one focuses on a single task and cut away from maintaining common ground coordination it does not come back for free. A debt of sorts is generated and effort must be re-invested to re-enter communications and coordination.

#VelocityConf

For joint activity to succeed there needs to be a deferral of local goals for those of the common group.

For instance: teams dropping planned work to jump on to an incident Slack or conference line to coordinate resolution.
#VelocityConf

Because this is cognitive and psychological there are direct parallels to this type of example within other industries: think aviation, aerospace, and medical.

One thing these industries generally don't deal with is distributed teams.

#VelocityConf

This picture from Apollo 13 demonstrates a lot of what goes in to the coordination of joint activity:

Data is being grabbed from screens, papers, internal digestion, side work being performed, concurrent audio from headsets etc

#VelocityConf

Being co-located in the same environment such as the previous picture provides a lot of context 'for free.' Cues, both data based and behavioral, are easy to derive. This is not the case with distributed teams...

#VelocityConf

As multiple platforms are utilized to pull in context, if the tools do not automatically support coordination this weight is placed on the shoulders of the operators themselves.

#VelocityConf

This can cause communication and coordination breakdowns.

#VelocityConf

While there’s a generally linear pattern for incident resolution for unchanging systems a linear model is woefully insufficient in highly complex distributed systems

#VelocityConf

Models and tools need to be based on reality and the complexity of this type of coordination.

#VelocityConf

Escalation - when normal work breaks down and becomes exceptional.

#VelocityConf

As an incident escalated cognitive and coordination demands increase making it more difficult to accomplish joint activity.

#VelocityConf

For all of these reasons simplistic models representing effectiveness of incident response such as Mean Time To Recover (MTTR) is insufficient.

#VelocityConf

Cognitive demands during an incident are cyclical in nature, as we take a more expansive hypothesis driven approach coordination increases and as we observe and validate these hypotheses we lower cost

#VelocityConf

Some quick takeaways:

Tech can help or hinder coordination
Coordination must be designed to integrate expertise, bringing in new resources, and getting up to speed.

#VelocityConf

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling