, 28 tweets, 14 min read Read on Twitter
#srecon @randyshoup: Learning from Learnings: Anatomy of Three Incidents
#srecon @randyshoup: Outage 1: Google App Engine Outage. App Engine was down globally for 8 hours. The playbook failed and triggered a cascading failure.
#srecon @randyshoup: Resolutions: increased traffic routing capacity, but more importantly, created a program to reduce probability of the same problem happening again.
#srecon @randyshoup: Step 1: Identify the problem (postmortem identifies themes, related issues)
#srecon @randyshoup: Step 2: Understand the problem - timeboxed investigation, reporting to the master group, making recommended steps+magnitude to fix.
#srecon @randyshoup: Step 3; Consensus and prioritization - figure out what to do and who will do it.
#srecon @randyshoup: Step 4; Implementation and follow-up. Weekly updates in a spreadsheet.
#srecon @randyshoup: This was all intended to be super lightweight and allow us to move fast. The results included a 10x reduction in reliability problems. Prioritizing tech debt fixes improved the team cohesion and ownership.
#srecon @randyshoup: Incident 2: Stitchfix outages related to the shared database. The applications contended on common tables and scalability was limited by the number of database connections. Failure of one application could take down whole company.
#srecon @randyshoup: Repeat the playbook: Identify problem, understand the problem, consensus and prioritization, implementation and follow-up.
#srecon @randyshoup: Incident 3: WeWork login issues caused by inconsistent representations across different services in the system.
#srecon @randyshoup: Common elements to all incidents: Unintentional, long-term accumulation of small, individually reasonable decisions. A compelling event (outage) catalyzes long-term change.
#srecon @randyshoup: The problem is not the outage, it’s having an outage and not preventing it from happening again.
#srecon @randyshoup: Blameless culture makes learning and improvement possible.
#srecon @randyshoup: If you don’t end up regretting your early technology decisions, you probably overengineered. 🙌🏻
#srecon @randyshoup: Functional responses: Blameless postmortems, including open and honest discussion in a safe space. If it’s truly safe, engineers compete to take personal responsibility.
#srecon @randyshoup: Teams are excited about being able to finally prioritize fixing that broken system.
#srecon @randyshoup: Cross-functional collaboration: The best decisions are made through partnership, sharing context about constraints and goals. Given common context, well-meaning people generally agree.
#srecon @randyshoup: We can disagree and still commit to implementing the consensus decision.
#srecon @randyshoup: Quality and discipline: Not just engineering concerns, but also business concerns. Reliability directly affects customer experience.
#srecon @randyshoup: If you don’t have time to do it right, do you have time to do it twice? — Randy, and also my mom.
#srecon @randyshoup: The more constrained you are on time or resources, the more important it is to build it right the first time, because you cannot afford to do it again.
#srecon @randyshoup: Define done: Feature complete, automated tests, released to production, feature gate turned on, monitored. (Microsoft says “Live in production, returning metrics”)
#srecon @randyshoup: Virtuous cycle of investment: quality investment, solid foundation, confidence to move faster,
#srecon @randyshoup: How do you drive this change? Create top-down authority to make good choices, and engage in bottom-up consensus to make changes we know are needed. Sustainable changes need to come from both directions.
#srecon @randyshoup: How do you fund product improvement? 1) Fold the expense of quality and reliability into every project, as part of the scope of the project. 2) Explicit global investment, across many teams. Top down set aside engineering effort to make on improving.
#srecon @randyshoup: Your unexpected ally is actually the finance team. They understand low-probablity, high-risk events.
#srecon @randyshoup: Martin Fowler: If you can’t change your organization, change your organization. (BTW, we’re hiring!) ;)
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Heidi @ SRECon
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!