, 15 tweets, 6 min read Read on Twitter
#chaosday19 next up is disaster recovery at scale at Google by Parma Gopalan
#chaosday19 : DiRT started at Google back in 2006 [ed: might have gotten the date wrong] Awesome scenarios including what happens if a meteor hits and zombies start coming up from out of the ground.
#chaosday19 : Data center failures can occur for a multitude of reasons. Failures aren't just the loss of compute but also the issue of PoP vanishing.
#chaosday19 : One scenario style is capacity needs particularly when maintenance is occurring. Especially an issue when multiple (potentially unplanned) changes are occurring simultaneously
#chaosday19 : Understanding incident management for security crises occurs as well. Planning and evaluating scenario for security needs is critical. DiRT is testing not only technical elements but the people and process parts as well.
#chaosday19 : Why DiRT? To discover problems under controlled circumstances before they are encountered in the wild. [ed: prep and practice are key, seems to be a recurring theme during the talks today!]
#chaosday19 : How do we test? Measure everything! Real outages including triggers, Incident Management signals and indicators [ed: would love to understand what measures are being utilized for Incident Management are we talking about things like TT-Respond, TT-Communicate etc?]
#chaosday19 : How to test production? We start with manual gamedays and isolate blast radius via % traffic, whitelists, or standby environmetns. Focus on customer experience as there are too many possible tests to evaluate the entire system so customer-centric SLOs are critical
#chaosday19 : Communication of running DiRT in prod is critical. [ed: predictability is critical for any joint activity, I would assume that prod DiRT activities fall into the joint activity realm :) ]
#chaosday19 : Catzilla. Allows google services to use fault injection to flush out hard-to-find issues and provides a suite of easy-to-use tests. Comes with a big red button to stop tests [ed: I hope that someone has rigged up a physical 'that was easy' button to execute this]
#ChaosDay19 how tests are structured in catzilla:
#chaosday19 : Note - Catzilla is not used in production. Needs fine grained control on requests that will be injected with faults.
#chaosday19 : Q&A how frequently do you run region cutover? Can't do it for public GCP. Run tests in single digit count.
#chaosday19 : [ed: going back to my question about what measures/records from DiRT experiments are used for incident management - mostly a focus on blameless post-mortems from DiRT tests to capture opportunities to improve IM processes]
@threadreaderapp unroll please!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Tom Leaman @ #chaosday19
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!