Profile picture
Murali Suriar @msuriar
, 35 tweets, 16 min read Read on Twitter
And now: @lauralifts on A Taxonomy of Black Swans. #LISA18
@lauralifts What I should have called this talk is "15 postmortems in 30 minutes". #LISA18
@lauralifts What is a "black swan"?
- Outlier event.
- Hard to predict.
- Severe in impact.
@lauralifts Example from the financial world: 2008 financial crash.

You can predict when they'll strike, or what they're going to be, but looking back you can see predictive indicators.
@lauralifts Most alerts when you're oncall are white swans (or geese), where you understand the cause and the response. #LISA18
@lauralifts Black swans can become routine non-incidents. Example: class of incidents caused by changes which can be detected and prevented by canarying. #LISA18
@lauralifts On sharing postmortems: I'm going to be talking about a lot of public postmortems. Not trying to throw shade here. People who share postmortems are doing us a great service. #LISA18
@lauralifts Six different subspecies: 1) Hitting limits.

Instapaper, Feb 2017
- MySQL RDS, backed by ext3 with a 2TiB file limit.
- Hit that limit, had to dump data and repopulate on ext4.
- Down for days. #LISA18
@lauralifts Sentry, July 2015
- Maxed out Postgres transaction IDs, vacuum process didn't work.

SparkPost, May 2017
- Unable to send mail for hours.
- High DNS workload.
- Hit undocumented per-cluster connection limit.
@lauralifts Foursquare, October 2010
- MongoDB outgrew RAM
- Hit performance cliff
- Backlog of queries
- Resharding at full capacity is hard.
@lauralifts - October 2016
- EU region down for 4 hours
- Orchestration wouldn't start
- Zookeeper database bigger than 64KiB, which was the size of the pipe used by a library.
@lauralifts Defence?
- Load and capacity testing
- Including cloud services (warn provider first)
- Include write loads
- Use a replica of prod (staging is like prod!)
- Grow past current size.
- Test startup and other operations.
@lauralifts Defence: monitoring
- The best documentation of known limits is a monitoring alert.
- Include a link that includes the nature of limit.
@lauralifts 2) Spreading slowness

Hosted Graphite, March 2018
- AWS problems
- HostedGraphite not on AWS?!
- Their LB were being saturated due to slow connections coming from customers inside AWS.

(Looked like Slowloris attack). #LISA18
@lauralifts Spotify, June 2013
- (Fairly complicated)
- Playlist service overloaded due to another service calling them.
- Rolled back.
- But huge ongoing request queues and verbose logging broke stuff.
- Firewall rules and restarts.
@lauralifts Stripe, March 2017
- Auth systems slowed to a crawl.
- Redis overloaded
- Tight 500x retry loop.
@lauralifts Defence: fail fast
- Enforce deadlines for all requests, in and out.
- Limit retries, have exponential backoff and jitter.
- Consider circuit breaker pattern
- Limit retries from a client, [limit?] sharing state across multiple requests.
@lauralifts Defence: USE dashboards
- Utilisation, saturation, errors
- Quick way to identify bottlenecks.
- Look at physical and virtual resources.

@lauralifts 3) Thundering herds

Where does coordinated demand come from?
- From users (e.g. flash sales)
- But also systems
- Cron jobs
- Mobile clients updating a t the same time.
- Large batch jobs starting.

Slack, October 2014
- Two separate incidents caused significant number of user disconnections (13%)
- Web sockets based API
- Simultaneous reconnect caused database saturation.
CircleCI, July 2015
- Github down for a while
- Github came back, bunch of pending build requests.
- Requests queued into DB
- Complex scheduling
- DB contention/saturation.
Defence: plan and test
- ~Every internet facing service can face a thundering herd.
- Plan for this
- Degraded mode.
- What can you drop?
- Queue input cheaply to process asynchronously.
- Test your degraded operation plan.
4) Automation interactions

Google erases its CDN [Whee, I remember this. See the Site Reliability Workbook for more information on it].

There was a talk about this at SRECon. #LISA18
Reddit, August 2016
- Zookeeper migration
- Turned off autoscaler, because Zookeeper not visible.
- Automation turned autoscaler on.
- Autoscaler turned down rest of reddit. Oops. #LISA18
Complex systems are inherently hazardous systems. --Richard Cook, MD.
Defence: control
- Constraints service which limits automation operations
- e.g. Limit concurrent requests per unit time
- Log to central place
- Provide simple way to turn stuff of. #LISA18
5) Cyber attacks

Maersk, June 2017. Infected by malware. Turned off all office machines for 3 weeks. Really bad. Cost billions across industry.

- Separate prod from non-prod as much as possible. (See Google's beyond corp)
- Break production into zones
- Validate and control what runs in production
- Minimise worst possible blast radius for incidents.
6) Dependency loops
- Shout out to @whereistanya keynote from last year "Have you tried turning it off and on again?"
Github, Jan 2018
- 2 hour outage.
- Redis soft dependency turned out to be a hard dependency.

Trello, March 2017
- S3 outage took down web frontend.
- Backends healthchecked frontends, wouldn't come up. Total outage for mobile/other clients.
Defence: layer your infrastructure. Enumerate your dependencies.
General defences
- Disaster testing
- Fuzz testing
- Chaos engineering (talk tomorrow)
- Incident management process.
- Practice using it.
Defence: comms
- Don't rely on your infrastructure or its dependencies.
- Phone bridge, IRC
- Make sure people know where it is.
- Wallet cards.
- Practice using it.
[Note, for people using AWS: slack runs on AWS]
Psychology: help people who have been dealing with an incident like this. #LISA18
Slides will be published and will be very linkified. Fin. #LISA18
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Murali Suriar
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!