And now, a thread of Ancient Sysadmin Wisdom: an incomplete list of things we have learned from decades of outages.
"It's always DNS." Yup. Everything relies upon DNS, those relationships are non-obvious, and some things like to cache well beyond your TTL.
"If an outage lasts more than ten minutes, it's likely to last for hours." Yup. Usually related to electric power, but this is a good rule of thumb for "do we activate our DR plan" decisions.
"The bootstrapping problem." We don't usually take EVERYTHING down at once. Very often the thing needed to boot the server lives in one of the VMs that's hosted on that server.
"The herd of elephants." When users find your site is down, they start spamming the refresh button like it's my shitpost button. This dramatically increases load on an already-wobbly site.
"Don't use external domains for internal things." You really want the thing you provide to the world not to power your internal systems; ideally you want outages that take down outside things or internal things but not both at once.
"The Internet has opinions." As much fun as it is to blame cyberattacks or insiders acting in bad faith, the real world is usually a lot less interesting.
"Not all downtime is equal." If you sell shoes and your site goes down, a lot of customers will come back an hour later to buy shoes. Conversely, nobody's coming back in an hour to click an ad for you.
"You'll never map all of your dependencies." How many folks pay a third party vendor to defend against AWS outages, but don't realize that that vendor relies completely upon AWS?
"BGP is the devil." Yes, it is. I'm astounded it works. @ioshints for the professional analysis of that. It's not my area because I still aspire to happiness.
"An outage won't destroy your business." It feels like the world is ending at the time, but taking an outage from time to time is generally okay. If a site is down every third day, in time people go elsewhere.
"Plan for failure." If it can break, it will break. If it can't break, fuck you yes it can.
"Outages like to travel in clusters." Sometimes it's a batch of hard drives failing together, other times it's downstream issues surfacing a day or two later, other other times it's attempted fixes breaking other things subtly. Plan to be busy after a big one.
"Out-of-band access will save your life." Seriously. I've had entire secondary networks installed in data center cages just so I could use some crappy residential DSL line to get in after I'd REALLY broken the firewall. Cheaper than an interstate flight...
"Outage communications should be planned for." Seriously, have a template. You don't want to have to wing it when half of the internet is pounding down your door. And saying nothing enrages people.
"Ignore best practices." Seriously: not having any single points of failure or important nodes is great in theory, so is distributed observability, but if my bastion host that lets me get into the busted firewall goes down, I want good old Nagios blowing me up about it.
"Internal messaging." We all rely on other platforms. When one of them goes down, you're basically stuck until they come back up. Make sure that's messaged to your leadership so it doesn't look like you just don't give a shit that the site is broken.
"Be a good person, do good things." Seriously. Outages are hard. You probably don't want to work somewhere that inspires most of the world to cheer when you go offline. Ahem.
"Rate limits help." As the site recovers, the flock of elephants will attempt to stampede onto it in huge numbers, taxing already overworked systems. Have a way to defer recovery across a broad swath of your users.
"Ensure your vendors all have up to date emergency contacts." Every once in a while I still get a call from the data center I helped set up a decade ago at a long-ago employer. Next time I'm telling them to "shut it down, we have another provider."
"Split horizon DNS is a bad plan." I can't believe I have to mention this, but "you'll send internal data to a different destination depending upon which network your laptop is on" is a horrifying mode.
"Keep your eye on the prize." The outage is big and momentous and important but you should probably not ignore that email about an SSL cert expiring in three days.
"Remember that computers are dumb." If you have alarms that fire a week after an outage because holy SHIT the week-over-week metrics look WAY different right now," you have no one to blame but yourself.
"Institutional knowledge matters." No matter how you run things or document your systems, Pat's been here for twenty years and knows how and why that system runs.

You didn't just Frugally fail to retain Pat last cycle, did you?
"You will hate yourself." It's super important that you find out when certain things break. If the core network breaks, you're about to find out how long it takes your cell provider to work through a backlog of 20k automated SMS messages alerting you about it.
"Wait, what?" Gmail has a hard limit of 3,600 emails it will let an account receive per hour. All of those alert emails get through; anything above the limit bounces.
"This list is not comprehensive." There are always outages caused by things not on this list. What've I missed that you've experienced?

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Corey Quinn

Corey Quinn Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @QuinnyPig

6 Oct
One of the hands down most sobering conversations I’ve ever had was with a bunch of Very Savvy Investment Bankers about what exactly a total failure of us-east-1 would look like economically.

The *best case* outcomes closely resembled a global depression.
You want to talk five nines?

That’s comfortably within their probability models for “a US civil war.” They’ve drawn up maps that show likely sides for such an event and they plan accordingly.
These people get paid significantly more than most engineers do to consider risk.

This is why you’ll not find even the most die-hard all-in cloud customer who’s publicly traded who doesn’t have “rehydrate the business” level backups either on-prem or in another provider.
Read 4 tweets
5 Oct
So, folks are asking how I did this. Thread time!
While I do enjoy Twitter, I believe it's important to "own my platform." As such, Twitter's not material to the functioning of my business.

But I do talk to a lot of folks here, and a "subscribe" button for @LastWeekinAWS in my profile can't hurt anything...
I started by emailing @revue and checking their Terms of Service. As of today, there's nothing against using their sign-up function and exporting the list to another platform unless I'm directly monetizing the subscribers via subscriptions.

I am not.
Read 11 tweets
30 Sep
And now because @gabsmashh made a wish on the monkey's paw that is The Cloud:

A meme dump thread of @awscloud memes.

Let's begin.
"Bad at names," "the AWS Partner Network," and a tagline I shockingly did not have to alter led to this:
Less relevant now, but still annoying when iterating on Lambda @ Edge.
Read 87 tweets
29 Sep
Time to put on my Cloud Economics Pants and do a bit of math around @Cloudflare's R2 pricing model as described herein.

blog.cloudflare.com/introducing-r2…
So today I'm going to store 1GB of data in @awscloud's S3 and serve it out to the internet. The storage charge is 2.3¢ per month the tier 1 regions.
Someone on the internet grabs that 1GB of data once. I'm paying 9¢ to send it to them. You read that right; just shy of four months' of storage charges to send it to the internet once.
Read 14 tweets
27 Sep
So @LastWeekinAWS sponsor @stackeryio got acquired. By other @LastWeekinAWS sponsor @awscloud.

I'm going to need a minute here.
AWS doesn't generally do acquisitions like this.
Oh, it's an acquihire and the service is getting axed.
Read 4 tweets
23 Sep
I might get yelled at for this thread, but we'll give it a shot.

I'm not sure anyone needs to hear it as much as I needed to hear it myself a decade and change ago.
If you work in tech, either as an employee or as a consultant, most people you encounter *will not understand what you do*. "Something to do with the computers" is the best you can hope for.
They may be vaguely aware of a few additional facts. Such as "the company claims that people are their most important asset but pay the people who work on the computers three times what they pay the people who work in HR."
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(