And now, a thread of Ancient Sysadmin Wisdom: an incomplete list of things we have learned from decades of outages.
"It's always DNS." Yup. Everything relies upon DNS, those relationships are non-obvious, and some things like to cache well beyond your TTL.
"If an outage lasts more than ten minutes, it's likely to last for hours." Yup. Usually related to electric power, but this is a good rule of thumb for "do we activate our DR plan" decisions.
"The bootstrapping problem." We don't usually take EVERYTHING down at once. Very often the thing needed to boot the server lives in one of the VMs that's hosted on that server.
"The herd of elephants." When users find your site is down, they start spamming the refresh button like it's my shitpost button. This dramatically increases load on an already-wobbly site.
"Don't use external domains for internal things." You really want the thing you provide to the world not to power your internal systems; ideally you want outages that take down outside things or internal things but not both at once.
"The Internet has opinions." As much fun as it is to blame cyberattacks or insiders acting in bad faith, the real world is usually a lot less interesting.
"Not all downtime is equal." If you sell shoes and your site goes down, a lot of customers will come back an hour later to buy shoes. Conversely, nobody's coming back in an hour to click an ad for you.
"You'll never map all of your dependencies." How many folks pay a third party vendor to defend against AWS outages, but don't realize that that vendor relies completely upon AWS?
"BGP is the devil." Yes, it is. I'm astounded it works. @ioshints for the professional analysis of that. It's not my area because I still aspire to happiness.
"An outage won't destroy your business." It feels like the world is ending at the time, but taking an outage from time to time is generally okay. If a site is down every third day, in time people go elsewhere.
"Plan for failure." If it can break, it will break. If it can't break, fuck you yes it can.
"Outages like to travel in clusters." Sometimes it's a batch of hard drives failing together, other times it's downstream issues surfacing a day or two later, other other times it's attempted fixes breaking other things subtly. Plan to be busy after a big one.
"Out-of-band access will save your life." Seriously. I've had entire secondary networks installed in data center cages just so I could use some crappy residential DSL line to get in after I'd REALLY broken the firewall. Cheaper than an interstate flight...
"Outage communications should be planned for." Seriously, have a template. You don't want to have to wing it when half of the internet is pounding down your door. And saying nothing enrages people.
"Ignore best practices." Seriously: not having any single points of failure or important nodes is great in theory, so is distributed observability, but if my bastion host that lets me get into the busted firewall goes down, I want good old Nagios blowing me up about it.
"Internal messaging." We all rely on other platforms. When one of them goes down, you're basically stuck until they come back up. Make sure that's messaged to your leadership so it doesn't look like you just don't give a shit that the site is broken.
"Be a good person, do good things." Seriously. Outages are hard. You probably don't want to work somewhere that inspires most of the world to cheer when you go offline. Ahem.
"Rate limits help." As the site recovers, the flock of elephants will attempt to stampede onto it in huge numbers, taxing already overworked systems. Have a way to defer recovery across a broad swath of your users.
"Ensure your vendors all have up to date emergency contacts." Every once in a while I still get a call from the data center I helped set up a decade ago at a long-ago employer. Next time I'm telling them to "shut it down, we have another provider."
"Split horizon DNS is a bad plan." I can't believe I have to mention this, but "you'll send internal data to a different destination depending upon which network your laptop is on" is a horrifying mode.
"Keep your eye on the prize." The outage is big and momentous and important but you should probably not ignore that email about an SSL cert expiring in three days.
"Remember that computers are dumb." If you have alarms that fire a week after an outage because holy SHIT the week-over-week metrics look WAY different right now," you have no one to blame but yourself.
"Institutional knowledge matters." No matter how you run things or document your systems, Pat's been here for twenty years and knows how and why that system runs.
You didn't just Frugally fail to retain Pat last cycle, did you?
"You will hate yourself." It's super important that you find out when certain things break. If the core network breaks, you're about to find out how long it takes your cell provider to work through a backlog of 20k automated SMS messages alerting you about it.
"Wait, what?" Gmail has a hard limit of 3,600 emails it will let an account receive per hour. All of those alert emails get through; anything above the limit bounces.
"This list is not comprehensive." There are always outages caused by things not on this list. What've I missed that you've experienced?
• • •
Missing some Tweet in this thread? You can try to
force a refresh
One of the hands down most sobering conversations I’ve ever had was with a bunch of Very Savvy Investment Bankers about what exactly a total failure of us-east-1 would look like economically.
The *best case* outcomes closely resembled a global depression.
That’s comfortably within their probability models for “a US civil war.” They’ve drawn up maps that show likely sides for such an event and they plan accordingly.
These people get paid significantly more than most engineers do to consider risk.
This is why you’ll not find even the most die-hard all-in cloud customer who’s publicly traded who doesn’t have “rehydrate the business” level backups either on-prem or in another provider.
While I do enjoy Twitter, I believe it's important to "own my platform." As such, Twitter's not material to the functioning of my business.
But I do talk to a lot of folks here, and a "subscribe" button for @LastWeekinAWS in my profile can't hurt anything...
I started by emailing @revue and checking their Terms of Service. As of today, there's nothing against using their sign-up function and exporting the list to another platform unless I'm directly monetizing the subscribers via subscriptions.
So today I'm going to store 1GB of data in @awscloud's S3 and serve it out to the internet. The storage charge is 2.3¢ per month the tier 1 regions.
Someone on the internet grabs that 1GB of data once. I'm paying 9¢ to send it to them. You read that right; just shy of four months' of storage charges to send it to the internet once.
I might get yelled at for this thread, but we'll give it a shot.
I'm not sure anyone needs to hear it as much as I needed to hear it myself a decade and change ago.
If you work in tech, either as an employee or as a consultant, most people you encounter *will not understand what you do*. "Something to do with the computers" is the best you can hope for.
They may be vaguely aware of a few additional facts. Such as "the company claims that people are their most important asset but pay the people who work on the computers three times what they pay the people who work in HR."