It's time for a thread on why Twitter hasn't gone down yet 🧵
I'm glad you asked. No, not really. It's a stupid fucking question. High performance systems need more maintenance. Athletes sleep more. Pick an analogy. fin.
However since it keeps being asked let's attempt to answer the question.
To do this I am absolutely not going to explain distributed systems to the uninitiated. I am instead going to use something dead fucking simple. Cron.
What's cron you ask because you've never administrated a computer system yet somehow have opinions about massive computer systems? Again, I'm not excited to be here.
Cron is daemon that runs scheduled jobs. You add the job to a list, it runs at the minute/hour/day you've set
As you might imagine, being able to run a job on a server does a lot of cool things. It can also do a lot of bad things if it uses too many resources.
Your local SRE/whoever is going to build a system to put some guardrails around this.
The first step is to reduce the free-for-all within the system. Ideally your users, other engineers, drop their payload in place, set a schedule, and everything works.
This is usually true if they use the tooling rather than recreate the wheel each time.
The "fun" part is that no matter how good your initial system was, you will have missed things. Even simple tooling like cron that are literally put a file in a directory and schedule it. Because it's never ever ever that fucking simple when you have 100k+ of something
Everyone person who has never administrated a massive system with me so far? No, you're not because distilling decades of experience into spongebob memes is a shit way to explain emergent behavior.
To recap: system does things. more things, more effects including side effects
What sort of side effects you ask?
Finally a not shit question. The same cronjob across multiple machines can create a thundering herd problem when they all reach for the same external resource. Enter randomization.
By default all hourly jobs are randomized by minute
Problem solved!
Of course not you sweet summer child. YOU HAVE 100k+ OF THESE CASES.
HWENG stops by your desk to ask if you know why power consumption IN THE ENTIRE GOD DAMNED DATACENTER is "weird" at the top of every minute.
Jobs use power. Jobs are schedule to the minute aka ??:00. This is bad when you have 1M+ jobs.
"hah hah, this system is firming up nicely. Shouldn't have any more oddball problems."
Later that year a team manages to fill the disk on 20k (maybe? it was "a lot") machines through logging of their broken job.
So now you have to solve enough of logging to make that mostly impossible to do again.
Some asshat will ask why the system didn't already solve for people fucking around and finding out.
<dead eyes engineer stare>
This is just one of the 100+ systems or subsystems you own. And really is one of the simplest to reason about. And you're chasing new behavior generated by users, upgrades, etc on all of them.
After several years and several SEVs (ha!) the cronjo config system starts to feel pretty good. And has now grown to 1000s of lines that protect from all manner of footguns while also continuing to be mostly user friendly.
It's this defense in depth around millions of components that keeps massive complex systems running without requiring round the clock maintenance.
To recap: the fact that Twitter continues to work is a testament to the 1000s of engineer years spelt building that reliability.
But as engineers, we know that failure is coming without continued investment to protect against the next thing.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
