Ramin Khatibi Profile picture
Nov 18, 2022 18 tweets 5 min read Read on X
It's time for a thread on why Twitter hasn't gone down yet 🧵 Mocking Spongebob meme  If all these people were important,
I'm glad you asked. No, not really. It's a stupid fucking question. High performance systems need more maintenance. Athletes sleep more. Pick an analogy. fin.

However since it keeps being asked let's attempt to answer the question.
To do this I am absolutely not going to explain distributed systems to the uninitiated. I am instead going to use something dead fucking simple. Cron.
What's cron you ask because you've never administrated a computer system yet somehow have opinions about massive computer systems? Again, I'm not excited to be here.

Cron is daemon that runs scheduled jobs. You add the job to a list, it runs at the minute/hour/day you've set
As you might imagine, being able to run a job on a server does a lot of cool things. It can also do a lot of bad things if it uses too many resources.

Your local SRE/whoever is going to build a system to put some guardrails around this.
The first step is to reduce the free-for-all within the system. Ideally your users, other engineers, drop their payload in place, set a schedule, and everything works.

This is usually true if they use the tooling rather than recreate the wheel each time.
The "fun" part is that no matter how good your initial system was, you will have missed things. Even simple tooling like cron that are literally put a file in a directory and schedule it. Because it's never ever ever that fucking simple when you have 100k+ of something
Everyone person who has never administrated a massive system with me so far? No, you're not because distilling decades of experience into spongebob memes is a shit way to explain emergent behavior.

To recap: system does things. more things, more effects including side effects
What sort of side effects you ask?

Finally a not shit question. The same cronjob across multiple machines can create a thundering herd problem when they all reach for the same external resource. Enter randomization.

By default all hourly jobs are randomized by minute
Problem solved!

Of course not you sweet summer child. YOU HAVE 100k+ OF THESE CASES.
HWENG stops by your desk to ask if you know why power consumption IN THE ENTIRE GOD DAMNED DATACENTER is "weird" at the top of every minute.

Jobs use power. Jobs are schedule to the minute aka ??:00. This is bad when you have 1M+ jobs.
"hah hah, this system is firming up nicely. Shouldn't have any more oddball problems."

Later that year a team manages to fill the disk on 20k (maybe? it was "a lot") machines through logging of their broken job.
So now you have to solve enough of logging to make that mostly impossible to do again.

Some asshat will ask why the system didn't already solve for people fucking around and finding out.

<dead eyes engineer stare>
This is just one of the 100+ systems or subsystems you own. And really is one of the simplest to reason about. And you're chasing new behavior generated by users, upgrades, etc on all of them.
After several years and several SEVs (ha!) the cronjo config system starts to feel pretty good. And has now grown to 1000s of lines that protect from all manner of footguns while also continuing to be mostly user friendly.
It's this defense in depth around millions of components that keeps massive complex systems running without requiring round the clock maintenance.
To recap: the fact that Twitter continues to work is a testament to the 1000s of engineer years spelt building that reliability.

But as engineers, we know that failure is coming without continued investment to protect against the next thing.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ramin Khatibi

Ramin Khatibi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(