Maggie Johnson-Pint Profile picture
Jul 2, 2023 27 tweets 5 min read Read on X
This is weird to say out loud, but I actually am kinda an expert in rate limiting, so I'm gonna explain some stuff.

About half of incidents in large-scale production systems involve having more requests than you can serve. There are two categories of this kind of incident:
1. Top-Down overload or "Reddit Hug of Death": This is what Bluesky experienced today - suddenly there was a HUGE demand surge and the servers just *couldn't* for a while. This also happens after superbowl ads or when pop stars announce tours or during DDOS attacks.
2. Bottom-up: This is the less obvious and more common scenario, when something inside the system fails, that makes the system unable to serve normal load.
If you lose a redis cache and everything is reading to DB, you will drastically reduce your ability to serve requests.
Similarly, if a database replica, cloud region, or cluster goes down, you will be in a really tough spot for serving normal workload.

And of course if a developer on one service writes code that suddenly slams another service, that's "DDOSing Yourself" and is also bottom-up.
I don't know what happened at Twitter today, but I don't think Elon woke up and decided to shut it all down - my bet is some 'bottom up' problem (but not necessarily the DDOSd yourself problem everyone is tweeting about - that could be an effect of getting limited, not the cause)
In these scenarios, the rate limiter is the only thing standing between you and death - because of course if computers get hit with more requests than they can deal with eventually they OOM and crash.
Even if they don't crash, requests stack up waiting for completion - this is called 'backup' - which is what causes the slowness in the requests that do work.

Backups have this bad effect of causing users to refresh the page, causing more requests and... more backups.
What is the rate limiter tho?

At the simplest level, a rate limiter is a program that says "This computer can only do x requests per second" and stops all the others with "429 too many requests".
In CS terms, this is implemented with a 'leaky bucket' algorithm - but that is not important unless you are making one.

Most good rate limiters can get pretty fancy and split the quotas by things like customer, customer plan (verified in twitter for instance), or feature.
The best rate limiters are 'adaptive', and can change rate limits based on system stress, priority of requests, and other things.

Twitter has a really good one because they had a really exceptional infra team until a year ago.
Now, a lot of people only think of the rate limiter as something that goes at the 'front' of the infra to prevent the top-down kind of problem, but in fact advanced infra teams (including Twitter in the good times I'm sure) routinely use them *between all processes*
If you use it between all processes, then you can prevent one system from overloading another system, preventing all kinds of cascading failure scenarios.

When you do this, you also implement a key pattern: exponential backoff.
When a program sees a rate limiting error (a 429 status), it's a signal that the request can potentially be retried after a bit and might succeed.
It seems logical to retry in a loop - but what if stuff is *Really* broken? A loop firing every second in every client is ‼️‼️
Instead, you do an "exponential backoff" - first you retry in 1 second, then wait 2 seconds, then 4 seconds, then 8, 16, 32, 64 and so on (I used base 2 there but however you like)
This gives the servers a 'breather' if something really bad is going on - instead of slamming
Now, all this context brings us back to today.

BEGIN PURE SPECULATION, I DO NOT KNOW!
My hypothesis - Twitter lost a big part of a critical back end system - maybe they stopped paying their GCP bill, maybe they lost a critical cache and everything was reading other data, I truly do not know.
At this point, their probably very good adaptive rate limiter said 'ohshit' and brought the number of requests WAY WAY down throughout the system.

The infinite loop screenshot floating around? Front end code sees the 429 and retries, but without exponential backoff.
Then of course Elon gonna Elon so it's intentional 🤷‍♀️.

But really when you have a major outage you are intentionally going to do what Twitter engineering was doing, and have your requests come back in gradually to ensure you don't overload - their smart rate limiter can do that.
END SPECULATION
One common question I get is 'why not just autoscale?'

1. Not everything easily can
2. Autoscale is *expensive* to handle problems that only happen a couple minutes a day or year
3. It takes a few minutes, this fills the gap
4. Sometimes even the cloud runs out
Another: "I'm a product developer - why do I care about an infra problem?"

1. if you handle this in code you can do something other than give your users 'error'
2. If you handle this in client code, you can save the entire infrastructure by never sending. Literal hero shit.
Anyways, hope this was informative to someone somewhere because it took a while to write 😂.
Here are my financial interests in this topic. 😀
We are running a private beta - dm if interested.

Getting rate limited again. Assume we went over midnight UTC and this madness restarted.

Good grief.
If you find yourself configuring a rate limit on a daily basis that changes globally at midnight UTC ask yourself why because you are gonna get HUGE traffic spikes at that time and honestly you could be running way cheaper if you smeared the reset to the users local time.
Since this seems to be leaving tech Twitter and breaking into the real world, if you don't code at all this summary might help.

Top down overload: "covid is here, everyone buy TP"

Bottom up overload: "can't staff the registers come back later"

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Maggie Johnson-Pint

Maggie Johnson-Pint Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @maggiepint

Mar 8, 2018
I started with TC39 just over a year ago. The first day that I went, I was completely terrified. I had been told that the committee was aggressive, could be hostile, and has no understanding of the needs of regular devs.
I was one of the first women ever to go (with @fhinkel and @adalex). The feeling of walking into a room of people who are Way Smarter (TM) can be really overwhelmingly negative. I am exceedingly aware of the fact that I'm not a compiler dev - just a regular web app dev.
I expected to find a room of people who were entrenched deeply in their world with little patience for those outside - but I didn't. Instead, I found a committee that had been very busy developing a remarkable self-awareness.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(