, 14 tweets, 5 min read Read on Twitter
So I've just spent many hours chasing down what was causing Stack Overflow to stall for about 1-2 minutes every hour (with a lot of help from @JasonPunyon). I think this is an interesting cache story, so I'm going to lay it out in hopes it helps other people.
@JasonPunyon Once an hour we have a cache that falls out. It does so in the background with *1* thread/task handling the recache. We call this a .GetSet<T> - it locks to know it's he worker, and kicks off a background task to actually go refresh some data.
@JasonPunyon We have 2 durations here: cache time, and stale cache time. We can cache it for 1 hour, but stale cache for 7 days (ample time if the back-end thin i's fetching from goes down...we just use old cache). This works via an object with a date and the data in it.
When a .GetSet<T> call past the cache time happens (e.g. 1 hour and 1 second), we trigger that background refresh (if we got the lock). So far so good, this works pretty well and prevents users from seeing a stall vs. say a cache that isn't there and you'd synchronously go get.
In that background refresh we get the new value, then purge the old value (via Redis pub/sub) from all web servers. Then they get the new value on the next fetch. Again, so far so good. We thought.
The trouble is between that pub/sub and the storage, all *other* web servers got a miss when they fetched that key. Well okay, it didn't work, but that just means a thing wasn't there...not a big boom. The problem is when the cache *was* there ~1.5 seconds later.
When the key (which just exploded to 5MB - key point) was there, all of those .Get<T> ("fetch if if it's there...but don't block to load it") calls *all* tried to fetch it. Since 5MB takes a measurable amount of time to get out of the pipe, these calls added up quickly.
Each web request to Questions/Show (our question page) was asking for the key. This means in the span of 1-2 minutes we asked for the key north of 12,000 times in those few seconds. Uh oh, this is where shit hit the fan.
Here's what our web tier bandwidth looked like (bytes per second):
More importantly, here's what that Redis server's bandwidth looked like at the same time (this is all from 1 instance):
That 5MB key many, many times quickly added up to 30-60GB transferred in that 1 minute window every hour. This is enough to clog up even a 2x10Gb network. One key saturated the network no problem, due to how that cache works. It wasn't designed for any HUGE keys.
We don't want to block the question page on this (e.g. no locking) and we want fresh cache every hour-ish, so we'll need to change up how this works. We spent more than 2 days on the approaches here (that failed) already...we didn't plan for this scenario. So: back to spec.
Sometimes code isn't the answer. We have enough tricky code here. I'm pausing to step back and write down requirements of what we actually want to do here, then approach it fresh with ideas. For today, we've pushed out that cache duration to 24 hours and are working on a fix.
In case it helps anyone else, here's what our internal docs look like for something like this.

I try my best to treat it as a learning exercise as we go, hangouts are open for all to join as we investigate so people can ask questions, help out, bounce ideas, etc.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Nick Craver
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!