christina 死神 Profile picture
Feb 19 30 tweets 6 min read Read on X
story time: the launch of Dauntless (2019) was the most difficult launch of my career

we planned for 260k ccu peak (players online). during open beta (2018), we'd hit 65k with some wrangling

it fell over at 10k ccu. took 3 weeks of 15 hour days 7 days a week to get it stable
it was a FOTM viral breakout hit with the launch of the Epic Games Store

but 12 months of code changes and infrastructure upgrades and we'd 1/6th'd our capacity before we hit issues.

we'd load-tested every service up to 260k, and run bots to 100k. didnt matter. shit broke
we had a war room with permanently open calls with xbox, playstation, epic, and google

the process was straight forward:
1. identify the current bottleneck
2. deploy a fix
3. see if the graphs indicated things had gotten better
4. increase the max allowed ccu
5. repeat
i cannot think of a single piece of infrastructure that didn't have issues

and again, every single one of these pieces had performed admirably during load testing. we spent SO much money on cloud services running bots, scripts, and swarms
most of the issues have faded from memory, but one of the most interesting ones came from the fact that much of the load-testing came from randomly generating players and what we found in reality was that the player IDs were not randomly distributed
so even though the system could theoretically handle a much larger amount, the automatic sharding systems that distributed load across player databases would cause one to bear most of the load at any one time. and sharding is VERY difficult to undo/change without migration
we ended up in a situation where load on individual databases meant that 1 minute of live time was taking more than 1 minute to back up. which meant backups got further and further behind. eventually we chose to just turn backups off and rely on read replicas for redundancy
furthermore, while we'd accounted for player load, the combination of those growing transaction logs and the fact that many more players were playing and storing data meant that the storage was growing

so while we had live services falling over, we had a ticking time bomb
dying services meant less people could play, but if the database servers drives filled up, the game would literally stop working for everyone. so we had to balance the sword of damocles with the fires of rome

we expanded the google mysql services as much as the product could
but the math on the rate of change of the storage meant we had ~5 days to solve it before the game was DEAD dead. we ended up migrating all our hosted google mysql to our own mysql instances running on regular compute, where we could add more storage
while all this was happening, we were also trying to fix matchmaking, which would take a player logging on and attempt to put them into a particular cluster and give them a game server to connect to. except as the clusters began to fall over, the pings produced bad results
initially we'd planned for one cluster per region, but we had more players in different areas than we'd needed, and the matchmaking had no visibility into the load of the server. and kubernetes (the server platform) couldn't give it the info it needed to choose
at one point i was in compiling kubernetes from source so i could add a non-serving node to google's kube infrastructure, so i could modify it to add a new API endpoint to it to expose the data we needed to tell how loaded a cluster was, but eventually i found another way
so instead i rewrote the matchmaking algorithm to separate the concept of clusters and regions and estimated load data from pod response times, and we spun up multiple clusters in the same region
while this was happening, telemetry went down. telemetry is how the fuck you tell what is happening. with no visibility into player data we may as well not have a game

i'd previously written a shim to sit in front of telemetry to selectively route between multiple systems
but the sheer load was too much. i basically commented out the entire front end and just had it punt to the ETL (data pipeline), as well as reporting success before it happened so as to close the connections faster
other issues, while this all occurred:

1. at one stage the game went down because an engineer who'd worked 2 weeks straight put a curly brace in the wrong place which caused the login/auth system to not have all the data it needed. took 20 people 4 hours to figure out why
2. the redis caches weren't keeping up, but i'd planned for this and was able to manually spin up additional instances and adjust the sharding, while making sure that a player's first call to a new instance would check what their old instance would have been to check for old data
3. turns out players spam hammer escape in login queue. someone had edited the player controller to catch that input during queue. that input triggered the main menu, which sent a request to a service for menu data. but that was only cached when the menu displayed...
and since the main player menu couldn't display during queue, the data was never saved. it was a superquery, which meant the microservice it hit, contacted a bunch of other microservices behind the scenes. it was an EXPENSIVE call but it was only meant to happen once
that would have been fine during in-game, but every player smashing that key was generating orders of magnitude more traffic, and no-one ever noticed this beforehand because how could they? it was an impossible sequence that could never have been detected beforehand
it would have required a client patch to fix properly, so instead we deployed a fix into the service that rate limited the player's ability to hit that endpoint. it was janky and it meant their first menu hit in game wouldnt be loaded when they tried, but it was what we had
4. one particular service was seeing increased response times and requests were beginning to time out as it hit its limit. we added more servers but it did nothing. they sat around idle. when we dug into it, it turned out that a hidden 'affinity' setting wasnt evenly distributing
so one hardware node in one cluster was using a single NIC to serve all that traffic, and it was flatlined. with help from google we were able to reset the affinity and distribute the service across the nodes in the cluster, and everything immediately resolved
there were many more - three full weeks of it. fifteen hours a day, seven days a week. i worked all day, went home, slept, and went straight back to the office. when i ran out of clothes i started stealing swag from marketing because i didn't have time to do laundry
a friend ended up flying out internationally to look after me during that period because there was not enough time to operate my life. i wasnt eating properly, didnt have time to walk dogs, clean. lived on delivery, the contents of my fridge went rotten
i also had an unrelated medical issue at the time due to a medication change, and so while all that was happening, i was vomiting, nausea, brain zaps, diarrhea, shakes, muscle cramps, headaches. i took codiene for the pain, valium for the cramps, booze to sleep, red bull to wake
i've worked multi-week live esports events, done multi-day network outages, even 911 emergency services, and nothing prepared me for the sheer destructive toll that launch took on my mind and body. by the end, i was a shell of a human.

the friend described me as "not a person"
after 21 days, we were stable at 300k ccu. we deployed 1700 patches during the launch window. players were playing, buying, and we'd secured a future for the company and the game. but it came at a cost. that took things from me i can never get back
please be kind to arrowhead. they're in a hell you can't even begin to imagine. don't make it worse

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with christina 死神

christina 死神 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @chhopsky

Mar 3, 2023
dan, like most AI people, either failed to understand the work, or is lying to make this look impressive

stable diffusion did not read people's minds. he's very conveniently either omitted or left out how this actually works, because it would show that SD didn't do shit

🧵
fMRI output is indeed reading your brain, but dan's skipped a very important part

you can imagine trying to construct an image from a brain scan, probably, right?

but what's this "semantic decoder" bit? whoops, it actually powers the whole thing person is shown an image. their brain output is read by an i
semantic decoding has been around since 2016, and it works by taking fMRI while people are shown things, then marking down the output with a description of what they were shown

record someone's response to a thing, then detect that response again later

frontiersin.org/articles/10.33… classification of scene contentsdecoding of scene contents
Read 8 tweets
Mar 3, 2023
for reasons unbeknownst to me i've started photoshopping various crash screens as things to drop on twitter when someone has said some dumbass shit and you're gonna dip UE4 crash reporter but it's my brain crashing from reading s
then i just started making custom ones for anyone who responded, about something they work on. cc @heytred apex crashing from reading something too dumb
then @Charalanahzard laughed, and alanah crashing from reading something too dumb
Read 5 tweets
Dec 19, 2022
HAHAHA ENGINEERS KNOW WHAT'S DYING AND THIS IS VERY FUNNY
for non-engineers: datetimes can be represented by the number of seconds since the unix epoch. january 1 1970. if you feed int(0) in it will return a datetime of january 1 1970. try it for yourself

epochconverter.com
for whatever reason, when twitter attempts to load the profile creation date timestamp, it is not able to, and it getting all zeros instead. resulting in a date of january 1 1970
Read 5 tweets
Nov 21, 2022
a context-collapse attack is one in which an attacker misrepresents a post containing keywords as bigotry, and uses the resulting fallout to drive traffic out of context. successful context-collapser attackers usually hide behind identity or account size to avoid criticism
context-collapsers rely on that people don't dig for original sources, they get angry, take away that the target is bad, and go about their days. the defence side of the context-collapse attack is that the victim is prevented from talking about it without 'attacking a [blank]'
common identities for this include neurodivergence, queerness, or disability, as "[victim] is attacking a [marginalized person]" reinforces the original attack that labelled the victim as Bad. and if the victim ever talks about it, they are "punching down"
Read 9 tweets
Jun 18, 2022
good thread

was having a discussion with a new writer about 'writing people who are different from you' and imo framing it that way missed the point

'writing things and experiences you understand'

this comic seems to be ABOUT blackness. what does this writer know of it?
i live in a majority chinese suburb. if i ask my brain for a character in vancouver they're likely going to be CN because that's my normal

but i wouldn't attempt to write a chinese character whose story is about the experience of being chinese in canada. because i dont know!
what little i do know comes from observing others or having people tell me. i could research, or get cultural consultants in. i could interview people. but why do it at all? why is someone else's story that i don't understand important for me to tell?
Read 6 tweets
Jun 18, 2022
thread about data analysis and product management in games. i've worked with this guy. whenever we did anything together it slapped
i dont know many data people on here except maybe @ajseps @ChadJessup @soupychloe @guldeuxchats @SmolBoricua ok thats a fair few i guess
@ajseps @ChadJessup @soupychloe @guldeuxchats @SmolBoricua one thing that i've observed over the years is that often to combat these kind of problems, analysts are removed from their embedded teams, and placed on separate teams behind a jira queue to enforce the respect of time and track work. seems fine, but the distance isn't free
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(