it was a FOTM viral breakout hit with the launch of the Epic Games Store
but 12 months of code changes and infrastructure upgrades and we'd 1/6th'd our capacity before we hit issues.
we'd load-tested every service up to 260k, and run bots to 100k. didnt matter. shit broke
we had a war room with permanently open calls with xbox, playstation, epic, and google
the process was straight forward: 1. identify the current bottleneck 2. deploy a fix 3. see if the graphs indicated things had gotten better 4. increase the max allowed ccu 5. repeat
i cannot think of a single piece of infrastructure that didn't have issues
and again, every single one of these pieces had performed admirably during load testing. we spent SO much money on cloud services running bots, scripts, and swarms
most of the issues have faded from memory, but one of the most interesting ones came from the fact that much of the load-testing came from randomly generating players and what we found in reality was that the player IDs were not randomly distributed
so even though the system could theoretically handle a much larger amount, the automatic sharding systems that distributed load across player databases would cause one to bear most of the load at any one time. and sharding is VERY difficult to undo/change without migration
we ended up in a situation where load on individual databases meant that 1 minute of live time was taking more than 1 minute to back up. which meant backups got further and further behind. eventually we chose to just turn backups off and rely on read replicas for redundancy
furthermore, while we'd accounted for player load, the combination of those growing transaction logs and the fact that many more players were playing and storing data meant that the storage was growing
so while we had live services falling over, we had a ticking time bomb
dying services meant less people could play, but if the database servers drives filled up, the game would literally stop working for everyone. so we had to balance the sword of damocles with the fires of rome
we expanded the google mysql services as much as the product could
but the math on the rate of change of the storage meant we had ~5 days to solve it before the game was DEAD dead. we ended up migrating all our hosted google mysql to our own mysql instances running on regular compute, where we could add more storage
while all this was happening, we were also trying to fix matchmaking, which would take a player logging on and attempt to put them into a particular cluster and give them a game server to connect to. except as the clusters began to fall over, the pings produced bad results
initially we'd planned for one cluster per region, but we had more players in different areas than we'd needed, and the matchmaking had no visibility into the load of the server. and kubernetes (the server platform) couldn't give it the info it needed to choose
at one point i was in compiling kubernetes from source so i could add a non-serving node to google's kube infrastructure, so i could modify it to add a new API endpoint to it to expose the data we needed to tell how loaded a cluster was, but eventually i found another way
so instead i rewrote the matchmaking algorithm to separate the concept of clusters and regions and estimated load data from pod response times, and we spun up multiple clusters in the same region
while this was happening, telemetry went down. telemetry is how the fuck you tell what is happening. with no visibility into player data we may as well not have a game
i'd previously written a shim to sit in front of telemetry to selectively route between multiple systems
but the sheer load was too much. i basically commented out the entire front end and just had it punt to the ETL (data pipeline), as well as reporting success before it happened so as to close the connections faster
other issues, while this all occurred:
1. at one stage the game went down because an engineer who'd worked 2 weeks straight put a curly brace in the wrong place which caused the login/auth system to not have all the data it needed. took 20 people 4 hours to figure out why
2. the redis caches weren't keeping up, but i'd planned for this and was able to manually spin up additional instances and adjust the sharding, while making sure that a player's first call to a new instance would check what their old instance would have been to check for old data
3. turns out players spam hammer escape in login queue. someone had edited the player controller to catch that input during queue. that input triggered the main menu, which sent a request to a service for menu data. but that was only cached when the menu displayed...
and since the main player menu couldn't display during queue, the data was never saved. it was a superquery, which meant the microservice it hit, contacted a bunch of other microservices behind the scenes. it was an EXPENSIVE call but it was only meant to happen once
that would have been fine during in-game, but every player smashing that key was generating orders of magnitude more traffic, and no-one ever noticed this beforehand because how could they? it was an impossible sequence that could never have been detected beforehand
it would have required a client patch to fix properly, so instead we deployed a fix into the service that rate limited the player's ability to hit that endpoint. it was janky and it meant their first menu hit in game wouldnt be loaded when they tried, but it was what we had
4. one particular service was seeing increased response times and requests were beginning to time out as it hit its limit. we added more servers but it did nothing. they sat around idle. when we dug into it, it turned out that a hidden 'affinity' setting wasnt evenly distributing
so one hardware node in one cluster was using a single NIC to serve all that traffic, and it was flatlined. with help from google we were able to reset the affinity and distribute the service across the nodes in the cluster, and everything immediately resolved
there were many more - three full weeks of it. fifteen hours a day, seven days a week. i worked all day, went home, slept, and went straight back to the office. when i ran out of clothes i started stealing swag from marketing because i didn't have time to do laundry
a friend ended up flying out internationally to look after me during that period because there was not enough time to operate my life. i wasnt eating properly, didnt have time to walk dogs, clean. lived on delivery, the contents of my fridge went rotten
i also had an unrelated medical issue at the time due to a medication change, and so while all that was happening, i was vomiting, nausea, brain zaps, diarrhea, shakes, muscle cramps, headaches. i took codiene for the pain, valium for the cramps, booze to sleep, red bull to wake
i've worked multi-week live esports events, done multi-day network outages, even 911 emergency services, and nothing prepared me for the sheer destructive toll that launch took on my mind and body. by the end, i was a shell of a human.
the friend described me as "not a person"
after 21 days, we were stable at 300k ccu. we deployed 1700 patches during the launch window. players were playing, buying, and we'd secured a future for the company and the game. but it came at a cost. that took things from me i can never get back
please be kind to arrowhead. they're in a hell you can't even begin to imagine. don't make it worse
• • •
Missing some Tweet in this thread? You can try to
force a refresh
dan, like most AI people, either failed to understand the work, or is lying to make this look impressive
stable diffusion did not read people's minds. he's very conveniently either omitted or left out how this actually works, because it would show that SD didn't do shit
fMRI output is indeed reading your brain, but dan's skipped a very important part
you can imagine trying to construct an image from a brain scan, probably, right?
but what's this "semantic decoder" bit? whoops, it actually powers the whole thing
semantic decoding has been around since 2016, and it works by taking fMRI while people are shown things, then marking down the output with a description of what they were shown
record someone's response to a thing, then detect that response again later
for reasons unbeknownst to me i've started photoshopping various crash screens as things to drop on twitter when someone has said some dumbass shit and you're gonna dip
then i just started making custom ones for anyone who responded, about something they work on. cc @heytred
for non-engineers: datetimes can be represented by the number of seconds since the unix epoch. january 1 1970. if you feed int(0) in it will return a datetime of january 1 1970. try it for yourself
for whatever reason, when twitter attempts to load the profile creation date timestamp, it is not able to, and it getting all zeros instead. resulting in a date of january 1 1970
a context-collapse attack is one in which an attacker misrepresents a post containing keywords as bigotry, and uses the resulting fallout to drive traffic out of context. successful context-collapser attackers usually hide behind identity or account size to avoid criticism
context-collapsers rely on that people don't dig for original sources, they get angry, take away that the target is bad, and go about their days. the defence side of the context-collapse attack is that the victim is prevented from talking about it without 'attacking a [blank]'
common identities for this include neurodivergence, queerness, or disability, as "[victim] is attacking a [marginalized person]" reinforces the original attack that labelled the victim as Bad. and if the victim ever talks about it, they are "punching down"
i live in a majority chinese suburb. if i ask my brain for a character in vancouver they're likely going to be CN because that's my normal
but i wouldn't attempt to write a chinese character whose story is about the experience of being chinese in canada. because i dont know!
what little i do know comes from observing others or having people tell me. i could research, or get cultural consultants in. i could interview people. but why do it at all? why is someone else's story that i don't understand important for me to tell?
@ajseps@ChadJessup@soupychloe@guldeuxchats@SmolBoricua one thing that i've observed over the years is that often to combat these kind of problems, analysts are removed from their embedded teams, and placed on separate teams behind a jira queue to enforce the respect of time and track work. seems fine, but the distance isn't free