If you figure it out or have heard this story before, please no spoilers.
However some of the games that used this system were free-to-play MMOs which had in-game items and currency you could trade.
But one day they started to get reports of money being stolen.
This is when it became our problem, though not yet *my* problem.
User A would go to the MMO and find themselves signed in as a completely unrelated user B.
1) It's all the games fault. They're just doing something silly and it's not our problem.
This was a popular explanation for a few days, mostly because we didn't get any similar reports from any other games.
Our session database was an early version of Riak so this seemed plausible for a bit and we had problems with other services that used Riak.
At the time the whole site wasn't behind HTTPS because a lot of it was served by a CDN and 10 years ago (Before Fastly) SSL for CDNs was *expensive*
This only explains the thefts though.
Extremely plausible. The inventory service was built on mnesia which is a notorious non-partition tolerant erlang data store. This didn't look like a failure mode we had seen before with mnesia though.
This was frustrating because it clearly did happen.
- browser ID: a unique ID set as a cookie the first time the browser crossed our website
- IP address of the client
- session ID
- user ID
- path
- user agent
- timestamp
When I just pulled the logs for a single affected user however I started to notice something strange.
Was this it? was this the point of transition? The user IDs were the same, the session IDs were the same, the IP addresses and paths were different.
The requests were only seconds apart though.
Browser ID was the same, IP address was the same, User ID/Session ID to change.
All I had were my logs.
One report of a problem like this is a "huh?", two is a "weird coincidence", 10 or more is in clear "wtf" territory.
But now at least some of the affected users had something in common.
Then I asked for all the IP addresses those users had made requests from. And then I asked for all the users who made requests from those IPs…
But now we have something we can try to predict. I'd expect the next report to include a user who is in this cluster.
Did you know that if you run `whois` with an IP address instead of a domain name it'll ask ARIN for info about the owner of that IP address?
And YY.YY.YY.YY? It's owned by a *different* telecom in Singapore.
Every. Single. One.
Browser ID 1, User B makes a request to a /pathA, Browser ID 2, User A makes a request to /pathB, User B gets a new session.
It is an unfortunate reality about computers is that most of the time the people building systems do not understand them.
And so I did.
This is a common and reasonable design decision.
If a user visits the site and their session is about to expire we give them a new one.
Turns out WAN links are also expensive in a lot of the world, so sometimes a thing called WAN optimization happens.
We never had the problem again.
- Just use HTTPS for everything all the time.
- High fidelity logs of user actions are great, if you can correlate them across services they're even better.
- Sometimes the problem is more than halfway around the world in a computer you didn't know existed.