Read on Twitter

dreid @dreid

, 64 tweets, 9 min read Read on Twitter

Ok, I am going to tell the story of one of my favorite bugs. It's gonna be kind of long and it is about a decade old so some of the details might be fuzzy.

If you figure it out or have heard this story before, please no spoilers.

There are a lot of things to take away from the story but one of the most important and most relevant is that high-fidelity event logs available in a structured and queryable format are vital for diagnosing seemingly undiagnosable bugs.

About 10 years ago MochiMedia rolled out a virtual currency called "MochiCoins". It was intended to be a multi-game virtual currency that could be bought for real money and used to buy in-game assets (including sometimes fungible in-game currency.)

The thing that is perhaps most interesting about this currency is that the whole system of user wallets, inventory, exchange and in-game integration was rolled out in around 3 months by a team of ~6 people.

MochiCoins themselves were intended to be non-fungible, you couldn't given them to other users, and you couldn't easily get USD back out of the system.

However some of the games that used this system were free-to-play MMOs which had in-game items and currency you could trade.

One particular MMO is at the center of this bug, it was a free-to-play MMO popular in several countries in Asia but honestly I don't even remember the name and you've probably never heard of it anyway.

But one day they started to get reports of money being stolen.

At first we figured it was a problem with their system and that there wasn't much we could do about it. And then they started getting reports of users being signed in as other users.

This is when it became our problem, though not yet *my* problem.

Here is how MochiCoins worked, the users would go to a website called MochiGames, sign up, sign in, exchange money (via paypal or credit card) for coins. MochiGames was also a flash game portal meant to showcase MochiCoins using games.

In-game there was a widget that would communicate with our backend and allow the user to buy items from the game, and there was an API that would allow the game to see the user's current inventory.

The widget also functioned as a part of an SSO system (similar to a facebook connect or google auth). If the user didn't have a valid session they'd be shuffled off to MochiGames to sign-in, then the widget would share some user information with the requesting game.

In most cases the game didn't really have a backend but in the case of MMOs this would link your MMO account with your MochiGames account. So if you were signed into MochiGames as a particular user you'd be signed in (more or less automatically) to the MMO as a corresponding user

So back to those reports of being signed in as the wrong user.

User A would go to the MMO and find themselves signed in as a completely unrelated user B.

How does something like that happen?

1) It's all the games fault. They're just doing something silly and it's not our problem.

This was a popular explanation for a few days, mostly because we didn't get any similar reports from any other games.

2) Session data was corrupted.

Our session database was an early version of Riak so this seemed plausible for a bit and we had problems with other services that used Riak.

3) Firesheep? Local hackers stealing session cookies from plaintext HTTP?

At the time the whole site wasn't behind HTTPS because a lot of it was served by a CDN and 10 years ago (Before Fastly) SSL for CDNs was *expensive*

This only explains the thefts though.

4) Our inventory service was somehow corrupting inventory?

Extremely plausible. The inventory service was built on mnesia which is a notorious non-partition tolerant erlang data store. This didn't look like a failure mode we had seen before with mnesia though.

Those are just a few of the explanations floated after the first few reports. But the reports didn't stop, these weren't flukes, or heisenbugs, as we started we investigating in earnest (and as the game got more popular) the reports just became more frequent.

It was the first thing in my email inbox every morning was a new batch of affected accounts. Both for users that had items stolen and for users that found themselves logged into the wrong account.

We considered for a while that these might be *unrelated* bugs, but that didn't really pass Occam's Razor and we couldn't find any evidence that particular users were being targeted. Some users were just finding themselves signed in as other users and quickly robbing them.

Mochi had a logging problem at this time, we always erred on the side of logging too much information but we also didn't invest heavily in making those logs explorable or easily accessible.

At around the time this problem started happening we had just started to repurpose some older machines into a hadoop cluster. Mostly in search of a satisfactory analytics solution. For reasons I don't really remember we were fairly allergic to 3rd party analytics systems.

And wow am I glad, if we had used GA extensively we might have not bothered keeping around raw access logs, but we didn't use GA so we had raw access logs and we had them in HDFS so we were able to load them into hive and run some queries.

Once we had a good 15 reports or so I started taking a deep look into the logs to see if I could identify the moment when a user who had been logged in as A was suddenly logged in as B. Unfortunately, we usually only had the user ID for B and not the user ID for A.

I didn't really find anything, there wasn't a clear point where a user had been signed in as A and then suddenly transitioned to B. It just didn't seem to happen.

This was frustrating because it clearly did happen.

Here is what our logs looked like.
- browser ID: a unique ID set as a cookie the first time the browser crossed our website
- IP address of the client
- session ID
- user ID
- path
- user agent
- timestamp

There was some other information but nothing totally relevant. It did become clear that we were missing information, we didn't have an easy way to correlate these logs with access to our other services, this was just the logs for the website.

The requests were all spread out in time, and it didn't appear to have a noticable increase in the number of requests the users were making to the website.

When I just pulled the logs for a single affected user however I started to notice something strange.

User B would make a request from an IP of XX.XX.XX.XX and then the seemingly next request would be from the IP of YY.YY.YY.YY.

Was this it? was this the point of transition? The user IDs were the same, the session IDs were the same, the IP addresses and paths were different.

But the browser ID was *also* the same. So it was seemingly the same browser on the same computer. Maybe the computer moved?

The requests were only seconds apart though.

And a subsequent request might be from XX.XX.XX.XX again, also seconds apart.

What was I really looking for, I would expect in the transition point to see a case where:

Browser ID was the same, IP address was the same, User ID/Session ID to change.

I found a few cases of this, but they were over hours or days, and weren't conclusive of anything, they could have all been explained by shared computers/connections. Maybe the game was popular with groups of friends who all played in the same internet cafes?

I didn't have anyway to interrogate the users who saw this behavior, they might not have even known what Mochi was, it was just some thing this game they played relied on.

All I had were my logs.

I decided the IP addresses changing was weird enough that I wanted to understand it, I checked to see what other requests were made from these IP addresses.

And it turns out a lot of requests were made from those two IP addresses, requests all over the site, from all sorts of users, for all sorts of games. But these same IP addresses were also used by other affected users.

This is where the pace picks up.

One report of a problem like this is a "huh?", two is a "weird coincidence", 10 or more is in clear "wtf" territory.

But now at least some of the affected users had something in common.

I started to look at the IP addresses more closely, I started to ask for all the users who made requests from these IP addresses.

Then I asked for all the IP addresses those users had made requests from. And then I asked for all the users who made requests from those IPs…

It's time for some Graph Theory.

What I have now is a cluster of users connected by shared IP addresses, and many of the affected users are in the cluster.

But now we have something we can try to predict. I'd expect the next report to include a user who is in this cluster.

I still don't understand the problem, but I keep digging while I wait for more reports. If the data is *predictive* in some way we can start to gauge the real impact not just the reported impact.

I start to investigate the actual IP addresses, I built out hive tables with GeoIP and reverse DNS data, the way we do this is by taking every IP address we've ever seen and run it through a python script that does a GeoIP lookup and reverse DNS and builds a new hive table.

This ended up causing a bit of an outage the machines running our DNS servers got real sad, DNS requests in completely unrelated services started failing.

It takes nearly a full day just to make this table. While it's happening I do some manual exploration.

Did you know that if you run `whois` with an IP address instead of a domain name it'll ask ARIN for info about the owner of that IP address?

It turns out XX.XX.XX.XX is owned by a telecom in Singapore.

And YY.YY.YY.YY? It's owned by a *different* telecom in Singapore.

The GeoIP data is loaded, I check all the affected users, they're *all* in Singapore.

Every. Single. One.

When we go back to looking at the browser ID cookies we'll find some requests from the user A userIDs. We'll start to see some established identities take shape. Weeks of requests from user A, all over a handful of different IPs, and then one day that browser ID is user B.

And every single User A and User B is in Singapore.

So lets take a closer look at user B. One thing we start to find are places where the Browser ID is the same, the IP address changes but in an expected way, the User ID is the same, but the session changes.

And this is where see the transition. When the session changes we start to see requests with the same session ID from different browser IDs.

Browser ID 1, User B makes a request to a /pathA, Browser ID 2, User A makes a request to /pathB, User B gets a new session.

Browser ID 2 is now *also* User B. and these requests happen *seconds apart*.

I've been assured by the team working on the website that the data is impossible.

It starts to look like a cache is interfering, but we send a proper Vary header, and even if we didn't we only do login over HTTPS.

Do we ever set a new session cookie over HTTP? "No", I am assured.

It is an unfortunate reality about computers is that most of the time the people building systems do not understand them.

This is a fundamental truth about computing, and so too it is a fundamental truth about computing that eventually you'll have to tcpdump in production.

And so I did.

I took a backend webserver and I ran tcpdump on it and I got 10s of gigabytes of traffic and I found a handful of `Set-Cookie` headers being sent over HTTP.

It turns out that a product decision was to have functionally infinite sessions for users who kept using the website. So as long as you used the website frequently you wouldn't have to log in very much.

This is a common and reasonable design decision.

We implemented it as:

If a user visits the site and their session is about to expire we give them a new one.

Remember why most of the website was not working over HTTPS? Because CDNs were *expensive*.

Turns out WAN links are also expensive in a lot of the world, so sometimes a thing called WAN optimization happens.

A service provider might deduplicate outbound requests so that they only hit the remote service once, and cache those requests for a period of time to so that they do not use too much WAN bandwidth.

Now, HTTP tries to anticipate this need, and they give you headers like Cache-Control and Vary to tell these intermediate caches what not to cache.

But there are a lot of bad caches in the world, and it looked like most of Singapore was behind one.

In the end we turned off session extension and stripped `Set-Cookie` over HTTP with a Django middleware.

We never had the problem again.

The big takeaways?

- Just use HTTPS for everything all the time.
- High fidelity logs of user actions are great, if you can correlate them across services they're even better.
- Sometimes the problem is more than halfway around the world in a computer you didn't know existed.

- Browser + Web Server is a complex distributed system.

Like this thread? Get email updates or save it to PDF!

Subscribe to dreid

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to dreid

This content may be removed anytime!

Try unrolling a thread yourself!

More from @dreid see all

Related threads

Trending hashtags

Did Thread Reader help you today?