Bartłomiej Cupiał Profile picture
PhD Student @ University of Warsaw | @IDEAS_NCBR https://t.co/DrOexJe5Tf

May 24, 2024, 14 tweets

So here's a story of, by far, the weirdest bug I've encountered in my CS career.

Along with @maciejwolczyk we've been training a neural network that learns how to play NetHack, an old roguelike game, that looks like in the screenshot. Recenlty, something unexpected happened.

We use a model by @JensTuyls that clones expert behavior on NetHack, and we improve it using RL methods. That model gets 5000 points and we finetune it in the game so that the score improves. However, suddenly in a recent run, Jens' model only got 3000 points. Quite a drop.

This problem is consistent between seeds so it's not just a fluke. Well, we probably screwed up something in the code for loading the model in the recent commit. Let's revert, no biggie. Except that after reversing to a version of the code from a few days back, we still get 3000.

Revert code a few weeks back? Still 3000 points. Luckily, the server we run our experiments on saves the files from the previous runs. We find the files corresponding to a run that previously got 5000 points, we re-run, and, well, it gets 3000. Nothing about the code changed.

We start suspecting our software stack. Thankfully, we use Singularity which means that our whole environment is in a single, self-contained file. That file hasn't changed for a few months, so that shouldn't be the problem. However, the container loads one thing from the server.

Namely, the CUDA libraries that allow us to compute things quickly on GPU. So we suspect that maybe something about these libraries changed that degraded the model. Because what else could have? And yes, recently the version was changed from 11.8 to 12.4.

The CUDA mismatch probably shouldn't impact the results in this particular way, but we see no other explanation. We override the version to 11.8 - we still get 3000 points. We build a new environment from scratch, for CUDA 12.4 - 3000 points. Welp.

We repeat the evaluation on a personal laptop. This is slow and expensive without the specialized hardware, but we make it work. Again, 3000 points. We disable multithreading, GPU, and some other things that have at least a conceivable chance of causing the problem - 3000 points.

By the point we've spent several hours on this, it's 7 PM. I am starting to feel like a madman. I can't even watch a TV show constantly thinking about the bug. Before going to sleep I decide to ask @JensTuyls, the author of the model, if he knows what might be broken.

Next day in the morning I see a lot of messages on slack. Jens replied "Oh yes, it's probably a full moon today."

What.

I check a moon phase calendar, and yes, it's a full moon today. Hands shaking, I start a new NetHack game, and the message says "You are lucky! Full moon tonight."

What.

So apparently NetHack has a mechanic that slightly changes how the game plays every time it's full moon according to your system clock: The player character is luckier, werewolves appear in their animal form, and the dogs howl ominously.nethackwiki.com/wiki/Time

It doesn't make the game harder, but the model hasn't seen full moon data in its training set, so the score drops. In this particular case, it drops from 5k points to 3k points. We override the time so it's not a full moon, we evaluate the model - and it's 5k points again.

The moral is, if you encounter an unexpected bug, be sure to consult lunar calendar. Big thanks to @JensTuyls for solving this for us!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling