1/ In our defense, we didn't know it was going to be a slack thread with a 1000 messages. Our renders were flickering. And not in a subtle way... Objects would disappear, change textures, drop subsurface. It was not reproducible on any given frame.
Making of #TurningRed
2/ It was the dreaded sometimes missing, sometimes corrupt file. In a large distributed system, a file can be cached at a variety of locations. When you ask for a file, it first checks the os local page cache. If it's not there, it goes to backend storage.
3/ You can think of storage these days like a computer; it's not just a hard drive. So within that storage device, there are also caches. And if your storage device is hot, you might even throw another separate cache device in front of it.
4/ But wait. Each layer there has separate cache nodes. Yup, we might have multiple nodes serving this data. At both storage levels. (How many places might we find this file already?)
5/ Now, systems aside, at Pixar, we have a path resolver that identifies a set of paths where a file will be searched. You might find the globally installed Panda Mei. Or you might find a different version of her, overridden at a sequence level, shot level, you get the picture.
6/ Production verified that no code, assets, or paths had changed at the time the behavior started. Systems ran commands to drop caches at all levels, right down to the individual compute nodes page caches.
And yet it persisted.
7/ Code is instrumented, and Sadmins (sad storage admins) checksum paths. They all match, but the renders do not.
It 5pm on a Friday before a 3-day weekend of rendering. What we tried next had to be foolproof: we could not lose 3 days of rendering.
8/ So we did what any talented admin/programmer would do: a hard reboot.
It worked.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
@levork@RizzoBlake 1/ Our world works quite a bit differently than VFX in two major ways: we spend a lot of man hours on optimization, and our scheduler is all about sharing.
Due to optimization, by and large we make the film fit the box, aka each film at its peak gets the same number of cores.
@levork@RizzoBlake 2/ in terms of scheduling, some brilliant folks like Josh Grant, Eric Peden have devised a way to lend extra clock cycles to neighbor processes that are underutilizing their checked out cores.
So say process A and process B each check out 8 cores and land on the same host...
@levork@RizzoBlake 3/ process A during scene gen gets stuck using 2/8 cores.
Meanwhile process B is brute force patbtracing, using 8/8 cores.