Tweet

pixprin

Mar 12 • 8 tweets • 2 min read

1/ In our defense, we didn't know it was going to be a slack thread with a 1000 messages. Our renders were flickering. And not in a subtle way... Objects would disappear, change textures, drop subsurface. It was not reproducible on any given frame.
Making of #TurningRed

2/ It was the dreaded sometimes missing, sometimes corrupt file. In a large distributed system, a file can be cached at a variety of locations. When you ask for a file, it first checks the os local page cache. If it's not there, it goes to backend storage.

3/ You can think of storage these days like a computer; it's not just a hard drive. So within that storage device, there are also caches. And if your storage device is hot, you might even throw another separate cache device in front of it.

4/ But wait. Each layer there has separate cache nodes. Yup, we might have multiple nodes serving this data. At both storage levels. (How many places might we find this file already?)

5/ Now, systems aside, at Pixar, we have a path resolver that identifies a set of paths where a file will be searched. You might find the globally installed Panda Mei. Or you might find a different version of her, overridden at a sequence level, shot level, you get the picture.

6/ Production verified that no code, assets, or paths had changed at the time the behavior started. Systems ran commands to drop caches at all levels, right down to the individual compute nodes page caches.

And yet it persisted.

7/ Code is instrumented, and Sadmins (sad storage admins) checksum paths. They all match, but the renders do not.

It 5pm on a Friday before a 3-day weekend of rendering. What we tried next had to be foolproof: we could not lose 3 days of rendering.

8/ So we did what any talented admin/programmer would do: a hard reboot.

It worked.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @pixprin

pixprin

@pixprin

Jan 1, 2021

@levork

@levork @RizzoBlake 1/ Our world works quite a bit differently than VFX in two major ways: we spend a lot of man hours on optimization, and our scheduler is all about sharing.

Due to optimization, by and large we make the film fit the box, aka each film at its peak gets the same number of cores.

@levork

@levork @RizzoBlake 2/ in terms of scheduling, some brilliant folks like Josh Grant, Eric Peden have devised a way to lend extra clock cycles to neighbor processes that are underutilizing their checked out cores.

So say process A and process B each check out 8 cores and land on the same host...

@levork

@levork @RizzoBlake 3/ process A during scene gen gets stuck using 2/8 cores.

Meanwhile process B is brute force patbtracing, using 8/8 cores.

During that time, process B uses 14/8 cores!

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

pixprin

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @pixprin

pixprin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?