@levork@RizzoBlake 1/ Our world works quite a bit differently than VFX in two major ways: we spend a lot of man hours on optimization, and our scheduler is all about sharing.
Due to optimization, by and large we make the film fit the box, aka each film at its peak gets the same number of cores.
@levork@RizzoBlake 2/ in terms of scheduling, some brilliant folks like Josh Grant, Eric Peden have devised a way to lend extra clock cycles to neighbor processes that are underutilizing their checked out cores.
So say process A and process B each check out 8 cores and land on the same host...
@levork@RizzoBlake 3/ process A during scene gen gets stuck using 2/8 cores.
Meanwhile process B is brute force patbtracing, using 8/8 cores.
During that time, process B uses 14/8 cores!
@levork@RizzoBlake 4/ this means when you look at the average utilization of checked out cores across the farm, most shows hit 80-90%. So we get a little more bang for our buck!
@levork@RizzoBlake 5/ We buy more cores when the needs of all of the upcoming projects show we need more on a consistent basis (eg for many months).
We also lease for short term bursts of need, usually for 4-8 weeks at a time.
@levork@RizzoBlake 6/ while each gen of new hardware is clearly faster than their predecessor, the fixed number of cores allocated to a film hasn’t changed in some time.
A film can also choose to spend their production budget on a lease if the story and/or schedule warrants the expense.
@levork@RizzoBlake 7/ last but not least, we are truly spoiled by our systems department. We have some killer storage hardware admins that make having many high IO projects IP at the same time possible!
@levork@RizzoBlake 8/ not to mention the insanity of how all of systems had Soul up and running, working from home, almost overnight! And software engineers who put a lot of rapid feature dev into remote reviews.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
1/ In our defense, we didn't know it was going to be a slack thread with a 1000 messages. Our renders were flickering. And not in a subtle way... Objects would disappear, change textures, drop subsurface. It was not reproducible on any given frame.
Making of #TurningRed
2/ It was the dreaded sometimes missing, sometimes corrupt file. In a large distributed system, a file can be cached at a variety of locations. When you ask for a file, it first checks the os local page cache. If it's not there, it goes to backend storage.
3/ You can think of storage these days like a computer; it's not just a hard drive. So within that storage device, there are also caches. And if your storage device is hot, you might even throw another separate cache device in front of it.