I cut my teeth on TensorFlow 1, where graphs were compiled ahead of time, and did a lot of my grad school work in classic CPU-only autograd, because I needed forward-mode differentiation for fast Hessians (don't ask), so this was not at all obvious to me!
This is also why it's so 🔑 that you use num_workers>0 in your DataLoader.
Otherwise, the CPU forward pass won't start until the batch has been loaded, and then the next batch won't start loading until the optimizer step is done.
That's a lot of (expensive!) idle GPU time😬
I learned a whole lot more using the trace viewer, including the reasoning behind most of @karpathy's hitherto mysterious tips on optimizing @PyTorch.
PS: this hot'n'fresh @weights_biases feature is courtesy of @vanpelt, who incorporated PyTorch's excellent trace viewer into our Artifacts system so that they can more easily be tracked, shared, and integrated into dashboards and reports
• • •
Missing some Tweet in this thread? You can try to
force a refresh
tl;dr: the basic idea of the SVD works for _any_ function.
it's a three step decomposition:
- throw away the useless bits ⤵
- rename what remains 🔀
- insert yourself into the right context ⤴
also, if you're more of a "YouTube talk" than a "tweet wall" kinda person, check out the video version, given as part of the @weights_biases Deep Learning Salon webinar series