✨ Low Numerical Precision in PyTorch ✨
Most DL models are single-precision floats by default.
Lower numerical precision - while reasonably maintaining accuracy - reduces:
a) model size
b) memory required
c) power consumed
Thread about lower precision DL in PyTorch ->
1/11
Lower precision speeds up :
* compute-bound operations, by reducing load on the hardware
* memory bandwidth-bound operations, by accessing smaller data
In many deep models, memory access dominates power consumption; reducing memory I/O makes models more energy efficient.
2/11
3 lower precision datatypes are typically used in PyTorch:
* FP16 or half-precision (`torch. float16`)
* BF16 (`torch. bfloat16`)
* INT8 (`torch.quint8` and `torch. qint8`) which stores floats in a quantized format
3/11
FP16 is only supported in CUDA, BF16 has support on newer CPUs and TPUs
Calling .half() on your network and tensors explicitly casts them to FP16, but not all ops are safe to run in half-precision.
4/11
A better solution is to use Automatic Mixed Precision to let PyTorch choose the right op-specific precision (FP32 vs FP16 / BF16) for your tensors.
5/11
For torch <= 1.9.1, AMP was limited to CUDA tensors using
`torch.cuda.amp. autocast()`
v1.10 onwards, PyTorch has a generic API `torch. autocast()` that automatically casts
* CUDA tensors to FP16, and
* CPU tensors to BF16.
Want to make your inference code in PyTorch run faster? Here’s a quick thread on doing exactly that.
1. Replace torch.no_grad() with the ✨torch.inference_mode()✨ context manager.
2. ⏩ inference_mode() is torch.no_grad() on steroids
While NoGrad excludes operations from being tracked by Autograd, InferenceMode takes that two steps ahead, potentially speeding up your code (YMMV depending on model complexity and hardware)
3. ⏩ InferenceMode reduces overheads by disabling two Autograd mechanisms - version counting and metadata tracking - on all tensors created here ("inference tensors").
Disabled mechanisms mean inference tensors have some restrictions in how they can be used 👇