PyTorch Profile picture
Tensors and neural networks in Python with strong hardware acceleration. PyTorch is an open source project at the Linux Foundation. #PyTorchFoundation

Oct 19, 2021, 11 tweets

✨ Low Numerical Precision in PyTorch ✨
Most DL models are single-precision floats by default.
Lower numerical precision - while reasonably maintaining accuracy - reduces:

a) model size
b) memory required
c) power consumed

Thread about lower precision DL in PyTorch ->
1/11

Lower precision speeds up :

* compute-bound operations, by reducing load on the hardware

* memory bandwidth-bound operations, by accessing smaller data

In many deep models, memory access dominates power consumption; reducing memory I/O makes models more energy efficient.

2/11

3 lower precision datatypes are typically used in PyTorch:

* FP16 or half-precision (`torch. float16`)

* BF16 (`torch. bfloat16`)

* INT8 (`torch.quint8` and `torch. qint8`) which stores floats in a quantized format

3/11

FP16 is only supported in CUDA, BF16 has support on newer CPUs and TPUs

Calling .half() on your network and tensors explicitly casts them to FP16, but not all ops are safe to run in half-precision.

4/11

A better solution is to use Automatic Mixed Precision to let PyTorch choose the right op-specific precision (FP32 vs FP16 / BF16) for your tensors.

5/11

For torch <= 1.9.1, AMP was limited to CUDA tensors using
`torch.cuda.amp. autocast()`

v1.10 onwards, PyTorch has a generic API `torch. autocast()` that automatically casts

* CUDA tensors to FP16, and
* CPU tensors to BF16.

Docs: pytorch.org/docs/1.10./amp…

6/11

Running Resnet101 on a Tesla T4 GPU shows AMP to be faster than explicit half-casting:

7/11

Don’t wrap your backward pass in `autocast()`!

Ensure you’re only wrapping your forward pass and the loss computation in lower-precision.

The backward ops will run in the same dtype that the corresponding forward op was autocast to.

8/11

Low-precision gradients save network bandwidth in distributed training too.

You can enable gradient compression to FP16 with DistributedDataParallel: pytorch.org/docs/stable/dd…

9/11

For non-BF16 and ARM CPUs, lower precision is currently enabled via quantization.

Quantization converts FP32 to INT8, with a potential 4x reduction in model sizes.

Only the forward pass is quantizable, so you can use this only for inference, not training.

10/11

Learn more about half precision on the PyTorch Developer Podcast episode: pytorch-dev-podcast.simplecast.com/episodes/half-…

torch. autocast: pytorch.org/docs/1.10./amp…
AMP Examples:
pytorch.org/docs/stable/no…
Quantization in PyTorch: pytorch.org/docs/stable/qu…

11/11

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling