Neural networks are getting HUGE. In their @stateofaireport 2020, @NathanBenaich and @soundboy visualized how the number of parameters grew for breakthrough architectures. The result below is staggering.

What can you do to compress neural networks?

👇A thread.
1⃣ Neural network pruning: iteratively removing connections after training. Turns out that in some cases, 90%+ of the weights can be removed without noticeable performance loss.
A few selected milestone papers:
📰Optimal Brain Damage by @ylecun, John S. Denker, and @SaraASolla. As far as I know, this was the one where the idea was introduced.

papers.nips.cc/paper/250-opti…
📰The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by @jefrankle and @mcarbin.

arxiv.org/abs/1803.03635
📰 Pruning neural networks without any data by iteratively conserving synaptic flow by @Hidenori8Tanaka, Daniel Kunin, @dyamins, and @SuryaGanguli

arxiv.org/abs/2006.05467
How to do this in practice?
🏗️ TensorFlow Model Optimization Toolkit: tensorflow.org/model_optimiza…
🏗️ PyTorch pruning tools: pytorch.org/tutorials/inte…
2⃣ Knowledge distillation: teaching a smaller network to learn the predictions of the big one. Since predictions are available for unlabelled data as well, the student network learns how to generalize like the teacher.
📰 This technique was introduced by @geoffreyhinton, @OriolVinyalsML, and @JeffDean in their paper Distilling the Knowledge in a Neural Network.

arxiv.org/abs/1503.02531
📰 One recent success story with knowledge distillation is DistilBERT from @huggingface. 40% smaller and 60% faster, while retaining 97% of its language understanding capabilities!

arxiv.org/abs/1910.01108
🏗️ Since it doesn't require any special tool, it can be performed with any framework.
3⃣ Quantification: using integer types instead of float32 for faster computation. There are two flavors: quantizing after training and quantization-aware. The former is simpler but can result in a performance decrease.

Image source: TensorFlow Lite docs
How to do this in practice?
🏗️ TensorFlow Lite: tensorflow.org/lite/performan…
🏗️ PyTorch: pytorch.org/docs/stable/qu…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

28 Oct
1️⃣ If you struggle to understand determinants, stop what you are doing and check out this video by @3blue1brown, it will make your brain explode.

2⃣ Sometimes, it is hard to figure out what a concept represents by looking at how it is calculated. The determinant of a matrix is calculated by a sum, iterating through all permutations of a row.
3⃣ However, this definition doesn't reveal anything about what the determinant means. In fact, it is quite simple: it describes how the volume scales under the corresponding linear transformation.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!