Neural networks are getting HUGE. In their @stateofaireport 2020, @NathanBenaich and @soundboy visualized how the number of parameters grew for breakthrough architectures. The result below is staggering.
What can you do to compress neural networks?
👇A thread.
1⃣ Neural network pruning: iteratively removing connections after training. Turns out that in some cases, 90%+ of the weights can be removed without noticeable performance loss.
A few selected milestone papers:
📰Optimal Brain Damage by @ylecun, John S. Denker, and @SaraASolla. As far as I know, this was the one where the idea was introduced.
2⃣ Knowledge distillation: teaching a smaller network to learn the predictions of the big one. Since predictions are available for unlabelled data as well, the student network learns how to generalize like the teacher.
📰 One recent success story with knowledge distillation is DistilBERT from @huggingface. 40% smaller and 60% faster, while retaining 97% of its language understanding capabilities!
🏗️ Since it doesn't require any special tool, it can be performed with any framework.
3⃣ Quantification: using integer types instead of float32 for faster computation. There are two flavors: quantizing after training and quantization-aware. The former is simpler but can result in a performance decrease.
1️⃣ If you struggle to understand determinants, stop what you are doing and check out this video by @3blue1brown, it will make your brain explode.
2⃣ Sometimes, it is hard to figure out what a concept represents by looking at how it is calculated. The determinant of a matrix is calculated by a sum, iterating through all permutations of a row.
3⃣ However, this definition doesn't reveal anything about what the determinant means. In fact, it is quite simple: it describes how the volume scales under the corresponding linear transformation.