Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations.
In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6
For inputs involving many steps, the operands for each step remain important until an identical depth. This indicates that the model is *not* breaking down the computation, solving subproblems, and composing their results together. 2/6
Aug 30, 2021 • 4 tweets • 2 min read
I'm happy to announce that our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers" has been accepted to #EMNLP2021!
1/4
We improve the systematic generalization of Transformers on SCAN (0 -> 100% with length cutoff=26), CFQ (66 -> 81% on output length split), PCFG (50 -> 85% on productivity split, 72 -> 96% on systematicity split), COGS (35 -> 81%), and Mathematics dataset.
2/4