How to get URL link on X (Twitter) App
https://x.com/ArmenAgha/status/2003873670393991654The argument Lucas gives goes: for attention output S = Σ AᵢVᵢ (where A is the attention weight vector over positions), if the values Vᵢ are uncorrelated with unit variance, then Var(S) = ||A||₂². Since softmax enforces ||A||₁ = 1 (not ||A||₂ = 1), uniform attention gives ||A||₂² = 1/N, so variance "collapses" with sequence length. More details below.
https://twitter.com/_akhaliq/status/1612987028660011010Chinchilla proposed a mathematical formulation, known as compute-optimal scaling laws, to explain how the performance of a language model decreases as the model size (N) and the number of tokens (D) increase.
https://twitter.com/ak92501/status/1483983326901899272Causal modeling has the benefit of per-token generation while masked language models allow for bidirectional conditioning at the cost of partial generation. We propose a new objective, causal masking combining the best of both worlds, full generation + optional bidirectionality.