How to get URL link on X (Twitter) App
https://twitter.com/_akhaliq/status/1612987028660011010Chinchilla proposed a mathematical formulation, known as compute-optimal scaling laws, to explain how the performance of a language model decreases as the model size (N) and the number of tokens (D) increase.
https://twitter.com/ak92501/status/1483983326901899272Causal modeling has the benefit of per-token generation while masked language models allow for bidirectional conditioning at the cost of partial generation. We propose a new objective, causal masking combining the best of both worlds, full generation + optional bidirectionality.