Takeaways from reading the "LLaMa: Open and Efficient Foundation Language Models" paper that made big waves yesterday.
It's a laudable open-source effort making large language models available* for research purposes, but there are also some missed research opportunities here 1/8
Inspired by Chinchilla's scaling laws paper, the LLaMA paper proposes a set of "small" large language models, trained on only public data, that outperform GPT-3 (175B) with >10x fewer parameters (13B). And there's a larger 65B version that outperforms PaLM-540B. 2/8
The LLaMA models, are a welcome alternative to previous open source models like OPT and BLOOM, which are both said to underperform GPT-3.
What are some of the methods they used to achieve this performance?
3/8
They reference Pre-normalization, SwiGLU activations, and Rotary Embeddings as techniques to improve the performance of the LLaMA models. Since these are research models, I would have loved to see ablation studies -- I feel like this is a missed opportunity. 4/8
Moreover, the plots show that a steep negative slope when showing the training loss versus the number of training tokens. I cannot help but wonder what would happen if they trained the model for more than 1-2 epochs. 5/8
Moreover, from reading the research paper, it's not clear what the architecture looks like exactly. The referenced "Attention is all you need" architecture is an encoder-decoder architecture. GPT-3 that they compare themselves to is a decoder-only architecture. 6/8
*Now let's get to the asterisk of the first tweet. The model repo is available under a GNU GPL v3.0 license on GitHub here: github.com/facebookresear….
It contains the code only. The weights are available upon filing a request form.
7/8
While I think it's fair (no pun intended), it should be mentioned that this comes with a pretty hefty restriction:
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose." 8/8
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Summarized some of the interesting takeaways below.
(Note that alternatives to scaled-dot product self-attention are notably absent -- no one uses these, still?)
1/6
Sharpness-aware minimization (SAM) nearly doubles the training time (since it needs to solve a bi-level min-max optimization problem). Stochastic weight perturbation (arxiv.org/abs/2205.14083), is a more efficient alternative.
2/6
Weight initialization and rescaling schemes like Fixup (arxiv.org/abs/1901.09321) for residual blocks stabilize training an enable higher learning rates, removing the need for BatchNorm and LayerNorm.
3/6
Training deep neural nets on multiple GPUs has become increasingly common in recent years.
Dividing the workload allows for larger and more complex models to be trained more quickly.
I made a little cheatsheet summarizing the different approaches:
If you are training your models using PyTorch, you can use the @LightningAI Trainer to experiment with these various techniques.
My recommendation:
1) Speed up regular model training where things fit on a single GPU: Trainer(…, strategy='ddp')
1/n
2) Same as above but in a Jupyter notebook: Trainer(…, strategy='ddp_notebook')
[data parallel]
3) Pretrain a large model: Trainer(…, strategy="ddp_sharded")
[this uses data parallel + tensor parallel]
2/n
After putting together a lecture on multi-GPU training paradigms, I thought it might be a good idea to catch up with the recent “Cramming: Training a Language Model on a Single GPU in One Day” paper (arxiv.org/abs/2212.14034).
An interesting read with lots of insights!
1/8
Let’s start with maybe the most interesting take-way: Sure, smaller models have higher throughput, but smaller models learn less efficiently.
Consequently, larger models don’t take longer to train!
2/8
Taking a step back: What did they do? The researchers trained a masked language model / decoder-style LLM (here: BERT) for 24h on 1 GPU — for comparison, the original 2018 BERT paper trained it on 16 TPUs for 4 days.
3/8
Springer Nature "says it has no problem with AI being used to help write research — as long as its use is properly disclosed." theverge.com/2023/1/26/2357…
What are the different approaches for detecting content generated by LLMs such as ChatGPT? And how do they work and differ?
Let's discuss (1) The AI Classifier by OpenAI (2) DetectGPT (3) GPTZero (4) Watermarking
1/5
(1) AI Classifier: a GPT model fine-tuned via supervised learning to perform binary classification -- the training dataset consisted of human- & AI-written text passages. The probas [0, 1] are thresholded to obtain the four categories.
Limitation: Representative training data
(2) DetectGPT perturbs the text: if the probability of the new text is noticeably lower than the original one it is AI-generated. Otherwise, if it's approx the same, it's human-generated.
Limitation: Access to probas via a specific LLM model that may not be representative