Sebastian Raschka Profile picture
Feb 25 8 tweets 3 min read
Takeaways from reading the "LLaMa: Open and Efficient Foundation Language Models" paper that made big waves yesterday.
It's a laudable open-source effort making large language models available* for research purposes, but there are also some missed research opportunities here
1/8
Inspired by Chinchilla's scaling laws paper, the LLaMA paper proposes a set of "small" large language models, trained on only public data, that outperform GPT-3 (175B) with >10x fewer parameters (13B). And there's a larger 65B version that outperforms PaLM-540B.
2/8
The LLaMA models, are a welcome alternative to previous open source models like OPT and BLOOM, which are both said to underperform GPT-3.
What are some of the methods they used to achieve this performance?
3/8
They reference Pre-normalization, SwiGLU activations, and Rotary Embeddings as techniques to improve the performance of the LLaMA models. Since these are research models, I would have loved to see ablation studies -- I feel like this is a missed opportunity.
4/8
Moreover, the plots show that a steep negative slope when showing the training loss versus the number of training tokens. I cannot help but wonder what would happen if they trained the model for more than 1-2 epochs.
5/8
Moreover, from reading the research paper, it's not clear what the architecture looks like exactly. The referenced "Attention is all you need" architecture is an encoder-decoder architecture. GPT-3 that they compare themselves to is a decoder-only architecture.
6/8
*Now let's get to the asterisk of the first tweet. The model repo is available under a GNU GPL v3.0 license on GitHub here: github.com/facebookresear….
It contains the code only. The weights are available upon filing a request form.
7/8
While I think it's fair (no pun intended), it should be mentioned that this comes with a pretty hefty restriction:
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose."
8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Raschka

Sebastian Raschka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rasbt

Feb 16
"A Survey on Efficient Training of Transformers" 2023 (arxiv.org/abs/2302.01107)

Summarized some of the interesting takeaways below.
(Note that alternatives to scaled-dot product self-attention are notably absent -- no one uses these, still?)

1/6
Sharpness-aware minimization (SAM) nearly doubles the training time (since it needs to solve a bi-level min-max optimization problem). Stochastic weight perturbation (arxiv.org/abs/2205.14083), is a more efficient alternative.

2/6
Weight initialization and rescaling schemes like Fixup (arxiv.org/abs/1901.09321) for residual blocks stabilize training an enable higher learning rates, removing the need for BatchNorm and LayerNorm.

3/6
Read 6 tweets
Feb 14
Training deep neural nets on multiple GPUs has become increasingly common in recent years.
Dividing the workload allows for larger and more complex models to be trained more quickly.

I made a little cheatsheet summarizing the different approaches:
If you are training your models using PyTorch, you can use the @LightningAI Trainer to experiment with these various techniques.

My recommendation:

1) Speed up regular model training where things fit on a single GPU: Trainer(…, strategy='ddp')

1/n
2) Same as above but in a Jupyter notebook: Trainer(…, strategy='ddp_notebook')
[data parallel]

3) Pretrain a large model: Trainer(…, strategy="ddp_sharded")
[this uses data parallel + tensor parallel]

2/n
Read 6 tweets
Feb 3
After putting together a lecture on multi-GPU training paradigms, I thought it might be a good idea to catch up with the recent “Cramming: Training a Language Model on a Single GPU in One Day” paper (arxiv.org/abs/2212.14034).

An interesting read with lots of insights!
1/8
Let’s start with maybe the most interesting take-way: Sure, smaller models have higher throughput, but smaller models learn less efficiently.
Consequently, larger models don’t take longer to train!
2/8
Taking a step back: What did they do? The researchers trained a masked language model / decoder-style LLM (here: BERT) for 24h on 1 GPU — for comparison, the original 2018 BERT paper trained it on 16 TPUs for 4 days.
3/8
Read 8 tweets
Feb 2
It will be interesting to see how things play out for competing companies who are pro/con AI-generated content.

(1)
- Getty Images bans AI and sues
- Shutterstock adds AI and compensates

(2)
- Science Magazine bans AI content
- Springer Nature permits authors to use AI
Sources 1/3:
Getty Images bans AI-generated content: theverge.com/2022/9/21/2336…

Getty Images sues Stability AI theverge.com/2023/1/17/2355…

Shutterstock adds text-to-image AI: prnewswire.com/news-releases/…

Shutterstock compensation plan for data used for training: support.submit.shutterstock.com/s/article/Shut…
Sources 2/3:

Springer Nature "says it has no problem with AI being used to help write research — as long as its use is properly disclosed."
theverge.com/2023/1/26/2357…
Read 4 tweets
Feb 1
What are the different approaches for detecting content generated by LLMs such as ChatGPT? And how do they work and differ?

Let's discuss
(1) The AI Classifier by OpenAI
(2) DetectGPT
(3) GPTZero
(4) Watermarking

1/5
(1) AI Classifier: a GPT model fine-tuned via supervised learning to perform binary classification -- the training dataset consisted of human- & AI-written text passages. The probas [0, 1] are thresholded to obtain the four categories.

Limitation: Representative training data
(2) DetectGPT perturbs the text: if the probability of the new text is noticeably lower than the original one it is AI-generated. Otherwise, if it's approx the same, it's human-generated.

Limitation: Access to probas via a specific LLM model that may not be representative
Read 7 tweets
Jan 31
OpenAI just launched the "AI Text Classifier" to identify texts generated by AI.
Tried it, and IT DOES NOT WORK.
platform.openai.com/ai-text-classi…

Using my Python ML book published in 2015:

1) @randal_olson's foreword: unclear
2) my preface: possibly AI
3) paragraph from Ch1: likely AI ImageImageImage
I mean, this is a funny example, but I already feel bad for students who might get penalized for their essays in the future because of this.
First page from Shakespeare's Macbeth. What!? Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(