Tweet

Sebastian Raschka

Follow @rasbt

Feb 25 • 8 tweets • 3 min read

Takeaways from reading the "LLaMa: Open and Efficient Foundation Language Models" paper that made big waves yesterday.
It's a laudable open-source effort making large language models available* for research purposes, but there are also some missed research opportunities here
1/8

Inspired by Chinchilla's scaling laws paper, the LLaMA paper proposes a set of "small" large language models, trained on only public data, that outperform GPT-3 (175B) with >10x fewer parameters (13B). And there's a larger 65B version that outperforms PaLM-540B.
2/8

The LLaMA models, are a welcome alternative to previous open source models like OPT and BLOOM, which are both said to underperform GPT-3.
What are some of the methods they used to achieve this performance?
3/8

They reference Pre-normalization, SwiGLU activations, and Rotary Embeddings as techniques to improve the performance of the LLaMA models. Since these are research models, I would have loved to see ablation studies -- I feel like this is a missed opportunity.
4/8

Moreover, the plots show that a steep negative slope when showing the training loss versus the number of training tokens. I cannot help but wonder what would happen if they trained the model for more than 1-2 epochs.
5/8

Moreover, from reading the research paper, it's not clear what the architecture looks like exactly. The referenced "Attention is all you need" architecture is an encoder-decoder architecture. GPT-3 that they compare themselves to is a decoder-only architecture.
6/8

*Now let's get to the asterisk of the first tweet. The model repo is available under a GNU GPL v3.0 license on GitHub here: github.com/facebookresear….
It contains the code only. The weights are available upon filing a request form.
7/8

While I think it's fair (no pun intended), it should be mentioned that this comes with a pretty hefty restriction:
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose."
8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @rasbt

Sebastian Raschka

@rasbt

Feb 16

"A Survey on Efficient Training of Transformers" 2023 (arxiv.org/abs/2302.01107)

Summarized some of the interesting takeaways below.
(Note that alternatives to scaled-dot product self-attention are notably absent -- no one uses these, still?)

1/6

Sharpness-aware minimization (SAM) nearly doubles the training time (since it needs to solve a bi-level min-max optimization problem). Stochastic weight perturbation (arxiv.org/abs/2205.14083), is a more efficient alternative.

2/6

Weight initialization and rescaling schemes like Fixup (arxiv.org/abs/1901.09321) for residual blocks stabilize training an enable higher learning rates, removing the need for BatchNorm and LayerNorm.

3/6

Read 6 tweets

Sebastian Raschka

@rasbt

Feb 14

Training deep neural nets on multiple GPUs has become increasingly common in recent years.
Dividing the workload allows for larger and more complex models to be trained more quickly.

I made a little cheatsheet summarizing the different approaches:

@LightningAI

If you are training your models using PyTorch, you can use the @LightningAI Trainer to experiment with these various techniques.

My recommendation:

1) Speed up regular model training where things fit on a single GPU: Trainer(…, strategy='ddp')

1/n

2) Same as above but in a Jupyter notebook: Trainer(…, strategy='ddp_notebook')
[data parallel]

3) Pretrain a large model: Trainer(…, strategy="ddp_sharded")
[this uses data parallel + tensor parallel]

2/n

Read 6 tweets

Sebastian Raschka

@rasbt

Feb 3

After putting together a lecture on multi-GPU training paradigms, I thought it might be a good idea to catch up with the recent “Cramming: Training a Language Model on a Single GPU in One Day” paper (arxiv.org/abs/2212.14034).

An interesting read with lots of insights!
1/8

Let’s start with maybe the most interesting take-way: Sure, smaller models have higher throughput, but smaller models learn less efficiently.
Consequently, larger models don’t take longer to train!
2/8

Taking a step back: What did they do? The researchers trained a masked language model / decoder-style LLM (here: BERT) for 24h on 1 GPU — for comparison, the original 2018 BERT paper trained it on 16 TPUs for 4 days.
3/8

Read 8 tweets

Sebastian Raschka

@rasbt

Feb 2

It will be interesting to see how things play out for competing companies who are pro/con AI-generated content.

(1)
- Getty Images bans AI and sues
- Shutterstock adds AI and compensates

(2)
- Science Magazine bans AI content
- Springer Nature permits authors to use AI

Sources 1/3:
Getty Images bans AI-generated content: theverge.com/2022/9/21/2336…

Getty Images sues Stability AI theverge.com/2023/1/17/2355…

Shutterstock adds text-to-image AI: prnewswire.com/news-releases/…

Shutterstock compensation plan for data used for training: support.submit.shutterstock.com/s/article/Shut…

Sources 2/3:

Springer Nature "says it has no problem with AI being used to help write research — as long as its use is properly disclosed."
theverge.com/2023/1/26/2357…

Read 4 tweets

Sebastian Raschka

@rasbt

Feb 1

What are the different approaches for detecting content generated by LLMs such as ChatGPT? And how do they work and differ?

Let's discuss
(1) The AI Classifier by OpenAI
(2) DetectGPT
(3) GPTZero
(4) Watermarking

1/5

(1) AI Classifier: a GPT model fine-tuned via supervised learning to perform binary classification -- the training dataset consisted of human- & AI-written text passages. The probas [0, 1] are thresholded to obtain the four categories.

Limitation: Representative training data

(2) DetectGPT perturbs the text: if the probability of the new text is noticeably lower than the original one it is AI-generated. Otherwise, if it's approx the same, it's human-generated.

Limitation: Access to probas via a specific LLM model that may not be representative

Read 7 tweets

Sebastian Raschka

@rasbt

Jan 31

@randal_olson

OpenAI just launched the "AI Text Classifier" to identify texts generated by AI.
Tried it, and IT DOES NOT WORK.
platform.openai.com/ai-text-classi…

Using my Python ML book published in 2015:

1) @randal_olson's foreword: unclear
2) my preface: possibly AI
3) paragraph from Ch1: likely AI

I mean, this is a funny example, but I already feel bad for students who might get penalized for their essays in the future because of this.

First page from Shakespeare's Macbeth. What!?

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Sebastian Raschka

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @rasbt

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!