Samuel Albanie 🇬🇧 Profile picture
Jan 24, 2023 21 tweets 8 min read Read on X
BLOOM.

A large language model trained by researchers from around the world by @BigscienceW.

How did they do it?

Why did they do it?

Let's dive in.

1/21
🧵
Large Languages Models (LLMs) now play a key role in NLP.

But few orgs can afford to train them.

Also:
- most LLMs focus on English
- many are not public

Goals for BLOOM
- release a strong multilingual LLM
- document the development process

2/21
BLOOM was a BigScience effort:

- 28 countries
- 1200+ registered participants

3/21
Training data is key.

For BLOOM, the goals were to prioritise:
- human involvement
- local expertise
- language expertise

4/21
Also important: data governance.

Efforts were made to:
- obtain permission for data where possible
- keep data sources separate until final preprocessing
- provide tools to inspect/visualise data
- share data where possible

5/21
BLOOM was ultimately trained on ROOTS.

ROOTS = "Responsible Open-Science Open-Collaboration Text Sources":
- 1.61 TB
- 46 natural languages
- 13 coding languages

Careful studies were conducted to select the architecture, objective and model.

6/21
Ultimately, the project settled on a decoder-only architecture.

7/21
Training used the @genci_fr Jean Zay cluster:
- 3.5 months
- 384 (80GB) A100s
- > 1 million GPU hours

8/21
The Megatron-DeepSpeed framework was used for efficiency.

This provides:
- data parallelism (replicate model across GPUs)
- tensor parallelism (split within individual layers across GPUs)
- pipeline parallelism (split different layers across GPUs)

9/21
Other fun engineering details that proved useful:
- mixed-precision training
- CUDA kernel fusion
- disabling async CUDA kernel launches (avoid deadlocks)
- splitting parameter groups (to avoid excessive CPU mem allocation)

10/21
Carbon footprint is fairly difficult to estimate reliably (it's a complex business).

With this caveat, BLOOM CO2eq emissions were lower than other models of similar sizes.

In part, this is due to French nuclear electricity generation (with relatively low emissions).

11/21
BLOOM is released under a Responsible AI License (RAIL).

This has 13 behavioural-use restrictions related to LLM use cases.

12/21
On WMT14, 1-shot BLOOM does a reasonable job with the right prompt
On DiaBla (an informal dialogue dataset), BLOOM (and other models) struggle

13/21
On @StanfordHAI's HELM benchmark, BLOOM:
- lags somewhat behind the top closed-source models on accuracy
- is poor on calibration error
- but quite good on robustness

14/21
BLOOM also does:
- relatively well on fairness
- somewhat moderately on bias
- poorly on toxicity metrics

15/21
Similarly to other LLMs, BLOOM benefits considerably from multilingual multitask finetuning.

16/21
BLOOM can generate code.

But it lags behind models like Codex on the HumanEval benchmark.

17/21
BLOOM can also produce solid embeddings for retrieval.

18/21
A preliminary study of BLOOM suggests limited bias.

Caveats apply.

19/21
For those who enjoy videos:


20/21

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Samuel Albanie 🇬🇧

Samuel Albanie 🇬🇧 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SamuelAlbanie

Jun 8, 2025
Who are the top public human LLM prompters rn? 🧵

A few of my picks below

(I'm biased obvs, and a lot of talent is prompting in private)

1/9
Murray Shanahan @mpshanahan

consciousness, AI & philosophy

Goes deep

2/9doc.ic.ac.uk/~mpsha/convers…
Janus @repligate

hard to compress in a tweet

unearthing new phenomena through ninja prompting

and taking LLMs seriously

3/9
Read 9 tweets
May 15, 2023
Another week, another full bucket of AI news.

Some highlights...

🧵1/25 Image
Language models can explain neurons in language models

- Aims to scale up interpretability to large language models

- Exploits ability of GPT-4 to simulate neurons

by S. Bills, @nickcammarata, @mildseasoning, @HenkTillman, @nabla_theta, @WuTheFWasThat, @janleike

2/25 Image
Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

- Creates anonymous, randomized battles between chatbots

- For now, the top 3 models are:
(1) GPT-4
(2) Claude-v1
(3) GPT-3.5-turbo

by @lm_zheng, @ying11231, @infwinston, @haozhangml, @profjoeyg ++

3/25 ImageImage
Read 25 tweets
Mar 31, 2023
1/ 🚀🔬 Introducing our groundbreaking research paper: "Large Language Models are Few-shot Publication Scoopers"

We've discovered the secret to achieving personal glory and a lifetime supply of Cheerios
Joint work with
@LiliMomeni and J. F. Henriques

Appears @sigbovik today
2/ 🏃💨 Tired of racing to publish your next high-impact research?

Our revolutionary pip-to-the-post algo. ensures adulatory Wikipedia pages without risking your career on conventional research strategies

Scoop with the insouciance of a seasoned researcher at a dessert buffet🍨
3/ 🎭⚔️ Our paper draws inspiration from the glory days of scientific feuds like the 16th-century Prioritätsstreit between Tycho and Ursus

Remember, the best science is science that brings YOU personal glory!
Read 5 tweets
Nov 7, 2022
Multitask prompted finetuning (aka instruction finetuning) can boost language model performance.

But how can we make progress beyond English (esp. on languages with limited finetuning data)?

Work by @Muennighoff & others in @BigscienceW studies this in detail.

1/17 🧵 Image
For this study, datasets spanning 46 languages were gathered (collectively referred to as "xP3").

xP3 aims to mimic the distribution of languages found in ROOTS (the dataset used to pretrain BLOOM).

2/17 Image
Three dataset variants were studied:
- English prompts on English datasets (P3)
- English prompts on multilingual datasets (xP3)
- Machine-translated prompts on multilingual datasets (xP3mt)

3/17 Image
Read 17 tweets
Oct 28, 2022
Finetuning language models on instructions increasingly seems a compute-efficient way to gain performance.

Recent work from @hwchung27, @_jasonwei, @JeffDean, @quocleix & others scales this up to new regimes.

TLDR: Even for big models (540B params), gains are substantial.

1/12 Image
For those who prefer a narrated version:



2/12
Flan-PaLM 540B (PaLM 540B finetuned on instructions) makes major progress on MMLU.

Note: my previous graph () lacked some of the available SotA forecasts - that's updated below.

Even with the update, the numbers remain impressive.

3/12 Image
Read 12 tweets
Oct 28, 2022
How can we reduce the computational cost of training neural networks?

Bo Zhao, Hakan Bilen and collaborators have produced a creative body of work developing a technique known as "dataset condensation".

1/7
Key idea: compress a large dataset into a small set of synthetic images that can train networks to the same accuracy as the original dataset.

Was a pleasure to examine Bo's thesis on this topic work with @driainmurray.

2/7
Papers/code links for Dataset Condensation:
- Gradient Matching (ICLR '21) arxiv.org/abs/2006.05929
- DSA (ICML '21) arxiv.org/abs/2102.08259
- CAFE (CVPR '22) arxiv.org/abs/2203.01531
- Distribution Matching (WACV '23) arxiv.org/abs/2110.04181

Code: github.com/VICO-UoE/Datas…

3/7
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(