Post

@OpenAI

@AnthropicAI

@CohereAI

@natfriedman

More from @ShayneRedford

Shayne Longpre

@ShayneRedford

Oct 28, 2025

📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality.

The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer:

🌍Are scaling laws different by language?

🧙‍♂️Can we model the curse of multilinguality?

⚖️Pretrain from scratch or finetune from multilingual checkpoint?

🔀Cross-lingual transfer scores for 1444 lang pairs?

1/🧵

Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/

ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/

Read 9 tweets

Shayne Longpre

@ShayneRedford

Jun 23, 2025

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel.

📜: arxiv.org/pdf/2506.05209
📚🤖 Data & models: huggingface.co/common-pile
1/

📚 Drawn from 30 diverse, permissively licensed sources (science, code, books, gov docs, news, audio transcripts & more).

🔍 “Openly licensed” = free for anyone to use, modify, and share for any purpose, as defined by Public Knowledge (opendefinition.org)

🔧 Every cleaning + processing step is open-sourced so anyone can reproduce or build on it.

2/

🤖 We also release Comma v0.1 (7B) — trained on CommonPile data, yet shockingly competitive with models like Llama-2-7B, which are trained on tons of more restrictively licensed text.

3/

Read 6 tweets

Shayne Longpre

@ShayneRedford

Mar 13, 2025

What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️

Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:

1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws affecting many systems.

1/🧵

🌟Motivation🌟

Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.

➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists.

2/

However, third-party evaluation currently faces key barriers:

➡️No flaw-reporting culture
➡️Lack of coordinated disclosure infrastructure
➡️Inadequate researcher protections

3/

Read 8 tweets

Shayne Longpre

@ShayneRedford

Feb 19, 2025

I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚

➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?

🧵

1️⃣ The International AI Safety Report 2025 — @Yoshua_Bengio, @privitera_, et al. — This report spans 100s of carefully curated citations from independent experts.

I co-wrote the Risks of Copyright section, and recommend it as a general starting point.

gov.uk/government/pub…

2️⃣ Foundation Models and Fair Use — @PeterHndrsn @lxuechen — This foundational paper examines the United States “fair use doctrine” in the context of generative AI models.

Peter also regularly tweets updates on on-going lawsuits.

arxiv.org/pdf/2303.15715

Read 12 tweets

Shayne Longpre

@ShayneRedford

Feb 12, 2025

I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)!

While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.

🔗

1/technologyreview.com/2025/02/11/111…

Crawlers are essential to our online ecosystem: they power search, price comparisons, news aggregation, security, accessibility, journalism, and research.

Think of them as a delicate biodiversity now threatened by a new “invasive species”: general-purpose AI with an insatiable appetite for web data.

2/

Publishers are understandably worried: news sites fear losing readers to AI chatbots; artists and designers fear AI image generators; coding forums fear AI-driven replacements.

Increasingly, they block or charge all non-human traffic, not just AI crawlers.

3/

Read 6 tweets

Shayne Longpre

@ShayneRedford

Jul 19, 2024

✨New Preprint ✨ How are shifting norms on the web impacting AI?

We find:

📉 A rapid decline in the consenting data commons (the web)

⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)

⛔️ Robots.txt preference protocols are ineffective

These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.

🔗

1/dataprovenance.org/consent-in-cri…

General-purpose AI relies on massive data collected by web crawlers.

The Data Provenance Initiative team annotated ~14k of the websites that underly pretraining datasets, for:

➡️Consent policies: robots.txt, ToS
➡️Monetization: ads, paywalls
➡️Purpose: news, e-commerce, forums, etc

2/

🌟Finding 1🌟 Access restrictions are rising dramatically

In <1 year, C4/RefinedWeb have seen:

➡️ >5% of all tokens become unavailable for AI training
➡️ >30% of tokens from top-2k, best quality, active domains become unavailable

Plus, 40%+ of tokens are from sites w/ anti-crawling terms

These are significant & unprecedented shifts in short periods.

3/

Read 12 tweets

Share this page!

Enter URL or ID to Unroll

Shayne Longpre

Try unrolling a thread yourself!

More from @ShayneRedford

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!