Tweet

Alex Tamkin

Feb 19 • 7 tweets • 4 min read

One of the reasons I think GPT-J is so cool is that its pretraining data is publicly available

This lets us ask questions that were impossible to answer for LLMs like GPT-3

For example: "did our model actually learn the task or was this example in the training data?"

1/

@yasaman_razeghi

Case in point, a recent paper looks at few-shot performance on numerical tasks like arithmetic

arxiv.org/abs/2202.07206
by @yasaman_razeghi @rloganiv @nlpmattg @sameer_

2/

The question they ask is simple:

How does the frequency of a term in the training data (e.g. "23") impact performance on problems involving that term (e.g. "What is 23 times 18?")

3/

If a model has learned to multiply correctly, the number of times it's seen the number "23" ideally shouldn't matter

Instead, they find very large effects!

4/

Interpreting these results is subtle, and I expect some debate about the mechanism here (e.g. could this just be a word embeddings issue?)

But I think it's an example of some exciting questions you can ask when you have access to both an LLM and its training data

5/

@KassnerNora

Also check out earlier work that also investigated term frequencies on symbolic tasks:

Are Pretrained Language Models Symbolic Reasoners Over Knowledge?
aclanthology.org/2020.conll-1.4…
@KassnerNora @benno_krojer @HinrichSchuetze

6/

Here's a link to the original paper again:
arxiv.org/abs/2202.07206

And a link to The Pile: GPT-J's training dataset!
pile.eleuther.ai

7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AlexTamkin

Alex Tamkin

@AlexTamkin

Dec 8, 2021

DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning

SSL is a promising technology, but current methods are field-specific. Can we find general algorithms that can be applied to any domain?

🌐: dabs.stanford.edu
📄: arxiv.org/abs/2111.12062

🧵👇 #NeurIPS2021

1/

Self-supervised learning (SSL) algorithms can drastically reduce the need for labeling by pretraining on unlabeled data

But designing SSL methods is hard and can require lots of domain-specific intuition and trial and error

2/

We designed DABS to drive progress in domain-agnostic SSL

Our benchmark addresses three core modeling components in SSL algorithms:

(1) architectures
(2) pretraining objectives
(3) transfer methods

3/

Read 13 tweets

Alex Tamkin

@AlexTamkin

Dec 7, 2021

@Patterns_CP

Love the "data science maturity levels" in @Patterns_CP

Interesting way to contextualize research at a glance (reminds me a bit of @justsaysinmice)

Full list in thread:

1) Concept

Basic principles of a new data science output observed and reported (e.g., statement of principles, dataset, new algorithm, new theoretical concept, theoretical system infrastructure)

2) Proof-of-concept

Data science output has been formulated, implemented, and tested for one domain/problem (e.g., dataset with rich domain-specific metadata, algorithm coded up as software, principles with expanded guidance on how to implement them)

Read 7 tweets

Alex Tamkin

@AlexTamkin

Feb 25, 2021

A quick thread for PhD admits thinking about potential advisors:

I see a lot of discussion about "hands-on" vs "hands-off" advisors

But I think there are at least 3 underlying dimensions here, each of which is worth considering in its own right:

👇 [THREAD]

1/

1) Directiveness—how much your advisor directs your research, in terms of the problems you work on or day-to-day activities

2/

Low directiveness can mean lots of freedom and the space to think big and chart your own path. However, it can also leave some feeling adrift or unproductive.

3/

Read 12 tweets

Alex Tamkin

@AlexTamkin

Jan 11, 2021

@openai

Some takeaways from @openai's impressive recent progress, including GPT-3, CLIP, and DALL·E:

[THREAD]

👇1/

1) The raw power of dataset design.

These models aren't radically new in their architecture or training algorithm

Instead, their impressive quality is largely due to careful training at scale of existing models on large, diverse datasets that OpenAI designed and collected.

2/

Why does diverse data matter? Robustness.

Can't generalize out-of-domain? You might be able to make most things in-domain by training on the internet

But this power comes w/ a price: the internet has some extremely dark corners (and these datasets have been kept private)

3/

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Alex Tamkin

Try unrolling a thread yourself!

More from @AlexTamkin

Alex Tamkin

Alex Tamkin

Alex Tamkin

Alex Tamkin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?