Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Ekin Akyürek

@akyurekekin

Nov 10, 2024 • 16 tweets • 5 min read • Read on X

Scrolly

Why do we treat train and test times so differently?

Why is one “training” and the other “in-context learning”?

Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize

We investigate the existing idea of test-time training (TTT): you construct an auxiliary dataset based on your test inputs and update the model before making a prediction.

But it’s not clear what tasks to train on, what kind of inference, and what base model to start with?

We present an extensive set of ablations for the ARC challenge! We perform three analyses to answer how to do TTT, and what to do *before* and *after* TTT?

*What data is needed for TTT?*

We tried two different ways of generating TTT data: (1) an in-context learning format and (2) an end-to-end format. In ICL, we create leave-1-out tasks from the given test demonstrations. In E2E, we treat each i/o pair as an individual task.

We also applied some geometric transformations to bootstrap the data; see how ICL tasks are generated in the above figure! We update our models with LoRA using these generated tasks. We find that:

- ICL wins over e2e tasks!

- The data augmentation is crucial!

We updated models with LoRA.

But should we train a *new* LoRA for every test task or train a *single shared* LoRA with a dataset generated from all test tasks?

We find that per-task LoRA is much better! (FT + TTT vs Shared-TTT)

*What kind of inference after TTT?*

In ARC, we don’t have CoTs, hence, you can’t improve much with majority voting.

We do what we did in TTT: we create few-shot tasks and transform them with *invertible* functions. Now we have a bunch of transformed inputs of the original task.

We feed the transformed inputs and invert the outputs back. Now, we can benefit more from majority voting. We name this “self-consistency under invertible transformations”.

- It is better than predicting with any single transformation!

- Hierarchical voting improves even more!

**What fine-tuning before TTT**

You need to FT a base LM, but you don’t need too much new data! A model fine-tuned on re-instantiations of the *training tasks* + few geometric transformations can get good scores.

We tried lots of LM-based synthetic data but surprisingly found that it didn’t help. Interestingly, TTT closes gaps between different levels of models.

*ARC Benchmark and Results*

We evaluated our final system on a full ARC validation set. TTT improves everything!

Our fine-tuned LM jumps from 18 to 47 after TTT!

Combining it with a program generator model from BARC paper (@xu3kev) we get 58.5.

https://x.com/ellisk_kellis/status/1852690768042897661

We repeat our experiments with BARC's fine-tuned model released just a week ago, and we observe 61\% score, matching the average human. Congrats to this concurrent work:

https://x.com/ellisk_kellis/status/1852690768042897661

The Minds AI team @MindsAI_Jack achieves similar scores but on *the private test* set which is remarkable given the computation limits of the leaderboard. To our knowledge, they were also the first ones to use TTT in the benchmark.

*Limitations*

Thanks to the amazing team of collaborators who worked incredibly hard! This work wouldn't be possible without @MehulDamani2 @linluqiu @HanGuo97 and our advisors @yoonrkim, @jacobandreas

Paper: ekinakyurek.github.io/papers/ttt.pdf
Code: github.com/ekinakyurek/ma…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @akyurekekin

Ekin Akyürek

@akyurekekin

Nov 29, 2022

How does in-context learning work?

Maybe language models unexpectedly discover how to store/simulate/train other models in their hidden units.

So, few-shot prompting can be equivalent to fine-tuning running inside of an LM!

Could this be true in theory? (⚠no real LM)👇

Paper: arxiv.org/abs/2211.15661
Code: github.com/ekinakyurek/go…

We show that Transformer decoders can simulate and update a linear model by running SGD internally.

This means that few-shot prompting can be equivalent to a simple form of fine-tuning (e.g. of a linear model).

Read 13 tweets

Ekin Akyürek

@akyurekekin

Jun 6, 2022

🔢 Does GPT-3 know arithmetic? Are LM scratchpads/chain-of-thought prompting always helpful? What should go into a successful scratchpad when sampling predictions from GPT-3? Check out our blog post where we inspect these questions on the addition problem!
lingo.csail.mit.edu/blog/arithmeti…

Past work found that prompting large language models to generate a set of short sentences that describe intermediate steps before producing the final answer significantly boosted their performance on tasks that require symbolic reasoning.

With the simple addition task in mind, we ventured into investigating what components of a scratchpad are essential to success and in what scenarios they come in handy. Below is a short summary of some of the interesting results we obtained.

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Ekin Akyürek

Try unrolling a thread yourself!

More from @akyurekekin

Ekin Akyürek

Ekin Akyürek

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!