Ekin Akyürek Profile picture
Nov 10, 2024 16 tweets 5 min read Read on X
Why do we treat train and test times so differently?

Why is one “training” and the other “in-context learning”?

Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize Image
We investigate the existing idea of test-time training (TTT): you construct an auxiliary dataset based on your test inputs and update the model before making a prediction.

But it’s not clear what tasks to train on, what kind of inference, and what base model to start with?
We present an extensive set of ablations for the ARC challenge! We perform three analyses to answer how to do TTT, and what to do *before* and *after* TTT?
*What data is needed for TTT?*

We tried two different ways of generating TTT data: (1)  an in-context learning format and (2) an end-to-end format. In ICL, we create leave-1-out tasks from the given test demonstrations. In E2E, we treat each i/o pair as an individual task. Image
We also applied some geometric transformations to bootstrap the data; see how ICL tasks are generated in the above figure! We update our models with LoRA using these generated tasks. We find that:

- ICL wins over e2e tasks!

- The data augmentation is crucial! Image
We updated models with LoRA.

But should we train a *new* LoRA for every test task or train a *single shared* LoRA with a dataset generated from all test tasks?

We find that per-task LoRA is much better! (FT + TTT vs Shared-TTT) Image
*What kind of inference after TTT?*

In ARC, we don’t have CoTs, hence, you can’t improve much with majority voting.

We do what we did in TTT: we create few-shot tasks and transform them with *invertible* functions. Now we have a bunch of transformed inputs of the original task. Image
We feed the transformed inputs and invert the outputs back. Now, we can benefit more from majority voting. We name this “self-consistency under invertible transformations”.

- It is better than predicting with any single transformation!

- Hierarchical voting improves even more! Image
**What fine-tuning before TTT**

You need to FT a base LM, but you don’t need too much new data! A model fine-tuned on re-instantiations of the *training tasks* + few geometric transformations can get good scores. Image
We tried lots of LM-based synthetic data but surprisingly found that it didn’t help. Interestingly, TTT closes gaps between different levels of models. Image
*ARC Benchmark and Results*

We evaluated our final system on a full ARC validation set. TTT improves everything!

Our fine-tuned LM jumps from 18 to 47 after TTT!

Combining it with a program generator model from BARC paper (@xu3kev) we get 58.5. Image
We repeat our experiments with BARC's fine-tuned model released just a week ago, and we observe 61\% score, matching the average human. Congrats to this concurrent work:
The Minds AI team @MindsAI_Jack achieves similar scores but on *the private test* set which is remarkable given the computation limits of the leaderboard. To our knowledge, they were also the first ones to use TTT in the benchmark.
*Limitations* Image
Thanks to the amazing team of collaborators who worked incredibly hard! This work wouldn't be possible without @MehulDamani2 @linluqiu @HanGuo97 and our advisors @yoonrkim, @jacobandreas

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ekin Akyürek

Ekin Akyürek Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @akyurekekin

Nov 29, 2022
How does in-context learning work?

Maybe language models unexpectedly discover how to store/simulate/train other models in their hidden units.

So, few-shot prompting can be equivalent to fine-tuning running inside of an LM!

Could this be true in theory? (⚠no real LM)👇
We show that Transformer decoders can simulate and update a linear model by running SGD internally.

This means that few-shot prompting can be equivalent to a simple form of fine-tuning (e.g. of a linear model).
Read 13 tweets
Jun 6, 2022
🔢 Does GPT-3 know arithmetic? Are LM scratchpads/chain-of-thought prompting always helpful? What should go into a successful scratchpad when sampling predictions from GPT-3? Check out our blog post where we inspect these questions on the addition problem!
lingo.csail.mit.edu/blog/arithmeti… Image
Past work found that prompting large language models to generate a set of short sentences that describe intermediate steps before producing the final answer significantly boosted their performance on tasks that require symbolic reasoning. Image
With the simple addition task in mind, we ventured into investigating what components of a scratchpad are essential to success and in what scenarios they come in handy. Below is a short summary of some of the interesting results we obtained.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(