Ekin Akyürek Profile picture
Research @ OpenAI | MIT PhD | exchanging algorithms with ai

Nov 10, 2024, 16 tweets

Why do we treat train and test times so differently?

Why is one “training” and the other “in-context learning”?

Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize

We investigate the existing idea of test-time training (TTT): you construct an auxiliary dataset based on your test inputs and update the model before making a prediction.

But it’s not clear what tasks to train on, what kind of inference, and what base model to start with?

We present an extensive set of ablations for the ARC challenge! We perform three analyses to answer how to do TTT, and what to do *before* and *after* TTT?

*What data is needed for TTT?*

We tried two different ways of generating TTT data: (1)  an in-context learning format and (2) an end-to-end format. In ICL, we create leave-1-out tasks from the given test demonstrations. In E2E, we treat each i/o pair as an individual task.

We also applied some geometric transformations to bootstrap the data; see how ICL tasks are generated in the above figure! We update our models with LoRA using these generated tasks. We find that:

- ICL wins over e2e tasks!

- The data augmentation is crucial!

We updated models with LoRA.

But should we train a *new* LoRA for every test task or train a *single shared* LoRA with a dataset generated from all test tasks?

We find that per-task LoRA is much better! (FT + TTT vs Shared-TTT)

*What kind of inference after TTT?*

In ARC, we don’t have CoTs, hence, you can’t improve much with majority voting.

We do what we did in TTT: we create few-shot tasks and transform them with *invertible* functions. Now we have a bunch of transformed inputs of the original task.

We feed the transformed inputs and invert the outputs back. Now, we can benefit more from majority voting. We name this “self-consistency under invertible transformations”.

- It is better than predicting with any single transformation!

- Hierarchical voting improves even more!

**What fine-tuning before TTT**

You need to FT a base LM, but you don’t need too much new data! A model fine-tuned on re-instantiations of the *training tasks* + few geometric transformations can get good scores.

We tried lots of LM-based synthetic data but surprisingly found that it didn’t help. Interestingly, TTT closes gaps between different levels of models.

*ARC Benchmark and Results*

We evaluated our final system on a full ARC validation set. TTT improves everything!

Our fine-tuned LM jumps from 18 to 47 after TTT!

Combining it with a program generator model from BARC paper (@xu3kev) we get 58.5.

We repeat our experiments with BARC's fine-tuned model released just a week ago, and we observe 61\% score, matching the average human. Congrats to this concurrent work:

The Minds AI team @MindsAI_Jack achieves similar scores but on *the private test* set which is remarkable given the computation limits of the leaderboard. To our knowledge, they were also the first ones to use TTT in the benchmark.

*Limitations*

Thanks to the amazing team of collaborators who worked incredibly hard! This work wouldn't be possible without @MehulDamani2 @linluqiu @HanGuo97 and our advisors @yoonrkim, @jacobandreas

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling