Why do we treat train and test times so differently?
Why is one “training” and the other “in-context learning”?
Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize
We investigate the existing idea of test-time training (TTT): you construct an auxiliary dataset based on your test inputs and update the model before making a prediction.
But it’s not clear what tasks to train on, what kind of inference, and what base model to start with?
We present an extensive set of ablations for the ARC challenge! We perform three analyses to answer how to do TTT, and what to do *before* and *after* TTT?
*What data is needed for TTT?*
We tried two different ways of generating TTT data: (1) an in-context learning format and (2) an end-to-end format. In ICL, we create leave-1-out tasks from the given test demonstrations. In E2E, we treat each i/o pair as an individual task.
We also applied some geometric transformations to bootstrap the data; see how ICL tasks are generated in the above figure! We update our models with LoRA using these generated tasks. We find that:
- ICL wins over e2e tasks!
- The data augmentation is crucial!
We updated models with LoRA.
But should we train a *new* LoRA for every test task or train a *single shared* LoRA with a dataset generated from all test tasks?
We find that per-task LoRA is much better! (FT + TTT vs Shared-TTT)
*What kind of inference after TTT?*
In ARC, we don’t have CoTs, hence, you can’t improve much with majority voting.
We do what we did in TTT: we create few-shot tasks and transform them with *invertible* functions. Now we have a bunch of transformed inputs of the original task.
We feed the transformed inputs and invert the outputs back. Now, we can benefit more from majority voting. We name this “self-consistency under invertible transformations”.
- It is better than predicting with any single transformation!
- Hierarchical voting improves even more!
**What fine-tuning before TTT**
You need to FT a base LM, but you don’t need too much new data! A model fine-tuned on re-instantiations of the *training tasks* + few geometric transformations can get good scores.
We tried lots of LM-based synthetic data but surprisingly found that it didn’t help. Interestingly, TTT closes gaps between different levels of models.
*ARC Benchmark and Results*
We evaluated our final system on a full ARC validation set. TTT improves everything!
Our fine-tuned LM jumps from 18 to 47 after TTT!
Combining it with a program generator model from BARC paper (@xu3kev) we get 58.5.
We repeat our experiments with BARC's fine-tuned model released just a week ago, and we observe 61\% score, matching the average human. Congrats to this concurrent work:
The Minds AI team @MindsAI_Jack achieves similar scores but on *the private test* set which is remarkable given the computation limits of the leaderboard. To our knowledge, they were also the first ones to use TTT in the benchmark.
*Limitations*
Thanks to the amazing team of collaborators who worked incredibly hard! This work wouldn't be possible without @MehulDamani2 @linluqiu @HanGuo97 and our advisors @yoonrkim, @jacobandreas
🔢 Does GPT-3 know arithmetic? Are LM scratchpads/chain-of-thought prompting always helpful? What should go into a successful scratchpad when sampling predictions from GPT-3? Check out our blog post where we inspect these questions on the addition problem! lingo.csail.mit.edu/blog/arithmeti…
Past work found that prompting large language models to generate a set of short sentences that describe intermediate steps before producing the final answer significantly boosted their performance on tasks that require symbolic reasoning.
With the simple addition task in mind, we ventured into investigating what components of a scratchpad are essential to success and in what scenarios they come in handy. Below is a short summary of some of the interesting results we obtained.