🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️
We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥
We find that GPT3, ChatGPT, and GPT4 cannot fully solve compositional tasks even with in-context learning, fine-tuning, or using scratchpads. To understand when models succeed, and the nature of the failures, we represent a model’s reasoning through computation graphs.
We show that Transformers' successes are heavily linked to having seen significant portions of the required computation graph during training! This revelation, where models reduce multi-step reasoning into subgraph matching, raises questions about sparks of AGI claims 🦄
To understand where Transformers fail, we analyze the errors made by transformers at different layers of the computation graph. We find that while models can execute single-step reasoning accurately, they struggle to plan and combine multiple steps for correct overall reasoning.
We also provide theoretical insights into why models perform worse in compositional tasks as the problem size increases: whether you need independent applications or iterated applications of the same function to solve a task, the probability of error increases exponentially.