You've probably seen results showing impressive few-shot performance of very large language models (LLMs). Do those results mean that LLMs can reason? Well, maybe, but maybe not. Few-shot performance is highly correlated with pretraining term frequency. arxiv.org/abs/2202.07206
We focus on numerical reasoning (addition, multiplication, and unit conversion). We use the same formats and tasks used previously to show impressive few-shot performance, but we systematically evaluate every number and correlate performance with pretraining term frequency.