How to get URL link on X (Twitter) App
🧵 2/8 The evaluation problem for reasoning 📊
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical/#symbolic reasoning? vs. #pattern_recognition, inadvertent data #contamination, or #overfitting?