🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
We tested increasingly weak and even pathological training signals on math problems:
✅ Ground truth answers
🔲 Format – reward if contains “\boxed{}”
😈 Incorrect answers as labels
🎲 Randomly reward regardless of answer
All these improved Qwen2.5-Math, AMC, AIME dramatically!!!