PhD student @stanfordAILab | Prev: AI resident @metaai, @vectorinst, @CambridgeMLG
Mar 20 • 8 tweets • 3 min read
AlpacaEval is now length-controlled (LC)!
✅ highest correlation with Chat Arena (0.98)
✅ no reannotation
✅ simple interpretation: win rate if model length = baseline length
✅ robust to length gamification
0.98 that’s essentially evaluation on Arena but in 3min and <$10.
Key: predict what the win rate would be if model length=baseline length
We: 1. fit GLM: model | length | instruction -> preference 2. predict preference conditioned on baseline length
Benefits:
✅easily extendible to other biases
✅nice math properties
✅no reannotation needed
Jan 6, 2023 • 9 tweets • 8 min read
I played with @AnthropicAI assistant (AA) and compared it to @OpenAI ChatGPT
TLDR: both are similar but AA is
+ Harder to jailbreak
+ Tries to be more helpful
+ Follows closer what we ask for
+ ~Better for writing in English
- Worst for code
- Worst in French
- Longer resp.
🧵
**Coding**
CGPT is better
Quantitative (leetcode hard in python/c/javascript):
- CGPT: 3/3
- AA: 1/3 only got python correct
Qualitative:
- CGPT: more reliable/concise/efficient
- AA: more comments + emphasizes explainability
both are wrong when asked for impossible algo 2/8
Nov 28, 2022 • 11 tweets • 10 min read
#NeurIPS2022
What are ideal representations for self-sup. learning (SSL)?
🤓We give simple optimality conditions and use them to improve/understand/derive SSL methods!