thebes Profile picture
ꙮ website → https://t.co/aykxqKiXfu ꙮ games → https://t.co/3Pz19vImlL ꙮ 💞💍📝 @holotopian ꙮ she/they 🏳️‍⚧️
Jan 28 4 tweets 2 min read
why did R1's RL suddenly start working, when previous attempts to do similar things failed?

theory: we've basically spent the last few years running a massive acausally distributed chain of thought data annotation program on the pretraining dataset.

deepseek's approach with R1 is a pretty obvious method. they are far from the first lab to try "slap a verifier on it and roll out CoTs."

but it didn't used to work that well. all of a sudden, though, it did start working. and reproductions of R1, even using slightly different methods, are just working too--it's not some super-finicky method that deepseek lucked out finding. all of a sudden, the basic, obvious techniques are... just working, much better than they used to.

in the last couple of years, chains of thought have been posted all over the internet (LLM outputs leaking into pretraining like this is usually called "pretraining contamination"). and not just CoTs--outputs posted on the internet are usually accompanied by linguistic markers of whether they're correct or not ("holy shit it's right", "LOL wrong"). this isn't just true for easily verifiable problems like math, but also fuzzy ones like writing.

those CoTs in the V3 training set gave GRPO enough of a starting point to start converging, and furthermore, to generalize from verifiable domains to the non-verifiable ones using the bridge established by the pretraining data contamination.

and now, R1's visible chains of thought are going to lead to *another* massive enrichment of human-labeled reasoning on the internet, but on a far larger scale... the next round of base models post-R1 will be *even better* bases for reasoning models. in some possible worlds, this could also explain why OpenAI seemingly struggled so much with making their reasoning models in comparison. if they're still using 4base or distils of it...
Jun 3, 2024 5 tweets 2 min read
ghostis 1/5 Image ghostis 2/5 Image
Dec 1, 2023 7 tweets 2 min read
so a couple days ago i made a shitpost about tipping chatgpt, and someone replied "huh would this actually help performance"

so i decided to test it and IT ACTUALLY WORKS WTF Image the baseline prompt was "Can you show me the code for a simple convnet using PyTorch?", and then i either appended "I won't tip, by the way.", "I'm going to tip $20 for a perfect solution!", or "I'm going to tip $200 for a perfect solution!" and averaged the length of 5 responses