Updated our 2022 paper on the alignment problem (ICLR). More direct evidence now supports our hypotheses: alignment faking, manipulative and obfuscated reward hacking, situational awareness, internal goals and power-seeking. Key updates below. 1/🧵 arxiv.org/abs/2209.00626
Just yesterday* OpenAI confirmed one of hypotheses: obfuscated reward hacking. GPT knowingly hacks its reward function ‘in the wild’; OpenAI penalizes it; so GPT keeps doing it in secret.
Tired of waiting 💤 while your model trains? Try skipping points that are already learned, not learnable or not worth learning! Robustly reduces required training steps 🏎 by >10x ! to reach the same accuracy on big web-scraped data
📜ICML 2022 paper: arxiv.org/abs/2206.07137
Training on big web-scraped data can take ages 💤 But lots of compute and time is wasted on redundant and noisy points that are already learned, not learnable, or not even worth learning.
➡️Paper: medrxiv.org/content/10.110…
As Europe enters a third wave of COVID, policy-makers balance controlling infections with the sweeping socioeconomic costs of interventions. To do so, we must know how effective individual interventions were at controlling COVID. 2/
Jul 27, 2020 • 13 tweets • 4 min read
Excited to share a major update on how effective 8 interventions have been against COVID-19 transmission in 41 countries.