PhD student at University of Sussex. Into language models, reinforcement learning and Bayesian inference
Feb 21, 2023 • 13 tweets • 6 min read
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance. arxiv.org/abs/2302.08582
Reinforcement learning from human feedback (RLHF) is the secret sauce behind InstructGPT, ChatGPT and Claude. It’s a technique for finetuning pretrained language models (LMs) to maximize a reward function expressing human preferences, e.g. being a helpful and harmless assistant.
Nov 21, 2022 • 14 tweets • 7 min read
RL with KL penalties – a powerful approach to aligning language models with human preferences – is better seen as Bayesian inference. A thread about our paper (with @EthanJPerez and @drclbuckley) to be presented at #emnlp2022 🧵arxiv.org/pdf/2205.11275… 1/11 @EthanJPerez@drclbuckley RL with KL penalties is a powerful algorithm behind RL from human feedback (RLHF), the methodology heavily used by @OpenAI (InstructGPT), @AnthropicAI (helpful assistant) and @DeepMind (Sparrow) for aligning LMs with human preferences such as being helpful and avoiding harm. 2/14