For agents to improve over time, they can’t afford to forget what they’ve already mastered.
We found that supervised fine-tuning forgets more than RL when training on a new task!
Want to find out why? 👇
We fully swept hyperparameters for both methods and plotted the Pareto Frontier.
Our finding holds for LLMs, Robotics foundation models, and even 3-layer MLP:
Jun 13, 2025 • 11 tweets • 5 min read
What if an LLM could update its own weights?
Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs.
Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Self-edits (SE) are generated in token space and consist of training data and optionally optimization parameters. This is trained with RL, where the actions are the self-edit generations, and the reward is the updated model's performance on a task relevant to the input context.