wh Profile picture
wh
eng primarily, ml mostly, research previously

Aug 22, 9 tweets

New ByteDance Seed reasoning RL paper, relating RL to self-supervised learning.

The paper is pretty dense with all the dual-task derivation so this is basically my notes.

The main idea is to learn two tasks. Given input A, learn output B. To verify the quality of the output B, the model tries to reconstruct the input A'. The quality of the output is then evaluated based on how similar A and A' is.

Clearly, this is very difficult. For example if this was math, and the output is some number x, it is impossible to reconstruct the problem.

The solution is to decompose the input A into a known and unknown part. During the reverse process, the model has to reconstruct the unknown part of A using both the output B and the known A

This is all very handwavy so the simple example is to take the sum of 2 numbers. A + B = C. The model is then provided with the output C and 1 of the numbers (eg A) and asked what is B.

These are clearer examples.
Math: Each rollout leads to different answers. DuPO then says to try to derive the values of the question using each rollout's answer. The rollout that can best reverse is rewarded.

Machine Translation: This is alot more straightforward since dual task learning has already been used in MT. Basically measure string similarity of the reversed translation.

(i have no idea why this figure is in the appendix when it is a much clearer depiction of DuPO)

Strong results in Machine Translation and Math. They also show that it can work directly on base models.

Interestingly, this even works purely in inference without any training. This can be used as a best of N judge where the backward accuracy is used to select the best trajectory.

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
arxiv.org/pdf/2508.14460

I think its a really interesting paper. I will say that as someone who hasnt even heard of "dual task" etc literature, the first part of the paper was extremely confusing and a little abstract. And it made more sense after seeing the examples.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling