ECHO (Environment Cross-entropy Hybrid Objective) demo support just landed in OpenEnv, and it's a cool idea: train agents to learn a world model almost for free
original paper by @VaishShrivas, Piero Kauffman, Ahmed Awadallah and @DimitrisPapail @MSFTResearch
when an agent acts in an env, a rollout has 2 sides: what the agent writes and what the env writes back
normal agent RL would only train on the agent's side
train a CLI agent with GRPO and the reward shapes the action tokens, while the env's responses get masked out of the loss
all that ground-truth about what actually happened gets thrown away
ECHO proposes using that part too instead of discarding it
on top of the usual RL loss on actions, it adds a small cross-entropy loss on the env's tokens, so the model also learns to predict what the env does
L = GRPO(actions) + λ · CE(observations)
and this is almost free: those tokens already passed through the same forward pass, the logits are already computed, so no extra rollout and no teacher model
you get a world model as a side effect, even failed rollouts turn into signal, and the gains are real:
up to 2.3x faster training and TerminalBench 2.0 pass@1 roughly doubles
to learn more about the idea check out the article by one of the paper's authors (@DimitrisPapail): x.com/DimitrisPapail…
concretely, OpenEnv now lets you tag, per token, what was an action vs an env observation, plus a world-model coefficient