Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.
> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4
> Context-1-level cost and latency
> externalizes candidates, evidence, verification, and search history
> open-source
[1/N] I’ve been wondering:
maybe search agents are bad at search partly because we make them do all the paperwork in their head.
So I tried a simple idea:
externalize the search state, then train the model to use that harness.
The result is Harness-1: a 20B search agent that can match or even beat much larger frontier AI on hard long-horizon search tasks.
[2/N] The usual search-agent setup is basically:
search → read → search → read → keep appending everything to the transcript.
At some point the model is not just “searching” anymore.
It is also being asked to be a memory system, a note taker, a verifier, and a librarian.
[3/N] This gets especially weird for RL.
The final reward can tell you whether the episode worked, but it often does not tell you why it failed.
Was it a bad search?
Forgotten evidence?
Missing verification?
Poor curation?
Or the agent just losing track of what it had already seen?
[4/N] Harness-1 tries to separate these two jobs.
The model still makes the semantic decisions:
what to search, what to read, what to keep, what to verify, when to stop.
But the harness maintains the recoverable state around those decisions.
[5/N] Concretely, the harness keeps a working memory with:
candidate docs,
curated evidence,
importance tags,
search history,
evidence links,
verification records,
dedup/compression,
and context-budget markers.
So the agent is not just talking to a search box. It is operating over a workspace.
[6/N] I think this changes what RL is actually learning.
Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface:
search, curate, revisit, verify, and submit.
Much closer to how I’d want a search agent to work.
[7/N] A fun part: this was not trained with a huge amount of task data.
Harness-1 uses 899 filtered SFT trajectories and RL on 3,453 queries.
The point is not “less data is always enough.”
The point is that a lot of the behavioral prior can live in the harness.
[8/N] The result that made me most excited is transfer.
Harness-1 improves over Context-1 by +7.9 recall points on source-family benchmarks.
But on held-out transfer benchmarks, the gain is +17.0 points.
That’s the part that made the idea feel real to me.
[9/N] The ablations were also pretty revealing.
When we disable the harness mechanisms, the model does not just lose some information.
It changes behavior: more shallow searching, less reading / verification, worse final curation.
So the harness is not just engineering glue.
[10/N] My takeaway:
for search agents, “the model” is not the whole learning system.
The interface matters.
The memory layout matters.
The action space matters.
The harness matters.
If we want RL to teach better search behavior, we should probably stop making the model do all the paperwork in its head.
Paper 📄: arxiv.org/abs/2606.02373
Code 💻: github.com/pat-jj/harness…
Model 🤗: huggingface.co/pat-jj/harness…
HF Paper: huggingface.co/papers/2606.02…
Huge thanks to @trychroma for fully supporting this work, and to @tinkerapi for the training infra!
Huge shoutout to my awesome collaborators @zhiyiscs @HammadTime @kellyhongsn @PatrickXu565299 @SunJiashuo36 !!
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
