Patrick Jiang Profile picture
Jun 6 14 tweets 5 min read Read on X
Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.

> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4

> Context-1-level cost and latency

> externalizes candidates, evidence, verification, and search history

> open-source
[1/N] I’ve been wondering:

maybe search agents are bad at search partly because we make them do all the paperwork in their head.

So I tried a simple idea:

externalize the search state, then train the model to use that harness.

The result is Harness-1: a 20B search agent that can match or even beat much larger frontier AI on hard long-horizon search tasks.Image
[2/N] The usual search-agent setup is basically:

search → read → search → read → keep appending everything to the transcript.

At some point the model is not just “searching” anymore.

It is also being asked to be a memory system, a note taker, a verifier, and a librarian.
[3/N] This gets especially weird for RL.
The final reward can tell you whether the episode worked, but it often does not tell you why it failed.

Was it a bad search?
Forgotten evidence?
Missing verification?
Poor curation?
Or the agent just losing track of what it had already seen?
[4/N] Harness-1 tries to separate these two jobs.

The model still makes the semantic decisions:
what to search, what to read, what to keep, what to verify, when to stop.

But the harness maintains the recoverable state around those decisions.
[5/N] Concretely, the harness keeps a working memory with:
candidate docs,
curated evidence,
importance tags,
search history,
evidence links,
verification records,
dedup/compression,
and context-budget markers.

So the agent is not just talking to a search box. It is operating over a workspace.Image
[6/N] I think this changes what RL is actually learning.

Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface:
search, curate, revisit, verify, and submit.

Much closer to how I’d want a search agent to work.
[7/N] A fun part: this was not trained with a huge amount of task data.

Harness-1 uses 899 filtered SFT trajectories and RL on 3,453 queries.

The point is not “less data is always enough.”

The point is that a lot of the behavioral prior can live in the harness. Image
[8/N] The result that made me most excited is transfer.

Harness-1 improves over Context-1 by +7.9 recall points on source-family benchmarks.

But on held-out transfer benchmarks, the gain is +17.0 points.

That’s the part that made the idea feel real to me. Image
[9/N] The ablations were also pretty revealing.

When we disable the harness mechanisms, the model does not just lose some information.

It changes behavior: more shallow searching, less reading / verification, worse final curation.

So the harness is not just engineering glue. Image
[10/N] My takeaway:

for search agents, “the model” is not the whole learning system.

The interface matters.
The memory layout matters.
The action space matters.
The harness matters.

If we want RL to teach better search behavior, we should probably stop making the model do all the paperwork in its head.
Huge thanks to @trychroma for fully supporting this work, and to @tinkerapi for the training infra!
Huge shoutout to my awesome collaborators @zhiyiscs @HammadTime @kellyhongsn @PatrickXu565299 @SunJiashuo36 !!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Patrick Jiang

Patrick Jiang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(