Chuanyang Jin Profile picture
phd @JohnsHopkins | amazon ai fellow | past: @meta (FAIR) @MIT @nyu_courant

May 20, 10 tweets

What are users thinking during their interactions with LLMs?

We introduce ThoughtTrace — the first large-scale dataset that captures what users think during real-world human–AI conversations, not just what they type.
→ 10,174 thought annotations
→ 2,155 multi-turn conversations, 17,058 turns
→ 1,058 users
→ 20 LLMs

These thoughts improve user behavior prediction (+41.7%) and model alignment (+25.6%).
This opens a new paradigm of user-centric LLM research. Full information in the thread 🧶

Read our paper: arxiv.org/abs/2605.20087
Check our project website: thoughttrace-project.github.io

Conversational AI has reached billions of users, yet every dataset captures only what people say, never what they think.

ThoughtTrace pairs each turn with the user’s own latent thought: 🟦reasons for sending a prompt 🟧 reactions to the assistant's response.

ThoughtTrace is long-horizon and diverse.

Median 8 turns/conv, while existing datasets like WildChat and LMSYS-Chat-1M skew shorter with 2 turns/conv. 7 broad domains, 36 subtopics, no single category dominating.

Real users, real tasks, real depth.

Are thoughts just paraphrased messages? No.

UMAP shows message↔reason and reaction↔next-message pairs have much larger semantic shifts than consecutive messages.

Thoughts are a distinct, complementary signal — not redundant with transcripts.

Can frontier LLMs just infer the thought from context?

GPT, Gemini, and Claude all struggle:
- Reasons: 2.93 / 5
- Reactions: 2.54 / 5

Latent thoughts carry information that no amount of context can recover. Explicit annotations matter.

Thoughts are diverse and stage-dependent.

7 reason types, 5 reaction types.
→ Task Motivation dominates early turns
→ Task Continuation takes over later
→ Explicit Affirmation steadily rises as conversations converge

Utility 1: Predicting the next user message.

History-only: 21.6
Thought-augmented: 30.6 → +41.7% relative gain across GPT, Gemini, Opus.

User simulators get dramatically better when they model what users think, not only what they type.

Utility 2: Model alignment via DPO.

Thought-guided rewrites on Arena-Hard beat:
Base Qwen3.5-4B by +25.6%
WildChat by +6.6%
Message-guided rewrites by +4.5%

Thoughts give models actionable alignment signals by surfacing dissatisfaction that users never spell out.

ThoughtTrace opens a new modality for AI research:
→ user modeling beyond utterances
→ training signals from latent thoughts
→ evaluation grounded in subjective experience

📄Paper: arxiv.org/abs/2605.20087
🤗Data: huggingface.co/datasets/SCAI-…
💻Code: github.com/thoughttrace-p…
🔍Check more examples: thoughttrace-project.github.io/examples.html
w/ @binze_li @JackXie492016 @thecatfangs @tli104 @ShayneRedford @HXGuu3 @maximillianc_ @tianminshu 🙏
Fun collaboration with @GoogleResearch @MIT @jhuclsp

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling