Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Chuanyang Jin

@chuanyang_jin

May 20 • 10 tweets • 5 min read • Read on X

Scrolly

What are users thinking during their interactions with LLMs?

We introduce ThoughtTrace — the first large-scale dataset that captures what users think during real-world human–AI conversations, not just what they type.
→ 10,174 thought annotations
→ 2,155 multi-turn conversations, 17,058 turns
→ 1,058 users
→ 20 LLMs

These thoughts improve user behavior prediction (+41.7%) and model alignment (+25.6%).
This opens a new paradigm of user-centric LLM research. Full information in the thread 🧶

Read our paper: arxiv.org/abs/2605.20087
Check our project website: thoughttrace-project.github.io

Conversational AI has reached billions of users, yet every dataset captures only what people say, never what they think.

ThoughtTrace pairs each turn with the user’s own latent thought: 🟦reasons for sending a prompt 🟧 reactions to the assistant's response.

ThoughtTrace is long-horizon and diverse.

Median 8 turns/conv, while existing datasets like WildChat and LMSYS-Chat-1M skew shorter with 2 turns/conv. 7 broad domains, 36 subtopics, no single category dominating.

Real users, real tasks, real depth.

Are thoughts just paraphrased messages? No.

UMAP shows message↔reason and reaction↔next-message pairs have much larger semantic shifts than consecutive messages.

Thoughts are a distinct, complementary signal — not redundant with transcripts.

Can frontier LLMs just infer the thought from context?

GPT, Gemini, and Claude all struggle:
- Reasons: 2.93 / 5
- Reactions: 2.54 / 5

Latent thoughts carry information that no amount of context can recover. Explicit annotations matter.

Thoughts are diverse and stage-dependent.

7 reason types, 5 reaction types.
→ Task Motivation dominates early turns
→ Task Continuation takes over later
→ Explicit Affirmation steadily rises as conversations converge

Utility 1: Predicting the next user message.

History-only: 21.6
Thought-augmented: 30.6 → +41.7% relative gain across GPT, Gemini, Opus.

User simulators get dramatically better when they model what users think, not only what they type.

Utility 2: Model alignment via DPO.

Thought-guided rewrites on Arena-Hard beat:
Base Qwen3.5-4B by +25.6%
WildChat by +6.6%
Message-guided rewrites by +4.5%

Thoughts give models actionable alignment signals by surfacing dissatisfaction that users never spell out.

ThoughtTrace opens a new modality for AI research:
→ user modeling beyond utterances
→ training signals from latent thoughts
→ evaluation grounded in subjective experience

📄Paper: arxiv.org/abs/2605.20087
🤗Data: huggingface.co/datasets/SCAI-…
💻Code: github.com/thoughttrace-p…
🔍Check more examples: thoughttrace-project.github.io/examples.html
w/ @binze_li @JackXie492016 @thecatfangs @tli104 @ShayneRedford @HXGuu3 @maximillianc_ @tianminshu 🙏
Fun collaboration with @GoogleResearch @MIT @jhuclsp

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @chuanyang_jin

Chuanyang Jin

@chuanyang_jin

Feb 26, 2025

How to achieve human-level open-ended machine Theory of Mind?

Introducing #AutoToM: a fully automated and open-ended ToM reasoning method combining the flexibility of LLMs with the robustness of Bayesian inverse planning, achieving SOTA results across five benchmarks. 🧵[1/n]

Theory of Mind (ToM), the ability to understand people’s minds, is known to be challenging. Current approaches either rely on prompting LLMs, which are prone to systematic errors, or use rigid, hand-crafted Bayesian ToM models, which are more robust but cannot generalize across different domains.

To address this, we introduce #AutoToM, the first model-based ToM method that addresses open-ended scenarios. [2/n]

AutoToM can operate in any domain, infer any mental variable, and support any order of recursive reasoning. It consistently achieves SOTA performance on multiple ToM benchmarks by autonomously discovering the appropriate BToM models for different questions. [3/n]

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Chuanyang Jin

Try unrolling a thread yourself!

More from @chuanyang_jin

Chuanyang Jin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!