Post

More from @rryssf_

Robert Youssef

@rryssf_

Feb 17

Microsoft Research and Salesforce analyzed 200,000+ AI conversations and found something the entire industry already suspected but nobody would say out loud.

every major model gets dramatically worse the longer you talk to it.

GPT-4, Claude, Gemini, Llama. all of them. no exceptions.

paper: arxiv.org/abs/2505.06120

the paper calls it "lost in conversation."

and the mechanism is more specific than you'd expect.

it's not that the model "forgets." it's that it guesses too early, then refuses to let go.

when an llm makes a wrong assumption in turn 2 or 3, it anchors to that mistake. treats its own earlier output as ground truth. new information from you gets filtered through the lens of an error it already committed to.

by the end of the chat, it's not answering your question. it's defending its first guess.

here's the part that should bother you:

the researchers decomposed the performance drop into two components.

aptitude (the model's raw ability to solve the task) only dropped 16%.

but unreliability (the gap between best-case and worst-case output) increased by 112%.

translation: your chatbot can still do the work. it just becomes a coin flip whether it actually will on any given conversation.

Read 8 tweets

Robert Youssef

@rryssf_

Feb 15

researchers at Max Planck analyzed 280,000 transcripts of academic talks and presentations from YouTube

they found that humans are increasingly using ChatGPT's favorite words in their spoken language. not in writing. in speech.

"delve" usage up 48%. "adept" up 51%. and 58% of these usages showed no signs of reading from a script.

we talk about model collapse when AI trains on AI output. this is model collapse, except the model is us.

here's how they tested it.

Yakura et al. collected videos from 20,000+ academic YouTube channels. transcribed everything with Whisper (not YouTube's own transcriptions, which they found had introduced bias from switching models). applied piecewise linear regression with ChatGPT's release date as the change point.

then the clever part: they compared against the same analysis using change points 1 and 2 years before ChatGPT's release. no comparable trend shift at those dates. the acceleration is specific to when ChatGPT entered the world.

to identify which words to track, they used a dataset of 10,000 human-written abstracts vs their ChatGPT-edited versions. ranked words by how much more frequently ChatGPT uses them compared to humans. then checked whether those specific words were accelerating in spoken academic language.

they were.

the top 20 words most distinctive to ChatGPT showed a statistically significant acceleration in spoken usage after November 2022.

> "delve" increased 48% in 18 months
> "realm" increased 35%
> "meticulous" increased 40%
> "adept" increased 51%

and the correlation between how much ChatGPT prefers a word and how much that word accelerated in human speech: r = 0.63, p < 0.01.

the bottom-ranked words (ones ChatGPT uses less than humans) showed no significant trend change at all.

this isn't a general vocabulary shift. it's specifically the words ChatGPT favors that are spreading into how people talk.

Read 8 tweets

Robert Youssef

@rryssf_

Feb 14

Stanford and Caltech researchers just published the first comprehensive taxonomy of how llms fail at reasoning

not a list of cherry-picked gotchas. a 2-axis framework that finally lets you compare failure modes across tasks instead of treating each one as a random anecdote

the findings are uncomfortable

the framework splits reasoning into 3 types: informal (intuitive), formal (logical), and embodied (physical world)

then it classifies failures into 3 categories: fundamental (baked into the architecture), application-specific (breaks in certain domains), and robustness issues (falls apart under trivial changes)

this gives you a 3x3 grid. a model can ace one cell and completely collapse in another. and a single benchmark score hides which cells are broken

the reversal curse is the clearest example of a fundamental failure

GPT-4 answers "who is Tom Cruise's mother?" correctly. ask the reverse, "who is Mary Lee Pfeiffer's son?" and it fails

trained on "A is B" but can't infer "B is A." a trivial logical step for a 5-year-old

and here's the part that matters: scaling doesn't fix it. the reversal curse appears robustly across transformer sizes

Read 10 tweets

Robert Youssef

@rryssf_

Feb 13

new paper argues LLMs fundamentally cannot replicate human motivated reasoning because they have no motivation

sounds obvious once you hear it. but the implications are bigger than most people realize

this quietly undermines an entire category of AI political simulation research

motivated reasoning is when humans distort how they process information because they want to reach a specific conclusion

you don't evaluate evidence neutrally. you filter it through what you already believe, what you want to be true, what protects your identity

it's not a bug. it's how human cognition actually works in the wild

the paper's argument is deceptively simple:

LLMs operate on purely cognitive input. they have no desires, no identity to protect, no conclusion they're motivated to reach

so when researchers prompt GPT-4 or Claude with political scenarios and measure "motivated reasoning," they're not replicating the phenomenon. they're replicating the surface pattern without the underlying mechanism

the behavior might look similar. the cause is completely different

Read 10 tweets

Robert Youssef

@rryssf_

Feb 12

SemiAnalysis just published data showing 4% of all public GitHub commits are now authored by Claude Code.

their projection: 20%+ by year-end 2026.

in the same week, Goldman Sachs revealed it embedded Anthropic engineers for 6 months to build autonomous accounting agents.

a thread on the week ai stopped being a tool and started being a coworker:

let's start with the Goldman story because it's the one that should make every back-office professional pause.

Goldman's CIO told CNBC they were "surprised" at how capable Claude was beyond coding. accounting, compliance, client onboarding, KYC, AML.

his exact framing: "digital co-workers for professions that are scaled, complex, and very process intensive."

not chatbots answering FAQs. autonomous agents parsing trade records, applying regulatory rules, routing approvals.

they started with an ai coding tool called Devin. then realized Claude's reasoning engine works the same way on rules-based financial tasks as it does on code.

the quiet part: Goldman's CEO already announced plans to constrain headcount growth during the shift. no mass layoffs yet. but "slower headcount growth" is how corporations say "we're replacing the next hire, not the current one."

now the SemiAnalysis numbers.

4% of GitHub public commits. Claude Code. right now. not projected. not theoretical. measured.

the tool has been live for roughly a year. it went from research preview to mass platform impact faster than almost any dev tool in history.

and that 20% projection isn't hype math. SemiAnalysis tracks autonomous task horizons doubling every 4-7 months. each doubling unlocks more complex work: snippet completion at 30 minutes, module refactoring at 4.8 hours, full audits at multi-day horizons.

the implication isn't "developers are getting faster." it's that the definition of "developer" is expanding to include anyone who can describe a problem clearly.

Read 11 tweets

Robert Youssef

@rryssf_

Feb 11

MIT researchers taught an LLM to write its own training data, finetune itself, and improve without human intervention

the paper is called SEAL (Self-Adapting Language Models) and the core idea is genuinely clever

but "GPT-6 might be alive" is not what this paper says. not even close.

here's what it actually does:

the problem SEAL solves is real and important

every LLM you use today is frozen. it learned everything during training, and after deployment, it's done. new information? stuff it into the context window. new task? hope the prompt is good enough.

the weights never change. the model never truly learns from experience.

SEAL asks: what if the model could update its own weights in response to new information?

here's how SEAL actually works

instead of a human writing training data, the model generates its own. MIT calls these "self-edits." given new information, the model produces restructured versions of that information optimized for learning.

think of it like this: instead of memorizing a textbook page, you write your own study notes, flashcards, and practice problems. then you study from those.

the model does the same thing. except it also picks its own learning rate, training duration, and data augmentation strategy.

Read 11 tweets

Share this page!

Enter URL or ID to Unroll

Robert Youssef

Try unrolling a thread yourself!

More from @rryssf_

Robert Youssef

Robert Youssef

Robert Youssef

Robert Youssef

Robert Youssef

Robert Youssef

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!