researchers at Max Planck analyzed 280,000 transcripts of academic talks and presentations from YouTube
they found that humans are increasingly using ChatGPT's favorite words in their spoken language. not in writing. in speech.
"delve" usage up 48%. "adept" up 51%. and 58% of these usages showed no signs of reading from a script.
we talk about model collapse when AI trains on AI output. this is model collapse, except the model is us.
here's how they tested it.
Yakura et al. collected videos from 20,000+ academic YouTube channels. transcribed everything with Whisper (not YouTube's own transcriptions, which they found had introduced bias from switching models). applied piecewise linear regression with ChatGPT's release date as the change point.
then the clever part: they compared against the same analysis using change points 1 and 2 years before ChatGPT's release. no comparable trend shift at those dates. the acceleration is specific to when ChatGPT entered the world.
to identify which words to track, they used a dataset of 10,000 human-written abstracts vs their ChatGPT-edited versions. ranked words by how much more frequently ChatGPT uses them compared to humans. then checked whether those specific words were accelerating in spoken academic language.
they were.
the top 20 words most distinctive to ChatGPT showed a statistically significant acceleration in spoken usage after November 2022.
Stanford and Caltech researchers just published the first comprehensive taxonomy of how llms fail at reasoning
not a list of cherry-picked gotchas. a 2-axis framework that finally lets you compare failure modes across tasks instead of treating each one as a random anecdote
the findings are uncomfortable
the framework splits reasoning into 3 types: informal (intuitive), formal (logical), and embodied (physical world)
then it classifies failures into 3 categories: fundamental (baked into the architecture), application-specific (breaks in certain domains), and robustness issues (falls apart under trivial changes)
this gives you a 3x3 grid. a model can ace one cell and completely collapse in another. and a single benchmark score hides which cells are broken
the reversal curse is the clearest example of a fundamental failure
GPT-4 answers "who is Tom Cruise's mother?" correctly. ask the reverse, "who is Mary Lee Pfeiffer's son?" and it fails
trained on "A is B" but can't infer "B is A." a trivial logical step for a 5-year-old
and here's the part that matters: scaling doesn't fix it. the reversal curse appears robustly across transformer sizes
new paper argues LLMs fundamentally cannot replicate human motivated reasoning because they have no motivation
sounds obvious once you hear it. but the implications are bigger than most people realize
this quietly undermines an entire category of AI political simulation research
motivated reasoning is when humans distort how they process information because they want to reach a specific conclusion
you don't evaluate evidence neutrally. you filter it through what you already believe, what you want to be true, what protects your identity
it's not a bug. it's how human cognition actually works in the wild
the paper's argument is deceptively simple:
LLMs operate on purely cognitive input. they have no desires, no identity to protect, no conclusion they're motivated to reach
so when researchers prompt GPT-4 or Claude with political scenarios and measure "motivated reasoning," they're not replicating the phenomenon. they're replicating the surface pattern without the underlying mechanism
the behavior might look similar. the cause is completely different
they started with an ai coding tool called Devin. then realized Claude's reasoning engine works the same way on rules-based financial tasks as it does on code.
the quiet part: Goldman's CEO already announced plans to constrain headcount growth during the shift. no mass layoffs yet. but "slower headcount growth" is how corporations say "we're replacing the next hire, not the current one."
now the SemiAnalysis numbers.
4% of GitHub public commits. Claude Code. right now. not projected. not theoretical. measured.
the tool has been live for roughly a year. it went from research preview to mass platform impact faster than almost any dev tool in history.
and that 20% projection isn't hype math. SemiAnalysis tracks autonomous task horizons doubling every 4-7 months. each doubling unlocks more complex work: snippet completion at 30 minutes, module refactoring at 4.8 hours, full audits at multi-day horizons.
the implication isn't "developers are getting faster." it's that the definition of "developer" is expanding to include anyone who can describe a problem clearly.
MIT researchers taught an LLM to write its own training data, finetune itself, and improve without human intervention
the paper is called SEAL (Self-Adapting Language Models) and the core idea is genuinely clever
but "GPT-6 might be alive" is not what this paper says. not even close.
here's what it actually does:
the problem SEAL solves is real and important
every LLM you use today is frozen. it learned everything during training, and after deployment, it's done. new information? stuff it into the context window. new task? hope the prompt is good enough.
the weights never change. the model never truly learns from experience.
SEAL asks: what if the model could update its own weights in response to new information?
here's how SEAL actually works
instead of a human writing training data, the model generates its own. MIT calls these "self-edits." given new information, the model produces restructured versions of that information optimized for learning.
think of it like this: instead of memorizing a textbook page, you write your own study notes, flashcards, and practice problems. then you study from those.
the model does the same thing. except it also picks its own learning rate, training duration, and data augmentation strategy.