It’s a hefty 206-page research paper, and the findings are concerning.
"LLM users consistently underperformed at neural, linguistic, and behavioral levels"
This study finds LLM dependence weakens the writer’s own neural and linguistic fingerprints. 🤔🤔
Relying only on EEG, text mining, and a cross-over session, the authors show that keeping some AI-free practice time protects memory circuits and encourages richer language even when a tool is later reintroduced.
⚙️ The Experimental Setup
Fifty-four Boston-area students wrote SAT-style essays under three conditions: ChatGPT only, Google only, or brain only.
Each person completed three timed sessions with the same condition, then an optional fourth session in the opposite condition.
A 32-channel Enobio headset recorded brain signals throughout, and every keystroke, prompt, and interview answer was archived for analysis.
🧠 Brain Connectivity Results
Alpha and beta networks were strongest when no external tool was allowed, moderate with Google, and weakest with ChatGPT.
Lower coupling during LLM use signals reduced internal attention and memory rehearsal, while high parieto-frontal flow in the brain-only group matches deep semantic processing.
📚 Linguistic Patterns
Essays produced with ChatGPT clustered tightly in embedding space and reused the same named entities, showing high textual homogeneity.
Google essays sat in the middle, influenced by search rankings, whereas brain-only essays scattered widely, reflecting individual experience and vocabulary.
📝 Memory and Ownership
After writing, only 17 % of ChatGPT users could quote their own sentences, versus 89 % in the brain-only group.
ChatGPT writers also reported the weakest sense of authorship, matching EEG evidence of reduced self-monitoring hubs.
🔄 Crossover Effects
When habitual ChatGPT users had to write unaided, their connectivity and quoting remained low, suggesting lingering cognitive debt.
In contrast, brain-only writers who switched to ChatGPT lit up wide networks and produced richer revisions, showing that tool use after deep practice boosts, rather than blunts, engagement.
⚖️ Cognitive Load Implications
LLMs cut extraneous load by 32 % and extend productive time, yet they also trim germane load, so schema building suffers unless learners deliberately integrate ideas themselves.
🔍 Echo-Chamber Risk
Because a probabilistic model favors agreeable continuations, ChatGPT can tighten information loops more than a search page, shrinking exposure to contrasting facts and dulling critical thought.
Hooking sentence options
Percentage of participants within each group who struggled to quote anything from their essays
Percentage of participants within each group who provided a correct quote from their essays in Session
Relative reported percentage of perceived ownership of essay by the participants in comparison to the
Brain-only group
Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.
Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.
💡The mechanism they reveal for in-context-learning.
When the model reads a few examples in your prompt, it figures out a pattern (like a small rule or function). Instead of permanently changing its stored weights, it forms a temporary adjustment that captures this pattern. That adjustment can be written mathematically as a rank-1 matrix, meaning it only adds one simple direction of change to the existing weights.
This rank-1 update is “low-rank”, so it is very cheap and compact. But it still lets the model shift its behavior to fit the examples in the prompt. Once the prompt is gone, that temporary rank-1 tweak also disappears.
So, in simple terms:
The paper shows that in-context learning happens because the model internally applies a temporary rank-1 (very simple) weight update based on your examples, instead of permanently retraining itself.
---
That behavior looks impossible if learning always means gradient descent.
The authors ask whether the transformer’s own math hides an update inside the forward pass.
They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.
Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.
---
Shows that the attention part can take what it found in your prompt and package it into a tiny “instruction” that, for this 1 forward pass, acts exactly like a small temporary change to the MLP’s weights.
Nothing is saved to disk, yet the block behaves as if the MLP just got a low-rank tweak computed from your examples. Remove the prompt, the tweak disappears, the saved weights stay the same.
As the model reads your examples token by token, it keeps refining that temporary tweak. Each new token nudges the MLP a bit more toward the rule implied by your examples, similar to taking small gradient steps, again only for this pass.
When the examples have done their job, those nudges shrink toward 0, which is what you want when the pattern has been “locked in” for the current answer.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Idea
They call any layer that can read a separate context plus a query a “contextual layer”.
Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.
For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.
🧵3/n. 🛠️ Temporary rank 1 patch
A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.
It multiplies that difference by the frozen weight matrix, then projects the result back through the query activation.
The outcome is a one‑column times one‑row outer product, so the whole tweak has rank 1 and adds almost no storage overhead.
In the very next instruction the block behaves exactly as if the real weight matrix had been replaced by that patch plus the original weights, even though nothing on disk changed .
🌀 Why the change vanishes after each run
The patch lives only inside the forward pass. Once the model finishes processing the current token, the computation graph is cleared and the base weights revert to their untouched state.
Because the next token builds its own patch from scratch, no cumulative edit sticks around in memory, yet during the pass the effect is the same as a quick one‑step fine‑tune .
Put simply, each prompt token writes a throw‑away sticky note on top of the first weight matrix, lets the model read that note to answer the query, then tosses it out before the weights ever hit the file system.
🚫 This @Microsoft paper brings really bad news for medical AI models. Exposes some serious flaws.
AI models just aren’t ready yet for reliable medical reasoning. 🤯
Paper finds that medical AI model pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.
While medical AI models look good on benchmarks, in reality they can not handle real medical reasoning.
The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.
This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.
---
The specific key findings from this paper 👇
- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.
- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.
- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.
- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.
- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.
- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.
- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.
- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.
- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.
- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.
- Most models refuse to abstain without the image, which is unsafe behavior for medical use.
- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.
🧵 Read on 👇
🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.
The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.
The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.
The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.
The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.
🧵3/n. This figure shows that removing images from diagnostic questions makes accuracy drop, so models are overestimating how much they truly use vision.
Different benchmarks react differently, which means visual understanding is inconsistent across datasets and question types.
Even without images, most models still score above the 20% guess rate, which signals they rely on text cues, memorized pairs, or co-occurrence patterns.
One model even falls below chance on text-only inputs, which suggests fragile behavior rather than stable reasoning.
Overall, the message is that high Image+Text scores can hide shortcut use, so vision-language robustness is weaker than the headline numbers suggest.
🔥 Meta reveals a massive inefficiency in AI’s reasoning process and gives a solution.
Large language models keep redoing the same work inside long chains of thought.
For example, when adding fractions with different denominators, the model often re explains finding a common denominator step by step instead of just using a common denominator behavior.
In quadratic equations, it re explains the discriminant logic or completes the square again instead of calling a solve quadratic behavior.
In unit conversion, it spells out inches to centimeters again instead of applying a unit conversion behavior.
🛑The Prblem with this approach is, when the model re explains a routine, it spends many tokens on boilerplate steps that are identical across problems which is wasted budget.
So this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights.
A behavior compresses that routine into a short name plus instruction like a tiny macro that the model can reference.
At inference, a small list of relevant behaviors is given to the model or already internalized by training so the model can say which behavior it is using and skip the long re derivation.
Because it points to a named behavior, the output needs fewer tokens, and the saved tokens go to the new parts of the question.
Behavior conditioned fine tuning teaches the weights to trigger those routines from the question alone so even without retrieval the model tends to use the right shortcut.
Compute shifts from many output tokens to a few input hints and weight activations which is cheaper in most serving stacks and usually faster too.
Accuracy can improve because the model follows a tested routine instead of improvising a fresh multi step derivation that may drift.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
A behavior is a short name plus instruction for a reusable move, like inclusion exclusion or turning words into equations.
The behavior handbook is a store of how-to steps, which is procedural memory, unlike RAG that stores facts.
The authors frame the goal as remember how to reason, not just what to conclude, which matches the engineer’s point that remembering how to think beats thinking longer.
🧵3/n. Behavior Curation Pipeline
- Step 1: Solving and reflecting
The process starts with a model (labeled LLM A) that gets a math question. It solves it normally, producing a full reasoning trace.
Then the same model reflects on that solution, asking: “What steps here are general tricks I could reuse later?”
From that reflection, it extracts behaviors. Each behavior has a short name plus an instruction, like triangle angle sum or inclusion exclusion.
- Step 2: Building a handbook
These extracted behaviors are added to a behavior handbook. Think of it as a growing library of reasoning shortcuts.
The handbook is procedural memory, so instead of facts like “Paris is in France,” it stores “how to” rules, like “angles in a triangle add to 180.”
- Step 3: Using the handbook at test time
Now another model (LLM C, called the Student) faces new problems.
During inference, it is given the problem plus a small selection of relevant behaviors from the handbook.
It solves the problem while explicitly referencing these behaviors, so the reasoning trace is shorter and less repetitive.
- Step 4: Self-improvement loop
The same idea can also be used for self-improvement.
A model solves a problem once, extracts behaviors from that first attempt, then retries the problem using those behaviors as hints.
This second try is usually more accurate than just a critique-and-revise method.
- Step 5: Training with behaviors
There’s also a training path called behavior-conditioned supervised fine-tuning (SFT).
Here, a Teacher model (LLM B) generates reasoning traces that explicitly include behaviors.
The Student is then trained on these behavior-annotated traces. After training, the Student doesn’t need behavior prompts anymore because it has internalized the patterns.
The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.
Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.
Even if they never miss on the first step, their accuracy drops fast as the task gets longer.
Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.
The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors
The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.
🧵 Read on 👇
🧵2/n. 🧠 The idea
The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot.
🧵3/n. Even a tiny accuracy boost at the single-step level leads to exponential growth in how long a model can reliably execute a full task.
This is why scaling up models is still worth it, even if short benchmarks look like progress is stalling.
On the left, you see that step accuracy (how often the model gets each small move right) is almost flat, barely improving with newer models.
That looks like diminishing returns, because each release is only slightly better at a single step.
But on the right, when you extend that tiny step improvement across many steps in a row, the gains explode.
Task length (how long a model can keep going without failing) jumps from almost nothing to thousands of steps.
🤯 Current medical AI models may look good on standard medical benchmarks but those scores do not mean the models can handle real medical reasoning.
The key point is that many models pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.
The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.
This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.
---
The specific key findings from this paper 👇
- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.
- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.
- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.
- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.
- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.
- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.
- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.
- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.
- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.
- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.
- Most models refuse to abstain without the image, which is unsafe behavior for medical use.
- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.
🧵 Read on 👇
🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.
The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.
The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.
The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.
The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.
🧵3/n. This figure shows that removing images from diagnostic questions makes accuracy drop, so models are overestimating how much they truly use vision.
Different benchmarks react differently, which means visual understanding is inconsistent across datasets and question types.
Even without images, most models still score above the 20% guess rate, which signals they rely on text cues, memorized pairs, or co-occurrence patterns.
One model even falls below chance on text-only inputs, which suggests fragile behavior rather than stable reasoning.
Overall, the message is that high Image+Text scores can hide shortcut use, so vision-language robustness is weaker than the headline numbers suggest.
If this spreads, new research won’t just be something you read, it’ll be something you can use immediately.
It will lower barriers, saves huge amounts of time, and could make science much more reliable and connected.
Normally, a paper is just a PDF plus maybe some code, and if you want to use it you have to install dependencies, debug environments, and figure out parameters. That’s hard and often stops people from ever using the method.
Paper2Agent skips all that. It automatically converts a paper into an interactive AI agent. You can talk to it in plain language, and it will actually run the real code, with the right data and setup, and give you results. No setup or manual fixing needed.
They proved it works on heavy-duty cases like AlphaGenome, TISSUE, and Scanpy, and the agents reproduced the original paper’s results with 100% accuracy, even on brand new queries.
⚙️ The Core Concepts
The framework represents each paper as an MCP server that bundles executable tools, static resources, and step-by-step prompts, then any LLM agent can call those tools with plain language to run the paper’s method.
This shifts research output from a passive document to an interactive system that demonstrates, applies, and adapts the paper’s ideas on demand.
It automates environment setup, extracts tools from the repo and tutorials, and tests them until outputs match the originals.
🧵 Read on 👇
🧵2/n. 🧩 Why MCP
Model Context Protocol gives a standard way to expose functions and data with clear inputs and outputs, so agents can call them reliably without custom glue code.
Paper2Agent uses this to encode datasets, code paths, and multi step workflows so the paper becomes addressable, composable, and easy to query.
🧵3/n. This figure shows the whole idea of Paper2Agent in action.
At the top, it shows a research paper is turned into a paper MCP server.
This server bundles together the tools, the code repository, the data, and instructions for running analyses or reproducing figures.
Once created, it can be hooked up to any AI agent without manual setup. That makes the paper itself act like an interactive agent you can talk to. You just ask it to apply the method to your dataset, and it runs the analysis and gives back results.
At the bottom, it shows the workflow that builds this server.
The process starts with identifying the codebase of the paper. Then an environment agent sets up everything cleanly.
An extraction agent pulls out reusable tools from the code. A testing agent runs and refines them until they work exactly as intended. Finally, the validated tools are packaged into a Python MCP server and deployed remotely.
So Paper2Agent turns a static paper into a live system that can run its methods reliably through natural language interaction.