Compiling in real-time, the race towards AGI.
The Largest Show on X for AI.
🗞️ Get my daily AI analysis newsletter to your email 👉 https://t.co/6LBxO8215l
7 subscribers
Oct 5 • 9 tweets • 5 min read
Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.
Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.
💡The mechanism they reveal for in-context-learning.
When the model reads a few examples in your prompt, it figures out a pattern (like a small rule or function). Instead of permanently changing its stored weights, it forms a temporary adjustment that captures this pattern. That adjustment can be written mathematically as a rank-1 matrix, meaning it only adds one simple direction of change to the existing weights.
This rank-1 update is “low-rank”, so it is very cheap and compact. But it still lets the model shift its behavior to fit the examples in the prompt. Once the prompt is gone, that temporary rank-1 tweak also disappears.
So, in simple terms:
The paper shows that in-context learning happens because the model internally applies a temporary rank-1 (very simple) weight update based on your examples, instead of permanently retraining itself.
---
That behavior looks impossible if learning always means gradient descent.
The authors ask whether the transformer’s own math hides an update inside the forward pass.
They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.
Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.
---
Shows that the attention part can take what it found in your prompt and package it into a tiny “instruction” that, for this 1 forward pass, acts exactly like a small temporary change to the MLP’s weights.
Nothing is saved to disk, yet the block behaves as if the MLP just got a low-rank tweak computed from your examples. Remove the prompt, the tweak disappears, the saved weights stay the same.
As the model reads your examples token by token, it keeps refining that temporary tweak. Each new token nudges the MLP a bit more toward the rule implied by your examples, similar to taking small gradient steps, again only for this pass.
When the examples have done their job, those nudges shrink toward 0, which is what you want when the pattern has been “locked in” for the current answer.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Idea
They call any layer that can read a separate context plus a query a “contextual layer”.
Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.
For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.
Oct 3 • 5 tweets • 4 min read
🚫 This @Microsoft paper brings really bad news for medical AI models. Exposes some serious flaws.
AI models just aren’t ready yet for reliable medical reasoning. 🤯
Paper finds that medical AI model pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.
While medical AI models look good on benchmarks, in reality they can not handle real medical reasoning.
The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.
This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.
---
The specific key findings from this paper 👇
- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.
- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.
- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.
- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.
- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.
- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.
- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.
- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.
- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.
- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.
- Most models refuse to abstain without the image, which is unsafe behavior for medical use.
- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.
🧵 Read on 👇
🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.
The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.
The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.
The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.
The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.
Sep 28 • 8 tweets • 5 min read
🔥 Meta reveals a massive inefficiency in AI’s reasoning process and gives a solution.
Large language models keep redoing the same work inside long chains of thought.
For example, when adding fractions with different denominators, the model often re explains finding a common denominator step by step instead of just using a common denominator behavior.
In quadratic equations, it re explains the discriminant logic or completes the square again instead of calling a solve quadratic behavior.
In unit conversion, it spells out inches to centimeters again instead of applying a unit conversion behavior.
🛑The Prblem with this approach is, when the model re explains a routine, it spends many tokens on boilerplate steps that are identical across problems which is wasted budget.
So this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights.
A behavior compresses that routine into a short name plus instruction like a tiny macro that the model can reference.
At inference, a small list of relevant behaviors is given to the model or already internalized by training so the model can say which behavior it is using and skip the long re derivation.
Because it points to a named behavior, the output needs fewer tokens, and the saved tokens go to the new parts of the question.
Behavior conditioned fine tuning teaches the weights to trigger those routines from the question alone so even without retrieval the model tends to use the right shortcut.
Compute shifts from many output tokens to a few input hints and weight activations which is cheaper in most serving stacks and usually faster too.
Accuracy can improve because the model follows a tested routine instead of improvising a fresh multi step derivation that may drift.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
A behavior is a short name plus instruction for a reusable move, like inclusion exclusion or turning words into equations.
The behavior handbook is a store of how-to steps, which is procedural memory, unlike RAG that stores facts.
The authors frame the goal as remember how to reason, not just what to conclude, which matches the engineer’s point that remembering how to think beats thinking longer.
Sep 28 • 16 tweets • 7 min read
One of the best paper of the recent week.
The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.
Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.
Even if they never miss on the first step, their accuracy drops fast as the task gets longer.
Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.
The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors
The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.
🧵 Read on 👇
🧵2/n. 🧠 The idea
The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot.
Sep 25 • 13 tweets • 10 min read
🚨 BAD news for Medical AI models.
MASSIVE revelations from this @Microsoft paper.
🤯 Current medical AI models may look good on standard medical benchmarks but those scores do not mean the models can handle real medical reasoning.
The key point is that many models pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.
The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.
This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.
---
The specific key findings from this paper 👇
- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.
- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.
- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.
- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.
- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.
- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.
- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.
- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.
- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.
- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.
- Most models refuse to abstain without the image, which is unsafe behavior for medical use.
- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.
🧵 Read on 👇
🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.
The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.
The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.
The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.
The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.
Sep 24 • 6 tweets • 4 min read
This is such a brilliant paper.
If this spreads, new research won’t just be something you read, it’ll be something you can use immediately.
It will lower barriers, saves huge amounts of time, and could make science much more reliable and connected.
Normally, a paper is just a PDF plus maybe some code, and if you want to use it you have to install dependencies, debug environments, and figure out parameters. That’s hard and often stops people from ever using the method.
Paper2Agent skips all that. It automatically converts a paper into an interactive AI agent. You can talk to it in plain language, and it will actually run the real code, with the right data and setup, and give you results. No setup or manual fixing needed.
They proved it works on heavy-duty cases like AlphaGenome, TISSUE, and Scanpy, and the agents reproduced the original paper’s results with 100% accuracy, even on brand new queries.
⚙️ The Core Concepts
The framework represents each paper as an MCP server that bundles executable tools, static resources, and step-by-step prompts, then any LLM agent can call those tools with plain language to run the paper’s method.
This shifts research output from a passive document to an interactive system that demonstrates, applies, and adapts the paper’s ideas on demand.
It automates environment setup, extracts tools from the repo and tutorials, and tests them until outputs match the originals.
🧵 Read on 👇
🧵2/n. 🧩 Why MCP
Model Context Protocol gives a standard way to expose functions and data with clear inputs and outputs, so agents can call them reliably without custom glue code.
Paper2Agent uses this to encode datasets, code paths, and multi step workflows so the paper becomes addressable, composable, and easy to query.
Sep 24 • 12 tweets • 7 min read
Brilliant @GoogleDeepMind paper, a major advancement in embedding-based search.
Most regular search systems return only exact or near matches, they miss those farther-up categories that still matter, the bigger parent categories your query belongs to.
And this paper’s simple 2-step training pulls those in reliably, lifting far-away match accuracy from 19% to 76%.
Meaning, the Long-distance recall jumps from 19% to 76% on WordNet with low-dimensional embeddings.
Long-distance recall is the % of the far-away relevant items that your search actually returns.
“Far-away” means items several steps up the category tree from your query, like “Footwear” for “Kid’s sandals”.
You compute it by looking only at those distant ancestors, counting how many should be returned, counting how many you actually returned, then doing hits divided by should-have.
If there are 10 such ancestors and your system returns 7, long-distance recall is 70%.
⚙️ The Core Concepts
Hierarchical retrieval expects a query to bring back its own node and all more general ancestors, which is asymmetric, so the same concept must embed differently on the query side and the document side.
Euclidean geometry creates tension across queries, yet an asymmetric dual-encoder can resolve it with careful scoring.
The running example is “Kid’s sandals,” where “Sandals” is relevant to that query, but the reverse is not, which motivates asymmetric scoring.
🧠 The idea of this paper.
Dual encoders can solve hierarchical retrieval when query and document embeddings are asymmetric and the needed dimension grows gently with hierarchy depth and log of catalog size.
A simple schedule, pretrain on regular pairs then finetune on long-distance pairs, fixes misses on far ancestors without hurting close matches.
🧵 Read on 👇
🧵2/n. 🧩 Quick outline
The task is to retrieve the exact node plus all more general ancestors, so relevance is one-way.
They formalize the setup, train with a softmax loss over in-batch negatives, and score by recall.
They prove feasible Euclidean embeddings exist with a dimension that scales with depth and log of size.
Synthetic trees show learned encoders work at much smaller dimensions than the constructive bound.
Tiny dimensions fail mainly on far ancestors, the “lost-in-the-long-distance” effect.
Up-weighting far pairs hurts near pairs, so rebalancing alone fails.
Pretrain then finetune only on far pairs lifts all distance slices with early stopping to protect near pairs.
On WordNet and Amazon ESCI, the recipe gives clear recall gains across slices.
Sep 23 • 9 tweets • 5 min read
Tencent just took a big step beyond GRPO by introducing Single-stream Policy Optimization (SPO)
An approach that fixes GRPO’s wasted compute from degenerate groups and its constant group synchronization stalls, making training both faster and more stable.
🧠 The idea
SPO trains with 1 response per prompt, keeps a persistent baseline per prompt, and normalizes advantages across the batch, which stabilizes learning and cuts waste.
This removes degenerate groups that give 0 signal and avoids group synchronization stalls in distributed runs.
On math reasoning with Qwen3-8B, SPO improves accuracy and learns more smoothly than GRPO.
Sensational fact: 4.35× throughput speedup in an agentic simulation and +3.4 pp maj@32 over GRPO.
🧵 Read on 👇
🧵2/n. 🧩 The problem with group-based training
Group-based methods sample many responses per prompt to compute a relative baseline, but when every response in a group is all correct or all wrong the advantages become 0 and the step gives no gradient.
Heuristics like dynamic sampling try to force a non-zero advantage, but they add complexity and keep a synchronization barrier that slows large-scale training.
Sep 19 • 7 tweets • 5 min read
LLM for financial trading/decision making.
A 4B model financial-domain model, Trading-R1, that writes clear analyst theses and turns them into trades.
Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns.
The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.
The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context.
Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.
For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization.
In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.
🧵 Read on 👇
🧵2/n. The 3 steps used to train Trading-R1.
The first step is Structure. The model is taught how to write a thesis in a clear format. It must separate parts like market trends, company fundamentals, and sentiment, and it has to place each claim in the right section.
The second step is Claims. Here the model learns that any claim it makes must be supported by evidence. For example, if it says revenue is growing, it must back that with a source or number provided in the context.
The third step is Decision. The model turns the structured thesis into an actual trading action. It predicts outcomes like strong buy, buy, hold, sell, or strong sell. Its prediction is checked against the true outcome, and it gets rewards or penalties depending on accuracy.
Each step first uses supervised fine-tuning, which means training on examples with correct answers, and then reinforcement fine-tuning, which means refining the model by giving rewards when it produces better outputs.
Finally, all stages are combined, producing Trading-R1, a model that can both write well-structured financial reasoning and map that reasoning into actual trading decisions.
Sep 18 • 12 tweets • 7 min read
👨🔧 The DeepSeek R1 Nature paper’s supplementary notes are a goldmine across 83 solid pages.
everything from training data and hyperparameters to why the base model matters.
Reinforcement learning, not just supervised fine-tuning, is what pushes DeepSeek‑R1 to generate long, reflective reasoning that actually fixes its own mistakes.
They train with Group Relative Policy Optimization, drop the value model, manage divergence to a moving reference, and let the model scale test‑time thinking to crack harder problems.
🧵 Read on 👇
🧵2/n. 🔁 GRPO, not PPO
Group Relative Policy Optimization samples a small group of answers for each prompt, scores them, normalizes those scores within the group, and updates the policy toward the better ones while skipping a separate value model.
They control drift with an unbiased Kullback–Leibler estimate against a reference policy and periodically refresh that reference, which avoids over‑penalizing long responses and cuts memory and compute.
On the same backbone, Proximal Policy Optimization needed careful lambda tuning to approach GRPO on math, which made GRPO the lower‑friction choice in practice.
Sep 18 • 5 tweets • 4 min read
🇨🇳 DeepSeek-R1 was published in Nature yesterday as the cover article for their BRILLIANT latest research.
They show that pure Reinforcement Learning with answer-only rewards can grow real reasoning skills, no human step-by-step traces required.
So completely skip human reasoning traces and still get SOTA reasoning via pure RL.
It’s so powerful revelation, because instead of forcing the model to copy human reasoning steps, it only rewards getting the final answer right, which gives the model freedom to invent its own reasoning strategies that can actually go beyond human examples.
Earlier methods capped models at what humans could demonstrate, but this breaks that ceiling and lets reasoning emerge naturally.
Those skills include self-checking, verification, and changing strategy mid-solution, and they beat supervised baselines on tasks where answers can be checked.
Models trained this way also pass those patterns down to smaller models through distillation.
AIME 2024 pass@1 jumps from 15.6% to 77.9%, and hits 86.7% with self-consistency.
⚙️ The Core Concepts
The paper replaces human-labelled reasoning traces with answer-graded RL, so the model only gets a reward when its final answer matches ground truth, which frees it to search its own reasoning style.
The result is longer thoughts with built-in reflection, verification, and trying backups when stuck, which are exactly the skills needed for math, coding, and STEM problems where correctness is checkable.
This matters because supervised traces cap the model at human patterns, while answer-graded RL lets it discover non-human routes that still land on correct answers.
🧪 How R1-Zero is trained
R1-Zero starts from DeepSeek-V3 Base and uses GRPO, a group-based variant of PPO that compares several sampled answers per question, then pushes the policy toward the higher-reward ones while staying close to a reference model.
Training enforces a simple output structure with separate thinking and final answer sections, and uses rule-based rewards for accuracy and formatting, avoiding neural reward models that are easy to game at scale.
This minimal setup is intentional, because fewer constraints make it easier to observe what reasoning behaviours show up on their own.
Sep 16 • 11 tweets • 5 min read
"The Impact of Artificial Intelligence on Human Thought"
A big 132 page report.
AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,
A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.
It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.
In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.
Sep 15 • 8 tweets • 6 min read
Powerful new discoveries in this paper for autonomous software design.🎯
Will completely shift the way Software and AI programming will be written.
1/ Tau is in the process of constructing the next wave of AI.
Tau Language lets you write a spec of what a program should and shouldn’t do, and its logical engine automatically constructs a program mathematically guaranteed to meet your spec, removing manual implementation.
The most consuming aspect of software dev used to be writing correct code; now, it's about conveying intent accurately in specifications and getting correct-by-construction software.
This foundation is also the subject of a U.S. patent that covers using such temporal logics and Boolean‑algebraic theories for safe AI and a software‑spec logic, which matches the design.
2/ How this is different from today
With Tau, you directly state properties of the program like a formalization of “never send private data over the network”, and it produces a provably correct implementation that satisfies them.
This breaks away from today’s coding in which you write how and what a program should do at each step. And unlike Tau, in code you can’t say what the program should never do, you test and hope you covered edge cases.
In the Tau Language, programs, inputs, and outputs can be sentences in the Tau language itself, which is the first logic ever that can consistently refer to its own sentences.
Why LLMs Fall Short:
People expect deterministic and correct output from probabilistic tools, which can’t be trusted to be reliable. Imagine the disastrous results if an Airplane manufacturer decided to use code generated by LLMs, how many of you would take that flight?
Gen AI's probabilistic nature creates entropy precisely where complex systems need precision and reliability.
The V-model dev model is the standard for critical products developed for the medical industry.
The deeper you are into your V-model product development cycle, the more it costs to fix a defect.
🧩 What makes Tau Language a huge breakthough:
Tau’s Founder @ohadasor made several novel inventions, which together work as a masterpiece in theoretical computer science.
Tau Language straddles a fine line retaining decidability while being expressive enough to write specs of complex systems in their entirety, where other decidable formal languages simply aren’t strong enough.
Let’s dive deeper into Tau Language’s novel research:
🧵 1/n
🧵 2/n. 🧩 NSO (Nullary Second Order), the self‑referential core
NSO abstracts sentences so aggressively that each sentence becomes just a Boolean algebra element, which lets the logic speak about sentences in the same language without running into the usual truth paradoxes.
Because it deals with countable atomless Boolean algebras, you keep decidability, and even NSO[C] is decidable iff C is decidable, so extensions stay manageable.
This relies on the Lindenbaum–Tarski algebra view of logics, which collects sentences into equivalence classes under logical equivalence, turning syntax into algebra and letting the rest be handled with ordinary Boolean operations.
Sep 14 • 16 tweets • 7 min read
One of the best paper of the recent week.
The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.
Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.
Even if they never miss on the first step, their accuracy drops fast as the task gets longer.
Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.
The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors
The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.
🧵 Read on 👇
🧵2/n. 🧠 The idea
The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot.
Sep 14 • 5 tweets • 3 min read
🧠 🧩 🚦 📉 ⚙️ University of Sheffield argues LLM hallucinations are mathematically inevitable.
And using confidence thresholds the way OpenAI proposes would cut hallucinations but break consumer UX and spike costs.
The core claim is that next-token generation stacks errors across a sentence, so even with perfect data the total mistake rate grows.
A language model builds a sentence word by word. At each step, it picks the next word it thinks is most likely. If it makes one small mistake early on, that mistake affects the words that come after it. The sentence then drifts further from the correct answer.
Now compare that to a yes/no question. The model only has to pick between two options: “yes” or “no.” There is just one decision, so fewer chances for error.
An "yes/no question" is like a baseline: one single prediction, no chain of dependencies. But a sentence is a long chain of predictions, and each link in the chain can go wrong.
This is why the study says the error rate for full sentences will always be at least 2 times higher than for simple yes/no answers. Because in sentences, errors can accumulate word by word, instead of being contained in a single decision.
In plain terms, incentives today still favor fast, cheap, confident replies over slower, cautious, correct ones, so hallucinations will stick around.
2/n. Rarer facts are even more prone to hallucinations.
When a model is trained, it learns facts from the data it sees. Some facts show up many times during training, so the model gets strong evidence for them. Other facts only appear once or twice, so the model has weak evidence for them.
The study gives an example with birthdays. Suppose 20% of the people in the training set only have their birthday mentioned once. For those people, the model basically has just a single memory of the fact. That is too little for the model to reliably recall it later.
As a result, when you ask about those birthdays, the model will likely get at least 20% of them wrong — because those facts were too rare in its training data.
Sep 12 • 6 tweets • 2 min read
Congrats to @CreaoAI for hitting #1 on Product Hunt (Sept 11) 🚀
just used it myself, and quite smooth experience.
CREAO is an AI Agent that builds full-stack mini-SaaS from one sentence.
One sentence in → frontend + backend + data layer out.
They are building a platform to provide the critical interface for people to build apps where humans and AI agents can collaborate seamlessly.
So its entire infrastructure is engineered with an "AI-native first" philosophy.
🧵1/n.
🧵2/n. ⚡ All-in-one build.
CREAO gave me a deployable product — frontend, backend, database together.
#1 on Product Hunt (Sept 11).
Sep 11 • 5 tweets • 3 min read
🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0
Upto 100X faster while being trained on less than 2% of the data typically required.
Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.
Marks a significant shift from current AI architectures
Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.
Potential Applications:
- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical
This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.
---
arxiv .org/abs/2509.05276
🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.
- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.
- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.
- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.
Sep 11 • 4 tweets • 3 min read
Fantastic paper from ByteDance 👏
Shows how to train LLM agents to finish long, multi step tasks by letting them act in real environments with reinforcement learning.
Across 27 tasks, the trained agents rival or beat top proprietary models.
Most agents are trained on single turn data, so they fail when a job needs many decisions with noisy feedback.
AgentGym-RL splits the system into separate parts, the environments, the agent loop, and training, so each can improve on its own.
It supports mainstream algorithms and realistic tasks, and the agent learns by acting, seeing results, and adjusting across different settings.
The key method, ScalingInter-RL, starts with short interactions to master basics, then slowly allows longer runs so the agent can explore and plan.
This staged horizon schedule stabilizes learning, prevents pointless loops, and encourages planning, reflection, and recovery after mistakes.
A 7B model trained with this setup matches or beats much larger open models and competes well with strong commercial ones.
They also find that putting more compute into training and test time interaction, like more steps or samples, often helps more than adding parameters.
How the AgentGym-RL framework works.
At the center is the LLM agent. It takes an instruction, interacts with an environment for several turns, and then produces actions. Each action changes the environment, and the environment sends feedback back to the agent. This cycle repeats many times.
The environment itself is handled by a server that can simulate different types of tasks. These include web browsing, searching, coding, playing games, doing science tasks, or controlling embodied agents. The environment client manages the interaction and communicates through standard protocols.
Every full cycle of actions and observations is called a trajectory. These trajectories are collected and then used to update the agent’s policy with reinforcement learning algorithms like PPO, GRPO, RLOO, or REINFORCE++.
The framework is modular. The environment, the agent, and the training part are separated. This makes it flexible, easy to extend, and suitable for many types of realistic tasks.
The diagram highlights how the agent learns not by memorizing answers, but by trying actions, getting feedback, and improving its decision making across different domains.
Sep 9 • 7 tweets • 4 min read
📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.
An LLM plus tree search turns scientific coding into a score driven search engine.
This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.
The key idea is to treat coding for scientific tasks as a scorable search problem.
That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones
With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.
Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.
In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.
In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.
This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.
This framing also explains why the method can travel across domains, since only the scoring function changes.
Sep 9 • 9 tweets • 3 min read
Fei-Fei Li (@drfeifei) on limitations of LLMs.
"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."
Language is purely generated signal.
AI models trained on linguistic signals fail when the task requires embodied physical common sense in a world with real constraints.
Sep 8 • 10 tweets • 5 min read
BRILLIANT paper.
LLMs get stuck when they think too long in a single line, early tokens steer them into a narrow path and they rarely recover, which the authors call Tunnel Vision.
ParaThinker trains native parallel thinking, it spins up multiple distinct reasoning paths at once and then fuses them into 1 answer, which lifts accuracy a lot with tiny latency cost.
Sensational fact, if you only keep 1 thing: 12.3% average gain for 1.5B, 7.5% for 7B, with only 7.1% extra latency.
ParaThinker shows that training LLMs to think in parallel paths instead of just longer single chains avoids tunnel vision, giving up to 12.3% accuracy gains with only 7.1% extra latency, letting smaller models beat much larger ones.
🧵 Read on 👇
🧵2/n. 🧩 Why longer thinking stalls
When the model makes a mistake early on, it keeps building on that mistake.
The longer it goes down that wrong path, the less chance it has to recover.
This stuck behavior is what the authors call Tunnel Vision, and it explains why just letting the model think longer doesn’t always improve accuracy.