Rohan Paul Profile picture
Compiling in real-time, the race towards AGI. 🗞️ Don't miss my daily top 1% AI analysis newsletter directly to your inbox 👉 https://t.co/6LBxO8215l
6 subscribers
Sep 16 11 tweets 5 min read
"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image 🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
Sep 15 8 tweets 6 min read
Powerful new discoveries in this paper for autonomous software design.🎯

Will completely shift the way Software and AI programming will be written.

1/ Tau is in the process of constructing the next wave of AI.

Tau Language lets you write a spec of what a program should and shouldn’t do, and its logical engine automatically constructs a program mathematically guaranteed to meet your spec, removing manual implementation.

The most consuming aspect of software dev used to be writing correct code; now, it's about conveying intent accurately in specifications and getting correct-by-construction software.

This foundation is also the subject of a U.S. patent that covers using such temporal logics and Boolean‑algebraic theories for safe AI and a software‑spec logic, which matches the design.

2/ How this is different from today

With Tau, you directly state properties of the program like a formalization of “never send private data over the network”, and it produces a provably correct implementation that satisfies them.

This breaks away from today’s coding in which you write how and what a program should do at each step. And unlike Tau, in code you can’t say what the program should never do, you test and hope you covered edge cases.

In the Tau Language, programs, inputs, and outputs can be sentences in the Tau language itself, which is the first logic ever that can consistently refer to its own sentences.

Why LLMs Fall Short:

People expect deterministic and correct output from probabilistic tools, which can’t be trusted to be reliable. Imagine the disastrous results if an Airplane manufacturer decided to use code generated by LLMs, how many of you would take that flight?

Gen AI's probabilistic nature creates entropy precisely where complex systems need precision and reliability.

The V-model dev model is the standard for critical products developed for the medical industry.

The deeper you are into your V-model product development cycle, the more it costs to fix a defect.

🧩 What makes Tau Language a huge breakthough:

Tau’s Founder @ohadasor made several novel inventions, which together work as a masterpiece in theoretical computer science.

Tau Language straddles a fine line retaining decidability while being expressive enough to write specs of complex systems in their entirety, where other decidable formal languages simply aren’t strong enough.

Let’s dive deeper into Tau Language’s novel research:

🧵 1/nImage 🧵 2/n. 🧩 NSO (Nullary Second Order), the self‑referential core

NSO abstracts sentences so aggressively that each sentence becomes just a Boolean algebra element, which lets the logic speak about sentences in the same language without running into the usual truth paradoxes.

Because it deals with countable atomless Boolean algebras, you keep decidability, and even NSO[C] is decidable iff C is decidable, so extensions stay manageable.

This relies on the Lindenbaum–Tarski algebra view of logics, which collects sentences into equivalence classes under logical equivalence, turning syntax into algebra and letting the rest be handled with ordinary Boolean operations.
Sep 14 16 tweets 7 min read
One of the best paper of the recent week.

The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.

Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.

Even if they never miss on the first step, their accuracy drops fast as the task gets longer.

Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.

The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors

The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.

🧵 Read on 👇Image 🧵2/n. 🧠 The idea

The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot. Image
Sep 14 5 tweets 3 min read
🧠 🧩 🚦 📉 ⚙️ University of Sheffield argues LLM hallucinations are mathematically inevitable.

And using confidence thresholds the way OpenAI proposes would cut hallucinations but break consumer UX and spike costs.

The core claim is that next-token generation stacks errors across a sentence, so even with perfect data the total mistake rate grows.

A language model builds a sentence word by word. At each step, it picks the next word it thinks is most likely. If it makes one small mistake early on, that mistake affects the words that come after it. The sentence then drifts further from the correct answer.

Now compare that to a yes/no question. The model only has to pick between two options: “yes” or “no.” There is just one decision, so fewer chances for error.

An "yes/no question" is like a baseline: one single prediction, no chain of dependencies. But a sentence is a long chain of predictions, and each link in the chain can go wrong.

This is why the study says the error rate for full sentences will always be at least 2 times higher than for simple yes/no answers. Because in sentences, errors can accumulate word by word, instead of being contained in a single decision.

In plain terms, incentives today still favor fast, cheap, confident replies over slower, cautious, correct ones, so hallucinations will stick around.Image 2/n. Rarer facts are even more prone to hallucinations.

When a model is trained, it learns facts from the data it sees. Some facts show up many times during training, so the model gets strong evidence for them. Other facts only appear once or twice, so the model has weak evidence for them.

The study gives an example with birthdays. Suppose 20% of the people in the training set only have their birthday mentioned once. For those people, the model basically has just a single memory of the fact. That is too little for the model to reliably recall it later.

As a result, when you ask about those birthdays, the model will likely get at least 20% of them wrong — because those facts were too rare in its training data.Image
Sep 12 6 tweets 2 min read
Congrats to @CreaoAI for hitting #1 on Product Hunt (Sept 11) 🚀

just used it myself, and quite smooth experience.

CREAO is an AI Agent that builds full-stack mini-SaaS from one sentence.

One sentence in → frontend + backend + data layer out.

They are building a platform to provide the critical interface for people to build apps where humans and AI agents can collaborate seamlessly.

So its entire infrastructure is engineered with an "AI-native first" philosophy.

🧵1/n. 🧵2/n. ⚡ All-in-one build.

CREAO gave me a deployable product — frontend, backend, database together.

#1 on Product Hunt (Sept 11). Image
Sep 11 5 tweets 3 min read
🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0

Upto 100X faster while being trained on less than 2% of the data typically required.

Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.

Marks a significant shift from current AI architectures

Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.

Potential Applications:

- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical

This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.

---

arxiv .org/abs/2509.05276Image 🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.

- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.

- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.

- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.Image
Sep 11 4 tweets 3 min read
Fantastic paper from ByteDance 👏

Shows how to train LLM agents to finish long, multi step tasks by letting them act in real environments with reinforcement learning.

Across 27 tasks, the trained agents rival or beat top proprietary models.

Most agents are trained on single turn data, so they fail when a job needs many decisions with noisy feedback.

AgentGym-RL splits the system into separate parts, the environments, the agent loop, and training, so each can improve on its own.

It supports mainstream algorithms and realistic tasks, and the agent learns by acting, seeing results, and adjusting across different settings.

The key method, ScalingInter-RL, starts with short interactions to master basics, then slowly allows longer runs so the agent can explore and plan.

This staged horizon schedule stabilizes learning, prevents pointless loops, and encourages planning, reflection, and recovery after mistakes.

A 7B model trained with this setup matches or beats much larger open models and competes well with strong commercial ones.

They also find that putting more compute into training and test time interaction, like more steps or samples, often helps more than adding parameters.Image How the AgentGym-RL framework works.

At the center is the LLM agent. It takes an instruction, interacts with an environment for several turns, and then produces actions. Each action changes the environment, and the environment sends feedback back to the agent. This cycle repeats many times.

The environment itself is handled by a server that can simulate different types of tasks. These include web browsing, searching, coding, playing games, doing science tasks, or controlling embodied agents. The environment client manages the interaction and communicates through standard protocols.

Every full cycle of actions and observations is called a trajectory. These trajectories are collected and then used to update the agent’s policy with reinforcement learning algorithms like PPO, GRPO, RLOO, or REINFORCE++.

The framework is modular. The environment, the agent, and the training part are separated. This makes it flexible, easy to extend, and suitable for many types of realistic tasks.

The diagram highlights how the agent learns not by memorizing answers, but by trying actions, getting feedback, and improving its decision making across different domains.Image
Sep 9 7 tweets 4 min read
📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.

An LLM plus tree search turns scientific coding into a score driven search engine.

This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.

The key idea is to treat coding for scientific tasks as a scorable search problem.

That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones

With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.

Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.

In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.

In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.

🧵 Read on 👇Image 🧵2/n. ⚙️ The Core Concepts

Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.

This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.

This framing also explains why the method can travel across domains, since only the scoring function changes.Image
Sep 9 9 tweets 3 min read
Fei-Fei Li (@drfeifei) on limitations of LLMs.

"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."

Language is purely generated signal.

AI models trained on linguistic signals fail when the task requires embodied physical common sense in a world with real constraints. Image
Sep 8 10 tweets 5 min read
BRILLIANT paper.

LLMs get stuck when they think too long in a single line, early tokens steer them into a narrow path and they rarely recover, which the authors call Tunnel Vision.

ParaThinker trains native parallel thinking, it spins up multiple distinct reasoning paths at once and then fuses them into 1 answer, which lifts accuracy a lot with tiny latency cost.

Sensational fact, if you only keep 1 thing: 12.3% average gain for 1.5B, 7.5% for 7B, with only 7.1% extra latency.

ParaThinker shows that training LLMs to think in parallel paths instead of just longer single chains avoids tunnel vision, giving up to 12.3% accuracy gains with only 7.1% extra latency, letting smaller models beat much larger ones.

🧵 Read on 👇Image 🧵2/n. 🧩 Why longer thinking stalls

When the model makes a mistake early on, it keeps building on that mistake.

The longer it goes down that wrong path, the less chance it has to recover.

This stuck behavior is what the authors call Tunnel Vision, and it explains why just letting the model think longer doesn’t always improve accuracy.Image
Sep 8 4 tweets 3 min read
Another great @GoogleDeepMind paper.

Shows how to speed up LLM agents while cutting cost and keeping answers unchanged.

30% lower total cost and 60% less wasted cost at comparable acceleration.

Agents plan step by step, so each call waits for the previous one, which drags latency.

Speculative planning fixes that by having a cheap draft agent guess next steps while a stronger agent checks them in parallel.

Fixed guess lengths backfire, small guesses barely help, big guesses waste tokens when a check disagrees.

Dynamic Speculative Planning learns how far to guess, then stops early to avoid wasted calls.

A tiny online predictor learns how many steps will be right using reinforcement learning.

1 knob lets teams bias for speed or cost, either by skewing training or adding a small offset.

If a guess is wrong, extra threads stop and execution resumes from the verified step.

Across OpenAGI and TravelPlanner, the dynamic policy matches the fastest fixed policy while spending fewer tokens

The result is clear, faster responses, lower bills, and 0 loss in task quality.Image How Dynamic Speculative Planning, manages when and how far to guess ahead during an agent’s planning.

The top line called Predictor decides how many future steps to guess, marked by k. For example, k=2 means guess 2 steps ahead, while k=3 means guess 3 steps ahead. These guesses are carried out by a lighter agent called Approximation, and then checked in parallel by a stronger agent called Target.

If the guesses match the stronger agent, they are confirmed and execution continues. If they don’t match, shown with an X, all ongoing speculative threads are canceled, and the system resumes from the last correct step. This prevents wasted work from wrong guesses.

At the same time, an online Trainer collects data about each state and the chosen k. This data is then used to update the Predictor so it learns better over time without slowing down the agent. In other words, the system keeps improving its ability to guess how far it can safely look ahead.

So overall, the figure captures this cycle: make a guess, verify, cancel if wrong, and then use that experience to improve the predictor for the next runImage
Sep 6 13 tweets 8 min read
OpenAI realesed new paper.

"Why language models hallucinate"

Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty.

The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses.

The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing.

OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower.

Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess.

Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers.

🧵 Read on 👇Image
Image
🧵2/n. This figure is showing the idea of Is-It-Valid.

On the left side, you see examples. Some are valid outputs (in black), and others are errors (in red). Valid examples are simple and correct statements like “There are 2 D’s in LADDER” or “I don’t know Zdan’s birthday.” Error examples are things that look fluent but are wrong, like “There are 3 L’s in SPELL” or giving random birthdays.

The diagrams on the right show why errors happen differently depending on the task. For spelling, the model can learn clear rules, so valid and invalid answers separate cleanly. For counting, the model is weaker, so valid and invalid mix more. For birthdays, there is no real pattern in the data at all, so the model cannot separate correct from incorrect—this is why hallucinations occur on such facts.

So the figure proves: when there is a clear pattern (like spelling), the model learns it well. When the task has weak or no pattern (like birthdays), the model produces confident but wrong answers, which are hallucinations.Image
Sep 4 11 tweets 4 min read
AWS is betting heavily on its custom Trainium chips, with Anthropic as the anchor customer, to regain momentum in the AI cloud race.

~ A solid Semi Analysis report.

AWS is building multi-gigawatt data centers packed with Trainium2 hardware, designed to give a better cost per unit of memory bandwidth compared to Nvidia GPUs.

And this memory-vs-computer tradeoff has become super important because for many advanced AI work, especially reinforcement learning and reasoning-heavy training, it's less about raw compute and more about how quickly and cheaply memory can be moved.

🧩 Anthropic has become AWS’s anchor customer for AI capacity.

Anthropic, which has grown revenue to $5B annualized in 2025, is deeply tied into this effort, even co-designing features of Trainium to match its roadmap. That makes Trainium increasingly look like semi-custom silicon tuned for Anthropic’s workloads.

Azure’s surge shows why an anchor matters, since OpenAI’s ~$10B cloud spend lives there today.

"Trainium2 is converging toward an Anthropic custom-silicon program. This will enable Anthropic to be, alongside Google DeepMind, the only AI labs benefiting from tight hardware–software co-design in the near horizon."

🧵 Read on 👇Image 🧵2/n. 🏗️ AWS is finishing 3 campuses with over 1.3GW of IT capacity focused on Anthropic’s training runs.

SemiAnalysis expects these clusters to lift AWS growth above 20% YoY as they enter service. Image
Sep 2 8 tweets 3 min read
🇨🇳 China's Tencent open-sources translation model beats Google, OpenAI in top global AI competition

Hunyuan-MT-7B came first in 30 out of the 31 tests in a general machine-translation competition held as part of the coming WMT25 conference

Supports 33 languages, available on @huggingface

commercial use allowed.

Hunyuan-MT-7B’s strength is that it uses a small number of parameters to deliver results that measure up to or even surpass much larger models.

Tencent said its Hunyuan translation model had been employed across a range of in-house products, such as the Zoom-like Tencent Meeting, a web browser and the enterprise version of the WeChat messaging app.

🧵 Read on 👇Image 🧵2/n. English language pairs tested in the competition included Arabic, Estonian and Maasai, which is spoken by 1.5 million people living in southern Kenya and northern Tanzania.

Other language pairs included Czech-Ukrainian and Japanese-simplified Chinese. The only English language pair Hunyuan did not ace was Bhojpuri, a language spoken by around 50.5 million people in parts of northern India and Nepal.Image
Sep 1 15 tweets 8 min read
Someone let ChatGPT run a stock portfolio.

over 2 month ChatGPT’s portfolio is up +29.22% vs. the S&P 500’s +4.11% over the same window.

(Prompts, Code, Github listed)

The process works as follows.

ChatGPT is given real market data each trading day, including prices, volumes and benchmarks, stored on GitHub.

On weekends it uses that data to research deeply, reevaluate the portfolio, and look for new stock ideas.

The portfolio is simulated daily based on any changes, and then the person manually executes those trades in a real brokerage account.

ChatGPT has full authority to make buy or sell decisions, but only within U.S. micro-cap stocks under $300M market cap.Image github.com/LuckyOne7777/C…
Aug 31 11 tweets 6 min read
BRILLIANT @GoogleDeepMind research.

Even the best embeddings cannot represent all possible query-document combinations, which means some answers are mathematically impossible to recover.

Reveals a sharp truth, embedding models can only capture so many pairings, and beyond that, recall collapses no matter the data or tuning.

🧠 Key takeaway

Embeddings have a hard ceiling, set by dimension, on how many top‑k document combinations they can represent exactly.

They prove this with sign‑rank bounds, then show it empirically and with a simple natural‑language dataset where even strong models stay under 20% recall@100.

When queries force many combinations, single‑vector retrievers hit that ceiling, so other architectures are needed.

4096‑dim embeddings already break near 250M docs for top‑2 combinations, even in the best case.

🛠️ Practical Implications

For applications like search, recommendation, or retrieval-augmented generation, this means scaling up models or datasets alone will not fix recall gaps.

At large index sizes, even very high-dimensional embeddings fail to capture all combinations of relevant results.

So embeddings cannot work as the sole retrieval backbone. We will need hybrid setups, combining dense vectors with sparse methods, multi-vector models, or rerankers to patch the blind spots.

This shifts how we should design retrieval pipelines, treating embeddings as one useful tool but not a universal solution.

🧵 Read on 👇Image This figure explains LIMIT, a tiny natural-language dataset they built to test whether single-vector embeddings can represent all combinations of relevant documents for each query.

The left grid is the target relevance pattern, and the task is to rank exactly the k=2 correct documents for every query.

The right side shows the mapping into simple text, queries like “Who likes Quokkas?” paired with short bios such as “Jon Durben likes Quokkas and Apples,” so language complexity is not the challenge.

The key point, even with this simple setup, strong MTEB embedders stay under 20% recall@100, revealing a capacity limit of single-vector retrieval.Image
Aug 27 15 tweets 5 min read
Google shares for the first time the TPUv7 details, at Hot Chips 2025 .

Super valuable insight, that could not otherwise be easily gleamed.

Ironwood is said to offer 2x the perf-per-watt of Google’s previous generation TPU, Trillium.

With up to 9,216 chips in a node, Ironwood can scale up to a MASSIVE 42.5 Exaflops in performance.

Though with 10MW of power consumption, that performance doesn’t come cheap.

But, like all of Google’s TPUs, this is solely for Google’s use as part of their Google Cloud services, so Ironwood is not available to look at outside of Google.

🧵 Read on 👇Image 🧵2/n. Ironwood TPU comes with several innovations.

The big one is how big the SuperPods can go. Now up to 9,216 chips, thanks to the use of optical circuit switches (OCS) to share memory throughout the pod. There’s 1.77 PB of directly addressable HBM altogether.

This generation also brings a focus on RAS features in order to have reliable systems.

Power efficiency also gets a boost, of course. Google is claiming a 2x perf-per-watt improvement – though it’s unclear if this is at iso-datatype.Image
Aug 27 10 tweets 5 min read
"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image
Image
Image
Image
🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
Aug 26 10 tweets 5 min read
💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University.

THE SHIFT HAS STARTED.

Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing.

Though overall employment continues to grow, employment growth for young workers in particular has been stagnant.

The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration.

22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls.

⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts

The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement.

AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate.

🧵 Read on 👇Image 🧵2/n. 📊 The Data

The study uses administraest payroll processtive payroll records from ADP, which handles pay for over 25M workers, letting the authors observe monthly headcount and base salary with high granularity.

They build a balanced panel of firms present from 2021‑01 to 2025‑07, restrict to ages 18‑70 with recorded titles mapped to Standard Occupational Classification codes, and end up with 3.5M–5M workers per month in the main sample.Image
Aug 24 11 tweets 6 min read
MASSIVE claim in this paper 🫡

The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.

Will completely research paper writing.

They proved, AI can already draft proposals, run experiments, and write papers.

The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.

The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.

And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.

🧵 Read on 👇Image 🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.

81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.Image
Aug 23 24 tweets 9 min read
This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.

The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.

Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.

🧪 How they ran the study

They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.

🧵 Read on 👇Image
Image
The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow. Image