Post

More from @rohanpaul_ai

Rohan Paul

@rohanpaul_ai

Aug 23

This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.

The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.

Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.

🧪 How they ran the study

They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.

🧵 Read on 👇

The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow.

The steep drop from pilots to production for task-specific GenAI tools reveals the GenAI divide

Read 24 tweets

Rohan Paul

@rohanpaul_ai

Aug 23

Another paper claiming really BIG result.

The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯

DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking.

Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens.

🧠 The key idea

DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning.

🧱 Why majority voting hits a wall

Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets.

🔎 The confidence signals

Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment.

Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace.

Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip.

Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky.

Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early.

✅ Bottom line

DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.

🧮 Offline mode, smarter voting

DeepConf ranks traces by a confidence score and does confidence-weighted majority voting after optionally keeping only the top 10% or the top 90% by confidence.

With 512 traces, GPT-OSS-120B reaches 99.9% on AIME 2025 using tail or lowest-group confidence with filtering, compared to 97.0% for plain voting and 91.8% for pass@1.

⚡ Online mode, early stop while generating

A short warmup of 16 traces sets a stopping threshold s from the confidence distribution for the current problem.

During live generation, a trace stops the moment its lowest group confidence falls below s, so weak lines of thought do not waste tokens.

An adaptive sampling loop adds traces until the consensus is high enough, or a set budget like 512 is reached.

Read 8 tweets

Rohan Paul

@rohanpaul_ai

Aug 21

Really solid context engineering guide.

Directly From @AnthropicAI

In short, package stable context up front, give exact instructions and examples, restate the current ask, let the model reason, and demand a strict output format.

🧵 Read on 👇

🧵2/n Start with task context. Tell the model who it is, what domain it is in, and what outcome matters. In the demo, the first try misread the images as a skiing incident. Adding “you are assisting a Swedish car-insurance claims adjuster” fixed that because it anchored the model in the right world and goal.

🧵3/n Add tone context. Specify how to behave, for example “be factual, be confident only when evidence is clear, say you are unsure if you cannot tell.” This reduces guessing and aligns the model’s attitude with the task. The presenters explicitly ask the model not to invent details and to avoid a verdict unless it is sure.

Read 14 tweets

Rohan Paul

@rohanpaul_ai

Aug 20

BRILLIANT Paper. 💡

A small Qwen2.5 model is fine-tuned to think over retrieved documents, so a single lean setup can answer domain questions on resource-constrained local hardware.

Using summarised NHS pages, retrieval hits the right condition among top‑5 in 76% of queries, and the fine‑tuned model predicts the exact condition correctly 56% of the time, close to larger frontier models.

The whole pipeline is built for private deployments, so teams can run it without sending data to external APIs.

🔒 The problem they tackle

Many teams cannot ship prompts or data outside their network, especially in health and government, so cloud LLM endpoints are off the table.

They aim for a single lean model that can read retrieved evidence and reason over it, all running locally, so answers stay grounded and private.

The target setting is messy queries over a closed corpus, where retrieval constrains facts and the reasoning step interprets symptoms and next actions.

🧩 The pipeline in this paper.

The system indexes a corpus, retrieves the most relevant pieces for each query, then generates an answer that reasons over those pieces.

They use a classic retriever plus generator design, with retrieval first then reasoning, which fits decision tasks better than free‑form answering.

The chat flow lets a conversational agent decide when to call retrieval, then passes the retrieved context to the reasoning model to produce the answer.

🧵 Read on 👇

🧲 The retriever at work

Documents are split into overlapping chunks and embedded with a sentence transformer, then stored in a vector database for fast similarity search.

They use sentence-transformers all‑mpnet‑base‑v2, which maps text into a 768‑dimensional space with a max sequence of 384 tokens, and a Chroma store with L2 similarity.

If any chunk from a document makes the top‑k, the pipeline feeds the full original document to the LLM, so the model sees full context around the hit.

Below image shows the whole training loop for their lean, retrieval-augmented reasoning setup.

It starts with a private knowledge base of about 1,000 NHS condition pages. GPT-4o generates about 2,000 synthetic patient queries from those pages, so they have realistic questions tied to known answers.

For each query, a retriever pulls the top 5 likely documents. DeepSeek-R1 reads those documents and the query, then produces a final label plus a step-by-step reasoning trace. That bundle becomes one training example.

They then fine-tune Qwen-32B-Instruct on this data and distill it into a smaller t0-1 reasoning model. The result is a compact model that learns to reason over retrieved evidence from the approved corpus, so it can run locally and stay grounded.

Read 8 tweets

Rohan Paul

@rohanpaul_ai

Aug 15

Speed Always Wins.

Absolutely beautiful and exhaustive 82 page survey paper on on Efficient Architectures for Large Language Models

Maps the ways to make LLMs cheaper, longer context, and near real time.

Transformers compare every token with every other token, so if text is 2x longer, the work is about 4x. That burns memory because past keys and values are stored for every attention head, and it drags latency during long chats or reasoning loops.

The survey groups fixes into 4 buckets. Linear sequence models redo the math so cost grows with length, not length squared.

They include linear attention, recurrent networks that carry a small state, and state space models like Mamba, which track history with a running summary, so no big cache.

Sparse attention keeps the Transformer idea but only connects important pairs. Most tokens look locally, a few tokens act as global anchors, and some methods route tokens to the right places. You get large savings without throwing away core behavior.

Efficient full attention keeps exact attention but makes it hardware friendly. Input output aware kernels such as FlashAttention cut reads and writes, and multi-query or grouped-query attention lets many heads share 1 key-value set, cutting cache and bandwidth.

Sparse Mixture of Experts adds conditional compute. Only a few experts run per token, so capacity grows without paying full cost each step, and memory tricks compress, quantize, or prune the cache to stretch context.

The theme is simple, move less data. Methods that cut memory traffic tend to win on modern GPUs, which enables longer context, faster training, and lower serving cost.

This figure is a roadmap of how to make LLMs faster and cheaper from input tokens to output tokens.

The center shows Efficient Sequence Modeling. One path makes sequence cost scale linearly using things like linear attention, linear recurrent networks, and state space models, plus test-time-training variants and unified linear sequence models.

Another path saves work by using sparse attention so the model only looks at the most useful token pairs.

A third path keeps full attention but makes it cheaper with input-output aware scheduling, grouped attention, mixtures of different attention types, and quantization.

Below that sits Sparse Mixture-of-Experts. The model grows capacity by keeping many experts but routes each token to only a few, so compute per token stays low. Different routing rules, expert designs, and conversion tricks live here.

To the right are Hybrid Architectures. These mix building blocks across layers or inside a layer to hit better speed and accuracy tradeoffs.

Next is Diffusion LLM. This family targets non-autoregressive generation so many tokens can be produced in parallel, with methods to connect back to standard autoregressive decoding and to extend into multimodal settings.

The final column highlights reach beyond text, showing where these efficiency ideas apply to vision, audio, and multimodal tasks.

How can we break through the Transformer’s
efficiency ceiling? Is costly "intelligence" our only path forward?

Possible solutions

Read 6 tweets

Rohan Paul

@rohanpaul_ai

Aug 13

LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index

Researchers from UC Berkeley, CUHK, Amazon Web Services, and UC Davis have developed LEANN, a storage-efficient ANN search index optimized for resource-limited personal devices.

RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast, accurate, and 100% private RAG application on your personal device.

Storage Comparison

Github

github.com/yichuan-w/LEANN

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!