Even the best embeddings cannot represent all possible query-document combinations, which means some answers are mathematically impossible to recover.
Reveals a sharp truth, embedding models can only capture so many pairings, and beyond that, recall collapses no matter the data or tuning.
🧠 Key takeaway
Embeddings have a hard ceiling, set by dimension, on how many top‑k document combinations they can represent exactly.
They prove this with sign‑rank bounds, then show it empirically and with a simple natural‑language dataset where even strong models stay under 20% recall@100.
When queries force many combinations, single‑vector retrievers hit that ceiling, so other architectures are needed.
4096‑dim embeddings already break near 250M docs for top‑2 combinations, even in the best case.
🛠️ Practical Implications
For applications like search, recommendation, or retrieval-augmented generation, this means scaling up models or datasets alone will not fix recall gaps.
At large index sizes, even very high-dimensional embeddings fail to capture all combinations of relevant results.
So embeddings cannot work as the sole retrieval backbone. We will need hybrid setups, combining dense vectors with sparse methods, multi-vector models, or rerankers to patch the blind spots.
This shifts how we should design retrieval pipelines, treating embeddings as one useful tool but not a universal solution.
🧵 Read on 👇
This figure explains LIMIT, a tiny natural-language dataset they built to test whether single-vector embeddings can represent all combinations of relevant documents for each query.
The left grid is the target relevance pattern, and the task is to rank exactly the k=2 correct documents for every query.
The right side shows the mapping into simple text, queries like “Who likes Quokkas?” paired with short bios such as “Jon Durben likes Quokkas and Apples,” so language complexity is not the challenge.
The key point, even with this simple setup, strong MTEB embedders stay under 20% recall@100, revealing a capacity limit of single-vector retrieval.
2. ⚙️ The core concepts
They formalize retrieval as a binary relevance matrix over queries and documents, then ask for a low‑rank score matrix that, for each query, puts relevant documents ahead of the others.
They show that “row‑wise order preserving” and “row‑wise thresholdable” are the same requirement for binary labels, so both describe the exact capacity a single‑vector model needs.
They connect that capacity to sign rank, which is the smallest dimension that can reproduce the positive or negative pattern of a matrix, and they derive tight lower and upper bounds on the needed embedding dimension from it.
A direct consequence is stark, for any fixed dimension d, there exist top‑k combinations no query embedding can ever retrieve, regardless of training data.
3. 🧪 Best‑case test, no language in the loop
They remove language completely and directly optimize the document and query vectors against the target relevance matrix with full‑batch contrastive training, which is the friendliest possible setup for embeddings.
For k=2, they increase the number of documents until optimization cannot reach 100% accuracy, calling that cutoff the critical‑n for a given dimension.
The curve of critical‑n versus dimension fits a 3rd‑degree polynomial with r^2=0.999, which lets them extrapolate how scale pushes the ceiling.
The extrapolated breakpoints are 500K @ 512, 1.7M @ 768, 4M @ 1024, 107M @ 3072, 250M @ 4096, and this is already the best case any retriever could hope for.
4. 🧩 What LIMIT actually is
They build a natural‑language dataset that encodes all 2‑document combinations across a small pool, phrased as simple queries like “who likes X” and short biography‑style documents.
Each document lists fewer than 50 liked attributes to keep texts short, and each query asks for exactly 1 attribute, which keeps the language trivial and isolates the combination pressure.
They use 50K documents and 1000 queries, and pick 46 special documents because 46 choose 2 equals 1035, which is just above 1000, then they also provide a 46‑doc small split.
They randomize names and attributes, dedupe with lexical checks, and ensure every relevant pair appears somewhere, so the dataset is simple in wording but dense in combinations.
5. 📉 How current embedders fared
On the full LIMIT, leading single‑vector models struggle to even touch 20% recall@100, despite the plain language and the easy query form.
Performance grows with dimension but stays low, while a late‑interaction multi‑vector model does much better and a high‑dimensional sparse baseline, BM25, reaches about 93.6 recall@100 and GTE‑ModernColBERT sits near 54.8 recall@100.
This pattern matches the theory, more dimensions or more expressive matching buys more combinations, but a single compact vector hits limits fast.
6. 🪙 Even 46 docs are hard
With just 46 documents, models still do not reach 100% even by recall@20, and recall@10 sits notably lower.
Concrete numbers help, Promptriever 4096 reaches 97.7 recall@20, GritLM 4096 reaches 90.5, and E5‑Mistral 4096 reaches 85.2, which is high but still not perfect for such a tiny pool.
7. 🧪 Not a domain‑shift issue
Fine‑tuning an embedder on a matching training split barely moves the needle, with best gains only around 2.8 recall@10, so the failure is not because the language is unfamiliar.
Training on the test split lets the model overfit and reach near‑100% in this small case, which mirrors the free‑embedding result and confirms this is a capacity limit, not a distribution gap.
8. 🧮 Why the label pattern matters
When the qrels are dense, meaning many document pairs co‑occur across queries, scores crash across models, which shows that combination count, not language tricks, drives the difficulty.
For example, E5‑Mistral drops from 40.4 recall@100 to 4.8, and GritLM loses about 50 absolute points moving to the dense pattern.
9. 🧱 What to use when single vectors hit the wall
A long‑context reranker solved the small LIMIT with 100% accuracy by reading all 46 documents and 1000 queries together, which avoids the single‑vector bottleneck but costs more compute.
Multi‑vector late‑interaction retrieval lifts scores well above single‑vector baselines, and sparse term‑matching like BM25 stays strong because its effective dimension is huge compared to dense embeddings.
Paper –
Paper Title: "On the Theoretical Limitations of Embedding-Based Retrieval"arxiv.org/abs/2508.21038
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Google shares for the first time the TPUv7 details, at Hot Chips 2025 .
Super valuable insight, that could not otherwise be easily gleamed.
Ironwood is said to offer 2x the perf-per-watt of Google’s previous generation TPU, Trillium.
With up to 9,216 chips in a node, Ironwood can scale up to a MASSIVE 42.5 Exaflops in performance.
Though with 10MW of power consumption, that performance doesn’t come cheap.
But, like all of Google’s TPUs, this is solely for Google’s use as part of their Google Cloud services, so Ironwood is not available to look at outside of Google.
🧵 Read on 👇
🧵2/n. Ironwood TPU comes with several innovations.
The big one is how big the SuperPods can go. Now up to 9,216 chips, thanks to the use of optical circuit switches (OCS) to share memory throughout the pod. There’s 1.77 PB of directly addressable HBM altogether.
This generation also brings a focus on RAS features in order to have reliable systems.
Power efficiency also gets a boost, of course. Google is claiming a 2x perf-per-watt improvement – though it’s unclear if this is at iso-datatype.
"The Impact of Artificial Intelligence on Human Thought"
A big 132 page report.
AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,
A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.
It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.
In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.
🧵3/n. 🧰 Offloading and memory
Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.
The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.
That is why trust and verification routines matter as much as speed.
💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University.
THE SHIFT HAS STARTED.
Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing.
Though overall employment continues to grow, employment growth for young workers in particular has been stagnant.
The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration.
22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls.
⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts
The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement.
AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate.
🧵 Read on 👇
🧵2/n. 📊 The Data
The study uses administraest payroll processtive payroll records from ADP, which handles pay for over 25M workers, letting the authors observe monthly headcount and base salary with high granularity.
They build a balanced panel of firms present from 2021‑01 to 2025‑07, restrict to ages 18‑70 with recorded titles mapped to Standard Occupational Classification codes, and end up with 3.5M–5M workers per month in the main sample.
🧵3/n. 🧭 How AI exposure is measured
One exposure signal comes from occupational task links to GPT‑4 capabilities, aggregated to occupations, which ranks jobs by how model‑amenable their tasks look.
A second signal comes from the Anthropic Economic Index that tags millions of Claude chats by occupation tasks and classifies usage as automative or augmentative, which lets the authors separate substitute‑like usage from complement‑like usage.
The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.
Will completely research paper writing.
They proved, AI can already draft proposals, run experiments, and write papers.
The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.
The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.
And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.
🧵 Read on 👇
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.
81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.
🧵3/n. This diagram shows aiXiv’s closed-loop system where AI and humans submit work, get automated reviews, revise, and then publish once quality clears the bar.
It means the platform is not a simple preprint dump, it is a workflow that forces measurable improvement each cycle.
Review agents score novelty, soundness, clarity, and feasibility using retrieval so feedback is grounded, and a prompt-injection detector screens malicious instructions before any model reads the file.
If the revised version looks better in pairwise checks, it moves forward, then a panel of LLMs votes, and 3 of 5 accepts trigger publication.
So the figure is saying aiXiv operationalizes end-to-end research, from idea to accepted paper, with guardrails and iteration built in.
This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.
The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.
Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.
🧪 How they ran the study
They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.
🧵 Read on 👇
The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow.
The steep drop from pilots to production for task-specific GenAI tools reveals the GenAI divide
The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯
DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking.
Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens.
🧠 The key idea
DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning.
🧱 Why majority voting hits a wall
Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets.
🔎 The confidence signals
Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment.
Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace.
Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip.
Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky.
Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early.
✅ Bottom line
DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.
🧮 Offline mode, smarter voting
DeepConf ranks traces by a confidence score and does confidence-weighted majority voting after optionally keeping only the top 10% or the top 90% by confidence.
With 512 traces, GPT-OSS-120B reaches 99.9% on AIME 2025 using tail or lowest-group confidence with filtering, compared to 97.0% for plain voting and 91.8% for pass@1.
⚡ Online mode, early stop while generating
A short warmup of 16 traces sets a stopping threshold s from the confidence distribution for the current problem.
During live generation, a trace stops the moment its lowest group confidence falls below s, so weak lines of thought do not waste tokens.
An adaptive sampling loop adds traces until the consensus is high enough, or a set budget like 512 is reached.