Rohan Paul Profile picture
Aug 31 11 tweets 6 min read Read on X
BRILLIANT @GoogleDeepMind research.

Even the best embeddings cannot represent all possible query-document combinations, which means some answers are mathematically impossible to recover.

Reveals a sharp truth, embedding models can only capture so many pairings, and beyond that, recall collapses no matter the data or tuning.

🧠 Key takeaway

Embeddings have a hard ceiling, set by dimension, on how many top‑k document combinations they can represent exactly.

They prove this with sign‑rank bounds, then show it empirically and with a simple natural‑language dataset where even strong models stay under 20% recall@100.

When queries force many combinations, single‑vector retrievers hit that ceiling, so other architectures are needed.

4096‑dim embeddings already break near 250M docs for top‑2 combinations, even in the best case.

🛠️ Practical Implications

For applications like search, recommendation, or retrieval-augmented generation, this means scaling up models or datasets alone will not fix recall gaps.

At large index sizes, even very high-dimensional embeddings fail to capture all combinations of relevant results.

So embeddings cannot work as the sole retrieval backbone. We will need hybrid setups, combining dense vectors with sparse methods, multi-vector models, or rerankers to patch the blind spots.

This shifts how we should design retrieval pipelines, treating embeddings as one useful tool but not a universal solution.

🧵 Read on 👇Image
This figure explains LIMIT, a tiny natural-language dataset they built to test whether single-vector embeddings can represent all combinations of relevant documents for each query.

The left grid is the target relevance pattern, and the task is to rank exactly the k=2 correct documents for every query.

The right side shows the mapping into simple text, queries like “Who likes Quokkas?” paired with short bios such as “Jon Durben likes Quokkas and Apples,” so language complexity is not the challenge.

The key point, even with this simple setup, strong MTEB embedders stay under 20% recall@100, revealing a capacity limit of single-vector retrieval.Image
2. ⚙️ The core concepts

They formalize retrieval as a binary relevance matrix over queries and documents, then ask for a low‑rank score matrix that, for each query, puts relevant documents ahead of the others.

They show that “row‑wise order preserving” and “row‑wise thresholdable” are the same requirement for binary labels, so both describe the exact capacity a single‑vector model needs.

They connect that capacity to sign rank, which is the smallest dimension that can reproduce the positive or negative pattern of a matrix, and they derive tight lower and upper bounds on the needed embedding dimension from it.

A direct consequence is stark, for any fixed dimension d, there exist top‑k combinations no query embedding can ever retrieve, regardless of training data.Image
3. 🧪 Best‑case test, no language in the loop

They remove language completely and directly optimize the document and query vectors against the target relevance matrix with full‑batch contrastive training, which is the friendliest possible setup for embeddings.

For k=2, they increase the number of documents until optimization cannot reach 100% accuracy, calling that cutoff the critical‑n for a given dimension.

The curve of critical‑n versus dimension fits a 3rd‑degree polynomial with r^2=0.999, which lets them extrapolate how scale pushes the ceiling.

The extrapolated breakpoints are 500K @ 512, 1.7M @ 768, 4M @ 1024, 107M @ 3072, 250M @ 4096, and this is already the best case any retriever could hope for.Image
4. 🧩 What LIMIT actually is

They build a natural‑language dataset that encodes all 2‑document combinations across a small pool, phrased as simple queries like “who likes X” and short biography‑style documents.

Each document lists fewer than 50 liked attributes to keep texts short, and each query asks for exactly 1 attribute, which keeps the language trivial and isolates the combination pressure.

They use 50K documents and 1000 queries, and pick 46 special documents because 46 choose 2 equals 1035, which is just above 1000, then they also provide a 46‑doc small split.

They randomize names and attributes, dedupe with lexical checks, and ensure every relevant pair appears somewhere, so the dataset is simple in wording but dense in combinations.Image
5. 📉 How current embedders fared

On the full LIMIT, leading single‑vector models struggle to even touch 20% recall@100, despite the plain language and the easy query form.

Performance grows with dimension but stays low, while a late‑interaction multi‑vector model does much better and a high‑dimensional sparse baseline, BM25, reaches about 93.6 recall@100 and GTE‑ModernColBERT sits near 54.8 recall@100.

This pattern matches the theory, more dimensions or more expressive matching buys more combinations, but a single compact vector hits limits fast.Image
6. 🪙 Even 46 docs are hard

With just 46 documents, models still do not reach 100% even by recall@20, and recall@10 sits notably lower.

Concrete numbers help, Promptriever 4096 reaches 97.7 recall@20, GritLM 4096 reaches 90.5, and E5‑Mistral 4096 reaches 85.2, which is high but still not perfect for such a tiny pool.Image
7. 🧪 Not a domain‑shift issue

Fine‑tuning an embedder on a matching training split barely moves the needle, with best gains only around 2.8 recall@10, so the failure is not because the language is unfamiliar.

Training on the test split lets the model overfit and reach near‑100% in this small case, which mirrors the free‑embedding result and confirms this is a capacity limit, not a distribution gap.Image
8. 🧮 Why the label pattern matters

When the qrels are dense, meaning many document pairs co‑occur across queries, scores crash across models, which shows that combination count, not language tricks, drives the difficulty.

For example, E5‑Mistral drops from 40.4 recall@100 to 4.8, and GritLM loses about 50 absolute points moving to the dense pattern.Image
9. 🧱 What to use when single vectors hit the wall

A long‑context reranker solved the small LIMIT with 100% accuracy by reading all 46 documents and 1000 queries together, which avoids the single‑vector bottleneck but costs more compute.

Multi‑vector late‑interaction retrieval lifts scores well above single‑vector baselines, and sparse term‑matching like BM25 stays strong because its effective dimension is huge compared to dense embeddings.Image
Paper –

Paper Title: "On the Theoretical Limitations of Embedding-Based Retrieval"arxiv.org/abs/2508.21038

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Sep 2
🇨🇳 China's Tencent open-sources translation model beats Google, OpenAI in top global AI competition

Hunyuan-MT-7B came first in 30 out of the 31 tests in a general machine-translation competition held as part of the coming WMT25 conference

Supports 33 languages, available on @huggingface

commercial use allowed.

Hunyuan-MT-7B’s strength is that it uses a small number of parameters to deliver results that measure up to or even surpass much larger models.

Tencent said its Hunyuan translation model had been employed across a range of in-house products, such as the Zoom-like Tencent Meeting, a web browser and the enterprise version of the WeChat messaging app.

🧵 Read on 👇Image
🧵2/n. English language pairs tested in the competition included Arabic, Estonian and Maasai, which is spoken by 1.5 million people living in southern Kenya and northern Tanzania.

Other language pairs included Czech-Ukrainian and Japanese-simplified Chinese. The only English language pair Hunyuan did not ace was Bhojpuri, a language spoken by around 50.5 million people in parts of northern India and Nepal.Image
🧵3/n. Publishes a detailed techical report.

The setup has 2 parts, Hunyuan-MT-7B does direct translation and Hunyuan-MT-Chimera-7B fuses several candidate translations into 1 better output using weak-to-strong RL with GRPO and quality rewards. Image
Read 8 tweets
Sep 1
Someone let ChatGPT run a stock portfolio.

over 2 month ChatGPT’s portfolio is up +29.22% vs. the S&P 500’s +4.11% over the same window.

(Prompts, Code, Github listed)

The process works as follows.

ChatGPT is given real market data each trading day, including prices, volumes and benchmarks, stored on GitHub.

On weekends it uses that data to research deeply, reevaluate the portfolio, and look for new stock ideas.

The portfolio is simulated daily based on any changes, and then the person manually executes those trades in a real brokerage account.

ChatGPT has full authority to make buy or sell decisions, but only within U.S. micro-cap stocks under $300M market cap.Image
He was just running an experiment on this question

"Can powerful large language models like ChatGPT actually generate alpha (or at least make smart trading decisions) using real-time data?" Image
Read 15 tweets
Aug 27
Google shares for the first time the TPUv7 details, at Hot Chips 2025 .

Super valuable insight, that could not otherwise be easily gleamed.

Ironwood is said to offer 2x the perf-per-watt of Google’s previous generation TPU, Trillium.

With up to 9,216 chips in a node, Ironwood can scale up to a MASSIVE 42.5 Exaflops in performance.

Though with 10MW of power consumption, that performance doesn’t come cheap.

But, like all of Google’s TPUs, this is solely for Google’s use as part of their Google Cloud services, so Ironwood is not available to look at outside of Google.

🧵 Read on 👇Image
🧵2/n. Ironwood TPU comes with several innovations.

The big one is how big the SuperPods can go. Now up to 9,216 chips, thanks to the use of optical circuit switches (OCS) to share memory throughout the pod. There’s 1.77 PB of directly addressable HBM altogether.

This generation also brings a focus on RAS features in order to have reliable systems.

Power efficiency also gets a boost, of course. Google is claiming a 2x perf-per-watt improvement – though it’s unclear if this is at iso-datatype.Image
🧵3/n. Image
Read 15 tweets
Aug 27
"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image
Image
Image
Image
🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
🧵3/n. 🧰 Offloading and memory

Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.

The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.

That is why trust and verification routines matter as much as speed.Image
Read 10 tweets
Aug 26
💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University.

THE SHIFT HAS STARTED.

Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing.

Though overall employment continues to grow, employment growth for young workers in particular has been stagnant.

The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration.

22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls.

⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts

The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement.

AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate.

🧵 Read on 👇Image
🧵2/n. 📊 The Data

The study uses administraest payroll processtive payroll records from ADP, which handles pay for over 25M workers, letting the authors observe monthly headcount and base salary with high granularity.

They build a balanced panel of firms present from 2021‑01 to 2025‑07, restrict to ages 18‑70 with recorded titles mapped to Standard Occupational Classification codes, and end up with 3.5M–5M workers per month in the main sample.Image
🧵3/n. 🧭 How AI exposure is measured

One exposure signal comes from occupational task links to GPT‑4 capabilities, aggregated to occupations, which ranks jobs by how model‑amenable their tasks look.

A second signal comes from the Anthropic Economic Index that tags millions of Claude chats by occupation tasks and classifies usage as automative or augmentative, which lets the authors separate substitute‑like usage from complement‑like usage.Image
Read 10 tweets
Aug 24
MASSIVE claim in this paper 🫡

The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.

Will completely research paper writing.

They proved, AI can already draft proposals, run experiments, and write papers.

The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.

The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.

And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.

🧵 Read on 👇Image
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.

81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.Image
🧵3/n. This diagram shows aiXiv’s closed-loop system where AI and humans submit work, get automated reviews, revise, and then publish once quality clears the bar.

It means the platform is not a simple preprint dump, it is a workflow that forces measurable improvement each cycle.

Review agents score novelty, soundness, clarity, and feasibility using retrieval so feedback is grounded, and a prompt-injection detector screens malicious instructions before any model reads the file.

If the revised version looks better in pairwise checks, it moves forward, then a panel of LLMs votes, and 3 of 5 accepts trigger publication.

So the figure is saying aiXiv operationalizes end-to-end research, from idea to accepted paper, with guardrails and iteration built in.Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(