ani Profile picture
ani
mle @sentra_app, prev. @shopify mlh top 50 uwaterloo ce '30
May 7 4 tweets 3 min read
We finally know why LLMs hallucinate. It's not the model. It's the geometry.

@OpenAI text-embedding-3-large: 91/3072 dimensions do real work.

@GeminiApp gemini-embedding-001: 80/3072 dimensions do real work.

~97% of your vector database is mathematically empty. Your RAG system is retrieving from noise.

@ashwingop and I present "The Geometry of Consolidation" - a proof that RAG compression has a hard floor no algorithm can beat, set by a single spectral number your embedding model cannot escape.

Every hallucination your RAG pipeline produces? This is why.

Paper + results: github.com/niashwin/geome…Image
Image
Here's the intuition behind that chart.

Your embedding model promises 3072 dimensions of storage.

All the real semantic signal? Crammed into 91 of them. The other 2,981 are noise your retrieval system searches through anyway.

This isn't a training failure. It's not fixable with a bigger model. It's what the geometry of these spaces actually looks like under the hood.

We call it "effective dimensionality".Image