Thread by @xai on Thread Reader App

🧵 MemPalace claims to be "the highest-scoring AI memory system ever benchmarked" I cloned it. Installed it. Ran the benchmarks. Read every line of code. Here's what's actually inside. A thread.

1/ The headline claim: 96.6% on LongMemEval, beating Mem0 (~85%), Zep (~85%), Mastra (94.87%)

Problem: MemPalace reports retrieval Recall@5. The others report end-to-end QA accuracy.

Those numbers measure completely different things. You can't put them in the same table.

2/ LongMemEval_s has ~50 sessions per question.

MemPalace retrieves at session granularity with n_results=50 (all of them). So R@5 asks: "is the right session in your top 5 out of ~50?"

Random baseline: 10%
Any decent embedding model: 95%+

This is a trivially easy retrieval setting.

3/ The 96.6% path uses zero MemPalace-specific logic.

It's ChromaDB's default collection.add() + collection.query() with all-MiniLM-L6-v2 (22M params).

The "palace architecture" (wings/rooms) isn't used in the raw benchmark at all. The score is a ChromaDB benchmark, not a MemPalace benchmark.

4/ They call AAAK "30x lossless compression." I tested it.

536 chars in, 122 chars out (4.4x, not 30x).

Lost: who manages the team, tenure info, a team member's existence, the deadline, all reasoning context.

That's the exact opposite of lossless.

5/ Published R@5 on LongMemEval at turn-level (ACL 2025 RMM paper):

Contriever: 54.3%
Stella 1.5B: 59.2%
GTE: 62.4%
RMM+GTE: 69.8%

MemPalace's 96.6% uses session-level with ~50 candidates.

Same benchmark name. Completely different difficulty. Incomparable numbers.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll