Abstract
Memory benchmarks for LLM agents are dominated by single-vendor pipelines: each system is evaluated end-to-end against its own published numbers, with vendor-controlled generators, judges, and extraction pipelines. Side-by-side numbers from different papers are not directly comparable, and when they appear in pitch decks they hide as much as they reveal.
We run Atlaso against mem0 on LongMemEval-S (n = 500) under a shared reader(Qwen 3.5-9B) and four independent judges (Anthropic Haiku 4.5 strict, GPT-5 permissive, GPT-4o strict, and mem0's own verbatim judge prompt). Atlaso wins on every judge, by +9.8 to +14.8 percentage points.
A 2×2 retriever-reader swap with MemPalace separates retrieval recall (where MemPalace wins with R@5 = 96.2%) from end-to-end QA accuracy (where the Atlaso reader wins, regardless of which retriever feeds it). Substrate carries QA; raw retrieval recall does not.
We also publish the limits of this study: mem0's published 93.4% does not reproduce under matched conditions (we observe 44.2% with their own judge prompt, a 49.2pp gap), but we did not reproduce mem0's full published pipeline, so the gap is best read as “methodology + pipeline,” not “methodology alone.” On the adversarial LoCoMo subset Atlaso loses to mem0 by 11.5 percentage points, and we discuss why.
1. Why this study
The memory-system landscape has matured fast over the last year. mem0, LangMem, Letta, Cognee, Honcho, MemPalace and a handful of others all publish benchmark numbers on LongMemEval, LoCoMo and adjacent fixtures. Yet the published numbers are largely incomparable: each system evaluates itself with its own judge, its own reader, its own fixture preprocessing, and its own definition of what counts as a correct answer.
This is not bad faith — it is the natural shape of an early field. But it produces a specific failure mode: a pitch deck slide that lines up “mem0 93.4% vs Atlaso X%” can be off by 50 percentage points purely because the two numbers were generated under different rules.
We wrote this study to do one thing: hold every variable except the memory system fixed, then measure.
1.1 What we hold fixed
- Shared fixture — LongMemEval-S, n = 500 questions, identical question IDs across all arms.
- Shared reader — Qwen 3.5-9B for the final answer step. Every system feeds its retrieved context to the same reader with the same prompt template.
- Shared judges — four independent LLM judges score the same answers: Anthropic Haiku 4.5 (strict, calibrated-abstention F1), OpenAI GPT-5 (permissive), OpenAI GPT-4o (strict), and mem0's own verbatim judge prompt as fetched from their public benchmarks repo.
- Label-blind normalizer — before each judge sees an answer, we strip system-identifying fingerprints (e.g.
<CONFIRMED>polarity headers, “Atlaso says:” banners, signature abstention phrases) so judges cannot identify the arm from the answer text.
2. Method
2.1 Fixture
LongMemEval-S is a published benchmark of 500 long-context conversational-memory questions spanning five categories: user-said-facts, updated-facts, preferences, multi-chat synthesis, and time-based reasoning. We use the exact JSONL fixture and gold answers from the LongMemEval repository, pinned by commit SHA, with no preprocessing.
2.2 Arms
Each arm consists of (i) an ingestion step that reads the conversation history and writes to the system's memory store, and (ii) a retrieval step that produces a context for the reader. mem0 default uses gpt-4o-mini as its extraction LLM at ingestion time, per its OSS default configuration. Atlaso writes deposits with zero LLM calls at ingestion.
2.3 Judges
Each system answer is scored by all four judges independently. The mem0 verbatim judge prompt is reproduced from the memory-benchmarks repository on GitHub (fetched 2026-05-06); we make no edits beyond formatting. Per-question judge outputs are stored in the run JSONLs alongside the system answer, retrieved contexts, and gold answer, so any third party can re-judge with a different model.
A note on pre-registration. The v1.8 protocol — under which the canonical n = 500 numbers were generated — was iteratively converged across prereg-option3-v1.1 through prereg-option3-v1.8 during pilot runs. Every version is a public git tag; the full v1.1 → v1.8 diff history, pilot run JSONLs, and the locked v1.8 specification are all in the repository. We do not claim the canonical run was preregistered before any data was collected; we publish the entire version trail so the protocol-tuning surface is auditable rather than hidden.
3. Main result: four judges, same direction
Across four independent judges Atlaso beats mem0 on the same 500 questions, with deltas ranging from +9.8pp (GPT-4o strict) to +14.8pp (Haiku 4.5 strict). No judge flips the direction.
| Judge | Atlaso | mem0 | Δ |
|---|---|---|---|
| Haiku 4.5 strict | 61.8% (309/500) | 47.0% (235/500) | +14.8pp |
| mem0 verbatim | 56.4% (282/500) | 44.2% (221/500) | +12.2pp |
| GPT-5 permissive | 57.8% (289/500) | 46.4% (232/500) | +11.4pp |
| GPT-4o strict | 69.2% (346/500) | 59.4% (297/500) | +9.8pp |
Table 1: Atlaso vs mem0-default on LongMemEval-S (n = 500), shared Qwen 3.5-9B reader, four independent judges.
The four-judge structure is the point. A single judge can be biased toward one system's answer style — for example a permissive judge might over-credit hedged answers, while a strict judge might over-penalize abstention. By scoring the same 500 answers four times under four independent prompts (one of which is mem0's own), we get a sensitivity band rather than a single number. Atlaso wins in every band.
4. Why mem0's published 93.4 does not reproduce
mem0 publishes a headline number of 93.4% accuracy on LongMemEval-S in their memory-benchmarks repository. That number anchors most external comparisons to mem0. We tried to reproduce it.
Running mem0 with its default OSS pipeline (gpt-4o-mini extractor, default top-k, default storage configuration) on the same n = 500 fixture, and scoring with mem0's own verbatim judge prompt, we observe 44.2% — a 49.2 percentage point gap from the published number, under mem0's own scoring rule.
We are careful with how we describe this gap. We did notreproduce mem0's full published pipeline (their managed-platform configuration, top-k settings, and proprietary post-processing). So the 49.2pp gap is best read as “methodology + pipeline,” not “methodology alone.” What we can say definitively:
- Under mem0's own published judge prompt, mem0's OSS default pipeline scores 44.2%, not 93.4%.
- Under the same conditions, Atlaso scores 56.4% — a +12.2pp lead, with no extraction LLM and no proprietary platform configuration.
- The honest comparison floor is somewhere between mem0's 44.2% and their published 93.4%; future work that reproduces mem0's full managed-platform pipeline will narrow this band.
5. Retrieval recall is not QA accuracy
MemPalace publishes an R@5 (top-5 retrieval recall) of 96.6% on LongMemEval, a striking number that is sometimes lined up against end-to-end QA numbers. These are different axes. R@5 measures whether the right session is in the top-5 retrieved chunks; QA accuracy measures whether the final reader produces a correct answer. A system can dominate one axis and lose the other.
We ran a clean 2 × 2 retriever-reader swap to test this. Both cells share the Atlaso reader and the same LongMemEval-S fixture; the retrieval source is the only variable.
| Configuration | R@5 (n=479) | End-to-end QA (n=500) |
|---|---|---|
| Atlaso retriever + Atlaso reader | 64.9% | 61.8% |
| MemPalace retriever + Atlaso reader | 96.2% | 58.6% |
Table 2: 2×2 retriever-reader swap on LongMemEval-S, shared Atlaso reader. R@5 is computed on the 479 questions with a gold session label; end-to-end QA is on the full n = 500. MemPalace wins R@5 by +31pp; Atlaso wins end-to-end QA by ~3pp.
The category-level breakdown is more striking. MemPalace returns whole sessions (dense embeddings on session-level documents) while Atlaso returns specific high-relevance turns (BM25 + scope filtering on deposit-level units). On preference questions — where the right answer is one short user statement within a long session — coarse session retrieval drops the signal: accuracy falls from 56.7% to 16.7%, a −40 percentage point gap.
The interpretation: high R@5 is not free QA accuracy. It is a useful component, but the substrate the reader runs against — what is stored, in what shape, and how is it filtered — is where end-to-end QA is won or lost.
6. Cost
Atlaso is cheaper per query and structurally cheaper at ingestion. The query-time number is measured directly from instrumented runs; the ingestion-time number is a definitional consequence of the architecture (Atlaso writes no LLM-extracted summaries).
| Axis | Atlaso | mem0 | Ratio |
|---|---|---|---|
| $ per query | $0.0096 | $0.0115 | 1.20× cheaper |
| LLM calls at ingestion | 0 | 1 × gpt-4o-mini per turn | ∞ |
Table 3: Cost on LongMemEval-S. Per-query cost measured on instrumented Wall I runs; mem0 extractor cost is modeled (gpt-4o-mini default), not traced API calls.
We deliberately do not publish a $/year savings extrapolation. Early drafts of this work projected dollar savings at 100K daily-active-users; on review the extrapolation multiplied a per-question extractor cost by a per-message rate, which is a unit error. The defensible cost claim is the 1.20× per-query number on the measured fixture, plus the structural ingestion-time difference; any TCO calculation downstream of that depends on assumptions about message volume, retention horizon, and storage tier that vary by deployment.
7. Where we lose: LoCoMo
LoCoMo is an adversarial conversational-memory benchmark designed with planted distractors and temporal-reasoning traps. It is a different distribution from LongMemEval. We ran the same shared-reader protocol on a 200-question LoCoMo subset. Atlaso loses.
| System | Correct | Accuracy |
|---|---|---|
| mem0-default | 71/200 | 35.5% |
| Atlaso | 48/200 | 24.0% (−11.5pp) |
Table 4: LoCoMo adversarial subset (n = 200, Haiku 4.5 strict judge). Atlaso loses to mem0-default by 11.5pp.
The honest interpretation is that Atlaso's contradiction-aware retrieval is currently tuned for conversational-memory questions where the answer either exists clearly in the field or does not. On LoCoMo's temporal- and distractor-heavy questions, the system over-abstains, or returns the planted distractor with high confidence. We publish this loss because hiding it would make the rest of the study less trustworthy, not more — and because where a system fails is more diagnostic than where it succeeds.
We do not yet have matched-condition LoCoMo numbers for the other systems in the broader landscape, so we do not draw conclusions about the LoCoMo leaderboard. The fixed comparison here is Atlaso vs mem0, same fixture, same judge.
8. Limitations
We have tried to be conservative about what this study establishes. Specifically:
- Mem0's full pipeline was not reproduced. The 49.2pp gap between 93.4% and 44.2% conflates judge methodology with pipeline configuration. Until mem0's full managed-platform pipeline is reproduced at or near 93.4% on the same fixture, the gap should be read as “methodology + pipeline,” not isolated to either one.
- Other memory systems were run only under exploratory protocols. Exploratory runs of LangMem, Letta, Cognee, and Honcho exist in the repository (
research/option3/runs/histogram_arm*.jsonland adjacent files) under earlier configurations — different readers, different top-k, different judge regimes — and were not re-run under the locked shared-reader, shared-judge protocol used here. They are excluded from the head-to-head table for that reason; matched-condition re-runs are v1.1 work. Cross-system comparisons on this page are restricted to mem0 (run end-to-end on Wall I) and MemPalace (used only as a retriever in the 2×2 swap). - Pre-registration is a version trail, not a clean prior commit. The v1.8 protocol was iteratively converged through pilot runs (v1.1 → v1.8 are all public tags). We do not claim the canonical run was preregistered before any data was collected; we publish the full version trail so the protocol-tuning surface is auditable rather than hidden.
- Cost is per-query, not TCO. The 1.20× number is measured on the run; deployment cost depends on assumptions we do not make.
- Single subset of LoCoMo. The LoCoMo result is on n = 200, Haiku 4.5 strict, one judge. The other three judges were not run on LoCoMo at this scope.
9. Reproducibility
The entire study is reproducible from the open-source repository. The arm scripts, fixtures, judge prompts (including mem0's verbatim prompt as fetched), run JSONLs with raw per-question scores and contexts, the bootstrap CI script, and the protocol version history are all in the public repo. There is no proprietary platform call in the Atlaso path; mem0 is run via its public OSS configuration with the default extractor.
- Code: github.com/imashishkh21/atlaso
- Fixture: LongMemEval-S (commit SHA pinned in the run config)
- Judges: Anthropic Haiku 4.5, OpenAI GPT-5, OpenAI GPT-4o, mem0 verbatim (prompts in
research/option3/judge.py) - Per-question outputs: every system answer, retrieved context, judge rationale and gold answer is in the run JSONLs under
research/option3/runs/ - Protocol version:
prereg-option3-v1.8
The practical takeaway: hold the reader, the fixture, and the judge fixed before you compare memory systems. When you do, the gap between what systems claim and what they deliver becomes visible — and the direction of the answer stops depending on whose judge you used.