Abstract

Memory benchmarks for LLM agents are dominated by single-vendor pipelines: each system is evaluated end-to-end against its own published numbers, with vendor-controlled generators, judges, and extraction pipelines. Side-by-side numbers from different papers are not directly comparable, and when they appear in pitch decks they hide as much as they reveal.

We run Atlaso against mem0 on LongMemEval-S (n = 500) under a shared reader (Qwen 3.5-9B) and four independent judges (Anthropic Haiku 4.5 strict, GPT-5 permissive, GPT-4o strict, and mem0's own verbatim judge prompt). Atlaso wins on every judge, by +9.8 to +14.8 percentage points.

A 2×2 retriever-reader swap with MemPalace separates retrieval recall (where MemPalace wins with R@5 = 96.2%) from end-to-end QA accuracy (where the Atlaso reader wins, regardless of which retriever feeds it). Substrate carries QA; raw retrieval recall does not.

We also publish the limits of this study: mem0's published 93.4% does not reproduce under matched conditions (we observe 44.2% with their own judge prompt, a 49.2pp gap), but we did not reproduce mem0's full published pipeline, so the gap is best read as "methodology + pipeline," not "methodology alone." On the adversarial LoCoMo subset Atlaso loses to mem0 by 11.5 percentage points, and we discuss why.

1. Why this study

The memory-system landscape has matured fast over the last year. mem0, LangMem, Letta, Cognee, Honcho, MemPalace and a handful of others all publish benchmark numbers on LongMemEval, LoCoMo and adjacent fixtures. Yet the published numbers are largely incomparable: each system evaluates itself with its own judge, its own reader, its own fixture preprocessing, and its own definition of what counts as a correct answer.

This is not bad faith — it is the natural shape of an early field. But it produces a specific failure mode: a pitch deck slide that lines up "mem0 93.4% vs Atlaso X%" can be off by 50 percentage points purely because the two numbers were generated under different rules.

We wrote this study to do one thing: hold every variable except the memory system fixed, then measure.

1.1 What we hold fixed

Shared fixture — LongMemEval-S, n = 500 questions, identical question IDs across all arms.
Shared reader — Qwen 3.5-9B for the final answer step. Every system feeds its retrieved context to the same reader with the same prompt template.
Shared judges — four independent LLM judges score the same answers: Anthropic Haiku 4.5 (strict, calibrated-abstention F1), OpenAI GPT-5 (permissive), OpenAI GPT-4o (strict), and mem0's own verbatim judge prompt as fetched from their public benchmarks repo.
Label-blind normalizer — before each judge sees an answer, we strip system-identifying fingerprints (e.g. <CONFIRMED> polarity headers, "Atlaso says:" banners, signature abstention phrases) so judges cannot identify the arm from the answer text.

2. Method

2.1 Fixture

LongMemEval-S is a published benchmark of 500 long-context conversational-memory questions spanning five categories: user-said-facts, updated-facts, preferences, multi-chat synthesis, and time-based reasoning. We use the exact JSONL fixture and gold answers from the LongMemEval repository, pinned by commit SHA, with no preprocessing.

2.2 Arms

Each arm consists of (i) an ingestion step that reads the conversation history and writes to the system's memory store, and (ii) a retrieval step that produces a context for the reader. mem0 default uses gpt-4o-mini as its extraction LLM at ingestion time, per its OSS default configuration. Atlaso writes deposits with zero LLM calls at ingestion.

2.3 Judges

Each system answer is scored by all four judges independently. The mem0 verbatim judge prompt is reproduced from the memory-benchmarks repository on GitHub (fetched 2026-05-06); we make no edits beyond formatting. Per-question judge outputs are stored in the run JSONLs alongside the system answer, retrieved contexts, and gold answer, so any third party can re-judge with a different model.

A note on pre-registration. The v1.8 protocol — under which the canonical n = 500 numbers were generated — was iteratively converged across prereg-option3-v1.1 through prereg-option3-v1.8 during pilot runs. Every version is a public git tag; the full v1.1 → v1.8 diff history, pilot run JSONLs, and the locked v1.8 specification are all in the repository. We do not claim the canonical run was preregistered before any data was collected; we publish the entire version trail so the protocol-tuning surface is auditable rather than hidden.

3. Main result: four judges, same direction

Across four independent judges Atlaso beats mem0 on the same 500 questions, with deltas ranging from +9.8pp (GPT-4o strict) to +14.8pp (Haiku 4.5 strict). No judge flips the direction.

Judge	Atlaso	mem0	Δ
Haiku 4.5 strict	61.8% (309/500)	47.0% (235/500)	+14.8pp
mem0 verbatim	56.4% (282/500)	44.2% (221/500)	+12.2pp
GPT-5 permissive	57.8% (289/500)	46.4% (232/500)	+11.4pp
GPT-4o strict	69.2% (346/500)	59.4% (297/500)	+9.8pp

Table 1: Atlaso vs mem0-default on LongMemEval-S (n = 500), shared Qwen 3.5-9B reader, four independent judges.

Figure 1. Atlaso vs mem0-default on LongMemEval-S (n = 500). Four independent judges, same direction every time. Lowest delta (GPT-4o strict) is +9.8pp; highest (Haiku 4.5 strict) is +14.8pp.

The four-judge structure is the point. A single judge can be biased toward one system's answer style — for example a permissive judge might over-credit hedged answers, while a strict judge might over-penalize abstention. By scoring the same 500 answers four times under four independent prompts (one of which is mem0's own), we get a sensitivity band rather than a single number. Atlaso wins in every band.

4. Why mem0's published 93.4 does not reproduce

mem0 publishes a headline number of 93.4% accuracy on LongMemEval-S in their memory-benchmarks repository. That number anchors most external comparisons to mem0. We tried to reproduce it.

Running mem0 with its default OSS pipeline (gpt-4o-mini extractor, default top-k, default storage configuration) on the same n = 500 fixture, and scoring with mem0's own verbatim judge prompt, we observe 44.2% — a 49.2 percentage point gap from the published number, under mem0's own scoring rule.

Figure 2. Mem0's published 93.4% (left bar, mem0's own pipeline + own judge) does not reproduce under matched conditions. With our infrastructure but mem0's own verbatim judge prompt, mem0 scores 44.2% and Atlaso scores 56.4% on the same 500 questions.

We are careful with how we describe this gap. We did not reproduce mem0's full published pipeline (their managed-platform configuration, top-k settings, and proprietary post-processing). So the 49.2pp gap is best read as "methodology + pipeline," not "methodology alone." What we can say definitively:

Under mem0's own published judge prompt, mem0's OSS default pipeline scores 44.2%, not 93.4%.
Under the same conditions, Atlaso scores 56.4% — a +12.2pp lead, with no extraction LLM and no proprietary platform configuration.
The honest comparison floor is somewhere between mem0's 44.2% and their published 93.4%; future work that reproduces mem0's full managed-platform pipeline will narrow this band.

5. Retrieval recall is not QA accuracy

MemPalace publishes an R@5 (top-5 retrieval recall) of 96.6% on LongMemEval, a striking number that is sometimes lined up against end-to-end QA numbers. These are different axes. R@5 measures whether the right session is in the top-5 retrieved chunks; QA accuracy measures whether the final reader produces a correct answer. A system can dominate one axis and lose the other.

We ran a clean 2 × 2 retriever-reader swap to test this. Both cells share the Atlaso reader and the same LongMemEval-S fixture; the retrieval source is the only variable.

Configuration	R@5 (n=479)	End-to-end QA (n=500)
Atlaso retriever + Atlaso reader	64.9%	61.8%
MemPalace retriever + Atlaso reader	96.2%	58.6%

Table 2: 2×2 retriever-reader swap on LongMemEval-S, shared Atlaso reader. R@5 is computed on the 479 questions with a gold session label; end-to-end QA is on the full n = 500. MemPalace wins R@5 by +31pp; Atlaso wins end-to-end QA by ~3pp.

Figure 3. The 2×2 retriever-reader swap. MemPalace wins the R@5 axis (96.2% vs 64.9%, +31pp). Atlaso wins the end-to-end QA axis (61.8% vs 58.6%) regardless of which retriever feeds the reader. Retrieval recall and QA accuracy are non-commensurable.

The category-level breakdown is more striking. MemPalace returns whole sessions (dense embeddings on session-level documents) while Atlaso returns specific high-relevance turns (BM25 + scope filtering on deposit-level units). On preference questions — where the right answer is one short user statement within a long session — coarse session retrieval drops the signal: accuracy falls from 56.7% to 16.7%, a −40 percentage point gap.

Figure 4. Per-category breakdown of the retriever-reader swap. MemPalace's session-level retrieval drops 40 percentage points on Preference questions because the answer is one user statement inside a long session, not the session itself.

The interpretation: high R@5 is not free QA accuracy. It is a useful component, but the substrate the reader runs against — what is stored, in what shape, and how is it filtered — is where end-to-end QA is won or lost.

6. Cost

Atlaso is cheaper per query and structurally cheaper at ingestion. The query-time number is measured directly from instrumented runs; the ingestion-time number is a definitional consequence of the architecture (Atlaso writes no LLM-extracted summaries).

Axis	Atlaso	mem0	Ratio
$ per query	$0.0096	$0.0115	1.20× cheaper
LLM calls at ingestion	0	1 × gpt-4o-mini per turn	∞

Table 3: Cost on LongMemEval-S. Per-query cost measured on instrumented Wall I runs; mem0 extractor cost is modeled (gpt-4o-mini default), not traced API calls.

Figure 5. Measured $ per query on LongMemEval-S. Atlaso runs at $0.0096/q vs mem0 at $0.0115/q (1.20× cheaper). mem0 extractor cost is modeled from default gpt-4o-mini, not traced API calls.

We deliberately do not publish a $/year savings extrapolation. Early drafts of this work projected dollar savings at 100K daily-active-users; on review the extrapolation multiplied a per-question extractor cost by a per-message rate, which is a unit error. The defensible cost claim is the 1.20× per-query number on the measured fixture, plus the structural ingestion-time difference; any TCO calculation downstream of that depends on assumptions about message volume, retention horizon, and storage tier that vary by deployment.

7. Where we lose: LoCoMo

LoCoMo is an adversarial conversational-memory benchmark designed with planted distractors and temporal-reasoning traps. It is a different distribution from LongMemEval. We ran the same shared-reader protocol on a 200-question LoCoMo subset. Atlaso loses.

System	Correct	Accuracy
mem0-default	71/200	35.5%
Atlaso	48/200	24.0% (−11.5pp)

Table 4: LoCoMo adversarial subset (n = 200, Haiku 4.5 strict judge). Atlaso loses to mem0-default by 11.5pp.

Figure 6. LoCoMo adversarial subset (n = 200). Atlaso loses to mem0 by 11.5 percentage points on temporal-reasoning and planted-distractor questions. We publish this loss in line with the gains; it is informative about where the substrate currently breaks.

The honest interpretation is that Atlaso's contradiction-aware retrieval is currently tuned for conversational-memory questions where the answer either exists clearly in the field or does not. On LoCoMo's temporal- and distractor-heavy questions, the system over-abstains, or returns the planted distractor with high confidence. We publish this loss because hiding it would make the rest of the study less trustworthy, not more — and because where a system fails is more diagnostic than where it succeeds.

We do not yet have matched-condition LoCoMo numbers for the other systems in the broader landscape, so we do not draw conclusions about the LoCoMo leaderboard. The fixed comparison here is Atlaso vs mem0, same fixture, same judge.

8. Limitations

We have tried to be conservative about what this study establishes. Specifically:

Mem0's full pipeline was not reproduced. The 49.2pp gap between 93.4% and 44.2% conflates judge methodology with pipeline configuration. Until mem0's full managed-platform pipeline is reproduced at or near 93.4% on the same fixture, the gap should be read as "methodology + pipeline," not isolated to either one.
Other memory systems were run only under exploratory protocols. Exploratory runs of LangMem, Letta, Cognee, and Honcho exist in the repository (research/option3/runs/histogram_arm\.jsonl* and adjacent files) under earlier configurations — different readers, different top-k, different judge regimes — and were not re-run under the locked shared-reader, shared-judge protocol used here. They are excluded from the head-to-head table for that reason; matched-condition re-runs are v1.1 work. Cross-system comparisons on this page are restricted to mem0 (run end-to-end on Wall I) and MemPalace (used only as a retriever in the 2×2 swap).
Pre-registration is a version trail, not a clean prior commit. The v1.8 protocol was iteratively converged through pilot runs (v1.1 → v1.8 are all public tags). We do not claim the canonical run was preregistered before any data was collected; we publish the full version trail so the protocol-tuning surface is auditable rather than hidden.
Cost is per-query, not TCO. The 1.20× number is measured on the run; deployment cost depends on assumptions we do not make.
Single subset of LoCoMo. The LoCoMo result is on n = 200, Haiku 4.5 strict, one judge. The other three judges were not run on LoCoMo at this scope.

9. Reproducibility

The entire study is reproducible from the open-source repository. The arm scripts, fixtures, judge prompts (including mem0's verbatim prompt as fetched), run JSONLs with raw per-question scores and contexts, the bootstrap CI script, and the protocol version history are all in the public repo. There is no proprietary platform call in the Atlaso path; mem0 is run via its public OSS configuration with the default extractor.

Code: github.com/atlaso-labs
Fixture: LongMemEval-S (commit SHA pinned in the run config)
Judges: Anthropic Haiku 4.5, OpenAI GPT-5, OpenAI GPT-4o, mem0 verbatim (prompts in research/option3/judge.py)
Per-question outputs: every system answer, retrieved context, judge rationale and gold answer is in the run JSONLs under research/option3/runs/
Protocol version: prereg-option3-v1.8

The practical takeaway: hold the reader, the fixture, and the judge fixed before you compare memory systems. When you do, the gap between what systems claim and what they deliver becomes visible — and the direction of the answer stops depending on whose judge you used.

Abstract

1. Why this study

We wrote this study to do one thing: hold every variable except the memory system fixed, then measure.

1.1 What we hold fixed

Shared fixture — LongMemEval-S, n = 500 questions, identical question IDs across all arms.
Shared reader — Qwen 3.5-9B for the final answer step. Every system feeds its retrieved context to the same reader with the same prompt template.
Shared judges — four independent LLM judges score the same answers: Anthropic Haiku 4.5 (strict, calibrated-abstention F1), OpenAI GPT-5 (permissive), OpenAI GPT-4o (strict), and mem0's own verbatim judge prompt as fetched from their public benchmarks repo.
Label-blind normalizer — before each judge sees an answer, we strip system-identifying fingerprints (e.g. <CONFIRMED> polarity headers, "Atlaso says:" banners, signature abstention phrases) so judges cannot identify the arm from the answer text.

2. Method

2.1 Fixture

2.2 Arms

2.3 Judges

A note on pre-registration. The v1.8 protocol — under which the canonical n = 500 numbers were generated — was iteratively converged across prereg-option3-v1.1 through prereg-option3-v1.8 during pilot runs. Every version is a public git tag; the full v1.1 → v1.8 diff history, pilot run JSONLs, and the locked v1.8 specification are all in the repository. We do not claim the canonical run was preregistered before any data was collected; we publish the entire version trail so the protocol-tuning surface is auditable rather than hidden.

3. Main result: four judges, same direction

Across four independent judges Atlaso beats mem0 on the same 500 questions, with deltas ranging from +9.8pp (GPT-4o strict) to +14.8pp (Haiku 4.5 strict). No judge flips the direction.

Judge	Atlaso	mem0	Δ
Haiku 4.5 strict	61.8% (309/500)	47.0% (235/500)	+14.8pp
mem0 verbatim	56.4% (282/500)	44.2% (221/500)	+12.2pp
GPT-5 permissive	57.8% (289/500)	46.4% (232/500)	+11.4pp
GPT-4o strict	69.2% (346/500)	59.4% (297/500)	+9.8pp

Table 1: Atlaso vs mem0-default on LongMemEval-S (n = 500), shared Qwen 3.5-9B reader, four independent judges.

Figure 1. Atlaso vs mem0-default on LongMemEval-S (n = 500). Four independent judges, same direction every time. Lowest delta (GPT-4o strict) is +9.8pp; highest (Haiku 4.5 strict) is +14.8pp.

4. Why mem0's published 93.4 does not reproduce

mem0 publishes a headline number of 93.4% accuracy on LongMemEval-S in their memory-benchmarks repository. That number anchors most external comparisons to mem0. We tried to reproduce it.

Under mem0's own published judge prompt, mem0's OSS default pipeline scores 44.2%, not 93.4%.
Under the same conditions, Atlaso scores 56.4% — a +12.2pp lead, with no extraction LLM and no proprietary platform configuration.
The honest comparison floor is somewhere between mem0's 44.2% and their published 93.4%; future work that reproduces mem0's full managed-platform pipeline will narrow this band.

5. Retrieval recall is not QA accuracy

We ran a clean 2 × 2 retriever-reader swap to test this. Both cells share the Atlaso reader and the same LongMemEval-S fixture; the retrieval source is the only variable.

Configuration	R@5 (n=479)	End-to-end QA (n=500)
Atlaso retriever + Atlaso reader	64.9%	61.8%
MemPalace retriever + Atlaso reader	96.2%	58.6%

6. Cost

Axis	Atlaso	mem0	Ratio
$ per query	$0.0096	$0.0115	1.20× cheaper
LLM calls at ingestion	0	1 × gpt-4o-mini per turn	∞

Table 3: Cost on LongMemEval-S. Per-query cost measured on instrumented Wall I runs; mem0 extractor cost is modeled (gpt-4o-mini default), not traced API calls.

Figure 5. Measured $ per query on LongMemEval-S. Atlaso runs at $0.0096/q vs mem0 at $0.0115/q (1.20× cheaper). mem0 extractor cost is modeled from default gpt-4o-mini, not traced API calls.

7. Where we lose: LoCoMo

System	Correct	Accuracy
mem0-default	71/200	35.5%
Atlaso	48/200	24.0% (−11.5pp)

Table 4: LoCoMo adversarial subset (n = 200, Haiku 4.5 strict judge). Atlaso loses to mem0-default by 11.5pp.

8. Limitations

We have tried to be conservative about what this study establishes. Specifically:

Mem0's full pipeline was not reproduced. The 49.2pp gap between 93.4% and 44.2% conflates judge methodology with pipeline configuration. Until mem0's full managed-platform pipeline is reproduced at or near 93.4% on the same fixture, the gap should be read as "methodology + pipeline," not isolated to either one.
Other memory systems were run only under exploratory protocols. Exploratory runs of LangMem, Letta, Cognee, and Honcho exist in the repository (research/option3/runs/histogram_arm\.jsonl* and adjacent files) under earlier configurations — different readers, different top-k, different judge regimes — and were not re-run under the locked shared-reader, shared-judge protocol used here. They are excluded from the head-to-head table for that reason; matched-condition re-runs are v1.1 work. Cross-system comparisons on this page are restricted to mem0 (run end-to-end on Wall I) and MemPalace (used only as a retriever in the 2×2 swap).
Pre-registration is a version trail, not a clean prior commit. The v1.8 protocol was iteratively converged through pilot runs (v1.1 → v1.8 are all public tags). We do not claim the canonical run was preregistered before any data was collected; we publish the full version trail so the protocol-tuning surface is auditable rather than hidden.
Cost is per-query, not TCO. The 1.20× number is measured on the run; deployment cost depends on assumptions we do not make.
Single subset of LoCoMo. The LoCoMo result is on n = 200, Haiku 4.5 strict, one judge. The other three judges were not run on LoCoMo at this scope.

9. Reproducibility

Code: github.com/atlaso-labs
Fixture: LongMemEval-S (commit SHA pinned in the run config)
Judges: Anthropic Haiku 4.5, OpenAI GPT-5, OpenAI GPT-4o, mem0 verbatim (prompts in research/option3/judge.py)
Per-question outputs: every system answer, retrieved context, judge rationale and gold answer is in the run JSONLs under research/option3/runs/
Protocol version: prereg-option3-v1.8

The practical takeaway: hold the reader, the fixture, and the judge fixed before you compare memory systems. When you do, the gap between what systems claim and what they deliver becomes visible — and the direction of the answer stops depending on whose judge you used.

What memory systems actually do. A four-judge head-to-head on LongMemEval-S.

Abstract

1. Why this study

1.1 What we hold fixed

2. Method

2.1 Fixture

2.2 Arms

2.3 Judges

3. Main result: four judges, same direction

4. Why mem0's published 93.4 does not reproduce

5. Retrieval recall is not QA accuracy

6. Cost

7. Where we lose: LoCoMo

8. Limitations

9. Reproducibility

Giving AI a subconscious. Ambient Memory — the part that's there before you ask.

When to start communicating. Adaptive stigmergy gates improve multi-agent RL training.

Where the 12-point lift comes from. 94% retrieval, 6% format.

What memory systems actually do. A four-judge head-to-head on LongMemEval-S.

Abstract

1. Why this study

1.1 What we hold fixed

2. Method

2.1 Fixture

2.2 Arms

2.3 Judges

3. Main result: four judges, same direction

4. Why mem0's published 93.4 does not reproduce

5. Retrieval recall is not QA accuracy

6. Cost

7. Where we lose: LoCoMo

8. Limitations

9. Reproducibility

Giving AI a subconscious. Ambient Memory — the part that's there before you ask.

When to start communicating. Adaptive stigmergy gates improve multi-agent RL training.

Where the 12-point lift comes from. 94% retrieval, 6% format.