Lead

On Qwen3.5-9B across ten seeds × 300 tasks per seed, a Growing arm with scope-matched, success-gated retrieval lands at 97.33% — against 85.40% for Empty context. That is a +11.87 percentage-point lift, all 10 seeds positive, exact sign-test p = 0.00195.

Two ablations now place that lift on a real scale:

Token count alone (Arm C — matched-token random prior successes) produces ≈ 0pp; it lands at 84.93%, statistically indistinguishable from Empty.
Scope-matched content with the deposit schema stripped (Arm D — same retrieved trajectories, flattened to plain text under an “examples” key) lands at 96.70%. This is ~94% of the total lift, recovered without the deposit JSON structure or the “field” namespace.
Deposit schema and namespace add a smaller, reproducible +0.70pp on top (Growing − Arm D, 95% CI [+0.60, +0.80], sign-test p = 0.00195, 10/10 seeds positive). A renderer-invariance falsifier across three byte-different separators (space, newline, pipe) holds the gap inside a 0.11pp band — so it is not a serialization artifact.

The decomposition. Of the 12-point lift, scope-matched retrieval contributes about 94%; the structured deposit format contributes about 6%. The big lever is finding the right prior outcomes for the current task. The small lever is presenting them inside the trained memory-namespace rather than as generic in-context examples.

What this is not. A test against frontier models, a third independent domain, or a portability-at-scale claim. Each of those is queued separately. This post is one mechanism question, asked carefully, and one answer in two slices.

What the field is

The memory field is a shared, append-only store that every agent reads from and writes to. After each task an agent finishes cleanly, the substrate synthesizes a short structured summary — which building was visited, which approach worked, which didn't — and deposits it. Future agents solving nearby tasks retrieve the most relevant prior deposits before they decide. The field is not a vector of latent numbers; it's human-legible text. You could read any entry and know exactly what the agent learned and when.

The 10-seed result

Setup

We ran a longitudinal delivery simulation — 300 adversarial navigation tasks per seed, delivered one at a time, with a growing memory field that accumulates across the run. The agent is a fine-tuned 9B model (Qwen3.5-9B base, our V3 fine-tune, bf16 precision via vLLM). We compared two arms on the same 300 tasks: empty (no field) and growing (field + deposits written after each clean win). We repeated this for seeds 1 through 10 — 10 full replications, 6,000 task runs total. Wall-clock: about 2 hours 17 minutes on a Colab G4 Blackwell GPU.

Fig 01Paired-seed slope chart. Each line connects a single seed's final success rate in the Empty arm (left) to its final success rate in the Growing arm (right). All 10 seeds lift; the exact two-sided sign test over the 10 paired outcomes gives p = 0.00195. Y-axis is 83–99% — not truncated.

Per-seed table

Seed	Empty	Growing	Lift	Deposits
1	0.8500	0.9767	+12.67pp	224
2	0.8567	0.9700	+11.33pp	224
3	0.8567	0.9700	+11.33pp	224
4	0.8567	0.9700	+11.33pp	224
5	0.8567	0.9700	+11.33pp	224
6	0.8567	0.9700	+11.33pp	224
7	0.8533	0.9733	+12.00pp	226
8	0.8533	0.9800	+12.67pp	227
9	0.8533	0.9767	+12.34pp	224
10	0.8533	0.9767	+12.34pp	224

Table 1: Per-seed final success rates across 300 tasks. 6 of 10 seeds land at exactly +11.33pp because 34/300 = 0.1133 is the granularity floor near the near-ceiling success rates — not a sign of fabrication.

Aggregates

Empty arm: 85.4% mean (stable 85.00–85.67% across 10 seeds)
Arm C — matched-token random-success ICL: 84.93% mean. Indistinguishable from Empty. Token-count effect alone: ~0pp.
Arm D — scope-matched, schema stripped: 96.70% mean across 10 seeds. Recovers ~94% of the total lift with no deposit JSON structure and no “field” namespace. Renderer-invariance check across three byte-different separators holds the Growing−Arm-D gap inside a 0.11pp band — so the structured contribution is not a serialization artifact.
Growing arm: 97.33% mean (stable 97.00–98.00% across 10 seeds)
Growing − Empty: +11.87pp (95% CI [+11.53, +12.24], paired-seed bootstrap; 10/10 positive; sign-test p = 0.00195)
Growing − Arm C: +12.40pp (95% CI [+12.13, +12.67]; 10/10 positive; sign-test p = 0.00195)
Growing − Arm D: +0.70pp (95% CI [+0.60, +0.80]; 10/10 positive; sign-test p = 0.00195). Renderer-invariance: per-renderer mean spread 0.11pp — gap reproduces across serializations.
Decomposition. Token count: ~0% · Scope-matched retrieval + content: ~94% · Deposit schema/namespace: ~6%.

Interpretation

Every seed shows the lift. The tightness is unusual for a stochastic LLM-agent workload: the growing-arm success rate falls inside a 1pp band across 10 independent runs. This is the mechanism asserting itself consistently — not a single-seed anomaly that would dissolve on replication. The empty-arm ceiling at 85% tells us the 9B model is already quite good at the core task; the growing field is closing most of the remaining gap to 100%, not creating competence from scratch.

Rolling success rate (median across 10 seeds, window = 30)

Empty arm Growing arm

Growing − Empty gap (95% paired-seed bootstrap CI)

Fig 02Rolling success rate per arm (median across 10 seeds; top panel) and the Growing − Empty gap with a 95% paired-seed bootstrap CI band (bottom panel). Window = 30 tasks. CI is 95% paired-seed bootstrap on the growing-minus-empty gap (resampled over the 10 seed pairs, 1000 draws). The gap opens within the first 50 tasks as the deposit store fills, and never closes.

The second pilot — trading data

In a separate pilot, we ran the same field mechanism on a real dataset: 500 trades sampled from a 3.5-year, 8,434-trade book. Each trade's context (symbol, direction, session, day, hour, recent streak) was fed to a binary profitable / not-profitable classifier, with the option to seed the classifier with accumulated outcomes from prior similar trades. A GPT-5.4 frontier model alone hit 43.8% accuracy. The same frontier model with an accumulating field of 500 prior outcomes hit 58.4%— a +14.6 percentage-point lift. Our local 9B went from 40.2% to 44.2% under the same treatment (+4pp). This is a single-run pilot, not a benchmark; the result says the mechanism carries signal on real data, not that we've built a trading system.

Portability — the most distinctive finding

A 4-billion-parameter model — smaller and weaker than the 9B — was run on a held-out task set with no field, scoring 26%. We then seeded its field with 38 deposits that the 9B had written during its own deployment on a different set of tasks. The 4B's success rate jumped to 66%— a +40 percentage-point lift from deposits written by a different, larger model. Those 38 deposits beat 200 hand-curated synthetic deposits at matched field size (66% vs 62%). Memory written by one model transferred to another, and agent-curated experience turned out to be denser per entry than enumerated rules. This is the finding we're most curious to see hold up across more model pairs; the one test does not a law make.

What we have NOT done

Tested whether the 6% deposit-format effect is itself a trained behavior of this fine-tune. The V3 model was trained on substrate-V04 data where retrieved content sits under a “field” key. The +0.70pp gap we measure could be the model behaving as trained rather than a generic property of structured deposits. A second-fine-tune ablation, or testing a base-model variant, would address this.
Direct frontier comparison on identical task instances. Our frontier comparisons so far mix different seeds, different task-set sizes, and different scoring windows. A clean head-to-head on the same 300 tasks per seed, same ordering, same scoring is pending.
Third longitudinal domain at full scale. The 10-seed result is one domain (delivery). The trading pilot is a single small run. A third independent long-horizon domain would close the “domain-specific artifact” objection.
Ablations on the portability claim. We have one model pair (9B writes → 4B reads), one task set, and one deposit subset. Multiple pairs and randomized deposit subsets would make the portability claim more than a single data point.

Mechanism

The field substrate has four permanent design choices that had to be right for the memory loop to work at all. (1) A strict-4-axis retriever that only surfaces deposits matching the current task's building family, direction, session, and recent-streak context. (2) Success-gated deterministic deposit synthesis — only clean wins (success AND cost-ratio ≤ 1.2) produce a deposit, and the deposit content is deterministically derived from the step history, not free-form. (3) Sequential-within-seed task ordering — task N+1's retrieval happens aftertask N's deposit is written, not at the same batch step. (4) Cross-seed batched inference for throughput, with per-request seeds to preserve determinism. A bit-identical parity check against a trusted sequential reference (HF seed 1: 0.9700 = 0.9700, 297/300 action match) is what made this result trustable.

Growing · retrieved Growing · no retrieval Empty baseline Task count N (right axis)

Fig 03Use vs. presence. Within the Growing arm, tasks are split by whether the agent retrieved at least one deposit (retrieval_count > 0) or none. The lift tracks field use, not field presence: Growing · retrieved sits at or near 100% in every bin, while Growing · no-retrieval tracks the Empty baseline or drops below it when the task happens to miss the retrieval gates. Faint bars show the task count N in each cell. This defeats the 'field is just decoration' strawman but does not substitute for the matched-token-ICL ablation.

What's next

Frontier head-to-head on identical task instances. Same 300 tasks, same ordering, same scoring. Until we run this, we cannot compare 9B+field to a frontier model in a way that survives review.
Full factorial attribution sweep. Content-identity × position × scope, ~30 hours on a G4. The gate to the strongest mechanism claims. Useful for the preprint, not the blog.
Multi-pair portability. Re-run the 9B → 4B transfer with additional source/target model pairs and randomized deposit subsets. Needed before the portability figure becomes a real claim rather than one data point.
Third domain. Delivery and trading are two. A third independent long-horizon domain at full scale would make the mechanism claim much harder to dismiss as a simulator artifact.

Summary

Fig 04Headline summary. Two-bar comparison of Empty vs Growing mean success rates across 10 seeds × 300 tasks, with per-seed rates overlaid as translucent dots. Mean lift +11.87pp (95% CI [+11.53, +12.24], paired-seed bootstrap). Two ablations not shown here decompose the lift: matched-token random ICL (Arm C) lands at 84.93% — token count alone does ~0pp. Scope-matched content with the deposit schema stripped (Arm D) lands at 96.70% — recovering ~94% of the lift. The remaining ~6% is the deposit-structure effect, +0.70pp [+0.60, +0.80], reproducible across three byte-different serializations.

The 12-point lift on Qwen3.5-9B is real, replicates across ten seeds, and decomposes cleanly: about 94% comes from scope-matched retrieval finding the right prior outcomes for the current task, and about 6% comes from presenting them in the trained deposit structure rather than as generic in-context examples. The 6% slice is small but reproducible — it survives three byte-different serializations of the same retrieved content. The big lever is retrieval. The small lever is format. We have not run the frontier head-to-head, the third domain, or the full attribution factorial — those are queued. This is one honest mechanism question, asked in two slices.

Previous paper

When to start communicating

Links

Code: imashishkh21/atlaso
Contact: Ashish Khandelwal on LinkedIn

Lead

Two ablations now place that lift on a real scale:

Token count alone (Arm C — matched-token random prior successes) produces ≈ 0pp; it lands at 84.93%, statistically indistinguishable from Empty.
Scope-matched content with the deposit schema stripped (Arm D — same retrieved trajectories, flattened to plain text under an “examples” key) lands at 96.70%. This is ~94% of the total lift, recovered without the deposit JSON structure or the “field” namespace.
Deposit schema and namespace add a smaller, reproducible +0.70pp on top (Growing − Arm D, 95% CI [+0.60, +0.80], sign-test p = 0.00195, 10/10 seeds positive). A renderer-invariance falsifier across three byte-different separators (space, newline, pipe) holds the gap inside a 0.11pp band — so it is not a serialization artifact.

What the field is

The 10-seed result

Setup

Per-seed table

Seed	Empty	Growing	Lift	Deposits
1	0.8500	0.9767	+12.67pp	224
2	0.8567	0.9700	+11.33pp	224
3	0.8567	0.9700	+11.33pp	224
4	0.8567	0.9700	+11.33pp	224
5	0.8567	0.9700	+11.33pp	224
6	0.8567	0.9700	+11.33pp	224
7	0.8533	0.9733	+12.00pp	226
8	0.8533	0.9800	+12.67pp	227
9	0.8533	0.9767	+12.34pp	224
10	0.8533	0.9767	+12.34pp	224

Aggregates

Empty arm: 85.4% mean (stable 85.00–85.67% across 10 seeds)
Arm C — matched-token random-success ICL: 84.93% mean. Indistinguishable from Empty. Token-count effect alone: ~0pp.
Arm D — scope-matched, schema stripped: 96.70% mean across 10 seeds. Recovers ~94% of the total lift with no deposit JSON structure and no “field” namespace. Renderer-invariance check across three byte-different separators holds the Growing−Arm-D gap inside a 0.11pp band — so the structured contribution is not a serialization artifact.
Growing arm: 97.33% mean (stable 97.00–98.00% across 10 seeds)
Growing − Empty: +11.87pp (95% CI [+11.53, +12.24], paired-seed bootstrap; 10/10 positive; sign-test p = 0.00195)
Growing − Arm C: +12.40pp (95% CI [+12.13, +12.67]; 10/10 positive; sign-test p = 0.00195)
Growing − Arm D: +0.70pp (95% CI [+0.60, +0.80]; 10/10 positive; sign-test p = 0.00195). Renderer-invariance: per-renderer mean spread 0.11pp — gap reproduces across serializations.
Decomposition. Token count: ~0% · Scope-matched retrieval + content: ~94% · Deposit schema/namespace: ~6%.

Interpretation

Rolling success rate (median across 10 seeds, window = 30)

Empty arm Growing arm

Growing − Empty gap (95% paired-seed bootstrap CI)

The second pilot — trading data

Portability — the most distinctive finding

What we have NOT done

Tested whether the 6% deposit-format effect is itself a trained behavior of this fine-tune. The V3 model was trained on substrate-V04 data where retrieved content sits under a “field” key. The +0.70pp gap we measure could be the model behaving as trained rather than a generic property of structured deposits. A second-fine-tune ablation, or testing a base-model variant, would address this.

Direct frontier comparison on identical task instances. Our frontier comparisons so far mix different seeds, different task-set sizes, and different scoring windows. A clean head-to-head on the same 300 tasks per seed, same ordering, same scoring is pending.

Third longitudinal domain at full scale. The 10-seed result is one domain (delivery). The trading pilot is a single small run. A third independent long-horizon domain would close the “domain-specific artifact” objection.

Ablations on the portability claim. We have one model pair (9B writes → 4B reads), one task set, and one deposit subset. Multiple pairs and randomized deposit subsets would make the portability claim more than a single data point.

Mechanism

Growing · retrieved Growing · no retrieval Empty baseline Task count N (right axis)

What's next

Frontier head-to-head on identical task instances. Same 300 tasks, same ordering, same scoring. Until we run this, we cannot compare 9B+field to a frontier model in a way that survives review.

Full factorial attribution sweep. Content-identity × position × scope, ~30 hours on a G4. The gate to the strongest mechanism claims. Useful for the preprint, not the blog.

Multi-pair portability. Re-run the 9B → 4B transfer with additional source/target model pairs and randomized deposit subsets. Needed before the portability figure becomes a real claim rather than one data point.

Third domain. Delivery and trading are two. A third independent long-horizon domain at full scale would make the mechanism claim much harder to dismiss as a simulator artifact.

Summary

Where the 12-point lift comes from.
94% retrieval, 6% format.

Lead

What the field is

The 10-seed result

Setup

Per-seed table

Interpretation

The second pilot — trading data

Portability — the most distinctive finding

What we have NOT done

Mechanism

What's next

Summary

Where the 12-point lift comes from.
94% retrieval, 6% format.

Lead

What the field is

The 10-seed result

Setup

Per-seed table

Interpretation

The second pilot — trading data

Portability — the most distinctive finding

What we have NOT done

Mechanism

What's next

Summary

Where the 12-point lift comes from.94% retrieval, 6% format.

Lead

What the field is

The 10-seed result

Setup

Per-seed table

Interpretation

The second pilot — trading data

Portability — the most distinctive finding

What we have NOT done

Mechanism

What's next

Summary

Where the 12-point lift comes from.94% retrieval, 6% format.

Lead

What the field is

The 10-seed result

Setup

Per-seed table

Interpretation

The second pilot — trading data

Portability — the most distinctive finding

What we have NOT done

Mechanism

What's next

Summary

Where the 12-point lift comes from.
94% retrieval, 6% format.

Where the 12-point lift comes from.
94% retrieval, 6% format.