Lead
On Qwen3.5-9B across ten seeds × 300 tasks per seed, a Growing arm with scope-matched, success-gated retrieval lands at 97.33% — against 85.40% for Empty context. That is a +11.87 percentage-point lift, all 10 seeds positive, exact sign-test p = 0.00195.
Two ablations now place that lift on a real scale:
- Token count alone (Arm C — matched-token random prior successes) produces ≈ 0pp; it lands at 84.93%, statistically indistinguishable from Empty.
- Scope-matched content with the deposit schema stripped (Arm D — same retrieved trajectories, flattened to plain text under an “examples” key) lands at 96.70%. This is ~94% of the total lift, recovered without the deposit JSON structure or the “field” namespace.
- Deposit schema and namespace add a smaller, reproducible +0.70pp on top (Growing − Arm D, 95% CI [+0.60, +0.80], sign-test p = 0.00195, 10/10 seeds positive). A renderer-invariance falsifier across three byte-different separators (space, newline, pipe) holds the gap inside a 0.11pp band — so it is not a serialization artifact.
The decomposition. Of the 12-point lift, scope-matched retrieval contributes about 94%; the structured deposit format contributes about 6%. The big lever is finding the right prior outcomes for the current task. The small lever is presenting them inside the trained memory-namespace rather than as generic in-context examples.
What this is not. A test against frontier models, a third independent domain, or a portability-at-scale claim. Each of those is queued separately. This post is one mechanism question, asked carefully, and one answer in two slices.
What the field is
The memory field is a shared, append-only store that every agent reads from and writes to. After each task an agent finishes cleanly, the substrate synthesizes a short structured summary — which building was visited, which approach worked, which didn't — and deposits it. Future agents solving nearby tasks retrieve the most relevant prior deposits before they decide. The field is not a vector of latent numbers; it's human-legible text. You could read any entry and know exactly what the agent learned and when.
The 10-seed result
Setup
We ran a longitudinal delivery simulation — 300 adversarial navigation tasks per seed, delivered one at a time, with a growing memory field that accumulates across the run. The agent is a fine-tuned 9B model (Qwen3.5-9B base, our V3 fine-tune, bf16 precision via vLLM). We compared two arms on the same 300 tasks: empty (no field) and growing (field + deposits written after each clean win). We repeated this for seeds 1 through 10 — 10 full replications, 6,000 task runs total. Wall-clock: about 2 hours 17 minutes on a Colab G4 Blackwell GPU.
Per-seed table
| Seed | Empty | Growing | Lift | Deposits |
|---|---|---|---|---|
| 1 | 0.8500 | 0.9767 | +12.67pp | 224 |
| 2 | 0.8567 | 0.9700 | +11.33pp | 224 |
| 3 | 0.8567 | 0.9700 | +11.33pp | 224 |
| 4 | 0.8567 | 0.9700 | +11.33pp | 224 |
| 5 | 0.8567 | 0.9700 | +11.33pp | 224 |
| 6 | 0.8567 | 0.9700 | +11.33pp | 224 |
| 7 | 0.8533 | 0.9733 | +12.00pp | 226 |
| 8 | 0.8533 | 0.9800 | +12.67pp | 227 |
| 9 | 0.8533 | 0.9767 | +12.34pp | 224 |
| 10 | 0.8533 | 0.9767 | +12.34pp | 224 |
Table 1: Per-seed final success rates across 300 tasks. 6 of 10 seeds land at exactly +11.33pp because 34/300 = 0.1133 is the granularity floor near the near-ceiling success rates — not a sign of fabrication.
Aggregates
- Empty arm: 85.4% mean (stable 85.00–85.67% across 10 seeds)
- Arm C — matched-token random-success ICL: 84.93% mean. Indistinguishable from Empty. Token-count effect alone: ~0pp.
- Arm D — scope-matched, schema stripped: 96.70% mean across 10 seeds. Recovers ~94% of the total lift with no deposit JSON structure and no “field” namespace. Renderer-invariance check across three byte-different separators holds the Growing−Arm-D gap inside a 0.11pp band — so the structured contribution is not a serialization artifact.
- Growing arm: 97.33% mean (stable 97.00–98.00% across 10 seeds)
- Growing − Empty: +11.87pp (95% CI [+11.53, +12.24], paired-seed bootstrap; 10/10 positive; sign-test p = 0.00195)
- Growing − Arm C: +12.40pp (95% CI [+12.13, +12.67]; 10/10 positive; sign-test p = 0.00195)
- Growing − Arm D: +0.70pp (95% CI [+0.60, +0.80]; 10/10 positive; sign-test p = 0.00195). Renderer-invariance: per-renderer mean spread 0.11pp — gap reproduces across serializations.
- Decomposition. Token count: ~0% · Scope-matched retrieval + content: ~94% · Deposit schema/namespace: ~6%.
Interpretation
Every seed shows the lift. The tightness is unusual for a stochastic LLM-agent workload: the growing-arm success rate falls inside a 1pp band across 10 independent runs. This is the mechanism asserting itself consistently — not a single-seed anomaly that would dissolve on replication. The empty-arm ceiling at 85% tells us the 9B model is already quite good at the core task; the growing field is closing most of the remaining gap to 100%, not creating competence from scratch.
Rolling success rate (median across 10 seeds, window = 30)
Growing − Empty gap (95% paired-seed bootstrap CI)
The second pilot — trading data
In a separate pilot, we ran the same field mechanism on a real dataset: 500 trades sampled from a 3.5-year, 8,434-trade book. Each trade's context (symbol, direction, session, day, hour, recent streak) was fed to a binary profitable / not-profitable classifier, with the option to seed the classifier with accumulated outcomes from prior similar trades. A GPT-5.4 frontier model alone hit 43.8% accuracy. The same frontier model with an accumulating field of 500 prior outcomes hit 58.4%— a +14.6 percentage-point lift. Our local 9B went from 40.2% to 44.2% under the same treatment (+4pp). This is a single-run pilot, not a benchmark; the result says the mechanism carries signal on real data, not that we've built a trading system.
Portability — the most distinctive finding
A 4-billion-parameter model — smaller and weaker than the 9B — was run on a held-out task set with no field, scoring 26%. We then seeded its field with 38 deposits that the 9B had written during its own deployment on a different set of tasks. The 4B's success rate jumped to 66%— a +40 percentage-point lift from deposits written by a different, larger model. Those 38 deposits beat 200 hand-curated synthetic deposits at matched field size (66% vs 62%). Memory written by one model transferred to another, and agent-curated experience turned out to be denser per entry than enumerated rules. This is the finding we're most curious to see hold up across more model pairs; the one test does not a law make.
What we have NOT done
- Tested whether the 6% deposit-format effect is itself a trained behavior of this fine-tune. The V3 model was trained on substrate-V04 data where retrieved content sits under a “field” key. The +0.70pp gap we measure could be the model behaving as trained rather than a generic property of structured deposits. A second-fine-tune ablation, or testing a base-model variant, would address this.
- Direct frontier comparison on identical task instances. Our frontier comparisons so far mix different seeds, different task-set sizes, and different scoring windows. A clean head-to-head on the same 300 tasks per seed, same ordering, same scoring is pending.
- Third longitudinal domain at full scale. The 10-seed result is one domain (delivery). The trading pilot is a single small run. A third independent long-horizon domain would close the “domain-specific artifact” objection.
- Ablations on the portability claim. We have one model pair (9B writes → 4B reads), one task set, and one deposit subset. Multiple pairs and randomized deposit subsets would make the portability claim more than a single data point.
Mechanism
The field substrate has four permanent design choices that had to be right for the memory loop to work at all. (1) A strict-4-axis retriever that only surfaces deposits matching the current task's building family, direction, session, and recent-streak context. (2) Success-gated deterministic deposit synthesis — only clean wins (success AND cost-ratio ≤ 1.2) produce a deposit, and the deposit content is deterministically derived from the step history, not free-form. (3) Sequential-within-seed task ordering — task N+1's retrieval happens aftertask N's deposit is written, not at the same batch step. (4) Cross-seed batched inference for throughput, with per-request seeds to preserve determinism. A bit-identical parity check against a trusted sequential reference (HF seed 1: 0.9700 = 0.9700, 297/300 action match) is what made this result trustable.
What's next
- Frontier head-to-head on identical task instances. Same 300 tasks, same ordering, same scoring. Until we run this, we cannot compare 9B+field to a frontier model in a way that survives review.
- Full factorial attribution sweep. Content-identity × position × scope, ~30 hours on a G4. The gate to the strongest mechanism claims. Useful for the preprint, not the blog.
- Multi-pair portability. Re-run the 9B → 4B transfer with additional source/target model pairs and randomized deposit subsets. Needed before the portability figure becomes a real claim rather than one data point.
- Third domain. Delivery and trading are two. A third independent long-horizon domain at full scale would make the mechanism claim much harder to dismiss as a simulator artifact.
Summary
The 12-point lift on Qwen3.5-9B is real, replicates across ten seeds, and decomposes cleanly: about 94% comes from scope-matched retrieval finding the right prior outcomes for the current task, and about 6% comes from presenting them in the trained deposit structure rather than as generic in-context examples. The 6% slice is small but reproducible — it survives three byte-different serializations of the same retrieved content. The big lever is retrieval. The small lever is format. We have not run the frontier head-to-head, the third domain, or the full attribution factorial — those are queued. This is one honest mechanism question, asked in two slices.
Previous paper
When to start communicatingLinks
- Code: imashishkh21/atlaso
- Contact: Ashish Khandelwal on LinkedIn