Research Blog · Pilot

A Growing Memory Field for LLM Agents: Pilot Evidence from Delivery and Trading

Ashish KhandelwalIndependent ResearcherApril 19, 2026Previous: Adaptive Stigmergy Gates

Lead

A small model, a growing deposit store, and a +11.87pp lift on delivery tasks.

On a longitudinal benchmark of 300 adversarial-building delivery-sim tasks, we ran Qwen3.5-9B on vLLM across 10 seeds under two arms: one with an empty retrieval store, one where each successful task's outcome was written back and made retrievable to the next task's agent. Same model, same prompts, same tasks — the only difference was whether earlier outcomes were visible as deposits.

The empty arm landed at 85.0–85.7%. The growing arm landed at 97.0–98.0%. Mean lift: +11.87pp (95% CI +11.53 to +12.24pp, paired-seed bootstrap; std 0.56pp), 10 of 10 seeds favored the growing store, exact two-sided sign test p = 0.00195. The runner is parity-verified — Hugging Face and vLLM produce bit-identical builds on seed 1 — so the number is not a decoding artifact.

We have not yet separated persistent field structure from a matched-token successful-trajectory ICL baseline, so we describe this result as a success-gated retrieval lift rather than evidence the system has acquired institutional memory.

This is pilot-grade. Three things we have not done, in order of how much they would change the story:

  1. Matched-token-budget ablation. The growing arm sees more context than the empty arm. Until we run a matched-token, uniformly-sampled successful-trajectory baseline, we cannot rule out that plain in-context demos do the same work.
  2. Frontier-model comparison. We ran 9B vs 9B. We don't yet know if a stronger base model narrows or widens this gap.
  3. A third domain. Delivery sim is the only clean run; the trading-sim pilot is older and n=1.

So: a success-gated retrieval store — a lab notebook for the agent, nothing grander yet — reliably lifts a 9B model on one benchmark under one configuration. Next up: the matched-token ablation. That single experiment decides whether this is memory or just demos.

What the field is

The memory field is a shared, append-only store that every agent reads from and writes to. After each task an agent finishes cleanly, the substrate synthesizes a short structured summary — which building was visited, which approach worked, which didn't — and deposits it. Future agents solving nearby tasks retrieve the most relevant prior deposits before they decide. The field is not a vector of latent numbers; it's human-legible text. You could read any entry and know exactly what the agent learned and when.

The 10-seed result

Setup

We ran a longitudinal delivery simulation — 300 adversarial navigation tasks per seed, delivered one at a time, with a growing memory field that accumulates across the run. The agent is a fine-tuned 9B model (Qwen3.5-9B base, our V3 fine-tune, bf16 precision via vLLM). We compared two arms on the same 300 tasks: empty (no field) and growing (field + deposits written after each clean win). We repeated this for seeds 1 through 10 — 10 full replications, 6,000 task runs total. Wall-clock: about 2 hours 17 minutes on a Colab G4 Blackwell GPU.

84%86%88%90%92%94%96%98%EmptyGrowing10 of 10 seeds liftsign test p = 0.00195

Figure 1: Paired-seed slope chart. Each line connects a single seed's final success rate in the Empty arm (left) to its final success rate in the Growing arm (right). All 10 seeds lift; the exact two-sided sign test over the 10 paired outcomes gives p = 0.00195. Y-axis is 83–99% — not truncated.

Per-seed table

SeedEmptyGrowingLiftDeposits at end
10.85000.9767+12.67pp224
20.85670.9700+11.33pp224
30.85670.9700+11.33pp224
40.85670.9700+11.33pp224
50.85670.9700+11.33pp224
60.85670.9700+11.33pp224
70.85330.9733+12.00pp226
80.85330.9800+12.67pp227
90.85330.9767+12.34pp224
100.85330.9767+12.34pp224

Table 1: Per-seed final success rates across 300 tasks (mean over tasks, not a rolling-window value). 6 of 10 seeds land at exactly +11.33pp because 34/300 = 0.1133 is the granularity floor near the near-ceiling success rates — not a sign of fabrication.

Aggregates

  • Mean lift: +11.87 percentage points (95% CI +11.53 to +12.24pp, paired-seed bootstrap; std 0.56pp)
  • Range: +11.33pp to +12.67pp (1.34pp spread across 10 seeds)
  • 10 out of 10 seeds positive. Exact two-sided sign test: p = 0.00195
  • Empty-arm ceiling: 85.4% mean (stable 85.00–85.67% across seeds)
  • Growing-arm ceiling: 97.3% mean (stable 97.00–98.00% across seeds)

Interpretation

Every seed shows the lift. The tightness is unusual for a stochastic LLM-agent workload: the growing-arm success rate falls inside a 1pp band across 10 independent runs. This is the mechanism asserting itself consistently — not a single-seed anomaly that would dissolve on replication. The empty-arm ceiling at 85% tells us the 9B model is already quite good at the core task; the growing field is closing most of the remaining gap to 100%, not creating competence from scratch.

Rolling success rate (median across 10 seeds, window = 30)

Empty arm Growing arm

Growing − Empty gap (95% paired-seed bootstrap CI)

Figure 2: Rolling success rate per arm (median across 10 seeds; top panel) and the Growing − Empty gap with a 95% paired-seed bootstrap CI band (bottom panel). Window = 30 tasks. CI is 95% paired-seed bootstrap on the growing-minus-empty gap (resampled over the 10 seed pairs, 1000 draws). The gap opens within the first 50 tasks as the deposit store fills, and never closes.

The second pilot — trading data

In a separate pilot, we ran the same field mechanism on a real dataset: 500 trades sampled from a 3.5-year, 8,434-trade book. Each trade's context (symbol, direction, session, day, hour, recent streak) was fed to a binary profitable / not-profitable classifier, with the option to seed the classifier with accumulated outcomes from prior similar trades. A GPT-5.4 frontier model alone hit 43.8% accuracy. The same frontier model with an accumulating field of 500 prior outcomes hit 58.4%— a +14.6 percentage-point lift. Our local 9B went from 40.2% to 44.2% under the same treatment (+4pp). This is a single-run pilot, not a benchmark; the result says the mechanism carries signal on real data, not that we've built a trading system.

Portability — the most distinctive finding

A 4-billion-parameter model — smaller and weaker than the 9B — was run on a held-out task set with no field, scoring 26%. We then seeded its field with 38 deposits that the 9B had written during its own deployment on a different set of tasks. The 4B's success rate jumped to 66% — a +40 percentage-point lift from deposits written by a different, larger model. Those 38 deposits beat 200 hand-curated synthetic deposits at matched field size (66% vs 62%). Memory written by one model transferred to another, and agent-curated experience turned out to be denser per entry than enumerated rules. This is the finding we're most curious to see hold up across more model pairs; the one test does not a law make.

What we have NOT done

  • Matched-budget ablation. The growing arm sees more tokens per task than the empty arm (the retrieved deposits). We have not yet run a content-shuffle placebo or a matched curated-ICL control to confirm the lift is coming from the content of the field, not the lengthof the prompt. This is the single most common and most lethal objection to results like this. It's first in the queue.
  • Direct frontier comparison on identical task instances. Our frontier comparisons so far mix different seeds, different task-set sizes, and different scoring windows. A clean head-to-head on the same 300 tasks per seed, same ordering, same scoring is pending.
  • Third longitudinal domain at full scale. The 10-seed result is one domain (delivery). The trading pilot is a single small run. A third independent long-horizon domain would close the “domain-specific artifact” objection.
  • Ablations on the portability claim. We have one model pair (9B writes → 4B reads), one task set, and one deposit subset. Multiple pairs and randomized deposit subsets would make the portability claim more than a single data point.

Mechanism

The field substrate has four permanent design choices that had to be right for the memory loop to work at all. (1) A strict-4-axis retriever that only surfaces deposits matching the current task's building family, direction, session, and recent-streak context. (2) Success-gated deterministic deposit synthesis — only clean wins (success AND cost-ratio ≤ 1.2) produce a deposit, and the deposit content is deterministically derived from the step history, not free-form. (3) Sequential-within-seed task ordering — task N+1's retrieval happens aftertask N's deposit is written, not at the same batch step. (4) Cross-seed batched inference for throughput, with per-request seeds to preserve determinism. A bit-identical parity check against a trusted sequential reference (HF seed 1: 0.9700 = 0.9700, 297/300 action match) is what made this result trustable.

Growing · retrieved Growing · no retrieval Empty baseline Task count N (right axis)

Figure 3: Use vs. presence. Within the Growing arm, tasks are split by whether the agent retrieved at least one deposit (retrieval_count > 0) or none. The lift tracks field use, not field presence: Growing · retrieved sits at or near 100% in every bin, while Growing · no-retrieval tracks the Empty baseline or drops below it when the task happens to miss the retrieval gates. Faint bars show the task count N in each cell. This defeats the “the field is just decoration” strawman but does not substitute for the matched-token-ICL ablation.

What's next

  • Matched-budget ablation on seed 1. About 1 hour of Colab time. This is the objection that sinks results like ours most often in peer review; it's the first thing we'll run.
  • Multi-pair portability. Re-run the 9B → 4B transfer with additional source/target model pairs and randomized deposit subsets. Needed before the portability figure becomes a real claim rather than one data point.
  • Third domain. Delivery and trading are two. A third independent long-horizon domain at full scale would make the mechanism claim much harder to dismiss as a simulator artifact.

Summary

80%85%90%95%100%Emptymean 85.4%Growingmean 97.3%+11.87pp95% CI [+11.53, +12.24] · paired-seed bootstrap

Figure 4: Headline summary. Two-bar comparison of Empty vs Growing mean success rates across 10 seeds × 300 tasks, with per-seed rates overlaid as translucent dots. The mean lift is +11.87pp (95% CI [+11.53, +12.24], paired-seed bootstrap). Not a standalone result — read in the context of the mandatory caveat in the lead.

The field is a boring idea dressed up as an exciting one: shared human-readable memory for LLM agents. What's surprising is how cleanly it compounds in our simulations when the retrieval and deposit mechanics are handled carefully, and how portable the memory turns out to be across model sizes. What we have right now is pilot evidence across two domains and one cross-model transfer — enough to make the mechanism worth investigating, not enough to make sweeping claims about frontier equivalence. We're running the ablations next.

Links