Architecture

Retrieval pipeline

Query → filter → rank → aggregate. The full read-time pipeline, including the constants you can tune.

7 min readUpdated May 18, 2026Edit on GitHub

The four phases

Query — FTS5 MATCH per polarity over deposits_fts (contentless virtual table).
Filter — polarity is applied in the SQL WHERE; optional scope-facet filter applied client side.
Rank — BM25 score weighted by an exponential freshness decay.
Aggregate — group by 6-facet scope bag, compute bag_precision / has_conflict / is_confident, sort conflict bags first, flatten.

BM25 + asymmetric freshness decay

python

# from _engine/retrieval.py
_HALF_LIVES = {
    "positive":    14.0,   # days
    "open":        14.0,
    "negative":    90.0,   # ~three months
    "cautionary":  90.0,
}

# rank score = bm25 * exp(-age_days * ln2 / half_life)

Asymmetric on purpose. A known failure is a constraint that should stay fresh for months. A positive observation needs to be re-evidenced more often before the dispersion pipeline takes it seriously.

k distribution

recall(limit=N) fans out across the four polarities with a 30/30/20/20 split and a minimum of 1 each:

python

# from _engine/aware.py
k_positive   = max(1, int(limit * 0.30))
k_negative   = max(1, int(limit * 0.30))
k_cautionary = max(1, int(limit * 0.20))
k_open       = max(1, int(limit * 0.20))

# candidate over-fetch:
candidates = max(50, k * 10)

Scope-bag aggregation

Every candidate is grouped by its scope bag. Engine field names (left) map to public field names (right) on SearchResult:

bag_size — total deposits in the bag.
dominant_polarity — the directional polarity with the most deposits.
bag_precision → public agreement_score — fraction of directional deposits matching the dominant polarity.
is_single_sample → public is_thin_evidence — bag has exactly one directional deposit.
has_conflict → public has_disagreement — bag contains two or more distinct directional polarities (any pair drawn from {positive, negative, cautionary}).
conflict_peers — ids of the opposing-polarity hits that flipped the bag.

The is_confident rule (again)

python

is_confident = (
    bag_precision >= 0.99
    and not is_single_sample
    and not has_conflict
)

FTS5 query sanitisation

FTS5 treats a broad set of punctuation as syntax. The SDK strips them at the boundary and OR-joins remaining tokens. The actual set: " ( ) : * ? , ; . @ ! # $ % ^ & + = [ ] { } | \ < > ~ `.

python

# Caller passes "what kind of coffee?"
# Becomes: "what OR kind OR of OR coffee"

The query log

Each recall() writes one row to query_log summarising the top hit's flags + bag size + result count. This table powers Coverage in the Field Maturity Index — without it, Coverage is structurally vacuous.

Previous← The gate NextStorage layer →

Was this page helpful?

The four phases

Query — FTS5 MATCH per polarity over deposits_fts (contentless virtual table).

Filter — polarity is applied in the SQL WHERE; optional scope-facet filter applied client side.

Rank — BM25 score weighted by an exponential freshness decay.

Aggregate — group by 6-facet scope bag, compute bag_precision / has_conflict / is_confident, sort conflict bags first, flatten.

BM25 + asymmetric freshness decay

python

# from _engine/retrieval.py
_HALF_LIVES = {
    "positive":    14.0,   # days
    "open":        14.0,
    "negative":    90.0,   # ~three months
    "cautionary":  90.0,
}

# rank score = bm25 * exp(-age_days * ln2 / half_life)

Asymmetric on purpose. A known failure is a constraint that should stay fresh for months. A positive observation needs to be re-evidenced more often before the dispersion pipeline takes it seriously.

k distribution

recall(limit=N) fans out across the four polarities with a 30/30/20/20 split and a minimum of 1 each:

python

# from _engine/aware.py
k_positive   = max(1, int(limit * 0.30))
k_negative   = max(1, int(limit * 0.30))
k_cautionary = max(1, int(limit * 0.20))
k_open       = max(1, int(limit * 0.20))

# candidate over-fetch:
candidates = max(50, k * 10)

Scope-bag aggregation

Every candidate is grouped by its scope bag. Engine field names (left) map to public field names (right) on SearchResult:

bag_size — total deposits in the bag.

dominant_polarity — the directional polarity with the most deposits.

bag_precision → public agreement_score — fraction of directional deposits matching the dominant polarity.

is_single_sample → public is_thin_evidence — bag has exactly one directional deposit.

has_conflict → public has_disagreement — bag contains two or more distinct directional polarities (any pair drawn from {positive, negative, cautionary}).

conflict_peers — ids of the opposing-polarity hits that flipped the bag.