# Architecture

<!-- Canonical: https://www.atlaso.ai/docs/architecture -->

How atlaso's moving parts fit together.

## One-line mental model

Append-only SQLite ledger of *typed* claims (polarity + evidence grade + scope). A **gate** rejects under-evidenced writes. A **dispersion-aware pipeline** groups hits by scope at read time and annotates each result with whether its neighbors agree, disagree, or conflict.

## Module map

```mermaid
graph TD
    App[App / Caller]
    MCP[atlaso.mcp.server<br/>FastMCP]
    Admin[atlaso.admin<br/>cross-tenant ops]
    Mem[Memory / AsyncMemory<br/>+ UserHandle]
    Val[_validation<br/>user_id regex]
    Path[_path_resolver<br/>.atlaso/ discovery]
    Local[LocalBackend<br/>per-user FieldStore cache]
    Gate[_engine.gate<br/>write-time policy]
    Aware[_engine.aware<br/>bag stats + conflict]
    Retr[_engine.retrieval<br/>BM25 + decay]
    Maturity[_engine.maturity<br/>FMI geometric mean]
    Store[_engine.store<br/>FieldStore]
    DB[(SQLite + FTS5<br/>field.db)]

    App --> Mem
    MCP --> Mem
    Admin --> Local
    Mem --> Val
    Mem --> Local
    Local --> Path
    Local --> Gate
    Local --> Aware
    Local --> Maturity
    Aware --> Retr
    Aware --> Store
    Maturity --> Aware
    Maturity --> Store
    Retr --> Store
    Store --> DB
```

## What happens on `add()`

1. Validate `user_id` against the strict regex (`_validation.py`).
2. Lazily resolve storage path (constructor / `ATLASO_PATH` / ancestor walk / project marker / cwd).
3. Open or fetch the cached per-user `FieldStore` at `<base>/users/<sha256(user_id)[:16]>/field.db`.
4. Run the gate. If rejected, raise `DepositRejectedError(gate_reason=...)`.
5. Insert into `deposits` + mirror into the contentless `deposits_fts` FTS5 virtual table.
6. Write one `contradictions` edge per id in `contradicts=[...]`.
7. Re-read the row, translate to a public `Deposit`, return `AddResult`.

```mermaid
sequenceDiagram
    autonumber
    participant App as Caller
    participant API as Memory.add
    participant V as _validation
    participant BE as LocalBackend
    participant G as _engine.gate
    participant S as FieldStore

    App->>API: add(text, user_id, polarity, evidence_grade, scope)
    API->>V: check_user_id_shape(user_id)
    API->>BE: add(...)
    BE->>BE: resolve path, open per-user FieldStore
    BE->>G: evaluate_deposit_request(...)
    G-->>BE: GateDecision(accept | reject + reason)
    alt accepted
        BE->>S: INSERT deposits + deposits_fts + contradictions
        S-->>BE: deposit_id
        BE-->>API: AddResult(deposit)
    else rejected
        BE-->>API: raise DepositRejectedError(gate_reason)
    end
```

## What happens on `recall()`

1. Sanitise the FTS5 query (strip ``" ( ) : * ? , ; . @ ! # $ % ^ & + = [ ] { } | \ < > ~ ` `` and OR-join remaining tokens).
2. `_engine.aware.query_aware` fans out per polarity with freshness decay.
3. Group every candidate into `BagStats` keyed by the six scope facets.
4. Compute `bag_precision` / `is_single_sample` / `has_conflict` / `conflict_peers` per bag.
5. Write one summary row to `query_log` — powers FMI Coverage.
6. Aggregate `has_disagreement` / `is_confident` across *all* bags *before* slicing.
7. Sort conflict bags first, flatten into per-deposit `SearchResult` rows.

```mermaid
sequenceDiagram
    autonumber
    participant App as Caller
    participant API as Memory.recall
    participant BE as LocalBackend
    participant A as _engine.aware
    participant S as FieldStore

    App->>API: recall(query, user_id, limit, scope)
    API->>BE: recall(...)
    BE->>BE: sanitise FTS5 query (strip syntax chars, OR-join)
    BE->>A: query_aware(store, query, scope_filter, limit)
    A->>S: list_recent(1000) and BM25 per polarity
    S-->>A: candidate deposits
    A->>A: group by scope_bag, compute BagStats
    A->>S: log_query(bag_key, is_confident, has_conflict, ...)
    A-->>BE: { results: [bags], field_health }
    BE->>BE: aggregate flags across ALL bags, sort conflicts first, flatten
    BE-->>API: SearchResults(items, has_disagreement, is_confident)
```

---

# The gate

A write-time policy that decides whether a deposit may enter the store, based on `polarity × scope_breadth × evidence_grade`.

> The bar is `expected_harm × scope_breadth`, not polarity. A broad positive claim ("X always works") needs replicated evidence; a narrow negative claim needs provenance so the field doesn't fill with "didn't work on my machine" sludge. Open questions are always cheap to record.
> — `_engine/gate.py`

The gate fires **at write time** and only decides whether the deposit is allowed. Conflict flagging happens at read time in `_engine/aware.py`.

## Scope breadth

```python
def _scope_breadth(scope: Scope) -> Literal["narrow", "broad"]:
    facets = sum(1 for f in (
        scope.model, scope.dataset, scope.env,
        scope.version, scope.n, scope.seed,
    ) if f is not None)
    return "narrow" if facets >= 2 else "broad"
```

## Rules

- `open` → always accept.
- `positive` + `broad` → requires `replicated` or `verified`.
- `positive` + `narrow` → requires `observed` or stronger.
- `negative` + `broad` → requires `observed` or stronger.
- `negative` + `narrow` → requires provenance: at least one `artifact_refs` entry OR a scoped `env` or `version`.
- `cautionary` → requires provenance.

## Author-role modifier

When `author_role` normalises to `redteam` (strip non-alphanumeric + lowercase, so `redteam`, `red_team`, `Red-Team`, `RedTeam` all match), `negative` and `cautionary` claims get a +1 evidence-grade bump. Positive and open deposits are unaffected.

## When rejected

```python
from atlaso import DepositRejectedError

try:
    user.add(
        "threshold 0.7 is always optimal",
        polarity="positive",
        evidence_grade="anecdotal",
    )
except DepositRejectedError as e:
    print(e.gate_reason)
    # "positive/broad claim requires evidence_grade>=replicated;
    #  upgrade evidence or narrow scope (e.g., add model= or dataset=)."
```

---

# Retrieval pipeline

```
query → filter → rank → aggregate
```

## BM25 + asymmetric freshness decay

```python
# _engine/retrieval.py
_HALF_LIVES = {
    "positive":   14.0,   # days
    "open":       14.0,
    "negative":   90.0,   # ~three months
    "cautionary": 90.0,
}
# rank score = bm25 * exp(-age_days * ln2 / half_life)
```

Asymmetric on purpose. A known failure is a constraint that should stay fresh for months. A positive observation needs to be re-evidenced more often before the dispersion pipeline takes it seriously.

## k distribution

```python
# _engine/aware.py — for recall(limit=N)
k_positive   = max(1, int(limit * 0.30))
k_negative   = max(1, int(limit * 0.30))
k_cautionary = max(1, int(limit * 0.20))
k_open       = max(1, int(limit * 0.20))

candidates = max(50, k * 10)   # candidate over-fetch
```

## Scope-bag aggregation

Engine field names (left) → public field names (right) on `SearchResult`:

- `bag_size` — total deposits in the bag.
- `dominant_polarity` — directional polarity with the most deposits.
- `bag_precision` → public `agreement_score` — fraction matching the dominant polarity.
- `is_single_sample` → public `is_thin_evidence`.
- `has_conflict` → public `has_disagreement` — bag contains ≥2 distinct directional polarities (any pair drawn from `{positive, negative, cautionary}`).
- `conflict_peers` — ids of the opposing-polarity hits that flipped the bag.

## The `is_confident` rule

```python
is_confident = (
    bag_precision >= 0.99
    and not is_single_sample
    and not has_conflict
)
```

## Query log

Each `recall()` writes one row to `query_log` summarising the top hit's flags + bag size + result count. Powers **Coverage** in the Field Maturity Index.

> v0.1 retrieval is lexical BM25 — fast, deterministic, no embedding service required. v0.2 will add an optional embedding-backed recall behind the same `recall()` signature.

---

# Storage layer

One SQLite file per user, FTS5-indexed. WAL mode. No `~/.atlaso/` fallback for field data — atlaso writes alongside your project.

## On-disk layout

```
<project-root>/.atlaso/
├── users/
│   └── <sha256(user_id)[:16]>/
│       └── field.db                   # per-user store
├── _unscoped/
│   └── field.db                       # atlaso.admin writes
└── _idempotency.db                    # 24h Stripe-style dedup keys
```

## Schema

```sql
CREATE TABLE deposits (
    id              TEXT PRIMARY KEY,
    content         TEXT NOT NULL,
    polarity        TEXT NOT NULL CHECK (polarity IN ('positive','negative','cautionary','open')),
    evidence_grade  TEXT NOT NULL CHECK (evidence_grade IN ('anecdotal','observed','replicated','verified')),
    author          TEXT NOT NULL,
    task_id         TEXT,
    repro_status    TEXT NOT NULL CHECK (repro_status IN ('unreplicated','replicated','failed_repro')),
    created_at      TEXT NOT NULL,
    scope_note      TEXT NOT NULL,
    scope_model     TEXT,
    scope_dataset   TEXT,
    scope_env       TEXT,
    scope_version   TEXT,
    scope_n         INTEGER,
    scope_seed      INTEGER,
    tags_json       TEXT NOT NULL,
    artifact_refs_json TEXT NOT NULL,
    author_role     TEXT
);
CREATE INDEX deposits_created_at ON deposits(created_at);
CREATE INDEX deposits_polarity   ON deposits(polarity);

CREATE VIRTUAL TABLE deposits_fts USING fts5(
    body,
    content=''      -- contentless: stores tokenization, not source text
);

-- No ON DELETE CASCADE — hard delete in _local.py removes
-- contradiction edges manually in FK-safe order.
CREATE TABLE contradictions (
    from_deposit_id  TEXT NOT NULL,
    to_deposit_id    TEXT NOT NULL,
    reason           TEXT NOT NULL,
    created_at       TEXT NOT NULL,
    PRIMARY KEY (from_deposit_id, to_deposit_id),
    FOREIGN KEY (from_deposit_id) REFERENCES deposits(id),
    FOREIGN KEY (to_deposit_id)   REFERENCES deposits(id)
);

CREATE TABLE query_log (
    id               TEXT PRIMARY KEY,       -- UUID string
    created_at       TEXT NOT NULL,
    bag_key          TEXT,
    is_confident     INTEGER,
    has_conflict     INTEGER,
    is_single_sample INTEGER,
    bag_size         INTEGER,
    result_count     INTEGER
);
```

## FTS5 is contentless

`deposits_fts` uses `content=''` — stores tokenization, not source text. The indexed `body` column is `content + tags + scope.note` merged at insert time so one FTS query matches across all three fields.

## PRAGMAs

- `foreign_keys=ON`
- `journal_mode=WAL`
- `busy_timeout=5000`

Connections are serialised by a `threading.RLock` at the SDK boundary — multi-threaded applications are safe without user-side coordination.

## Soft retract vs hard delete

Soft retract adds an `atlaso:retracted=<reason>` tombstone tag. The deposit becomes invisible to `recall()` but the row survives.

Hard delete removes the row and triggers an FK-safe contentless delete on the FTS index. Default is soft. Use `hard_delete=True` only for GDPR / PII.

## No `~/.atlaso/` for field data

Atlaso intentionally never falls back to `~/.atlaso/` for the field database — a global home-dir store would silently merge memories across unrelated projects. (`atlaso install-hooks` does write hook shell scripts to `~/.atlaso/hooks/` — that's tooling, not field data.)

## The vendored engine

The `_engine/` directory is **byte-identical-vendored** from the upstream monorepo at wheel build time (Hatch hook in `tools/vendor_engine.py`). A CI gate (`tools/verify_engine_parity.py`) blocks releases where the SDK's copy has drifted.

---

<!-- atlaso:doc-trailer -->
**Source:** <https://www.atlaso.ai/docs/architecture>  
**Edit on GitHub:** <https://github.com/imashishkh21/atlaso/tree/main/docs/architecture.md>  
**Updated:** 2026-05-12
