This document describes a memory architecture for AI agents, humans, and subsystems sharing one substrate. The substrate stores entities, facts, state-changes, decision traces, and interaction records. Lenses (persistent, role-keyed, DB-resident filters) render the substrate for each consumer. Frames (transient task-keyed bindings) hold the intent, plan, footprint, and suppression state for the work currently in front of a consumer. The architectural identity is one equation: view = render(substrate, lens, frame).
The architecture is orthogonal across three axes. The storage axis (Living Memory, Alexandria, Episodic Continuity) tells you where a fact lives. The interpretation axis (substrate vs lens vs frame) tells you whose view of the substrate it belongs to. The action axis (state-change events, outcome → procedure feedback, intentional non-action) tells you how the substrate evolves under work. Every memory feature gets classified on all three.
The substrate runs on my own development fleet: on Forge (TypeScript / Node / SQLite), with a second implementation under way on Strata (Python / FastAPI / PostgreSQL + pgvector). Lens contracts, decision traces, the frame primitive, the typed state-change stream, and lens-filtered cross-session retrieval are live; the compounding-learning loop (reweave, conflict resolution, forgetting) is in active build, not yet live. The companion landscape and pitch decks cover positioning; this document is the architectural reference.
Most of what's below is built and running, used daily on the Forge platform and SFDC dev workflows, though not all of it. Rows still in active build are marked Planned. The goal is to employ it as a real use case, iterate, and prove out the architecture before any broader rollout.
| Capability | Status | Detail |
|---|---|---|
| Lessons DB + connections + entities | Built | Consolidated agent-state store (SQLite), partitioned by workflow. Write-coalescing with debounced persistence and incremental export. |
| Decision traces DB | Built | Durable audit trail of agent decisions: what was decided, why, what alternatives were considered. Queryable via HTTP API. |
| Session lifecycle | Built | Register, getContext, do work, session-end reflection. Full HTTP API with WebSocket broadcasting. |
| getContext (intelligent retrieval) | Built | Workflow-scoped, entity-anchored, token-budgeted, multi-pass (berrypicking). Context selector with precedent matching. |
| Session-end reflection | Built | Agents write lessons, decision traces, guardrail suggestions at session end. Structured knowledge capture throughout. |
| Digestive pipeline | Partial | Reduce, Reflect, and Archive run asynchronously post-session via worker pool today. Reweave and Verify (the compounding step that integrates and reconciles knowledge) are in active build, not yet live. |
| Forgetting by design | Planned | Pruning, supersession, condensation, stale detection, demand-driven processing: designed, and part of the current build. Not yet live. |
| Multi-workflow support | Built | Core Forge platform and SFDC dev share one framework, data partitioned by workflow. Workflow engine with pluggable adapters. |
| Scratchpad + session summary | Built | Per-task state + recent work history. Created/loaded via API. File-backed state: the NLAH paper's #1 most valuable harness module. |
| Capability | Status | Detail |
|---|---|---|
| Evaluator system (LLM + criteria) | Built | LLM-based evaluator with ISC criteria parser and verifier. Auto-pass/fail/escalate decisions on backlog items. Aligns with Anthropic's capability vs regression eval distinction. |
| Verifier alignment logging | Built | Tracks whether evaluator decisions match human reviewer decisions. Directly implements the NLAH paper's critical finding: verification layers that diverge from the evaluator's acceptance criteria reduce quality. |
| Guardrail system (proposals + accumulation) | Built | Agent-proposed guardrails with human review. Self-improvement loop: failures generate guardrail updates. Per-agent guardrail files. |
| Failure taxonomy | Built | Named failure modes with recovery actions (context exhaustion, loop detection, verification divergence, tool failure, etc.). NLAH's sixth required harness component. |
| Golden principles | Built | Codified mechanical rules for codebase consistency. Per OpenAI's harness pattern: opinionated rules that agents can self-enforce. |
| Eval suites + regression framework | Built | YAML-defined eval suites with grader registry, regression sources, and result tracking. Fixtures for reproducible evaluation. |
| Orchestration coordination | Built | Multi-agent coordination with collision detection, session isolation, and bounded authority. Prevents concurrent conflicts. |
| Approval queue + human-in-the-loop | Built | Full approval workflow with JWT auth, review routing, and WebSocket-based real-time updates. Human review at the highest-leverage points. |
| Capability | Status | Detail |
|---|---|---|
| Alexandria renderer / Living Wiki | Built | WikiCompiler renders auto-promoted lessons into a scannable wiki with [[wiki-links]], cross-references, and topic grouping. Incremental compilation on file change. |
| Wiki quality gates | Built | Configurable promotion thresholds. Claims must meet confidence and quality gates before entering the wiki. Validation happens at write time. |
| Wiki curator + scheduler | Built | Automated curation: scheduled synthesis reports, staleness detection, and maintenance passes. The wiki maintains itself. |
| Dashboard (metrics + management) | Built | Living Memory metrics, backlog, session management, approval workflow, real-time session monitoring. |
| Lesson connections (graph edges) | Built | Typed connections with optional reason text for propositional links. Backlinks for bidirectional traversal. |
| Capability | Status | Detail |
|---|---|---|
| Trust tiers + memory integrity | Designed | ADR written. Content trust tiers, read-time validation, and write-event audit log sufficient for rollback. |
| Alexandria knowledge graph | Building | Full propositional knowledge graph with entity-centric claims, graph traversal, and progressive disclosure. Being built for personal and intra-team use, architected for scale. |
Cross-platform validation: The M/L/P architecture is being proven on two entirely different platforms. On Forge (an agentic project-orchestration platform on TypeScript / Node / SQLite), agents coordinate through a shared backlog with living memory. On Strata (Python / FastAPI / PostgreSQL + pgvector), eight governed agents collaborate through scoped repositories with decision traces. The same architectural principles (governed writes, learning loops, cost-conscious inference) apply naturally to both. This gives us confidence the patterns are transferable across frameworks.
The architecture spans three orthogonal axes. Confusion between axes is the most common architectural mistake: conflating storage with interpretation, or interpretation with action, produces designs that fight themselves.
Living Memory holds user/workspace context (preferences, behavioral feedback, project facts). Alexandria holds cross-session, cross-project learnings as bi-temporal claims with decision traces. Episodic Continuity holds session-scoped state: scratchpads (narrative) and per-ticket context (now generalized to frames). Storage is operational; documented in MEMORY-TIERS.md.
Substrate is the universal storage of state: entities, facts, typed state-changes, relationships, provenance, permissions, history. Bi-temporal validity columns (valid_at / invalid_at), decision traces, lesson chains, and interaction-typed records all live here. Lens is a persistent consumer-keyed filter declaring entity types loaded by default, relationships in scope, watched state-changes, RBAC capabilities, cross-lens read scope, and expiration policy. Lenses are DB-resident (lens_contracts + lens_preferences + consumer_lens_assignments + permission_groups + permissions). Frame is a transient task-keyed binding holding intents (canonical 10), next_steps with assigned consumers, footprint, surfaced substrate refs, and suppression state. Frames are also DB-resident; scratchpads carry frame_id at creation.
Substrate writes emit typed state-change events with before/after diffs, lens_id and frame_id annotation, and evidence references. Outcomes (eval results, gate failures, learning-proposal status) flow back to procedure (SKILL.md mutations) through an approval-gated mutator that records the mutation as a state-change. Triggers (hooks, watchdog, scheduler, evaluator) emit explicit act / wait / escalate / no-op decisions with rationale. Non-action is a first-class action when it is intentional and persisted, and silent failure is structurally distinguishable from intentional restraint.
Dreams: retrospective synthesis. The action axis includes scheduled retrospections inspired by Anthropic's Managed Agents Dreams primitive (research preview, 2026-04). Dreams run asynchronously over a rolling window of state-changes, scratchpad archives, decision traces, and the docs corpus, surfacing duplicate facts to merge, contradicted facts to retire, stale entries to age out, and skill-improvement proposals based on outcome patterns. Anthropic's version is developer-triggered and adoption-gated by hand. The local implementation extends Anthropic's framing with scheduled cadence on the local-LLM fleet (mini1/2/3, studio1) and an automated adoption gate: per-proposal-type confidence and risk thresholds with mandatory approval-queue routing for exceptions. Output flows through the same approval-gated mutator as every other substrate change, with provenance tagging so subsequent dreams skip their own prior outputs.
flowchart TB
subgraph storage["Storage Axis"]
LM["Living Memory
user/workspace"]
AL["Alexandria
cross-session learnings"]
EC["Episodic Continuity
scratchpads + frames"]
end
subgraph interp["Interpretation Axis"]
S["Substrate
entities, facts, state-changes,
decision traces, interactions"]
L["Lens
DB-resident, RBAC,
consumer-keyed"]
F["Frame
DB-resident, transient,
task-keyed"]
end
subgraph action["Action Axis"]
SC["Typed state-changes
before/after + lens_id + frame_id"]
OP["Outcome → Procedure
approval-gated SKILL.md mutator"]
NA["Intentional non-action
act / wait / escalate / no-op"]
end
storage -.->|"writes through"| interp
interp -.->|"evolves via"| action
action -.->|"persisted in"| storage
The render equation: view = render(substrate, lens, frame). The same substrate read through the Architect lens with a frame for “evaluate vendor for new identity broker” produces a different view than the same substrate through the Steward lens with a frame for “sprint plan.” The Memory / Learning / Personalization decomposition (MAPLE) sits inside the substrate side of the interpretation axis: Memory = substrate storage; Learning = digestive pipeline that mutates substrate; Personalization = render(substrate, lens, frame). M/L/P remains accurate; substrate / lens / frame is the load-bearing vocabulary.
Standing on strong shoulders. The substrate-vs-lens vocabulary was crystallized by Sentra's Company Brain series (Ashwin Gopinath, 2026-04). The frame primitive is a local extension Andrew added to disambiguate transient task-binding from persistent role-shape. Earlier influences remain load-bearing: Karpathy (Software 2.0, LLM OS, where memory is the file system), Cherny (context engineering as systems discipline), PAL (structured artifacts over free-form prose), Cornelius ("the vault is the riverbed, the sessions are the water"), and NLAH/Pan et al. (file-backed state as the highest-value harness module). The M/L/P naming draws from MAPLE. The deeper lineage (UXF, ASO26 / IA-EA, EDR > Decision Support) proved the substrate-vs-lens pattern in production at Macmillan years before Sentra named it.
Every agent session follows the same lifecycle. The loop is the same whether the agent is working on a core platform ticket or a Salesforce development task.
sequenceDiagram
participant Agent
participant API as HTTP API
participant Store as Learning Store
participant Pipeline as Digestive Pipeline
Agent->>API: POST /sessions (register)
API-->>Agent: Session ID + scratchpad
Agent->>API: GET /sessions/{id}/context
API->>Store: Query lessons, precedent (workflow-scoped)
Store-->>API: Relevant lessons + decision traces
API-->>Agent: Context payload (token-budgeted)
Note over Agent: Agent works (skills, tools, code)
Note over Agent: Updates scratchpad every 3-5 actions
Agent->>API: DELETE /sessions/{id} (end session)
Note over Agent: Includes: summary, reflection,
lessons, decision traces
API->>Store: Write lessons + traces (scoped)
API->>Pipeline: Queue digestive processing
Pipeline->>Store: Reduce → Reflect → Reweave → Verify → Archive
Note over Store: Next getContext returns
these lessons
Raw knowledge entering the system is not yet useful. It is food awaiting digestion before it becomes capability. The digestive pipeline transforms it. Each phase runs in isolation with fresh context to prevent contamination between steps.
flowchart LR
R["Reduce
Raw → atomic claims"]
RE["Reflect
Find connections"]
RW["Reweave
Update existing"]
V["Verify
Schema + dedup"]
A["Archive
Mark processed"]
R --> RE --> RW --> V --> A
| Phase | What it does | Analogy |
|---|---|---|
| Reduce | Extract atomic claims from source material. A 2000-word battlecard yields five entity-anchored claims. The rest is discarded. | Digestion: break down bulk into building blocks. |
| Reflect | Find connections. Where does each new claim link to what already exists? Connections make both claims more retrievable. | Cross-referencing: linking new knowledge to existing understanding. |
| Reweave | Backward maintenance. "If this existing claim were written today, knowing what we now know, what would be different?" New material strengthens existing structures. | Revision: the graph gets stronger claims, beyond simply more claims. |
| Verify | Immune system. Schema validation, retrieval filter quality check, entity resolution, duplicate detection. Malformed claims never enter the graph. | Type checker: correctness checks on installed software. |
| Archive | Processed source material exits the active workspace. What remains is extracted value: clean, connected, integrated. | Waste removal: keep the nutrients, release the bulk. |
Retrieval is purpose-aware and token-budgeted. The same entity returns different claim subsets depending on the consumer's workflow and task.
flowchart TB
Q["Agent calls getContext
(workflow, task, scopes)"]
F["Filter by workflow + scope
(project → team → framework → org)"]
R["Rank by: priority × recency
× tag match × value-per-token"]
B["Apply token budget
(cap total context size)"]
BP["Berrypicking: multi-pass
(reassess, refine, re-query)"]
A["Assemble response:
lessons + precedent + (optional) Alexandria index"]
Q --> F --> R --> B --> BP --> A
Retrieval spans four layers, from narrow to broad:
| Layer | Scope |
|---|---|
| Project | This task, this ticket |
| Team / domain | Shared best practices |
| Framework | Cross-workflow lessons |
| Organization | Entity graph, org-wide |
Retrieval isn't always one-shot. The system can do multiple passes: an initial pass finds relevant lessons, the retrieval logic reassesses ("given what I found, what else do I need?"), and runs another pass with a refined query.
This mirrors how people actually look things up. The search evolves as understanding grows.
The graph is entity-centric: it models the real things the organization cares about and the relationships between them. Entities are the nouns of institutional knowledge.
flowchart LR
subgraph SRC["Sources"]
direction TB
s1["Agent reflections"]
s2["Research"]
s3["Conversations"]
s4["Events"]
end
subgraph INC["Input Contract"]
c1["entity + claim +
source + confidence +
timestamp + author"]
end
subgraph KG["Knowledge Graph"]
direction TB
e["Entities"]
r["Relationships"]
kc["Claims"]
b["Backlinks"]
e --- r
r --- kc
kc --- b
end
subgraph OC["Output Contract"]
o1["Purpose-aware
context assembly"]
end
subgraph CON["Consumers"]
direction TB
a1["Agents"]
a2["Dashboards"]
a3["Decision support"]
a4["Search"]
end
SRC --> INC --> KG --> OC --> CON
Claims are installed capabilities. When an agent loads a knowledge claim, it gains a capability it didn't have before. A claim about competitive positioning enables the agent to reason about strategy. A vague claim is a buggy function. A stale claim is a regression. Quality gates are correctness checks on installed software, functioning as more than tidiness conventions.
Agents are one consumer class. Humans are another. Subsystems (watchdog, scheduler, dispatcher) are a third. All read the same substrate through different lenses. A consumer can wear more than one lens; multi-lens permissions UNION (wearing more lenses can only widen scope, never restrict it). Andrew, as Principal, can wear any lens to view the work through that role's filter.
| Lens | Consumers | Default scope |
|---|---|---|
| Architect | alex | Wide; ADR + planning + cross-system entities; high decision-trace density |
| Developer | bob, doug, remy, quinn, evelyn, sentinel, jerry | Code, tests, gates, eval results; narrow per-task |
| Analyst | bea | Requirements, decision matrices, stakeholder context |
| Steward | stewart, oscar | Backlog, sprint state, dependencies, collisions |
| Writer | sherry | Docs, Confluence, documentation drift |
| Principal | andrew (default; can wear any lens) | Workspace-wide; longitudinal; decision-trace heavy |
| Subsystem | watchdog, scheduler-daemon, dispatcher | Event-conditioned; trigger-shaped; non-action capable |
Skills are workspace-wide capabilities. A loose-coupling join records which lens each skill is primary for, which skills bridge between lenses (Bea's R×S to Backlog hands work from Analyst to Steward), and which are lens-agnostic (the career-coach skill operates outside any analyst/developer lens). The lens prioritizes and surfaces skills; permissions gate invocation. Skills don't gate against the active lens; carrying a skill while wearing the “wrong” lens is permitted, just not surfaced by default.
Not every agent should learn. The substrate still serves four levels of agent complexity, distinguished by their relationship to knowledge: stateless tools (run, return, forget), knowledge consumers (read substrate, no writes), scoped learners (write within a single lens), and full-metabolism agents (write across substrate with reflection and decision-trace contribution). The 65/20/10/3 distribution still holds. The lens layer cuts orthogonally to autonomy: a stateless tool can be Developer-lensed (formatter); a full-metabolism agent can be Architect-lensed (alex). Lens determines scope; autonomy determines write capacity.
We pressure-tested the architecture against 30+ external sources across 10+ sessions. Seven validated strengths. Five remaining gaps.
Propositional framing ("Since [[X]], therefore Y"): no evaluated system has an equivalent. Cornelius's "notes as callable functions" framing independently converges with this approach.
Memory evolution (decay, condensation, supersession): ahead of the field; most systems implement only formation and retrieval.
Forgetting design: the most complete in any source evaluated.
3D hierarchical token-level memory: placed at the apex of the 2026 academic survey's taxonomy.
Session lifecycle + scratchpad: independently validated by Anthropic, MAPLE, NLAH (file-backed state is the #1 most valuable harness module), and the academic survey. Two 2026 empirical papers additionally quantify what NLAH names: Cao et al. (arXiv 2603.20432) show +17.3% gains from filesystem-organized context across long-context reasoning, RAG, and QA; Lee et al. / Stanford (arXiv 2603.28052) show +7.7 points with 4× fewer tokens via a filesystem-backed agentic proposer. The .agent-data/, scratchpads, and decision traces in this architecture are exactly the pattern both papers measure.
Harness engineering: failure taxonomy, golden principles, evaluator system with verifier alignment logging, and eval suites, all built and running. Implements all six NLAH required components (Contracts, Roles, Stage Structure, Adapters, State Semantics, Failure Taxonomy).
Context engineering as architecture: token-budgeted getContext, berrypicking retrieval, progressive disclosure, and workflow-scoped serving. Per Cherny: the bottleneck is the context the model receives.
| Gap | Impact | How we close it |
|---|---|---|
| Memory integrity and trust model | No content trust tiers. No read-time validation. No write-event audit log sufficient for rollback. Defenses are write-time schema/format correctness only. | ADR written. Trust tiers, audit log, and read-time integrity checks designed. Implementation is the next priority for the knowledge graph layer. |
| Security / blast radius of poisoned claims | Alexandria's propositional framing means a poisoned claim is reasoned from, going beyond mere reference. Higher epistemic authority means higher blast radius. Three attack vectors identified: direct lesson injection, reflection poisoning, and pipeline injection. | Mitigations designed (trust tiers + audit log + integrity checks). The harness engineering layer (guardrails, eval system, verifier alignment) provides the governance structure; trust tiers extend it to knowledge claims. |
| Context pressure for long-running agents | Session-boundary model works well for bounded tasks. Breaks at 1+ hour sessions. We have context rotation signals and a failure taxonomy entry; we need architectural support. | HumanLayer's Research→Plan→Implement workflow (context compaction between phases) addresses this. Our existing scratchpad rotation and context budget zones (40-60% target) are partial solutions. Full architectural support (mid-session reflection triggers) is next. |
| Conflict resolution (clash) | When getContext returns contradictory lessons, no resolution logic. Agent receives contradictory signals with no guidance. | Design in progress. The knowledge graph's typed connections enable contradiction detection (CONTRADICTS edge type). Resolution requires either recency-wins, confidence-wins, or escalate-to-human logic. Per NLAH: verifier alignment data (which we now collect) can inform which resolution strategy matches human judgment. |
| Benchmarks measuring the wrong things | Quantitative benchmarks were added early (eval suites infrastructure with YAML-defined tasks, grader registry, result tracking, and the verifier alignment log). They run, but they measure retrieval speed and recall, while the things the substrate is actually good at go unmeasured: discovery (connections not sought), reasoning (propositional claims as installed capabilities), evolution (forgetting at scale), governance (trust tiers, lens-scoped permission resolution), and longitudinal compounding. Without the right metrics, we can't validate that the system is working end-to-end or improve it deliberately. | Build memory-specific eval tasks aligned to the substrate's value: cross-frame connection-discovery rate, decision-trace coverage, lens-scoped permission-resolver correctness, supersession lineage accuracy under contradiction, suppression-honoring rate after rejection, and longitudinal compounding (does the same task produce a richer view six months in than at start). Measurement is ongoing: phased calibration windows (7d standup → 30d steady-state) drive threshold tuning per surface. |
The benchmarking reality: For simple retrieval, the mechanism doesn't matter. Agents search well with any tool. Our graph's value is not faster retrieval. It is discovery (connections not sought), reasoning (propositional claims as installed capabilities), evolution (forgetting at scale), and governance (trust tiers). These don't show up on standard benchmarks. They show up in multi-agent, cross-domain scenarios. We need to build evaluation methods that test what we're actually good at.
The substrate, lenses, and frames are real and running; the compounding-learning loop is the current build. The vision is what comes next: the architectural shape carrying forward to additional consumer classes (enterprise, customer-facing), additional input sources (interaction memory beyond the local agent loop), and longitudinal compounding measured against the right metrics, the ones suited to the substrate rather than retrieval-speed benchmarks borrowed from RAG.
flowchart TB
subgraph now["Operational"]
S["Substrate
(facts, state-changes,
decision traces, interactions)"]
L["Lens contracts
(7-lens roster, RBAC)"]
F["Frame primitive
(intents, footprint, suppression)"]
D["Dreams
(retrospective synthesis,
local-LLM scheduled)"]
S --> L
S --> F
L --> F
S -.->|"feeds"| D
D -.->|"proposes mutations to"| S
end
subgraph extending["Extending"]
IM["Interaction memory
(meeting transcripts, calls,
messaging, opt-in)"]
LP["Lens packs
(Sales, Success,
Support, On-call, Exec)"]
CC["Customer-facing surface
(UXF descendant)"]
MET["Substrate-aligned metrics
(discovery, evolution,
governance, compounding)"]
end
S -.->|"new producers"| IM
L -.->|"productization"| LP
L -.->|"new consumer class"| CC
now -.->|"validated by"| MET
Today's lens roster covers agents, the principal, and named subsystems. Three additional classes are architecturally supported and ready to onboard: enterprise role lenses (Sales, Success, Support, On-call, Exec) as productizable lens packs; customer-facing lenses as the UXF descendant; peer principals for multi-human collaboration. Adding a class is a seed migration: the substrate doesn't fork.
Interaction memory is schema-ready but lightly fed. Capture-chat already accepts external transcripts; the next step is opt-in producers for meetings, calls, and messaging surfaces, scoped by directory or channel. Each new source produces interaction-typed substrate records and feeds Dreams' retrospective window without changing the substrate's shape.
The substrate's value lies in discovery, evolution, governance, and longitudinal compounding, beyond retrieval speed. The eval suites and verifier alignment log run, but the metrics they capture are wrong-shaped. The next benchmark generation tests cross-frame connection-discovery rate, supersession lineage accuracy, suppression-honoring rate, lens-scoped permission-resolver correctness, and whether the same task produces a richer view six months in than at start.
Any source publishes knowledge into the substrate through one contract: {entity, claim, source, confidence, timestamp, author, lens_id, frame_id}. The lens and frame annotations are what make the contract substrate-shaped: every write knows whose view produced it and which task it served. A well-formed claim is a capability the agent gains on load. The framework grows by accretion.
The strategic opportunity: The industry is converging on substrate ideas (Sentra), graphs (Glean), and agent memory (Anthropic, Mem0, Zep). What no evaluated competitor ships is the full architectural triad (substrate plus declared role-shaped lenses plus transient task frames) with state-change as a typed primitive, intentional non-action at the trigger layer, and substrate-aligned measurement. The substrate generalizes; lens packs are the productizable surface; the credibility artifact is years of solo dogfood under load. Memory fragmentation across per-tool memory is the moat.