We've built a shared memory system for AI agents that lets them learn from each other's sessions, share precedent across workflows, and maintain that knowledge automatically. The system is in active use on two workflows (core platform development and Salesforce development), where we're dogfooding it to prove out the architecture and fine-tune the real-world experience. The overall design has been pressure-tested against 30+ external sources from academia, industry, and thought leadership, though the current implementation doesn't yet address all of the findings.
The architecture decomposes into three layers: Memory (storage), Learning (extraction and maintenance), and Personalization (serving the right knowledge to the right agent). We operate at Level 4 (agentic learning and memory) today. Level 5 (Alexandria: the full knowledge library with propositional claims and graph traversal) is currently being built for personal and intra-team use, architected to scale.
This document focuses on the architecture: how it works, step-by-step processes, honest challenges, and the vision for what it could become. For the business case and landscape context, see the companion decks (Alexandria pitch book and Memory, Context & Graphs landscape primer).
Everything below is built and running, used daily on the PDT and SFDC dev workflows. The goal is to employ it as a real use case, iterate, and prove out the architecture before any broader rollout.
| Capability | Status | Detail |
|---|---|---|
| Lessons DB + connections + entities | Built | Consolidated agent-state store (SQLite), partitioned by workflow. Write-coalescing with debounced persistence and incremental export. |
| Decision traces DB | Built | Durable audit trail of agent decisions — what was decided, why, what alternatives were considered. Queryable via HTTP API. |
| Session lifecycle | Built | Register, getContext, do work, session-end reflection. Full HTTP API with WebSocket broadcasting. |
| getContext (intelligent retrieval) | Built | Workflow-scoped, entity-anchored, token-budgeted, multi-pass (berrypicking). Context selector with precedent matching. |
| Session-end reflection | Built | Agents write lessons, decision traces, guardrail suggestions at session end. Structured knowledge capture, not free-form. |
| Digestive pipeline | Built | Reduce, Reflect, Reweave, Verify, Archive. Runs asynchronously post-session via worker pool. |
| Forgetting by design | Built | Pruning, supersession, condensation, stale detection, demand-driven processing. |
| Multi-workflow support | Built | Core PDT (PCC) and SFDC dev share one framework, data partitioned by workflow. Workflow engine with pluggable adapters. |
| Scratchpad + session summary | Built | Per-task state + recent work history. Created/loaded via API. File-backed state — the NLAH paper's #1 most valuable harness module. |
| Capability | Status | Detail |
|---|---|---|
| Evaluator system (LLM + criteria) | Built | LLM-based evaluator with ISC criteria parser and verifier. Auto-pass/fail/escalate decisions on backlog items. Aligns with Anthropic's capability vs regression eval distinction. |
| Verifier alignment logging | Built | Tracks whether evaluator decisions match human reviewer decisions. Directly implements the NLAH paper's critical finding: verification layers that diverge from the evaluator's acceptance criteria reduce quality. |
| Guardrail system (proposals + accumulation) | Built | Agent-proposed guardrails with human review. Self-improvement loop: failures generate guardrail updates. Per-agent guardrail files. |
| Failure taxonomy | Built | Named failure modes with recovery actions (context exhaustion, loop detection, verification divergence, tool failure, etc.). NLAH's sixth required harness component. |
| Golden principles | Built | Codified mechanical rules for codebase consistency. Per OpenAI's harness pattern: opinionated rules that agents can self-enforce. |
| Eval suites + regression framework | Built | YAML-defined eval suites with grader registry, regression sources, and result tracking. Fixtures for reproducible evaluation. |
| Orchestration coordination | Built | Multi-agent coordination with collision detection, session isolation, and bounded authority. Prevents concurrent conflicts. |
| Approval queue + human-in-the-loop | Built | Full approval workflow with JWT auth, review routing, and WebSocket-based real-time updates. Human review at the highest-leverage points. |
| Capability | Status | Detail |
|---|---|---|
| Alexandria renderer / Living Wiki | Built | WikiCompiler renders auto-promoted lessons into a scannable wiki with [[wiki-links]], cross-references, and topic grouping. Incremental compilation on file change. |
| Wiki quality gates | Built | Configurable promotion thresholds. Claims must meet confidence and quality gates before entering the wiki. Write-time validation, not post-hoc cleanup. |
| Wiki curator + scheduler | Built | Automated curation: scheduled synthesis reports, staleness detection, and maintenance passes. The wiki maintains itself. |
| Dashboard (metrics + management) | Built | Living Memory metrics, backlog, session management, approval workflow, real-time session monitoring. |
| Lesson connections (graph edges) | Built | Typed connections with optional reason text for propositional links. Backlinks for bidirectional traversal. |
| Capability | Status | Detail |
|---|---|---|
| Trust tiers + memory integrity | Designed | ADR written. Content trust tiers, read-time validation, and write-event audit log sufficient for rollback. |
| Alexandria knowledge graph | Building | Full propositional knowledge graph with entity-centric claims, graph traversal, and progressive disclosure. Being built for personal and intra-team use, architected for scale. |
Cross-platform validation: The M/L/P architecture has now been validated on two entirely different platforms. On the PDT (TypeScript / Node.js / SQLite), agents coordinate through a shared backlog with living memory. On Strata (Python / FastAPI / PostgreSQL + pgvector), eight governed agents collaborate through scoped repositories with decision traces. The same architectural principles — governed writes, learning loops, cost-conscious inference — apply naturally to both. This gives us confidence the patterns are transferable, not framework-specific.
The architecture spans three layers with explicit boundaries between them. Failures concentrate at those boundaries: conflating a retrieval problem (Personalization) with a storage problem (Memory) or an extraction problem (Learning) is the most common design mistake in agent memory systems.
flowchart TB subgraph P["Personalization — Real-time serving
Context engineering (Cherny)"] direction LR gc["getContext"] rf["Retrieval filters"] tb["Token budgets"] pd["Progressive disclosure"] gc --- rf --- tb --- pd end subgraph L["Learning — Async extraction
Structured artifacts (PAL, Karpathy)"] direction LR refl["Session-end reflection"] dp["Digestive pipeline"] cond["Condensation"] forg["Forgetting"] refl --- dp --- cond --- forg end subgraph M["Memory — Storage infrastructure
Vault as identity (Cornelius), File-backed state (NLAH)"] direction LR les["Lessons DB"] conn["Connections"] ent["Entities"] sp["Scratchpads"] dt["Decision traces"] les --- conn --- ent --- sp --- dt end P -->|"Usage signals + outcomes"| M M -->|"Raw material"| L L -->|"Structured knowledge"| P
Standing on strong shoulders. This architecture didn't emerge in isolation. It synthesizes ideas from Karpathy (Software 2.0, LLM OS — the model is the program, memory is the file system), Cherny (context engineering as a systems-level discipline, not prompt crafting), PAL (structured code-as-reasoning artifacts, not free-form prose), Cornelius (the vault constitutes identity — "the vault is the riverbed, the sessions are the water"), and NLAH/Pan et al. (six-component harness decomposition, file-backed state as the #1 most valuable module). The M/L/P naming draws from MAPLE. Each layer has different failure modes and maintenance rhythms. The memory layer is an active participant in its own maintenance (agentic memory), not a passive store.
Every agent session follows the same lifecycle. The loop is the same whether the agent is working on a core platform ticket or a Salesforce development task.
sequenceDiagram
participant Agent
participant API as HTTP API
participant Store as Learning Store
participant Pipeline as Digestive Pipeline
Agent->>API: POST /sessions (register)
API-->>Agent: Session ID + scratchpad
Agent->>API: GET /sessions/{id}/context
API->>Store: Query lessons, precedent (workflow-scoped)
Store-->>API: Relevant lessons + decision traces
API-->>Agent: Context payload (token-budgeted)
Note over Agent: Agent works (skills, tools, code)
Note over Agent: Updates scratchpad every 3-5 actions
Agent->>API: DELETE /sessions/{id} (end session)
Note over Agent: Includes: summary, reflection,
lessons, decision traces
API->>Store: Write lessons + traces (scoped)
API->>Pipeline: Queue digestive processing
Pipeline->>Store: Reduce → Reflect → Reweave → Verify → Archive
Note over Store: Next getContext returns
these lessons
Raw knowledge entering the system is not yet useful. It is food, not capability. The digestive pipeline transforms it. Each phase runs in isolation with fresh context to prevent contamination between steps.
flowchart LR
R["Reduce
Raw → atomic claims"]
RE["Reflect
Find connections"]
RW["Reweave
Update existing"]
V["Verify
Schema + dedup"]
A["Archive
Mark processed"]
R --> RE --> RW --> V --> A
| Phase | What it does | Analogy |
|---|---|---|
| Reduce | Extract atomic claims from source material. A 2000-word battlecard yields five entity-anchored claims. The rest is discarded. | Digestion: break down bulk into building blocks. |
| Reflect | Find connections. Where does each new claim link to what already exists? Connections make both claims more retrievable. | Cross-referencing: linking new knowledge to existing understanding. |
| Reweave | Backward maintenance. "If this existing claim were written today, knowing what we now know, what would be different?" New material strengthens existing structures. | Revision: the graph gets stronger claims, not just more claims. |
| Verify | Immune system. Schema validation, retrieval filter quality check, entity resolution, duplicate detection. Malformed claims never enter the graph. | Type checker: correctness checks on installed software. |
| Archive | Processed source material exits the active workspace. What remains is extracted value: clean, connected, integrated. | Waste removal: keep the nutrients, release the bulk. |
Retrieval is purpose-aware and token-budgeted. The same entity returns different claim subsets depending on the consumer's workflow and task.
flowchart TB
Q["Agent calls getContext
(workflow, task, scopes)"]
F["Filter by workflow + scope
(project → team → framework → org)"]
R["Rank by: priority × recency
× tag match × value-per-token"]
B["Apply token budget
(cap total context size)"]
BP["Berrypicking: multi-pass
(reassess, refine, re-query)"]
A["Assemble response:
lessons + precedent + (optional) Alexandria index"]
Q --> F --> R --> B --> BP --> A
Retrieval spans four layers, from narrow to broad:
| Layer | Scope |
|---|---|
| Project | This task, this ticket |
| Team / domain | Shared best practices |
| Framework | Cross-workflow lessons |
| Organization | Entity graph, org-wide |
Retrieval isn't always one-shot. The system can do multiple passes: an initial pass finds relevant lessons, the retrieval logic reassesses ("given what I found, what else do I need?"), and runs another pass with a refined query.
This mirrors how people actually look things up. The search evolves as understanding grows.
The graph is entity-centric: it models the real things the organization cares about and the relationships between them. Entities are the nouns of institutional knowledge.
flowchart LR
subgraph SRC["Sources"]
direction TB
s1["Agent reflections"]
s2["Research"]
s3["Conversations"]
s4["Events"]
end
subgraph INC["Input Contract"]
c1["entity + claim +
source + confidence +
timestamp + author"]
end
subgraph KG["Knowledge Graph"]
direction TB
e["Entities"]
r["Relationships"]
kc["Claims"]
b["Backlinks"]
e --- r
r --- kc
kc --- b
end
subgraph OC["Output Contract"]
o1["Purpose-aware
context assembly"]
end
subgraph CON["Consumers"]
direction TB
a1["Agents"]
a2["Dashboards"]
a3["Decision support"]
a4["Search"]
end
SRC --> INC --> KG --> OC --> CON
Claims are installed capabilities. When an agent loads a knowledge claim, it gains a capability it didn't have before. A claim about competitive positioning enables the agent to reason about strategy. A vague claim is a buggy function. A stale claim is a regression. Quality gates are correctness checks on installed software, not tidiness conventions.
Not every agent should learn. The architecture serves four types of agents, distinguished by their relationship to knowledge. Moving along this spectrum is incremental.
flowchart LR
subgraph S["Stateless Tools"]
s1["Code formatter
Data validator"]
end
subgraph C["Knowledge Consumers"]
c1["Pipeline reporter
Meeting prep agent"]
end
subgraph L["Scoped Learners"]
l1["Coding assistant
Sales assistant"]
end
subgraph F["Full-Metabolism Agents"]
f1["Strategic architect
Business analyst"]
end
S -->|"Add graph reads"| C
C -->|"Add feedback loop"| L
L -->|"Add metabolic machinery"| F
| Type | % of fleet | Reads graph | Learns | Full metabolism |
|---|---|---|---|---|
| Stateless tools | 65-70% | No | No | No |
| Knowledge consumers | 20-25% | Yes | No | No |
| Scoped learners | 5-10% | Domain slice | Bounded | No |
| Full-metabolism agents | 1-3% | Full graph | Yes | Yes |
The key design choice: the graph must serve stateless tools and knowledge consumers (the majority) with minimal friction. A single API call returns what they need. No session registration, no scratchpad. The metabolic machinery exists for the 1-3% that justify it.
We pressure-tested the architecture against 30+ external sources across 10+ sessions. Seven validated strengths. Five remaining gaps.
Propositional framing ("Since [[X]], therefore Y"): no evaluated system has an equivalent. Cornelius's "notes as callable functions" framing independently converges with this approach.
Memory evolution (decay, condensation, supersession): ahead of the field; most systems implement only formation and retrieval.
Forgetting design: the most complete in any source evaluated.
3D hierarchical token-level memory: placed at the apex of the 2026 academic survey's taxonomy.
Session lifecycle + scratchpad: independently validated by Anthropic, MAPLE, NLAH (file-backed state is the #1 most valuable harness module), and the academic survey.
Harness engineering: failure taxonomy, golden principles, evaluator system with verifier alignment logging, and eval suites — all built and running. Implements all six NLAH required components (Contracts, Roles, Stage Structure, Adapters, State Semantics, Failure Taxonomy).
Context engineering as architecture: token-budgeted getContext, berrypicking retrieval, progressive disclosure, and workflow-scoped serving. Per Cherny: the bottleneck isn't the model; it's the context the model receives.
| Gap | Impact | How we close it |
|---|---|---|
| Memory integrity and trust model | No content trust tiers. No read-time validation. No write-event audit log sufficient for rollback. Defenses are write-time schema/format correctness only. | ADR written. Trust tiers, audit log, and read-time integrity checks designed. Implementation is the next priority for the knowledge graph layer. |
| Security / blast radius of poisoned claims | Alexandria's propositional framing means a poisoned claim is reasoned from, not just referenced. Higher epistemic authority means higher blast radius. Three attack vectors identified: direct lesson injection, reflection poisoning, and pipeline injection. | Mitigations designed (trust tiers + audit log + integrity checks). The harness engineering layer (guardrails, eval system, verifier alignment) provides the governance structure; trust tiers extend it to knowledge claims. |
| Context pressure for long-running agents | Session-boundary model works well for bounded tasks. Breaks at 1+ hour sessions. We have context rotation signals and a failure taxonomy entry; we need architectural support. | HumanLayer's Research→Plan→Implement workflow (context compaction between phases) addresses this. Our existing scratchpad rotation and context budget zones (40-60% target) are partial solutions. Full architectural support — mid-session reflection triggers — is next. |
| Conflict resolution (clash) | When getContext returns contradictory lessons, no resolution logic. Agent receives contradictory signals with no guidance. | Design in progress. The knowledge graph's typed connections enable contradiction detection (CONTRADICTS edge type). Resolution requires either recency-wins, confidence-wins, or escalate-to-human logic. Per NLAH: verifier alignment data (which we now collect) can inform which resolution strategy matches human judgment. |
| No quantitative benchmarks | Can't quantitatively prove "memory helps" or "graph beats filesystem." Letta scores 74% on LoCoMo with a plain filesystem. | We now have eval suites infrastructure (YAML-defined tasks, grader registry, result tracking). Need to build memory-specific eval tasks that test what we're actually good at: discovery, reasoning, evolution, and governance — not just retrieval speed. |
The benchmarking reality: For simple retrieval, the mechanism doesn't matter. Agents search well with any tool. Our graph's value is not faster retrieval. It is discovery (connections not sought), reasoning (propositional claims as installed capabilities), evolution (forgetting at scale), and governance (trust tiers). These don't show up on standard benchmarks. They show up in multi-agent, cross-domain scenarios. We need to build evaluation methods that test what we're actually good at.
The vision is federated memory with two tiers and a governed bridge. We're building this for personal and intra-team use today, architected to scale. The local tier is operational. The organizational tier is where the knowledge graph, propositional claims, and graph traversal live. Quality gates apply at the promotion boundary, not at local writes.
flowchart TB
subgraph today["Current State (operational)"]
L3["Living Memory store
(lessons, connections, entities)"]
L4["Agentic learning loop
(getContext, reflection, shared)"]
harness["Harness engineering
(evals, guardrails, failure taxonomy)"]
L3 --> L4
L4 --> harness
end
subgraph target["Target State (in progress)"]
local["Local tier
.md files, git-backed
personal/team scope
Cornelius: vault as identity"]
bridge["Bridge
Quality gates at
promotion boundary
Wiki gates (built)"]
org["Alexandria graph
Propositional claims
graph traversal
Karpathy: LLM OS file system"]
ctx["Context serving
Token-budgeted
berrypicking, disclosure
Cherny: context engineering"]
local <-->|"query / promote"| bridge
bridge <-->|"governed"| org
org --> ctx
end
L4 -.->|"feeds"| bridge
L3 -.->|"foundation"| org
The filesystem-vs-graph divide is not either/or. Individual developers use lightweight, file-based memory (markdown in a workspace, git-backed). Per Cornelius: "the vault is the riverbed, the sessions are the water." The graph tier provides the governed, cross-domain knowledge layer. The architecture serves both without forcing one onto the other.
.md files in workspace, git-backed, personal/team scope. Slash commands (/note, /remember). The agent becomes itself by reading what previous sessions left behind. Low ceremony, high velocity. No quality gates at local writes.
Knowledge graph as the "file system" of the agent operating system. Cross-domain, propositional, governed, forgetting-enabled. Claims become installed capabilities. Agents are processes; the graph is the shared state they read from and write to.
Local projects query Alexandria for best practices and precedent. Local learnings promote to Alexandria when they have broader value. Quality gates (wiki gates, confidence thresholds) apply at the promotion boundary. Per PAL: structured artifacts flow upward, not free-form prose.
Any source publishes knowledge into the graph through one contract: {entity, claim, source, confidence, timestamp, author}. Per Cornelius: notes aren't reminders — they're callable functions. A well-formed claim is a capability the agent gains on load. The framework grows by accretion, not revolution.
The strategic opportunity: The industry is converging on graphs, agents, learning loops, and forgetting. No one in the evaluated landscape is building the full knowledge engineering discipline: dependency tracking, impact analysis, capability testing, deprecation management for knowledge claims. This moves from a novel framing to a novel discipline — genuinely unprecedented and difficult to replicate.