Autogenous Synthesis

A purpose-built development operating system where specialized agents decompose, implement, test, and validate work end-to-end — under governance, with continuity at scale.

What Is This?

Autogenous Synthesis (Forge) is a purpose-built engineering framework that makes agent-driven software delivery reliable, auditable, and continuously improving.

The backlog operates as a live state machine with transactional guarantees. Agents register sessions, claim work, check for file conflicts, execute under structured procedures, generate and validate test suites, and submit for review — without a human orchestrating each step.

The human role: set direction, define acceptance criteria, review output. Everything between is Forge.

AI-Assisted vs. AI-First

Most AI-augmented development keeps the human as the message bus: read the ticket, write the prompt, paste the output, update the board. The AI is a faster keyboard. Forge is structured differently — it’s the surrounding system that makes agents reliable actors rather than capable tools.

	Typical AI-Augmented Workflow	Forge
Work state	Ticket in task manager; prompt in chat; output pasted back manually	Live state machine — agents claim work, execute, and submit via API; all transitions are logged
Acceptance criteria	Prose description in ticket body	Structured ISC table; at least one machine-verifiable check required; all-prose tickets are rejected by the v2 validator
Code quality gate	Linter and test suite; code review is unstructured human judgment, no formal pass/fail criteria	Linter + quality test suite + blocking code review gate; LLM agent review layer added for scope-sensitive changes
TDD	Convention; often skipped under time pressure	Mechanically enforced by the verification gate; a submission without a RED → GREEN trace fails before reaching review
Context across sessions	Context limit means starting over; prior work is lost or repeated	Structured scratchpad persists the full execution checkpoint; the next session reads it and continues from the last completed step
Risk routing	Uniform treatment for all changes	`risk_class` (trivial / standard / sensitive) determines gate strictness, model assignment, and approval path per ticket

Four Defining Properties

Agent-First

Agents serve as primary builders, not developer assistants. Every component — session protocol, backlog schema, SOP library, work packages — is designed to maximize what agents accomplish reliably without hand-holding. Work is legible, bounded, and enforceable.

High Throughput

A single engineer runs multiple agent sessions in parallel across unrelated concerns simultaneously. File lock arbitration prevents collisions before agents are dispatched. Work that would serialize a solo engineer runs concurrently across the network.

High Confidence

Implementation is test-driven mechanically. RED → GREEN → REFACTOR is enforced by the verification gate, not just a guideline. Machine-verifiable acceptance criteria (ISC format) make pass/fail binary. The evaluator auto-resolves routine reviews. No work reaches completion without passing automated validation.

Continuous Learning

Agent sessions generate structured scratchpads that persist across context limits. The living memory system — with digestive pipeline, auto-promotion gates, and decay — means agents begin sessions knowing what the fleet has already learned. Knowledge compounds across every session, every agent, and every project.

The Backlog: Live State Machine

Every backlog item moves through five stages with automated transitions:

Stage	Count (example)	What happens
Intake	15 items	Needs acceptance criteria written
Ready	8 items	AC verified, work package auto-generated
In Progress	1 item	Agent claimed, scratchpad active
Review	2 items	Eval runs first; human only if ambiguous
Done	8 items	Validated, merged, session archived

All state changes route through a single HTTP API server with atomic writes, 10-retry exponential backoff, and file-lock conflict detection. Every transition is an API call with a full audit trail — no manual drag-and-drop, no ambiguous state.

Ticket Schema v2

Every backlog ticket is validated against a structured schema before it can be claimed. Key required fields beyond title and description:

risk_class — trivial | standard | sensitive. Controls gate strictness, which model tier is assigned, and the approval path. Sensitive tickets stay in human-supervised paste mode indefinitely.
file_scope — Declared list of expected file paths and a max-file-count lock. Prevents agents from drifting outside the intended change boundary.
acceptance_criteria — Structured array. At least one entry must have a machine-verifiable kind (test / grep / http / snapshot). All-prose AC is rejected by the API validator.
verification — Exact commands and a timeout, so the agent knows precisely how to confirm the work is done.

Legacy v1 tickets (prose-only description, no structured fields) are read-only. Attempting to claim one returns HTTP 400 LEGACY_V1_BLOCKED.

Acceptance Criteria: Machine-Verifiable

Every ready-stage item includes an ISC (Ideal State Criteria) table: binary pass/fail conditions with explicit verification methods. No subjective assessments.

Example:

AC1: Migrations 029–038 applied cleanly   → verified: alembic current == head
AC2: All 6 new models importable          → verified: pytest -k test_models  
AC3: Rollback to 028 succeeds             → verified: alembic downgrade -1

Quality Pipeline

TDD as Infrastructure

All implementation work is test-driven mechanically — not by convention. Any submission without a RED → GREEN → REFACTOR trace fails the verification gate before reaching review. Stub implementations (throw new Error('Not implemented'), empty returns, TODO placeholders) in production files are caught by pre-commit hooks and rejected. If full implementation isn’t possible in scope, a new backlog item is created — not a placeholder left in code.

3-Tier Code Review Gate

Code review is a blocking gate, not an advisory step. Every submission passes through three tiers:

Tier 1 — Automated metrics (always runs): Cyclomatic complexity, function length, nesting depth, duplication, import cycles, dead code, naming conventions, and TODO density are scored and weighted. Threshold: 60. Below it, the submission fails.

Tier 2 — Scope-triggered agent analysis: Activates when 3+ files change, 100+ lines change, or changes touch auth, security, approval-queue, or config paths. Also fires for all security and infrastructure ticket types, and whenever new dependencies are added. Minimum confidence 0.7 to auto-pass.

Tier 3 — Human review: Required when Tier 2 confidence is low, critical findings exist, or the ticket carries a high risk level, security type, or breaking-change / public-api tags.

The bypass.codeReview field exists for trivial tickets only and requires an explicit reason — it’s enforced at the schema level, not by convention.

Human-in-the-Loop Governance

Speed with bounded authority. Governance is automation-first, not approval-first.

Step 1 — Automated evaluation: When agents submit work, Eval runs immediately: acceptance criteria (ISC table), linting, regression tests. All checks are binary pass/fail with full audit trails persisted to SQLite.

Step 2 — Green → auto-done / Red → back to agent: All checks pass: ticket marks done, human notified. Any check fails: ticket routes back to the responsible agent with failure details. No human in either path.

Step 3 — Escalate only when uncertain: When Eval can’t determine pass/fail with confidence — genuinely ambiguous architectural decisions, novel failure patterns — it escalates to humans with a structured evidence package explaining exactly what passed, what’s uncertain, and why escalation occurred. This reduces review load by 80%+ without sacrificing oversight.

Guardrails Are Enforced, Not Suggested

Global guardrails in .agent-data/guardrails/global.md define hard mechanical rules:

Never write to .env* files
Never execute rm -rf
Never force-push to main
Never commit secrets

These aren’t reminders — they’re validated by automated test suites running on every change. Agents violating guardrail rules fail the verification gate.

Sessions, Scratchpads & Continuity

No work is lost when context runs out.

Every agent session generates a structured scratchpad at .agent-data/scratchpads/active/. The scratchpad tracks:

Decomposition plan and sub-task status
What each sub-agent returned
Whether results met acceptance criteria
Accumulating synthesis (the actual output)

When agents hit context limits, they write state and stop cleanly. The next invocation reads the scratchpad, identifies the last completed checkpoint, and resumes from there. Zero restart cost. No duplicated work.

Work Packages

When Stewart promotes backlog items to ready, Forge auto-generates Work Packages — self-contained handoff documents containing:

Problem statement and binary ISC table
Exact file paths and modification requirements
Quick-start commands to verify the environment
Hard constraints and what not to touch

Work packages bridge “human writes ticket” and “agent starts work.” They’re the automated prompt-engineering layer that makes agent output reliable rather than hopeful.

Observability

The system knows what it’s doing and can show you.

Daily health reports — Backlog state, session activity, agent throughput, test coverage metrics
Event log — Every agent action logged to .agent-data/events/event-log.yaml: agent, ticket, action, timestamp
Eval results — Every review decision records what was checked, which AC rows passed, what evidence was cited, and whether it was auto-resolved or escalated
Scratchpad archive — Completed scratchpads archive to .agent-data/scratchpads/archive/YYYY-MM/. Every decision made during execution is preserved and auditable

What Changes: The Developer’s Role

“Forge doesn’t make you faster. It makes you an architect of systems that develop.”

Without Forge: Engineers work sequentially. One task at a time. Context switches are expensive. Tests get deferred. Documentation follows later — usually never. State lives in the engineer’s head. When they’re not working, nothing is.

With Forge: Multiple parallel agent sessions execute concurrently across unrelated concerns. Tests generate before implementation. Documentation produces alongside code. State persists and audits. The system works whether or not anyone is at a keyboard.

The key capability Forge creates isn’t “AI writes my code.” It’s institutional execution capacity — the ability to run more work, in parallel, with higher confidence, than any individual developer could sustain. The system gets measurably better over time because the memory system compounds what the agent network learns across every session.

Project Status

Autogenous Synthesis is the framework powering development of Strata — an autonomous job search operating system — and several other projects in the workspace. It is under active development.

12+ specialist agent skills operational
5 lifecycle stages with automated transitions
150+ automated test suites enforcing quality gates
80% review load reduction through automated evaluation

Recent Work

Ticket schema v2 — Hard cutover complete. Structured AC, risk classification, file-scope declaration, and machine-verifiable verification commands are now required fields. Legacy v1 tickets are read-only.
Blocking code review gate — The 3-tier review gate is now enforced on every submission. Tier 1 runs automated quality metrics; Tier 2 triggers agent analysis on scope-sensitive changes; Tier 3 escalates to human review when confidence is low.
Living memory (Alexandria) — Session reflections are automatically extracted and stored. The UserPromptSubmit hook injects relevant prior context at session start, so agents begin with knowledge of what the fleet has already learned.
Prompt injection detection — External content (job listings, scraped pages, API responses) passes through a sanitizer before reaching agent context. Injection patterns are detected and stripped.
Planner self-check — The planning skill now refuses to produce a work package unless the ticket has at least one machine-verifiable acceptance criterion. All-prose tickets are blocked at the planning stage as well as at the API level.

Built by Andrew Crenshaw