A purpose-built development operating system where specialized agents decompose, implement, test, and validate work end-to-end — under governance, with continuity at scale.
What Is This?
Autogenous Synthesis (Forge) is a purpose-built engineering framework that makes agent-driven software delivery reliable, auditable, and continuously improving.
The backlog operates as a live state machine with transactional guarantees. Agents register sessions, claim work, check for file conflicts, execute under structured procedures, generate and validate test suites, and submit for review — without a human orchestrating each step.
The human role: set direction, define acceptance criteria, review output. Everything between is Forge.
AI-Assisted vs. AI-First
Most AI-augmented development keeps the human as the message bus: read the ticket, write the prompt, paste the output, update the board. The AI is a faster keyboard. Forge is structured differently — it’s the surrounding system that makes agents reliable actors rather than capable tools.
| Typical AI-Augmented Workflow | Forge | |
|---|---|---|
| Work state | Ticket in task manager; prompt in chat; output pasted back manually | Live state machine — agents claim work, execute, and submit via API; all transitions are logged |
| Acceptance criteria | Prose description in ticket body | Structured ISC table; at least one machine-verifiable check required; all-prose tickets are rejected by the v2 validator |
| Code quality gate | Linter and test suite; code review is unstructured human judgment, no formal pass/fail criteria | Linter + quality test suite + blocking code review gate; LLM agent review layer added for scope-sensitive changes |
| TDD | Convention; often skipped under time pressure | Mechanically enforced by the verification gate; a submission without a RED → GREEN trace fails before reaching review |
| Context across sessions | Context limit means starting over; prior work is lost or repeated | Structured scratchpad persists the full execution checkpoint; the next session reads it and continues from the last completed step |
| Risk routing | Uniform treatment for all changes | risk_class (trivial / standard / sensitive) determines gate strictness, model assignment, and approval path per ticket |
Four Defining Properties
Agent-First
Agents serve as primary builders, not developer assistants. Every component — session protocol, backlog schema, SOP library, work packages — is designed to maximize what agents accomplish reliably without hand-holding. Work is legible, bounded, and enforceable.
High Throughput
A single engineer runs multiple agent sessions in parallel across unrelated concerns simultaneously. File lock arbitration prevents collisions before agents are dispatched. Work that would serialize a solo engineer runs concurrently across the network.
High Confidence
Implementation is test-driven mechanically. RED → GREEN → REFACTOR is enforced by the verification gate, not just a guideline. Machine-verifiable acceptance criteria (ISC format) make pass/fail binary. The evaluator auto-resolves routine reviews. No work reaches completion without passing automated validation.
Continuous Learning
Agent sessions generate structured scratchpads that persist across context limits. The living memory system — with digestive pipeline, auto-promotion gates, and decay — means agents begin sessions knowing what the fleet has already learned. Knowledge compounds across every session, every agent, and every project.
The Backlog: Live State Machine
Every backlog item moves through five stages with automated transitions:
| Stage | Count (example) | What happens |
|---|---|---|
| Intake | 15 items | Needs acceptance criteria written |
| Ready | 8 items | AC verified, work package auto-generated |
| In Progress | 1 item | Agent claimed, scratchpad active |
| Review | 2 items | Eval runs first; human only if ambiguous |
| Done | 8 items | Validated, merged, session archived |
All state changes route through a single HTTP API server with atomic writes, 10-retry exponential backoff, and file-lock conflict detection. Every transition is an API call with a full audit trail — no manual drag-and-drop, no ambiguous state.
Ticket Schema v2
Every backlog ticket is validated against a structured schema before it can be claimed. Key required fields beyond title and description:
risk_class—trivial | standard | sensitive. Controls gate strictness, which model tier is assigned, and the approval path. Sensitive tickets stay in human-supervised paste mode indefinitely.file_scope— Declared list of expected file paths and a max-file-count lock. Prevents agents from drifting outside the intended change boundary.acceptance_criteria— Structured array. At least one entry must have a machine-verifiablekind(test / grep / http / snapshot). All-prose AC is rejected by the API validator.verification— Exact commands and a timeout, so the agent knows precisely how to confirm the work is done.
Legacy v1 tickets (prose-only description, no structured fields) are read-only. Attempting to claim one returns HTTP 400 LEGACY_V1_BLOCKED.
Acceptance Criteria: Machine-Verifiable
Every ready-stage item includes an ISC (Ideal State Criteria) table: binary pass/fail conditions with explicit verification methods. No subjective assessments.
Example:
AC1: Migrations 029–038 applied cleanly → verified: alembic current == head
AC2: All 6 new models importable → verified: pytest -k test_models
AC3: Rollback to 028 succeeds → verified: alembic downgrade -1
Quality Pipeline
TDD as Infrastructure
All implementation work is test-driven mechanically — not by convention. Any submission without a RED → GREEN → REFACTOR trace fails the verification gate before reaching review. Stub implementations (throw new Error('Not implemented'), empty returns, TODO placeholders) in production files are caught by pre-commit hooks and rejected. If full implementation isn’t possible in scope, a new backlog item is created — not a placeholder left in code.
3-Tier Code Review Gate
Code review is a blocking gate, not an advisory step. Every submission passes through three tiers:
Tier 1 — Automated metrics (always runs): Cyclomatic complexity, function length, nesting depth, duplication, import cycles, dead code, naming conventions, and TODO density are scored and weighted. Threshold: 60. Below it, the submission fails.
Tier 2 — Scope-triggered agent analysis: Activates when 3+ files change, 100+ lines change, or changes touch auth, security, approval-queue, or config paths. Also fires for all security and infrastructure ticket types, and whenever new dependencies are added. Minimum confidence 0.7 to auto-pass.
Tier 3 — Human review: Required when Tier 2 confidence is low, critical findings exist, or the ticket carries a high risk level, security type, or breaking-change / public-api tags.
The bypass.codeReview field exists for trivial tickets only and requires an explicit reason — it’s enforced at the schema level, not by convention.
Human-in-the-Loop Governance
Speed with bounded authority. Governance is automation-first, not approval-first.
Step 1 — Automated evaluation: When agents submit work, Eval runs immediately: acceptance criteria (ISC table), linting, regression tests. All checks are binary pass/fail with full audit trails persisted to SQLite.
Step 2 — Green → auto-done / Red → back to agent: All checks pass: ticket marks done, human notified. Any check fails: ticket routes back to the responsible agent with failure details. No human in either path.
Step 3 — Escalate only when uncertain: When Eval can’t determine pass/fail with confidence — genuinely ambiguous architectural decisions, novel failure patterns — it escalates to humans with a structured evidence package explaining exactly what passed, what’s uncertain, and why escalation occurred. This reduces review load by 80%+ without sacrificing oversight.
Guardrails Are Enforced, Not Suggested
Global guardrails in .agent-data/guardrails/global.md define hard mechanical rules:
- Never write to
.env*files - Never execute
rm -rf - Never force-push to main
- Never commit secrets
These aren’t reminders — they’re validated by automated test suites running on every change. Agents violating guardrail rules fail the verification gate.
Sessions, Scratchpads & Continuity
No work is lost when context runs out.
Every agent session generates a structured scratchpad at .agent-data/scratchpads/active/. The scratchpad tracks:
- Decomposition plan and sub-task status
- What each sub-agent returned
- Whether results met acceptance criteria
- Accumulating synthesis (the actual output)
When agents hit context limits, they write state and stop cleanly. The next invocation reads the scratchpad, identifies the last completed checkpoint, and resumes from there. Zero restart cost. No duplicated work.
Work Packages
When Stewart promotes backlog items to ready, Forge auto-generates Work Packages — self-contained handoff documents containing:
- Problem statement and binary ISC table
- Exact file paths and modification requirements
- Quick-start commands to verify the environment
- Hard constraints and what not to touch
Work packages bridge “human writes ticket” and “agent starts work.” They’re the automated prompt-engineering layer that makes agent output reliable rather than hopeful.
Observability
The system knows what it’s doing and can show you.
- Daily health reports — Backlog state, session activity, agent throughput, test coverage metrics
- Event log — Every agent action logged to
.agent-data/events/event-log.yaml: agent, ticket, action, timestamp - Eval results — Every review decision records what was checked, which AC rows passed, what evidence was cited, and whether it was auto-resolved or escalated
- Scratchpad archive — Completed scratchpads archive to
.agent-data/scratchpads/archive/YYYY-MM/. Every decision made during execution is preserved and auditable
What Changes: The Developer’s Role
“Forge doesn’t make you faster. It makes you an architect of systems that develop.”
Without Forge: Engineers work sequentially. One task at a time. Context switches are expensive. Tests get deferred. Documentation follows later — usually never. State lives in the engineer’s head. When they’re not working, nothing is.
With Forge: Multiple parallel agent sessions execute concurrently across unrelated concerns. Tests generate before implementation. Documentation produces alongside code. State persists and audits. The system works whether or not anyone is at a keyboard.
The key capability Forge creates isn’t “AI writes my code.” It’s institutional execution capacity — the ability to run more work, in parallel, with higher confidence, than any individual developer could sustain. The system gets measurably better over time because the memory system compounds what the agent network learns across every session.
Project Status
Autogenous Synthesis is the framework powering development of Strata — an autonomous job search operating system — and several other projects in the workspace. It is under active development.
- 12+ specialist agent skills operational
- 5 lifecycle stages with automated transitions
- 150+ automated test suites enforcing quality gates
- 80% review load reduction through automated evaluation
Recent Work
- Ticket schema v2 — Hard cutover complete. Structured AC, risk classification, file-scope declaration, and machine-verifiable verification commands are now required fields. Legacy v1 tickets are read-only.
- Blocking code review gate — The 3-tier review gate is now enforced on every submission. Tier 1 runs automated quality metrics; Tier 2 triggers agent analysis on scope-sensitive changes; Tier 3 escalates to human review when confidence is low.
- Living memory (Alexandria) — Session reflections are automatically extracted and stored. The
UserPromptSubmithook injects relevant prior context at session start, so agents begin with knowledge of what the fleet has already learned. - Prompt injection detection — External content (job listings, scraped pages, API responses) passes through a sanitizer before reaching agent context. Injection patterns are detected and stripped.
- Planner self-check — The planning skill now refuses to produce a work package unless the ticket has at least one machine-verifiable acceptance criterion. All-prose tickets are blocked at the planning stage as well as at the API level.
Built by Andrew Crenshaw