How I designed the SDLC state machine for agentic coding

Eight stages, two human gates, four memory layers. Why this exact shape, and what I tried that didn't work.

The architecture page on the site shows the state machine as a clickable diagram. This post walks through the design decisions that produced that shape, including the alternatives I tried and abandoned.

The pipeline, in one diagram

$ init
  └─ archetype-detect
       └─ architect          (ARCH.md)
            └─ ⚐ gate: plan         ← human approval #1
                 └─ pm              (decompose, parallelism plan)
                      └─ senior-dev × N   (parallel git worktrees)
                           └─ specialist review × 5  (parallel)
                                └─ ⚐ gate: ship       ← human approval #2
                                     └─ devops        (merge, deploy)
                                          └─ l3-support
                                               ↘ (incident pattern)
                                                  continuous-learner
                                                       ↗ (inject lesson into next architect run)

Eight runtime stages. Two human gates. One out-of-band feedback path (the dashed loop through continuous-learner). Every node maps to a single specialist agent.

Decision 1: number of human gates

I started with seven gates: plan, design-review, security-review, qa-review, performance-review, compliance-review, ship.

Every early user complained: "this is just waterfall, but now I am reviewing AI outputs instead of writing code." Each gate added 5-15 minutes of human reading. Seven gates × 10 minutes = 70 minutes of attention per feature. Worse than manual.

Two gates works because they are at the right scope-vs-quality break points:

gate: plan controls scope. You decide what gets built before any code is written. This is the cheapest possible decision in the pipeline — if scope is wrong, fixing it costs zero.
gate: ship controls quality. You see all 5 reviewer verdicts at once, with rationale per reviewer. You either approve, or push back on a specific reviewer (which the system re-runs).

Everything in between is the agents' problem. If reviewers disagree with each other, the gate surfaces the disagreement explicit — not papered over. If they all agree, you approve quickly.

Three is not better than two. I tried three (added "design-review" between plan and pm) for a month. The middle gate added 8-12 minutes of reading time, and in practice approved 47/47 features unchanged. Removed.

Decision 2: parallelism is structural, not opportunistic

The pm step does not just decompose tasks. It explicitly tags tasks as [parallel] or [serial], schedules them into a DAG, and assigns each parallel task its own git worktree.

This matters because the temptation in an LLM pipeline is to serialize everything ("agents think too fast, let's not race them"). That's wrong. Modern agentic coding tools (Claude Code in particular) handle parallel worktrees cleanly. The bottleneck is human attention, not compute. Running 4 senior-devs in parallel for 38 minutes is the same total LLM cost as serial, but cuts wall-clock by 3-4×.

Reviewers are also parallel — 5 specialist reviewers run concurrently against the merged diff, each looking at a different aspect (QA, security, performance, archetype-specific compliance, code quality across 12 angles). They do not block each other.

The only serial steps are architect (must precede pm) and devops (must follow approval).

Decision 3: memory is per-project, per-org, and cross-project

The dashed loop in the diagram is the part most agentic coding tools skip. It is the part that compounds.

Four layers:

Per-session. Conversation history. Disappears at session end. Cheap.
Per-project (.great_cto/lessons.md). Decisions, rejected approaches, incident detection patterns. Survives session restarts. Git-trackable.
Per-org (~/.great_cto/decisions.md). Patterns confirmed across ≥3 projects. Promoted from per-project after manual review. Used as Step 0 context for architect.
Cross-project (incident patterns). Pattern hash + winning detection order, stored when a P0 resolves. Next time a similar incident shape hits in a different project, the agent's Step 0 includes the prior detection order.

This last layer is where the MTTR -94% claim comes from. It is not the agent being smarter; it is the agent skipping hypothesis exploration time because someone already paid for that exploration.

I tried a vector-DB-backed memory layer for two weeks. Abandoned. The cognitive overhead of a "search before you write" step in every agent prompt was worse than just listing 3-5 recent lessons in a markdown file and trusting the LLM's context window. Plain text + git history is the moat, not embeddings.

Decision 4: archetype-specific agents are opt-in

The 34 specialist agents in the pool are not all firing every time. For a typical fintech feature, only 7 run: architect, pm, 2× senior-dev, qa-engineer, security-officer (PCI focus), code-reviewer. The voice-AI reviewer does not load because the archetype is not voice.

This is more important than it sounds. Early versions ran "all reviewers always." Cost ballooned because every reviewer wants context (the diff + ARCH.md + project README) and most ran for 0 useful output. Now: archetype detection at init picks the relevant 5-7, no extras.

The detection is signal-based: regex matches in package.json, README, infra/. False positives happen (the static-site-generator that got TCPA threat-modeled) but auto-attach is reversible — PROJECT.md lists the active packs and you can remove one with a line edit.

What the state machine does not do

It does not write product specs. You bring the prompt.
It does not negotiate with stakeholders. The architect will write the ARCH note, but the conversation with your VP of Eng about "should we even build this" is yours.
It does not catch all bugs. The qa-engineer writes tests against the spec it inferred from the architect's ARCH note. If you mis-spec'd the feature at gate:plan, the tests pass and the bug ships. Two gates is not zero — it is two.
It does not learn from your specific code style without per-project lessons. First 3-5 features in a new repo, you will see suboptimal naming and inconsistent patterns. By feature 10, lessons.md has enough to lock the style.

Where this fits

This is the architecture I argue every agentic SDLC needs. Not this specific tool. If you are building your own, the shape that works is: explicit state, two gates, parallel specialists, memory loop. Anything looser breaks at scale.

The clickable version of this diagram, with every node linking to its agent's source on GitHub, is at greatcto.systems/architecture.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: @avelikiy.