A senior CTO emailed me last month: "We rolled out Devin across two teams. After three weeks the agents had merged 47 PRs. Three of them broke prod. Two contained a credential in the commit. One disabled rate limiting because the test fixtures didn't pass with rate limiting on. We're rolling back."
Everyone with eyes on agentic coding has heard a version of this story. The most common diagnosis is "the model isn't good enough yet." Reasonable on the surface. Wrong as a diagnosis.
I've spent the last 4 months building a multi-agent SDLC layer on top of Claude Code. 34 specialist agents, 25 archetype overlays, two human gates per feature. The clearest finding from this work: the failures CTOs describe almost never trace to bad code generation. They trace to missing gates.
This article walks through why, and shows the state machine I think every agentic SDLC needs.
The problem with how everyone does it
The default architecture for agentic coding is one autonomous loop:
loop:
llm.generate(task, context)
apply(diff)
run_tests()
if pass: commit
else: revise
This is fine for prototypes. It is a disaster for shipped code. Three reasons.
1. Tests aren't enough. Tests verify correctness against assertions you wrote. They do not verify: "is this PCI-DSS scope appropriate", "does this respect TCPA recording consent", "did we just add a hidden N+1 query", "is this idempotent under retry storms". You need humans, or specialist reviewers that act like humans, for each of those.
2. One agent can't review itself. Even if you ask GPT-4 or Claude Opus to review its own output, the same biases that wrote the bug are reading the diff. We have decades of evidence from code review at Google, Microsoft, and Apache that independent reviewers catch ~3Γ more defects than authors. Independence requires separation. Agents aren't different.
3. Speed compounds errors. When the loop runs unattended, errors accumulate quietly between human checkpoints. By the time a human sees the work, the agent has rebuilt on top of three earlier mistakes. You can't fix the lowest-level mistake without unwinding everything above it.
The pattern that keeps emerging across teams that ship agentic systems successfully is explicit gates + specialist reviewers, not bigger models.
What this article will show
- The 8-stage state machine I think every agentic SDLC needs
- Why two human gates per feature is the sweet spot
- The parallel implementer + parallel reviewer pattern
- How memory feedback closes the loop (the "94% MTTR" claim, with caveats)
- What this all costs (~$2 per small feature, with receipts)
The state machine
The full pipeline, as a deterministic state machine:
flowchart TD
Init["$ init"] --> Detect["archetype-detect"]
Detect --> Architect["architect (ARCH.md)"]
Architect --> GatePlan{"β gate: plan"}
GatePlan -->|human approve| PM["pm (decompose)"]
PM --> Impl["senior-dev Γ N (parallel)"]
Impl --> Review["specialist review Γ 5 (parallel)"]
Review --> GateShip{"β gate: ship"}
GateShip -->|human approve| Deploy["devops"]
Deploy --> Operate["l3-support"]
Operate -.->|incident pattern| Learner["continuous-learner"]
Learner -.->|inject lesson| Architect
The two diamond nodes are human gates. Everything else runs unattended.
A few things to notice:
Parallelism is structural, not accidental. At the implement stage, independent tasks run in isolated git worktrees. At the review stage, 5 reviewers run concurrently because they look at different aspects (QA, security, performance, archetype-specific compliance, 12-angle code review).
The memory loop is dashed. It's an out-of-band feedback path. When a P0 incident resolves, the continuous-learner agent extracts the detection pattern and writes it to ~/.great_cto/lessons.md. Next time a similar incident shape hits, the agent's Step 0 includes the prior detection order. This is where the MTTR savings come from.
Specialists run only when archetype matches. The 34 agents in the pool aren't all firing every time. For a typical fintech feature, only 7 run: architect, pm, 2Γ senior-dev, qa-engineer, security-officer (PCI focus), code-reviewer. The voice-AI reviewer doesn't load because the archetype isn't voice-AI.
Two gates, not seven
The hardest design question is: how many human gates?
I started with seven: plan, design-review, security-review, qa-review, performance-review, compliance-review, ship. The complaint from every early user was: "this is just the human checkpoint problem from waterfall, but worse, because now I'm reviewing AI outputs."
Down to two. Specifically:
Gate 1: plan. You approve the ARCH note + cost estimate + task decomposition before any code is written. This is the cheapest decision in the pipeline β if scope is wrong, fixing it now is free. If you approve it, you've committed to "ship this if implementation passes."
Gate 2: ship. You see the full review panel β 5 verdicts, with rationale and diff per reviewer. APPROVED chips and BLOCKED chips. You either approve, or push back on a specific reviewer.
Everything in between is the agents' problem. If they disagree with each other, the gate fails and surfaces with the disagreement explicit.
Why this specific shape:
- Gate 1 controls scope. You decide what gets built.
- Gate 2 controls quality. You decide whether the agents got it right.
You don't decide how in between. The agents do. If you're making more than 2 decisions per feature, you're a bottleneck β and the whole pipeline collapses to your reading speed.
This is the part that most agentic systems get wrong. They either show you everything (and you can't keep up), or they show you nothing (and you wake up to broken prod). Two well-chosen gates is the sweet spot.
The memory loop is the real moat
Most agentic coding tools have no memory. They start each session from zero. This is fine for syntax errors and dead code. It is bad for the kind of bugs that recur with different surface signatures.
Real example. Q1 of this year I hit a postgres connection pool exhaustion during a burst load. The log said Connection refused. Looked like a network issue. Spent 4 hours unwinding network config before finally checking pg_stat_activity and seeing pool size was the cap. Q3, same shape hits in a different project β different framework, different stack. Pattern hash matches. Agent's Step 0 includes the prior detection order. 28 minutes to resolution.
This is not the agent being smarter. It's the agent skipping hypothesis exploration time.
Across 47 paired P0 incidents in 12 repositories (full methodology and 4 honest memory-miss cases published here), the median MTTR reduction was 94.1%. The mean was 92.6%. Skewed by a couple of near-100% cases. Not an RCT. Observational. Caveats are listed in the methodology.
The mechanism is simple. The agent stores: (pattern_hash, detection_order_that_worked, rationale). On a match, it tries the winning detection first. If that's wrong (4 of 47 cases were misses), it falls back to systematic exploration. No worse than baseline.
What makes the memory layer work is that it's local, file-backed, and git-trackable. Not a vector DB. Not a cloud service. Plain markdown in .great_cto/lessons.md (per-project) and ~/.great_cto/decisions.md (cross-project). You can read it, edit it, version-control it.
Edge cases worth knowing
A few things that surprised me during the build:
Agent count doesn't matter much. I shipped 12 agents, then 24, then 34. The marginal value of adding the 35th agent is small. What matters is coverage of distinct review angles. After 12, you mostly add archetype-specific compliance reviewers, and each one is opt-in based on archetype detection.
Disagreement between reviewers is a feature, not a bug. When security-officer blocks a PR that qa-engineer approves, you want this visible at the gate, not papered over. The state machine surfaces both verdicts.
Cost is dominated by output tokens. A typical feature: $3.40 in LLM calls. ~80% is in the agents that write (senior-devs, architect). The reviewers are cheap because they output verdicts, not code. If costs balloon, look at how much code is being generated, not how many agents.
Auto-approve flag is the slippery slope. I considered an --auto-approve flag for trivial features. Killed it. The minute you have that flag, the cycle that produces broken prod starts. The two gates are load-bearing.
Where this fits
The thesis isn't "you need this specific tool." It's that any agentic SDLC needs explicit state, explicit gates, and a memory loop. Without them, you're shipping a faster version of the agent system that already burned the teams I mentioned at the top.
If you want to inspect the exact state machine, the live SVG with every node clickable to its source on GitHub is here. A real shipped feature, walked stage by stage with artifacts and costs, is here.
TL;DR
- Agentic coding failures trace to missing gates, not bad models.
- The pattern that ships safely is 2 human gates + parallel implementers + parallel reviewers + memory loop.
- "Bigger model" is rarely the right answer. "More specialist review angles" usually is.
- Cost per shipped feature on this architecture: $1β4 in LLM, ~45 min wall-clock, 2 human clicks.
- Memory is the difference between "fast at one-off code generation" and "improves over time at your codebase's recurring bugs".
About: I build GreatCTO β a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: @avelikiy. GitHub: @avelikiy/great_cto.