Proof — one real GreatCTO run, end to end

The run · 2026-05-14

From prompt to shipped in 1h 26m.

One feature: add a domain-pack overlay so voice-AI startups (Sierra, Cresta, Phonely, …) get the right gates automatically. Below: every stage, every agent, every artifact. PR #22 merged at 17:17:48 +0200.

wall-clock

1h 26m

llm cost

~$3.40

human gates

2 (plan + ship)

lines shipped

414 (+test infra)

15:51 · T+0 operator (human)

Prompt: "Voice-AI startups (Sierra, Cresta, Phonely) keep tripping on TCPA + state recording-consent + STIR/SHAKEN. Ship a domain pack so the gates fire automatically when the archetype is detected."

15:54 · T+3m architect ~$0.32

Read existing pack examples (clinical-pack, hr-ai-pack). Identified the contract: a pack ships 1 pack-spec + 1 reviewer agent + ≥3 EVAL fixtures + 1 CLI signal in packs.ts. Drafted ARCH.note listing the 5 OWASP-LLM threats specific to voice (PII leakage in transcripts, prompt injection via caller speech, synth-voice disclosure, call-handoff identity drift, recording-consent state-by-state matrix).

📄 reference: clinical-pack.md (prior art)

15:58 · T+7m ⚐ GATE: PLAN · operator (human) ~30 s

Operator reviewed the ARCH.note. Approved scope: voice-pack, 4 evals (handoff, PII, injection, synth-disclosure), TCPA + STIR/SHAKEN + state-recording-consent in the threat model. Rejected scope creep: no IVR-specific overlay yet (defer to v2.9).

APPROVED · proceed with 4-eval pack · skip IVR overlay

15:59 · T+8m pm ~$0.08

Decomposed into 4 independent tasks (no dependency between them, so parallelize): voice-pack.md spec, voice-ai-reviewer.md agent, EVAL fixtures × 4, CLI signal in packs.ts. Filed as beads tasks.

16:04 · T+13m senior-dev #1 (parallel) ~$0.42

Authored skills/great_cto/packs/voice-pack.md — pack spec: detection signals (Twilio/Vonage/Retell SDKs, keywords "voice agent / IVR / phone tree"), gates added when pack overlays an archetype, references to laws and standards.

📄 voice-pack.md (65 lines)

16:11 · T+20m senior-dev #2 (parallel) ~$0.68

Authored agents/voice-ai-reviewer.md — reviewer agent prompt: when to fire, gates it owns (gate:voice-compliance), threat model, pre-implementation checklist, sign-off criteria. 200 lines.

🤖 voice-ai-reviewer.md (200 lines)

16:19 · T+28m senior-dev #3 (parallel) ~$0.74

Authored 4 EVAL fixtures — one per identified threat. Each fixture has: scenario, expected verdict, refusal pattern, red-team probes. Used the existing EVAL-*.md template.

🧪 EVAL-voice-call-handoff-safety.md (42 ln) 🧪 EVAL-voice-pii-leakage.md (48 ln) 🧪 EVAL-voice-prompt-injection.md (29 ln) 🧪 EVAL-voice-synth-disclosure.md (30 ln)

16:25 · T+34m senior-dev #4 (parallel) ~$0.18

Added voice-pack detection signals to packages/cli/src/packs.ts: exact-match keywords ("twilio", "vonage", "retell", "voice agent", "IVR"), README hints, dep tree probes. Wrote unit tests covering all signals.

⚙️ packs.ts (+pack entry)

16:35 · T+44m ai-eval-engineer (review) ~$0.24

Verified each EVAL fixture covers a distinct threat (no overlap, no gaps). Checked refusal patterns are testable and not just narrative.

APPROVED · 4 evals cover the 4 threats with non-overlapping scenarios

16:42 · T+51m ai-security-reviewer (review) ~$0.31

Mapped each EVAL to OWASP LLM Top-10. Flagged one gap: missing eval for LLM-08 (excessive agency via tool use). Asked: "Does the pack's reviewer agent enforce tool-allowlisting in the agent loop?"

PARTIAL · add tool-allowlist clause to voice-ai-reviewer.md OR ship eval for LLM-08 in v2.8.1

16:48 · T+57m senior-dev #2 (re-claim) ~$0.16

Added tool-allowlist clause to voice-ai-reviewer.md per the security reviewer's request. Re-submitted.

16:58 · T+1h7m tests/run-packs-e2e.mjs CI · $0

Ran the full pack-chain validator: voice-pack detection signals → reviewer agent → 4 EVAL files → CLI suggester. 47 assertions passed across the voice-pack fixture; 456 across all 10 packs.

✓ run-packs-e2e.mjs (456 assertions)

PASS · 47/47 voice-pack assertions · CLI suggests voice-pack on Twilio fixture

17:10 · T+1h19m ⚐ GATE: SHIP · operator (human) ~30 s

Operator reviewed: all artifacts, the security-reviewer's PARTIAL → fixed chain, the e2e output. Approved merge to main.

APPROVED · ship as v2.8.0 · merged in PR #22

17:17 · T+1h26m devops ~$0.07

PR #22 merged at 17:17:48 +0200. CI green. npm publish v2.8.0 triggered.

⎘ commit b3087ec (initial) ⎘ commit 018e337 (merge) 📦 npm v2.8.0

17:20 · T+1h29m continuous-learner ~$0.05

Extracted lesson: "pack-rollout pattern" — every new pack ships exactly pack-spec + reviewer + ≥3 EVALs + CLI signal. Wrote to .great_cto/lessons.md. Promoted to ~/.great_cto/decisions.md after the 3rd pack hit (clinical-pack, hr-ai-pack, voice-pack).

LESSON SAVED · next pack rollout will skip 80 % of the exploration time

What one real GreatCTO run actually looks like.

From prompt to shipped in 1h 26m.

⚠ Honest caveats

9 more packs shipped the same way.