Every time someone runs npx great-cto init, the CLI has to decide:
- What kind of project is this? (one of ~25 archetypes)
- Which compliance packs apply on top? (voice / clinical / fintech / lending / 6 more)
- Are any of those guesses wrong enough that the user will get a useless threat model and abandon the tool?
That last question is what makes the detection logic interesting. Get it wrong and the first impression is "this is producing nonsense about regulations I don't care about." Get it too conservative and the user has to manually configure packs that should have auto-attached, defeating the point.
After four months in production, here is what works.
What I tried first: LLM-based detection
Original design (rejected after 2 weeks): pipe the repo's README, package.json, and top-level directory listing into Claude and ask it to classify.
Problems, in order of severity:
- Latency. First run of
initnow takes 12-18 seconds instead of <1s. Users perceive this as broken. - Cost. Roughly $0.04 per
init. Negligible per user, real money at scale. - Hallucinations. Claude classified a Helm chart for an internal Kubernetes operator as "fintech, because the README mentions billing in the Operator's logging section." It does not. The word "billing" appeared once, describing log volume.
- Variance. Same repo, same prompt, two runs: voice-AI then mlops. Probably temperature noise. Not acceptable for a decision that shapes the rest of the pipeline.
Killed it. Went to a regex-based detector. Latency dropped from 15s to 180ms. Cost dropped to $0. Variance dropped to zero.
The trade-off: regex cannot read intent. It reads tokens. A repo that says it does voice AI in its README but actually contains a music-recommender model will get the voice pack. That is a false positive I accept because the alternative (LLM in the loop) had its own false positives and was 80Γ slower.
The current detector
Three signal layers:
Layer 1 β package.json dependencies. twilio / livekit / deepgram / elevenlabs β voice pack. stripe / plaid / dwolla β fintech. tensorflow / pytorch + transformers β ml-pack (different from voice-pack). And so on for ~80 strong signal tokens.
Layer 2 β file paths. clinical/, fda/, phi/, hipaa/ in directory names β clinical pack. webhook/ + signature-related code β api-platform-pack.
Layer 3 β README + top-level docs grep. Exact-match keywords only, not fuzzy. "AEDT", "automated employment decision", "NYC Local Law 144" β hr-ai pack. "21 CFR Part 11", "SaMD", "FDA pre-submission" β clinical pack.
Each pack has a minimum signal count. voice-pack needs β₯2 of its 11 tokens. fintech needs β₯3 of 14. This is what cut false positives roughly in half.
The false positives I have logged
Across 4 months and ~340 init runs (instrumented from telemetry), 12 confirmed false positives:
| repo type | wrongly attached pack | trigger | fix |
|---|---|---|---|
| static-site generator | voice-pack | README explicitly disclaiming Twilio | exact-match keywords only |
| music-recommender ML | voice-pack | "audio" in package description | removed "audio" as solo trigger |
| internal Helm chart | fintech | "billing" in operator log section | minimum 3 signals |
| docs-only repo | clinical | "patient" in user-research subfolder | excluded docs/ from path scan |
| game-server prototype | mlops | torch in optional dev-dep | only scan dependencies, not devDependencies |
| 7 others | various | various | each addressed via test case in tests/detection.test.mjs |
The 12 cases are committed as regression tests. If the detector ever re-introduces one of these false positives, CI fails.
The case I worry about: silent false negatives
Easier to log a false positive (user complains "why is this thing telling me about TCPA"). Harder to catch a false negative (user runs init on a repo that should have hr-ai pack attached, doesn't, ships with no bias audit, gets fined two years later).
Mitigations:
/migratecommand. Rerun detection with updated rules. New packs (or new keywords for existing packs) get a second chance to attach.- PROJECT.md is editable. The
packs:list is plain YAML. User can add manually if detection missed. - Public catalogue. greatcto.systems/companies.html lists 200+ companies and the packs that would auto-attach to each. If a user's similar competitor is in the catalogue, they get a sanity check on whether their detection is correct.
- Telemetry on no-pack runs. When init detects zero packs, we log it (anon, opt-in). If a class of project keeps coming through with no pack and the cost-of-miss is high (regulated industry), I add detection rules.
I have not had a confirmed regulatory false negative yet. That is partly because the user population is small (~500 active installs as of writing) and partly because the high-stakes archetypes (clinical, fintech, lending) have strong-signal vocabulary that is hard to miss.
What I will not add
People keep asking for two features I have rejected:
- "Pack confidence scores." The detector should output 0-1 confidence per pack so the user can sort. I rejected this: it implies a precision the regex layer does not actually have, and users will treat a 0.6 score as "halfway right" when really it means "one signal matched, probably noise."
- "Auto-update detection from telemetry." If we see 10 users with
xyzin their repo overriding our detection, automatically addxyzas a fintech signal. Rejected: too easy to poison. One determined attacker registers 10 fakexyz/random-namerepos with manual fintech tags and the global detector starts attaching fintech to everyone usingxyz.
Both of these are textbook examples of "the obvious feature that becomes a backdoor."
What I might add
- LLM in the loop, but only for ambiguous cases. If 2+ packs have signal but below threshold for any one, pipe the README into Claude with a strict "pick one or 'unclear'" prompt. Latency penalty only on the 5-10% of repos that are ambiguous, not all of them.
- Per-language detection. Right now everything assumes Node/Python/JVM-ish patterns. Rust and Go projects sometimes have weak signal even when they are clearly fintech or healthcare. Not urgent β those communities are smaller in the user base.
The detection logic is small, boring, and one of the parts of the system I am most defensive of. It is the first thing every user sees, and a wrong first guess loses them.
About: I build GreatCTO β a multi-agent SDLC plugin for Claude Code. MIT, runs locally. The detector source is in packages/cli/src/detect.ts β read or fork.