12-Stage Pipeline
Not just build-test-merge. A full 12-stage pipeline with intake, planning, design, adversarial review, compound quality gates, deployment, validation, and monitoring. Each stage has configurable quality gates.
The Code Factory pattern is a control-plane architecture where coding agents write 100% of the code and the repository enforces deterministic, risk-aware checks before merge. Evidence is machine-verifiable, review loops are automated, and incidents feed back into harness coverage.
Shipwright implements every layer of the Code Factory pattern — and extends it with capabilities that go beyond the baseline.
Agent writes code → Risk policy gate classifies PR → CI runs tier-appropriate checks→ Code review agent validates → Findings remediated in-branch → Clean evidence for current SHA→ Bot-only threads auto-resolved → Merge with auditable proof → Incidents create harness gapsEvery step is deterministic. Every decision is traceable to policy. Every merge is backed by machine-verifiable evidence tied to the exact commit SHA being merged.
All risk tiers, merge requirements, docs drift rules, evidence specs, and harness gap SLAs live in one file: config/policy.json.
{ "riskTierRules": { "critical": [".github/workflows/**", "config/policy.json"], "high": ["scripts/sw-pipeline.sh", "scripts/sw-daemon.sh"], "medium": ["scripts/sw-*.sh", "dashboard/**"], "low": ["docs/**", "website/**", "**"] }, "mergePolicy": { "critical": { "requiredChecks": [ "risk-policy-gate", "tests", "e2e-smoke", "platform-health", "code-review-agent" ], "requireDocsDriftCheck": true }, "low": { "requiredChecks": ["risk-policy-gate"] } }}Why this matters: No ambiguity. No silent drift between scripts, workflows, and docs. One contract governs all merge decisions, and CI validates that contract on every push.
The risk-policy-gate workflow runs first on every PR:
riskTierRules (highest tier wins)Only after the gate passes do expensive CI jobs (tests, builds, security scans) fan out. This saves CI minutes on PRs that are already policy-blocked.
PR opened → risk-policy-gate (3s) → pass? → tests + e2e + security (5-10min) → fail? → blocked, no CI wastedThis is the most critical safety invariant. Shipwright enforces that all evidence — check runs, reviews, approvals — corresponds to the current PR head SHA:
Without this, you can merge a PR using “clean” evidence from an older commit that no longer applies.
When multiple workflows can request review reruns, duplicate bot comments and race conditions appear. Shipwright uses a single canonical rerun writer (sw-review-rerun.sh) that:
<!-- shipwright-review-rerun -->) for identificationsha:<head> to prevent duplicate requests for the same commitshipwright review-rerun request 42 abc1234 greptile# Only posts if no rerun was already requested for sha:abc1234When a code review finds actionable issues, the review-remediation workflow:
The remediation agent is constrained: minimum necessary changes, no new features, no unrelated refactoring. Pinned model + effort for reproducibility.
After a clean current-head review rerun, Shipwright auto-resolves unresolved PR threads where all comments are from bots. Human-participated threads are never touched.
This is controlled by policy:
{ "codeReviewAgent": { "autoResolveBotsOnlyThreads": true, "neverAutoResolveHumanThreads": true }}The workflow uses GraphQL to inspect thread participants and only resolves when every author matches known bot patterns.
The blog post recommends browser evidence for UI changes. Shipwright generalizes this into a multi-type evidence framework that covers every surface an agent can change:
| Evidence Type | What It Proves | Example |
|---|---|---|
| browser | UI renders correctly | Dashboard loads, pipeline status page shows stages |
| api | REST/GraphQL contracts hold | Health endpoint returns 200, response is valid JSON |
| database | Schema integrity maintained | Migrations are current, no orphaned tables |
| cli | Commands produce correct output | shipwright pipeline status --json exits 0 with valid JSON |
| webhook | Callback endpoints respond | Webhook receiver accepts POST with expected status |
| custom | Anything else | User-defined verification scripts |
# Capture all evidence typesnpm run harness:evidence:capture
# Capture specific type onlynpm run harness:evidence:capture:apinpm run harness:evidence:capture:clinpm run harness:evidence:capture:database
# Verify manifest and freshnessnpm run harness:evidence:verify
# Pre-PR: capture + verify in one stepnpm run harness:evidence:pre-prCollectors are defined in config/policy.json under the evidence section:
{ "evidence": { "artifactMaxAgeMinutes": 30, "requireFreshArtifacts": true, "collectors": [ { "name": "dashboard-api-health", "type": "api", "method": "GET", "url": "http://localhost:8767/api/health", "expectedStatus": 200, "assertions": ["status-ok", "response-has-version"] }, { "name": "pipeline-cli-smoke", "type": "cli", "command": "bash scripts/sw-pipeline.sh status", "expectedExitCode": 0, "assertions": ["has-pipeline-state"] }, { "name": "db-schema-integrity", "type": "database", "command": "bash scripts/sw-db.sh health", "expectedExitCode": 0, "assertions": ["schema-valid", "db-accessible"] } ] }}Merge policy enforces which evidence types are required per risk tier:
{ "mergePolicy": { "critical": { "requiredEvidence": ["cli", "api"] }, "high": { "requiredEvidence": ["cli"] }, "medium": { "requiredEvidence": [] } }}Every evidence artifact records the capture timestamp, collector type, pass/fail status, and type-specific details (HTTP status, exit code, response body, etc.) in a machine-readable manifest.
Every production regression must produce a test case:
production regression → incident detected → harness gap issue created→ test case written → gap resolved → SLA trackedThe shipwright incident gap commands manage this loop:
shipwright incident gap list # Show all open gapsshipwright incident gap sla # Show SLA compliance metricsshipwright incident gap resolve gap-inc-123 scripts/sw-auth-test.shSLAs are enforced by policy:
Gaps that exceed SLA are flagged as overdue. GitHub issues are auto-created for tracking.
12-Stage Pipeline
Not just build-test-merge. A full 12-stage pipeline with intake, planning, design, adversarial review, compound quality gates, deployment, validation, and monitoring. Each stage has configurable quality gates.
Predictive Risk
Risk isn’t just path-based classification. The intelligence layer scores issues using GitHub signals — security alerts, similar past failures, contributor expertise, file churn patterns — before a single line of code is written.
Self-Healing Builds
When tests fail, the pipeline re-enters the build loop with error context. Convergence detection prevents infinite loops. Error classification routes retries intelligently. The system learns which fixes work.
Persistent Memory
Every pipeline run feeds back into persistent memory: failure patterns, fix effectiveness, prediction accuracy. The next run benefits from every previous one. Cross-repo global memory shares learnings across projects.
18 Autonomous Agents
Specialized agents for every role: PM, code reviewer, security auditor, test generator, incident commander, architecture enforcer, and more. Each agent has defined responsibilities and quality standards.
Fleet Operations
The Code Factory pattern applied across your entire organization. Fleet daemons watch every repo, shared worker pools rebalance based on priority, and aggregate metrics track delivery health org-wide.
# Evidence framework — capture and verify all typesnpm run harness:evidence:capture # All collectorsnpm run harness:evidence:capture:api # API endpoints onlynpm run harness:evidence:capture:cli # CLI commands onlynpm run harness:evidence:capture:database # Database checks onlynpm run harness:evidence:capture:browser # Browser/UI onlynpm run harness:evidence:verify # Verify manifest + freshnessnpm run harness:evidence:pre-pr # Capture + verify in one step
# Risk and policynpm run harness:risk-tiernpm run harness:policy-gate
# Core pipelinenpm testnpm run test:smokenpm run test:integration
# Incident-to-harness loopshipwright incident gap listshipwright incident gap slashipwright incident gap resolve <gap-id> <test-file>
# Review rerun managementshipwright review-rerun request <pr#> <sha> [agent]shipwright review-rerun check <pr#>shipwright review-rerun wait <pr#> <sha> [timeout]
# Evidence CLI (all types)shipwright evidence capture [type]shipwright evidence verifyshipwright evidence pre-pr [type]shipwright evidence statusshipwright evidence typesFour GitHub Actions workflows implement the Code Factory control plane:
| Workflow | Trigger | Role |
|---|---|---|
risk-policy-gate.yml | PR open/sync | Classify risk, enforce preflight |
review-remediation.yml | Review submitted | Auto-fix review findings |
auto-resolve-threads.yml | Check suite complete | Clean up bot-only threads |
shipwright-pipeline.yml | Issue labeled | Full autonomous delivery |
Plus the existing CI workflows (test.yml, e2e-smoke.yml) that run after the preflight gate passes.
config/policy.json is the single source of truthThe result: a repo where agents implement, validate, and are reviewed with deterministic, auditable standards — and the system gets smarter with every run.