Skip to main content

Conformance suite — off-app QA

The Hub is QA'd from outside itself, by a Cowork plugin (helm-qa-sweep) driving a real browser session against a real Helm instance. The Hub itself ships no self-test code, no diagnostic tab, no in-app conformance runner. The substrate the off-app sweep reads against is the Process Library (BIKE.L<n>-NNNN designated processes + realizes(spec→binding) edges, migrations 130–132).

Hub-side QA is deliberately removed

v0.6.464 shipped one in-app "Run Diagnostic" (replacing three earlier QA viewers); v0.6.465 made it canon-driven and ships-clean-to-any-vertical (migration 129 added diagnostic_checks); v0.6.466 then removed it from the Hub entirely, dropping the migration's table via migration 133 and stripping the Health tab. v0.6.467 cleaned up the remaining doc orphans (including a draft conformance spec).

The decision: a system cannot honestly grade itself. The Hub may compute its own pass/fail and claim conformance against a canon the Hub itself ships, but that loop is closed — the same code authors and grades. Off-app QA is the only honest reading. Do NOT re-add diagnostic/conformance code to the Hub.

Architecture

+---------------------------+ +-----------------------+
| Cowork "helm-qa-sweep" | drives → | Real browser tab |
| plugin | | pointed at Helm QA |
| SKILL.md + scenarios | | instance |
+-------------+-------------+ +----------+------------+
| |
| reads the canon | observes outcomes
v v
+---------------------------+ +-----------------------+
| Process Library | | helm-qa-sweep |
| software_process, | ←── ←── ←── | dashboard |
| realizes_links, | | (verdicts + audit |
| BIKE.L<n>-NNNN catalog | | chain diff) |
+---------------------------+ +-----------------------+

The plugin is the precedent set by gm-dealer-audit: a SKILL.md driving Claude-in-Chrome through scenarios, posting verdicts to a verdict dashboard. The Hub plays no part in the verification loop — it just gets exercised.

The canon the sweep reads against

The Process Library is the source of truth for what the Hub is supposed to do. The sweep does not invent scenarios; it walks the canon and verifies behaviour matches.

LayerStored inWhat the sweep does
L1 — Codesoftware_process WHERE layer='L1' + code_anchor pointing at src/index.js:apiXxxExercises the API path via the UI and asserts the audit chain entry lands
L2 — Vertical canonsoftware_process WHERE layer='L2' + designated BIKE.L2-NNNNAsserts the business rule holds end-to-end (e.g. "walk-in service ticket requires deposit")
L3 — SME / per-shop loresoftware_process WHERE layer='L3' + per-shop variantsRuns against the actual shop's D1; asserts shop-specific behaviour holds

Permanent designations (migration 130, v0.6.466)

Every Process Library row keyed by BIKE.L\<n\>-NNNN — Tom calls the roster the Dr. Strangelove roster. The designation is permanent and stable across renames; the sweep keys verdicts by designation so a process that's been re-titled doesn't lose its history. Examples:

  • BIKE.L1-0042 — the apiSalesPost code path
  • BIKE.L2-0017 — the walk-in-service-requires-deposit rule
  • BIKE.L3-0008 — Swicked-specific chainring-scenario awaiting_customer flip

The connective seam (migrations 131 + 132, Step A + extend)

realizes_links (migration 131) is the join between spec (L2/L3 narrative) and binding (L1 code anchor). Pre-Step-A: 148 spec keys + 239 binding keys + only 2 shared + 0 cross-layer realizes edges. The seam was missing, which means a canon-driven runner had nothing to traverse.

  • Step A (migration 131) — initial seed of realizes(spec→binding) edges
  • Step A extend (migration 132) — 40 more edges

Each realizes_links row says "this L2/L3 narrative is realized by this L1 code anchor." That's the join the off-app sweep needs to translate a canon row into an executable scenario.

What the sweep does — and what it does NOT do

DoesDoes NOT
Walk the canon and exercise behaviours through the real UIRead the Hub's own self-grading
Post pass/fail verdicts to its own dashboardTrust any "diagnostic OK" result emitted by the Hub
Diff audit_events.chain_hash before and after each scenario as a tamper-checkAuto-gate the Hub's deploy (verdicts inform Tom; they don't block prod)
Run against the QA Helm instance (mockup-only-swicked-helm) for vertical-level regressionsRun against Swicked's prod D1 directly (L3 lore runs against a staging replica)
Cite the canon row (BIKE.L2-0017) in every verdict so a failure traces back to a specInvent scenarios from operator memory without a canon row to anchor them

The chain of doctrine that landed here

  1. v0.27 — Process Library three-layer model (migration 085). Code / vertical canon / SME knowledge, joined by concept_key.
  2. v0.28 — QA Tool runner (in-Hub) on the qa-tool branch. Phase 1 shipped, Phase 2 partial. Replaced by:
  3. v0.6.464–v0.6.466 — Hub QA collapse + removal. Three QA viewers consolidated into one Run Diagnostic, made canon-driven, then removed entirely. The reasoning: a self-grading system isn't honest QA.
  4. Off-app via Coworkhelm-qa-sweep, pattern modelled on the gm-dealer-audit plugin (agentic SKILL.md driving Claude-in-Chrome with verdicts to a dashboard).
  5. Connective seam first. Step A (migrations 131 + 132) lays the realizes(spec→binding) edges that any canon-driven sweep needs.

The line that survives: the Hub authors the canon and the code; something else grades how well they match.

See also

  • Process Library — the L1/L2/L3 canon the sweep reads against
  • Audit-everything principle — the chain the sweep diffs as oracle
  • migrations/130_process_designations.sql — BIKE.L<n>-NNNN designations
  • migrations/131_realizes_links.sql — initial spec→binding edges
  • migrations/132_realizes_links_extend.sql — Step A extend (40 more edges)
  • migrations/133_drop_diagnostic_checks.sql — Hub-side QA removed
  • Cowork plugin helm-qa-sweep (separate repo) — the off-app sweep itself