Skip to main content

Debug an incident

A shop owner texts you: "Helm just gave me an error when I tried to ring up a sale." Here's the playbook for getting from that message to a fix.

Drafted from planning · v0.1

Step 1: Ask for the request ID (30 seconds)

Helm's error toasts include a copy-able request_id (X-Request-Id header). Ask the owner: "Can you copy the request ID from the error message and send it to me?"

If they can't reach it (the error is gone), ask:

  • What were you doing? ("Trying to refund the sale from yesterday for Mrs. Johnson")
  • When? ("About 5 minutes ago")
  • What did the error say? Photograph if possible.

Step 2: Look up the request (1 minute)

# If you have the request_id
wrangler tail --env {slug} --search "{request_id}"

# Or grep yesterday's logs if it's in the past
# (Logpush archives in R2; query with DuckDB if needed)

The log line has:

  • path, method, status, ms
  • error_class, error_msg
  • staff_id, staff_label
  • error_context (table, row IDs, etc.)

Step 3: Look at the audit log (1 minute)

-- Recent activity around the time of the incident
SELECT e.id, e.at, e.staff_label, e.action, m.summary
FROM audit_events e
JOIN audit_mutations m ON m.event_id = e.id
WHERE e.at BETWEEN datetime('now', '-30 minutes') AND datetime('now')
AND e.staff_id = ?
ORDER BY e.at DESC;

This shows what the staff was doing in the 30 minutes around the error. Often shows the precise state right before the broken operation.

Step 4: Reproduce locally (5-15 minutes)

# Get a fresh copy of production state for this slug
wrangler d1 export helm-{slug}-db --output ./incident.sql --remote
# Apply to local
wrangler d1 import helm-dev ./incident.sql --local
# Restart wrangler dev
wrangler dev --env=development --local

Now repeat the action that triggered the incident; verify the bug reproduces.

If it reproduces, write a test that captures the failure. Then fix.

If it doesn't reproduce, the production state might have drifted by now (a race?). Investigate the audit log more closely.

Step 5: Fix and deploy (varies)

Fix on a branch. Verify the test now passes. Land on main. Rebase the shop's branch. Deploy.

Step 6: Verify in production (2 minutes)

Test the fixed path in production yourself (sign in as you, perform the operation that broke). Verify success.

Step 7: Tell the owner (1 minute)

"Found it — a bug where refunds with split-tender weren't computing the cash portion right. Deployed a fix. Tested the case from your error. Should be working now; try Mrs. Johnson's refund again."

Total time for a typical incident: 15-45 minutes from shop's text to fix-in-prod.

Common bug classes

NOT NULL constraint violations

Worker passes null to a column that's NOT NULL. Common when a frontend field is empty and the backend doesn't coerce.

Fix: coerce on the backend; also fix the frontend to send a value.

Race conditions

Two operators making conflicting changes nearly simultaneously. Optimistic locking (in-situ editing §6.2) catches these in the right places; gaps exist.

Fix: add a version column + check-and-update for the affected operation.

Stale frontend state

Operator opens a screen, gets data; another operator changes the data; first operator submits a stale-data-based edit.

Fix: include a version or updated_at in the frontend's submit; reject if it doesn't match.

Idempotency-key reuse

External call succeeded; we recorded success; later request reuses the same key thinking it's the same op.

Fix: idempotency key includes attempt number; new attempt = new key.

When you can't reproduce

Some bugs are environment-specific:

  • Operator's browser (try their version: Safari, Edge, old Chrome)
  • Network (a flaky shop Wi-Fi might cause partial requests)
  • D1 read replication lag (rare; the operator might have read stale data right after a write)

If you can't reproduce after 30 minutes, add more logging around the failure path, deploy, and wait for the bug to recur with better telemetry.

See also