Debug an incident
A shop owner texts you: "Helm just gave me an error when I tried to ring up a sale." Here's the playbook for getting from that message to a fix.
Step 1: Ask for the request ID (30 seconds)
Helm's error toasts include a copy-able request_id (X-Request-Id header). Ask the owner: "Can you copy the request ID from the error message and send it to me?"
If they can't reach it (the error is gone), ask:
- What were you doing? ("Trying to refund the sale from yesterday for Mrs. Johnson")
- When? ("About 5 minutes ago")
- What did the error say? Photograph if possible.
Step 2: Look up the request (1 minute)
# If you have the request_id
wrangler tail --env {slug} --search "{request_id}"
# Or grep yesterday's logs if it's in the past
# (Logpush archives in R2; query with DuckDB if needed)
The log line has:
path,method,status,mserror_class,error_msgstaff_id,staff_labelerror_context(table, row IDs, etc.)
Step 3: Look at the audit log (1 minute)
-- Recent activity around the time of the incident
SELECT e.id, e.at, e.staff_label, e.action, m.summary
FROM audit_events e
JOIN audit_mutations m ON m.event_id = e.id
WHERE e.at BETWEEN datetime('now', '-30 minutes') AND datetime('now')
AND e.staff_id = ?
ORDER BY e.at DESC;
This shows what the staff was doing in the 30 minutes around the error. Often shows the precise state right before the broken operation.
Step 4: Reproduce locally (5-15 minutes)
# Get a fresh copy of production state for this slug
wrangler d1 export helm-{slug}-db --output ./incident.sql --remote
# Apply to local
wrangler d1 import helm-dev ./incident.sql --local
# Restart wrangler dev
wrangler dev --env=development --local
Now repeat the action that triggered the incident; verify the bug reproduces.
If it reproduces, write a test that captures the failure. Then fix.
If it doesn't reproduce, the production state might have drifted by now (a race?). Investigate the audit log more closely.
Step 5: Fix and deploy (varies)
Fix on a branch. Verify the test now passes. Land on main. Rebase the shop's branch. Deploy.
Step 6: Verify in production (2 minutes)
Test the fixed path in production yourself (sign in as you, perform the operation that broke). Verify success.
Step 7: Tell the owner (1 minute)
"Found it — a bug where refunds with split-tender weren't computing the cash portion right. Deployed a fix. Tested the case from your error. Should be working now; try Mrs. Johnson's refund again."
Total time for a typical incident: 15-45 minutes from shop's text to fix-in-prod.
Common bug classes
NOT NULL constraint violations
Worker passes null to a column that's NOT NULL. Common when a frontend field is empty and the backend doesn't coerce.
Fix: coerce on the backend; also fix the frontend to send a value.
Race conditions
Two operators making conflicting changes nearly simultaneously. Optimistic locking (in-situ editing §6.2) catches these in the right places; gaps exist.
Fix: add a version column + check-and-update for the affected operation.
Stale frontend state
Operator opens a screen, gets data; another operator changes the data; first operator submits a stale-data-based edit.
Fix: include a version or updated_at in the frontend's submit; reject if it doesn't match.
Idempotency-key reuse
External call succeeded; we recorded success; later request reuses the same key thinking it's the same op.
Fix: idempotency key includes attempt number; new attempt = new key.
When you can't reproduce
Some bugs are environment-specific:
- Operator's browser (try their version: Safari, Edge, old Chrome)
- Network (a flaky shop Wi-Fi might cause partial requests)
- D1 read replication lag (rare; the operator might have read stale data right after a write)
If you can't reproduce after 30 minutes, add more logging around the failure path, deploy, and wait for the bug to recur with better telemetry.