Disaster recovery
The runbook for the day things go wrong. Calm, sequential, with the steps already written so you don't have to think under stress.
Severity classification
SEV1 — money or data integrity at risk. Examples:
- Charges happening but transactions not committing
- Audit chain broken
- Shop's database unavailable
SEV2 — significant feature broken, workaround possible. Examples:
- SMS not sending
- Receipts not generating
- Specific endpoint returning 500s
SEV3 — minor; can wait for next business day. Examples:
- A typo in the UI
- A non-critical cron didn't run (and isn't financial)
SEV1 response
- Stop further damage. If the shop is actively losing money on each transaction, instruct them to pause sales until you say go.
- Diagnose.
wrangler tail, the audit log, the alert dashboard. - Communicate. Text the shop owner: "I'm on it. Status update in 30 minutes."
- Apply the fix or roll back. Prefer rollback if you're not 100% sure of the fix.
- Verify. Test the broken path; verify audit shows a clean transaction.
- Post-mortem. Within 48 hours, write a short post-mortem: what happened, what we did, what we'll do to prevent it. Email to the shop owner (if SEV1 affected them).
Specific scenarios
Audit chain verification failed
This is always SEV1.
The daily cron caught a chain break. Possible causes:
- A DBA-style direct edit happened (rare; investigate who has D1 console access)
- A
withAuditwrite failed mid-transaction and left a gap - A bug in our chain-hash computation
Response:
wrangler d1 execute helm-{slug}-db --remote --command "SELECT id, at, action FROM audit_events ORDER BY id DESC LIMIT 50;"- Identify the first broken row (where
prev_chain_hashdoesn't match the previous row'schain_hash) - Compare to the audit_archive in R2 (monthly archives); restore from there if pre-archive boundary
- Re-verify
- Investigate root cause; deploy fix; resume
D1 unavailable
Rare. Cloudflare D1 incidents are usually short. Response:
- Confirm via
https://www.cloudflarestatus.com - Tell the shop owner; status update every 15 minutes
- Wait for Cloudflare; we can't accelerate
If D1 is data-lost (catastrophic, never happened):
- Restore from yesterday's backup in R2:
wrangler d1 import helm-{slug}-db backups/db/{date}.sql - Compare with Stripe charges since the backup time; manually re-create any missing transactions
- Re-deploy
Stripe webhook signatures failing
Symptoms: every webhook returns 400 from our endpoint. The shop's payments are landing in Stripe but Helm isn't recording them.
Response:
- Verify the secret:
wrangler secret list --env {slug}should showSTRIPE_WEBHOOK_SECRET - Check Stripe dashboard → Webhooks → Endpoints; the signing secret should match
- If they differ, the most common cause is the shop or someone rotated the Stripe secret;
wrangler secret put STRIPE_WEBHOOK_SECRET --env {slug}with the new value - Stripe automatically retries failed webhooks for ~3 days; missed events will catch up
A session token leaked
Someone outside the shop has a valid helm_session cookie. Response:
wrangler d1 execute helm-{slug}-db --remote --command "DELETE FROM staff_sessions WHERE staff_id = ?;"— kills all sessions for that staff- Rotate the staff's PIN: ask the staff to set a new PIN
- Audit-log the incident
- Email the owner; explain what we did
- Hard expiry on sessions (12 hours) limits the damage
A shop's Worker is somehow misconfigured to talk to another shop's D1
This must not happen under single-tenant per shop. If it does:
- Halt deploys immediately
- Verify wrangler.jsonc has the correct bindings for the affected shop
- Roll back to a known-good version
- Investigate how the binding got swapped (likely a manual edit; tighten CI checks)
- Audit both shops' data: did the wrong shop's data get written? If so, restore from yesterday's backup
This is the worst-case scenario the architecture is designed against. The bindings shouldn't allow it.
Data restoration from backup
# List available backups
wrangler r2 object list helm-{slug}-assets --prefix backups/db/
# Download the desired backup
wrangler r2 object get helm-{slug}-assets backups/db/2026-05-08.sql --file ./backup.sql
# Apply (this REPLACES the current data; use with care)
wrangler d1 import helm-{slug}-db ./backup.sql
A planned scripts/restore-from-backup.ps1 will wrap this with confirmation prompts.
Post-incident checklist
After every SEV1:
- Shop owner emailed with summary
- Post-mortem written (in
docs/incidents/{date}.md) - Cause identified
- Fix deployed
- Prevention measure decided (test? alert? guardrail?)
- Prevention measure implemented