Disaster recovery

The runbook for the day things go wrong. Calm, sequential, with the steps already written so you don't have to think under stress.

Drafted from planning · v0.1

Severity classification

SEV1 — money or data integrity at risk. Examples:

Charges happening but transactions not committing
Audit chain broken
Shop's database unavailable

SEV2 — significant feature broken, workaround possible. Examples:

SMS not sending
Receipts not generating
Specific endpoint returning 500s

SEV3 — minor; can wait for next business day. Examples:

A typo in the UI
A non-critical cron didn't run (and isn't financial)

SEV1 response

Stop further damage. If the shop is actively losing money on each transaction, instruct them to pause sales until you say go.
Diagnose. wrangler tail, the audit log, the alert dashboard.
Communicate. Text the shop owner: "I'm on it. Status update in 30 minutes."
Apply the fix or roll back. Prefer rollback if you're not 100% sure of the fix.
Verify. Test the broken path; verify audit shows a clean transaction.
Post-mortem. Within 48 hours, write a short post-mortem: what happened, what we did, what we'll do to prevent it. Email to the shop owner (if SEV1 affected them).

Specific scenarios

Audit chain verification failed

This is always SEV1.

The daily cron caught a chain break. Possible causes:

A DBA-style direct edit happened (rare; investigate who has D1 console access)
A withAudit write failed mid-transaction and left a gap
A bug in our chain-hash computation

Response:

wrangler d1 execute helm-{slug}-db --remote --command "SELECT id, at, action FROM audit_events ORDER BY id DESC LIMIT 50;"
Identify the first broken row (where prev_chain_hash doesn't match the previous row's chain_hash)
Compare to the audit_archive in R2 (monthly archives); restore from there if pre-archive boundary
Re-verify
Investigate root cause; deploy fix; resume

D1 unavailable

Rare. Cloudflare D1 incidents are usually short. Response:

Confirm via https://www.cloudflarestatus.com
Tell the shop owner; status update every 15 minutes
Wait for Cloudflare; we can't accelerate

If D1 is data-lost (catastrophic, never happened):

Restore from yesterday's backup in R2: wrangler d1 import helm-{slug}-db backups/db/{date}.sql
Compare with Stripe charges since the backup time; manually re-create any missing transactions
Re-deploy

Stripe webhook signatures failing

Symptoms: every webhook returns 400 from our endpoint. The shop's payments are landing in Stripe but Helm isn't recording them.

Response:

Verify the secret: wrangler secret list --env {slug} should show STRIPE_WEBHOOK_SECRET
Check Stripe dashboard → Webhooks → Endpoints; the signing secret should match
If they differ, the most common cause is the shop or someone rotated the Stripe secret; wrangler secret put STRIPE_WEBHOOK_SECRET --env {slug} with the new value
Stripe automatically retries failed webhooks for ~3 days; missed events will catch up

A session token leaked

Someone outside the shop has a valid helm_session cookie. Response:

wrangler d1 execute helm-{slug}-db --remote --command "DELETE FROM staff_sessions WHERE staff_id = ?;" — kills all sessions for that staff
Rotate the staff's PIN: ask the staff to set a new PIN
Audit-log the incident
Email the owner; explain what we did
Hard expiry on sessions (12 hours) limits the damage

A shop's Worker is somehow misconfigured to talk to another shop's D1

This must not happen under single-tenant per shop. If it does:

Halt deploys immediately
Verify wrangler.jsonc has the correct bindings for the affected shop
Roll back to a known-good version
Investigate how the binding got swapped (likely a manual edit; tighten CI checks)
Audit both shops' data: did the wrong shop's data get written? If so, restore from yesterday's backup

This is the worst-case scenario the architecture is designed against. The bindings shouldn't allow it.

Data restoration from backup

# List available backups
wrangler r2 object list helm-{slug}-assets --prefix backups/db/

# Download the desired backup
wrangler r2 object get helm-{slug}-assets backups/db/2026-05-08.sql --file ./backup.sql

# Apply (this REPLACES the current data; use with care)
wrangler d1 import helm-{slug}-db ./backup.sql

A planned scripts/restore-from-backup.ps1 will wrap this with confirmation prompts.

Post-incident checklist

After every SEV1:

Shop owner emailed with summary
Post-mortem written (in docs/incidents/{date}.md)
Cause identified
Fix deployed
Prevention measure decided (test? alert? guardrail?)
Prevention measure implemented

Severity classification​

SEV1 response​

Specific scenarios​

Audit chain verification failed​

D1 unavailable​

Stripe webhook signatures failing​

A session token leaked​

A shop's Worker is somehow misconfigured to talk to another shop's D1​

Data restoration from backup​

Post-incident checklist​

See also​