Skip to main content

Disaster recovery

The runbook for the day things go wrong. Calm, sequential, with the steps already written so you don't have to think under stress.

Drafted from planning · v0.1

Severity classification

SEV1 — money or data integrity at risk. Examples:

  • Charges happening but transactions not committing
  • Audit chain broken
  • Shop's database unavailable

SEV2 — significant feature broken, workaround possible. Examples:

  • SMS not sending
  • Receipts not generating
  • Specific endpoint returning 500s

SEV3 — minor; can wait for next business day. Examples:

  • A typo in the UI
  • A non-critical cron didn't run (and isn't financial)

SEV1 response

  1. Stop further damage. If the shop is actively losing money on each transaction, instruct them to pause sales until you say go.
  2. Diagnose. wrangler tail, the audit log, the alert dashboard.
  3. Communicate. Text the shop owner: "I'm on it. Status update in 30 minutes."
  4. Apply the fix or roll back. Prefer rollback if you're not 100% sure of the fix.
  5. Verify. Test the broken path; verify audit shows a clean transaction.
  6. Post-mortem. Within 48 hours, write a short post-mortem: what happened, what we did, what we'll do to prevent it. Email to the shop owner (if SEV1 affected them).

Specific scenarios

Audit chain verification failed

This is always SEV1.

The daily cron caught a chain break. Possible causes:

  • A DBA-style direct edit happened (rare; investigate who has D1 console access)
  • A withAudit write failed mid-transaction and left a gap
  • A bug in our chain-hash computation

Response:

  1. wrangler d1 execute helm-{slug}-db --remote --command "SELECT id, at, action FROM audit_events ORDER BY id DESC LIMIT 50;"
  2. Identify the first broken row (where prev_chain_hash doesn't match the previous row's chain_hash)
  3. Compare to the audit_archive in R2 (monthly archives); restore from there if pre-archive boundary
  4. Re-verify
  5. Investigate root cause; deploy fix; resume

D1 unavailable

Rare. Cloudflare D1 incidents are usually short. Response:

  1. Confirm via https://www.cloudflarestatus.com
  2. Tell the shop owner; status update every 15 minutes
  3. Wait for Cloudflare; we can't accelerate

If D1 is data-lost (catastrophic, never happened):

  1. Restore from yesterday's backup in R2: wrangler d1 import helm-{slug}-db backups/db/{date}.sql
  2. Compare with Stripe charges since the backup time; manually re-create any missing transactions
  3. Re-deploy

Stripe webhook signatures failing

Symptoms: every webhook returns 400 from our endpoint. The shop's payments are landing in Stripe but Helm isn't recording them.

Response:

  1. Verify the secret: wrangler secret list --env {slug} should show STRIPE_WEBHOOK_SECRET
  2. Check Stripe dashboard → Webhooks → Endpoints; the signing secret should match
  3. If they differ, the most common cause is the shop or someone rotated the Stripe secret; wrangler secret put STRIPE_WEBHOOK_SECRET --env {slug} with the new value
  4. Stripe automatically retries failed webhooks for ~3 days; missed events will catch up

A session token leaked

Someone outside the shop has a valid helm_session cookie. Response:

  1. wrangler d1 execute helm-{slug}-db --remote --command "DELETE FROM staff_sessions WHERE staff_id = ?;" — kills all sessions for that staff
  2. Rotate the staff's PIN: ask the staff to set a new PIN
  3. Audit-log the incident
  4. Email the owner; explain what we did
  5. Hard expiry on sessions (12 hours) limits the damage

A shop's Worker is somehow misconfigured to talk to another shop's D1

This must not happen under single-tenant per shop. If it does:

  1. Halt deploys immediately
  2. Verify wrangler.jsonc has the correct bindings for the affected shop
  3. Roll back to a known-good version
  4. Investigate how the binding got swapped (likely a manual edit; tighten CI checks)
  5. Audit both shops' data: did the wrong shop's data get written? If so, restore from yesterday's backup

This is the worst-case scenario the architecture is designed against. The bindings shouldn't allow it.

Data restoration from backup

# List available backups
wrangler r2 object list helm-{slug}-assets --prefix backups/db/

# Download the desired backup
wrangler r2 object get helm-{slug}-assets backups/db/2026-05-08.sql --file ./backup.sql

# Apply (this REPLACES the current data; use with care)
wrangler d1 import helm-{slug}-db ./backup.sql

A planned scripts/restore-from-backup.ps1 will wrap this with confirmation prompts.

Post-incident checklist

After every SEV1:

  • Shop owner emailed with summary
  • Post-mortem written (in docs/incidents/{date}.md)
  • Cause identified
  • Fix deployed
  • Prevention measure decided (test? alert? guardrail?)
  • Prevention measure implemented

See also