Skip to main content

Fail quietly, recover loudly

When a non-critical thing fails during a critical operation, the critical operation should still succeed and the failure should be reported asynchronously. When a critical operation itself fails, it must fail in a way that lets the operator do something useful immediately, not just see a stack trace.

Drafted from planning · v0.1

The shortest version

The till must always ring. Receipts can email later. Audit rows can replay later. Twilio messages can retry later. SMS to the operator about a failure is fine. SMS to the customer that fails silently is not.

The hierarchy of "what must work"

In order of importance during a customer interaction:

  1. The mutation lands in D1. Money records, inventory, tickets, status changes. If this fails, the operator must know now and the operation must abort cleanly.
  2. The audit row writes. Tied to (1) — the mutation does not commit if audit cannot. (See audit-everything.)
  3. The receipt prints to the printer. Operator can re-print from the transaction record if printing failed.
  4. The customer's SMS notification sends. Operator gets a hint if it didn't; can resend manually.
  5. The receipt PDF lands in R2. Regenerable on demand from the transaction.
  6. Every analytics / observability log line. Best-effort; we don't fail user actions because Logpush is slow.

The pattern

// Pseudo
async function handleSale(req, env, ctx) {
// CRITICAL: must succeed or fail loudly
const txn = await commitSaleAndAudit(req, env, ctx);

// BEST EFFORT: failures are reported, not aborted
const printResult = await tryAndCatch('receipt:print', () => printReceipt(txn));
const smsResult = await tryAndCatch('sms:send', () => sendCustomerSms(txn));
const r2Result = await tryAndCatch('r2:put', () => storeReceiptPdf(txn));

return {
transaction_id: txn.id,
warnings: [printResult.warning, smsResult.warning, r2Result.warning].filter(Boolean),
};
}

The UI shows the success state with any warnings as small inline notes: "Saved. Couldn't text customer (Twilio error). [Retry]"

The operator sees they got the win. They also see the asterisks. They can act on the asterisks if they want, or come back later.

What "fail loudly" actually means

When a critical operation fails, the operator does not see "Server Error" or a stack trace. They see:

  • A clear, brief description of what failed: "Couldn't save the line item (database busy)"
  • The action they should take: "Tap Save again to retry. If it keeps failing, take a photo of the screen and contact Kvick."
  • The context they need: a copy-button next to the request_id

The error UI is consistent across the app: bottom toast, red border, dismissable, persistent until dismissed (no auto-fade for errors that block work).

What "fail quietly" actually means

When a best-effort operation fails:

  • The success path completes normally
  • The failure is logged with full context (see observability)
  • The UI shows a small warning icon or inline note
  • The shop owner sees a daily digest of all the warnings (so patterns become visible)
  • If the failure recurs across multiple operations (e.g., R2 returning 503 on every PUT), the warning is escalated to an alert

Retry semantics

We retry inside the Worker only when the upstream is known to be transiently flaky and the operation is idempotent.

UpstreamRetry?Strategy
D1Yes (limited)Up to 2 retries, 100ms exponential backoff, on D1_TIMEOUT/D1_BUSY only
R2YesUp to 3 retries, idempotent puts only
Stripe writeNo (within request)Idempotency key on the original; client-side "Retry" button for explicit reattempt
Twilio sendYes (queued)Failed sends go to a pending_sms table; cron retries 3 more times over 24 hours
Claude APIYesUp to 2 retries on 5xx; tools-use has its own bounded loop
Webhook receiversNo retries; rely on the source's retryStripe and GBP retry their own webhooks; we just need to be idempotent

Retries beyond these are the cron's job, not the request handler's. The request handler's job is to return a deterministic answer in bounded time.

Idempotency keys everywhere

For every external write that matters (Stripe charges, Twilio sends, Claude tool calls that mutate D1):

  • The Worker generates an idempotency key for each operation, stored alongside the operation record
  • Retries use the same key
  • The upstream's idempotency contract prevents double-charges, double-texts, double-actions

For Stripe charges specifically: the idempotency key is txn_<id>_v<retry_n> where retry_n only increments if the operator manually clicks Retry. Network retries inside the Worker keep retry_n constant.

What this principle tells us about UX

A few concrete design choices that fall out:

  • Auto-save with optimistic UI — see in-situ editing. The UI shows the change instantly; the server commit happens asynchronously; failures roll back the UI and surface the error.
  • Undo toasts on destructive actions — 5-second window where the user can undo. Reversible actions reduce the cost of a mistake.
  • Progress indicators that show when something's slow — but never spinners that block all interaction. The operator can keep working in another part of the screen while a slow API completes.
  • No "your session has expired" full-page errors — the operator gets a small re-auth prompt that doesn't lose their work.

What this principle is not

  • Not "swallow errors." Best-effort failures are still logged and reported. The customer SMS that didn't send is on the daily digest. The audit log records the attempt.
  • Not "retry forever." Bounded retries with explicit fallback. Eventually, give up and surface the failure.
  • Not "make everything async." Sync paths are simpler when they're fast enough. We make things async when the latency cost would hurt the operator's flow.

Offline mode is the same principle at network scale

When the shop's whole internet drops, every Worker call is a "network blip." The same discipline applies: the till must keep ringing cash sales and accepting drop-offs (fail loudly only on the operations we genuinely can't do without the cloud — card terminal, SMS, server-arbitrated state). Everything else queues, retries, and reconciles when the connection is back. See offline architecture for the design and ADR-0015 for the idempotency-key contract that makes the queue safe under retry.

Examples

Good fail-quiet: R2 PUT for a receipt PDF returns 503. The transaction commits, the operator sees "Saved. Receipt couldn't be archived (will retry tonight)." A cron job retries the PUT later. The receipt is regeneratable from D1 anyway.

Good fail-loud: D1 INSERT for a transaction fails with SQLITE_FULL. The Worker returns 503 to the UI. The UI shows "Couldn't save the sale (storage full). Don't try again — contact Kvick immediately." Kvick gets paged.

Bad fail-quiet: Stripe charge succeeds but D1 INSERT fails. UI shows success, customer was charged, but the transaction record doesn't exist. We deliberately design against this — Stripe charges happen after the transaction row is committed (with status='pending'), and the post-charge update flips status to 'paid'. Daily reconciliation catches mismatches.

Bad fail-loud: A network blip on the operator's Wi-Fi causes the auto-save to fail. The UI shows a giant red modal "ERROR: NETWORK FAILURE." The operator panics. Better: a small inline "Saving... retrying..." indicator that quietly succeeds when the network comes back.

See also