Skip to main content

Observability

A one-person ops team cannot watch 50 shops simultaneously. The system has to tell you when something's wrong, with enough context to fix it without the operator's intervention. That's what observability means here: not dashboards as decoration, but signals that drive specific actions.

Drafted from planning · v0.1

The baseline (structured logs + Workers analytics) is enabled. The dashboards described below are a slice-1-follow-on.

The shortest version

Every Worker request emits one structured JSON log line at end-of-request: {request_id, shop, staff_id, path, method, status, ms, error_class?, error_msg?}. Every cron run emits a summary line: {cron, run_id, rows_touched, ms, errors_count}. Aggregate these into per-shop dashboards. Alert on the deltas that actually need a human.

What we instrument

HTTP requests

Every request through the Worker. The log line shape:

{
"ts": "2026-05-10T14:23:11Z",
"request_id": "req_a8h3k2",
"shop": "swicked",
"staff_id": 3,
"staff_label": "Robbie",
"path": "/api/tickets/2506/lines",
"method": "POST",
"status": 200,
"ms": 47,
"d1_queries": 4,
"r2_ops": 0,
"external_calls": ["stripe:0", "twilio:0", "claude:0"]
}

Errors include error_class and error_msg and a redacted error_context:

{
"ts": "2026-05-10T14:23:11Z",
"request_id": "req_a8h3k2",
"shop": "swicked",
"staff_id": 3,
"path": "/api/tickets/2506",
"method": "DELETE",
"status": 409,
"ms": 12,
"error_class": "BlockedByDependents",
"error_msg": "Cannot delete ticket with attached transactions",
"error_context": {"ticket_id": 2506, "blocker": "transactions"}
}

PII is never logged. staff_label is "Robbie" (first name only, fine). Customer names are never in log lines. SMS bodies are never logged.

Cron jobs

Each cron run emits one summary at end:

{
"ts": "2026-05-10T07:00:00Z",
"cron": "daily_reconciliation",
"run_id": "cron_d_2026_05_10",
"shop": "swicked",
"ms": 8420,
"results": {
"sales_reconciled": 47,
"low_stock_alerts": 3,
"audit_chain_verified": true,
"audit_chain_length": 18437
},
"errors": []
}

External calls

Each call to Stripe / Twilio / Claude / GBP is logged with: target service, endpoint, duration, status, retry count. Failed calls log the response body (minus secrets) for forensic value.

D1 query timings

Each query through env.DB.prepare(...).run() is timed. The end-of-request log line counts queries and includes the longest one's signature (parameterized SQL, no values). This lets us spot accidental N+1 patterns.

Where logs go

Cloudflare's Workers Logs (Logpush enabled, default) pipes every line to:

  • Cloudflare's Analytics for short-term retention (30 days)
  • An R2 archive bucket (Kvick-owned, not per-shop) for long-term retention (1 year)
  • Optionally, a third-party log aggregator (Logflare, Axiom) when shop count justifies it — TBD

The R2 archive bucket is the source of truth for "what happened on day X" — query with DuckDB over the Parquet exports.

What we alert on

Alerts go to Kvick's email + an oncall pager (when oncall starts being a thing). Alerts are tuned for action required, not "FYI."

AlertTriggerWhat you do
5xx rate > 1%Any shop, sustained 5 minutesWorker logs, identify error class, deploy fix
Stripe webhook signature failure3+ in 1 hour for one shopPossible secret rotation needed; check shop's Stripe config
D1 latency p95 > 200msAny shop, sustained 10 minutesD1 status page, possibly index missing
Audit chain verification failedDaily cronHigh priority — investigate immediately, don't deploy until resolved
Twilio cost projection > 150% of budgetWeekly summaryOwner contact about SMS volume
AI cost projection > monthly capDailyPause AI; notify shop owner
Cron didn't runExpected start + 30min, no end logCheck Cloudflare cron status; manually run if needed

We don't alert on traffic spikes (Workers handles them), CPU usage (Workers isn't user-priced on CPU), or "warning" log lines (they're fine, that's what they're for).

Dashboards (per shop + aggregate)

One dashboard per shop, plus one aggregate dashboard for Kvick. Per-shop shows: requests/hr, error rate, p50/p95 latency, today's revenue (read directly from transactions), open tickets count, R2 storage growth.

Aggregate shows: shops with elevated errors, total cron runs today, total AI spend, top error classes across shops.

The dashboards are not the primary observability tool — alerts are. The dashboards are for confirming-things-are-fine and for ad-hoc investigations.

Request IDs

Every request gets a request_id (cf-ray + a short random suffix). The ID flows:

  • Generated at the edge in middleware
  • Set as a response header X-Request-Id (so the operator sees it in error toasts)
  • Included in every log line for the request
  • Propagated to external API calls as a header where supported (Stripe respects Idempotency-Key; we use the request_id as the seed for that)

When a shop reports "I clicked save and got an error," the request_id from their error toast → log query → root cause in under a minute. This is the value Loop.

Build version as a forensic dimension

Alongside request_id, every audit row carries the build_version that produced it. See build versioning for the full design. The forensic pivots this enables:

  • "Did this bug land in v0.4.10 or v0.4.11?" — group recent error events by build_version, find the version where the error class first appears.
  • "Which builds did this customer's records flow through?"SELECT DISTINCT build_version FROM audit_events WHERE context_customer_id = ?.
  • "Show me everything from build X"WHERE build_version = ?, fast via the filtered index.

The operator-facing surface: the user-menu shows the current build version. When the operator opens a Beta Comments report, the version snapshot is captured at click time, so feedback resolves to the exact build the operator was on — not just the date.

What we deliberately don't have

  • Distributed tracing (OpenTelemetry). Workers is a single-runtime system; the value of tracing is span-across-services and we don't have that.
  • Custom metrics for everything. The structured log lines + Cloudflare's metrics give us what we need. Adding StatsD or Prometheus would be a separate system to maintain.
  • Real-time tail of every shop's logs. Logs are searchable in retrospect, not streamed live. wrangler tail --env swicked is available when you need it.
  • Heavy APM. Datadog / New Relic / Sentry's full APM is heavyweight for a Workers app. Sentry's error-only product is reasonable; we'll add it when error volume crosses a threshold that makes hand-grepping logs painful.

How instrumentation lands in code

The simplest pattern: a middleware that wraps the request handler.

// Pseudo
async function withInstrumentation(handler, request, env, ctx) {
const request_id = makeRequestId(request);
const t0 = performance.now();
let status = 500, error_class = null, error_msg = null;
try {
const res = await handler(request, env, { ...ctx, request_id });
status = res.status;
return res;
} catch (e) {
status = 500;
error_class = e.constructor.name;
error_msg = e.message;
return new Response('Internal Error', { status: 500 });
} finally {
const ms = Math.round(performance.now() - t0);
console.log(JSON.stringify({
ts: new Date().toISOString(),
request_id, shop: env.SHOP_SLUG,
path: new URL(request.url).pathname,
method: request.method,
status, ms, error_class, error_msg
}));
}
}

This is in src/lib/instrumentation.js (when slice 1 lands). The wrapper is around the top-level fetch handler — every code path is instrumented automatically.

See also