Skip to main content

Adding Structured Logs Where They Matter

What This Concept Is

A structured log is a record with stable, queryable fields -- not a sentence with values spliced into it. Instead of "user alice failed login from 10.0.0.5", you emit:

{"ts":"2026-04-22T14:31:02Z","level":"warn","event":"auth.login.failed",
"user_id":"alice","src_ip":"10.0.0.5","reason":"bad_password","request_id":"r-7f3a","trace_id":"t-9a12"}

The difference is not cosmetic. The second form can be aggregated, filtered, joined with traces, and turned into SLI counts. The first cannot -- without a regex zoo that will rot in six months.

"Where they matter" means decision boundaries: places where your code made a choice that you might later need to reconstruct. Authentication, authorization, retries, fallbacks, partial failures, background job transitions, external API responses, circuit-breaker state changes. Logging inside a tight inner loop is almost always noise; logging the decision the loop produced is usually the event.

Structured logging sits inside the broader OpenTelemetry "signals" model -- logs, metrics, traces -- and OTel's semantic conventions prescribe stable attribute names (e.g. http.request.method, user.id, exception.type) so that the same field means the same thing in Python, Go, and a shipped Grafana dashboard. Adopt a subset of the conventions early; the cost of renaming userID -> user.id across four services during the PRR is higher than the cost of agreeing now.

Finally: structured logs are not a dumping ground. Each log line should carry the minimum context to reconstruct the decision -- request_id, trace_id, tenant_id, a reason code, a duration when relevant -- and not the full request body or any PII you would not want printed on a billboard.

Why It Matters Here (In the Capstone)

Downstream of this module, logs drive three things:

  • SLIs: good / total counts come from aggregating structured events with known names.
  • Traces: spans get anchored to log lines through trace_id and span_id fields.
  • Runbooks: the first five checks in any runbook look like "filter logs for event=X and tenant=Y."

If you emit unstructured text you will spend more time writing grep patterns than debugging. If you emit structured logs everywhere you will drown in volume and pay too much. The judgment is where. Concept 7 (STRIDE) also relies on logs for Repudiation defenses -- without structured audit events, you cannot prove who did what when under even mild scrutiny.

Concrete Example(s) -- from a real capstone

A tiny webhook-handler service with three decision boundaries. Good logs look like this:

log.info("webhook.received", request_id=rid, source=provider, bytes=len(body), trace_id=tid)

try:
verified = verify_signature(body, sig, secret)
except BadSignature as e:
log.warn("webhook.signature_rejected", request_id=rid, reason=str(e), trace_id=tid)
return 401

if not within_rate_limit(provider):
log.warn("webhook.rate_limited", request_id=rid, provider=provider, trace_id=tid)
return 429

try:
queue.publish(event, attempt=1)
log.info("webhook.enqueued", request_id=rid, topic=topic, trace_id=tid)
except QueueTimeout:
log.error("webhook.enqueue_failed", request_id=rid, topic=topic,
attempt=1, next_action="retry_async", trace_id=tid)

Six events total, one per decision boundary. Field names are stable: request_id, trace_id, provider, reason, topic, attempt, next_action. From this alone you can:

  • count rejections per provider (event=webhook.signature_rejected | group by provider)
  • compute an SLI (enqueued / received per 5m)
  • filter to one bad request ID in under a second
  • jump from a log line into a trace by clicking the trace_id if your log UI supports OpenTelemetry exemplars

What this does not include: a log inside for item in body: or after every database read. Those belong in traces or metrics, not logs.

Capstone event schema (library/raw/logging.md, one page):

FieldTypeAlways?Notes
tsISO-8601yesUTC, millisecond precision
levelenumyesdebug / info / warn / error
eventarea.object.verbyespast tense, dotted, stable
request_idstringyes at request ingressflows end-to-end
trace_idstringwhen tracing active32 hex, OTel format
tenant_idstringwhen tenant-scopednever PII
reasonenumon warn/errorbounded cardinality
duration_msinton completed operationsfor latency rollups

Eight fields. Everything else is event-specific and goes in named keys, not prose.

Common Confusion / Misconceptions

"We use printf-style logs and grep them." Fine at hello-world scale; catastrophic past that. Logs without fields become a liability the first time you have to correlate two services -- and you will, the first time an incident crosses a process boundary.

"Structured logs are just JSON." Structured means stable field names with stable types. If one service logs user_id as a string and another as an int, your aggregator will treat them as different fields. Define a small schema -- the table above is enough for a capstone -- and enforce it in a linter or a pre-commit hook.

"Log everything just in case." Volume is not free: it costs money to store, it dilutes signal, and it leaks PII. Structured logs still require sampling and redaction. Log the decision and the minimum context required to reconstruct it -- not the full request body.

"Logs and traces are redundant." They overlap, but they solve different problems. Logs tell you that a decision was made and why; traces tell you how long it took and what else was called during it. You want both, correlated by trace_id -- concept 6 depends on this correlation.

"We'll sample logs the same way we sample traces." Tempting, mostly wrong. Traces sample whole requests; logs sample event types. A signature-verification failure event that happens twenty times a day should not be downsampled even if traces for its parent requests are dropped, because the pattern is what matters, not the individual request.

How To Use It (In Your Capstone)

  1. List the decision boundaries in your capstone. Usually 5-15 per service.
  2. Name each event with a dotted, stable name: <area>.<object>.<verb>, past tense. Examples: auth.login.failed, webhook.enqueued, payment.refund.issued.
  3. Define a small field schema (see table above). Use the same field names across services; adopt OpenTelemetry's semantic conventions subset where applicable (http.*, db.*, messaging.*).
  4. Emit at info for normal outcomes, warn for user-visible rejections, error only for things that need on-call attention.
  5. Add request-ID propagation on every ingress and pass it through to every outgoing call so logs across services can be joined.
  6. Add a redaction pass in the logger middleware: PII fields (email, ssn, card_number, full body) are stripped or hashed before they hit storage.
  7. Confirm in your aggregator (CloudWatch Logs Insights, Loki, Datadog, etc.) that you can filter by any schema field without a regex. If you cannot, the schema is aspirational.

See also (integrative)

Check Yourself

  1. Why is event="webhook.enqueued" a better log key than the free-text "enqueued webhook for provider X"?
  2. What three fields should appear on almost every log line for correlation to work across services?
  3. Name two places in a typical request path where a log is not worth the cost.
  4. If one service logs user_id as a string and another as an int, what practical thing breaks first?
  5. What problem does OpenTelemetry's semantic conventions solve that a per-team schema cannot?
  6. Which log level should a "user-visible rejection" be emitted at, and why not error?

Mini Drill or Application (Capstone-scoped)

  1. Pick one service in your capstone and in 20 minutes: list every decision boundary in that service.
  2. Give each a dotted event name following area.object.verb past tense. Commit the list.
  3. Define the minimum field set for that service and write it into library/raw/logging.md as a one-page schema.
  4. Replace the top five free-text logs with structured equivalents, passing trace_id and request_id through.
  5. Confirm in your aggregator that you can filter by any field without a regex. Paste a screenshot into the runbook for the top incident.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.