Adding Structured Logs Where They Matter
What This Concept Is
A structured log is a record with stable, queryable fields -- not a sentence with values spliced into it. Instead of "user alice failed login from 10.0.0.5", you emit:
{"ts":"2026-04-22T14:31:02Z","level":"warn","event":"auth.login.failed",
"user_id":"alice","src_ip":"10.0.0.5","reason":"bad_password","request_id":"r-7f3a","trace_id":"t-9a12"}
The difference is not cosmetic. The second form can be aggregated, filtered, joined with traces, and turned into SLI counts. The first cannot -- without a regex zoo that will rot in six months.
"Where they matter" means decision boundaries: places where your code made a choice that you might later need to reconstruct. Authentication, authorization, retries, fallbacks, partial failures, background job transitions, external API responses, circuit-breaker state changes. Logging inside a tight inner loop is almost always noise; logging the decision the loop produced is usually the event.
Structured logging sits inside the broader OpenTelemetry "signals" model -- logs, metrics, traces -- and OTel's semantic conventions prescribe stable attribute names (e.g. http.request.method, user.id, exception.type) so that the same field means the same thing in Python, Go, and a shipped Grafana dashboard. Adopt a subset of the conventions early; the cost of renaming userID -> user.id across four services during the PRR is higher than the cost of agreeing now.
Finally: structured logs are not a dumping ground. Each log line should carry the minimum context to reconstruct the decision -- request_id, trace_id, tenant_id, a reason code, a duration when relevant -- and not the full request body or any PII you would not want printed on a billboard.
Why It Matters Here (In the Capstone)
Downstream of this module, logs drive three things:
- SLIs:
good / totalcounts come from aggregating structured events with known names. - Traces: spans get anchored to log lines through
trace_idandspan_idfields. - Runbooks: the first five checks in any runbook look like "filter logs for
event=Xandtenant=Y."
If you emit unstructured text you will spend more time writing grep patterns than debugging. If you emit structured logs everywhere you will drown in volume and pay too much. The judgment is where. Concept 7 (STRIDE) also relies on logs for Repudiation defenses -- without structured audit events, you cannot prove who did what when under even mild scrutiny.
Concrete Example(s) -- from a real capstone
A tiny webhook-handler service with three decision boundaries. Good logs look like this:
log.info("webhook.received", request_id=rid, source=provider, bytes=len(body), trace_id=tid)
try:
verified = verify_signature(body, sig, secret)
except BadSignature as e:
log.warn("webhook.signature_rejected", request_id=rid, reason=str(e), trace_id=tid)
return 401
if not within_rate_limit(provider):
log.warn("webhook.rate_limited", request_id=rid, provider=provider, trace_id=tid)
return 429
try:
queue.publish(event, attempt=1)
log.info("webhook.enqueued", request_id=rid, topic=topic, trace_id=tid)
except QueueTimeout:
log.error("webhook.enqueue_failed", request_id=rid, topic=topic,
attempt=1, next_action="retry_async", trace_id=tid)
Six events total, one per decision boundary. Field names are stable: request_id, trace_id, provider, reason, topic, attempt, next_action. From this alone you can:
- count rejections per provider (
event=webhook.signature_rejected | group by provider) - compute an SLI (
enqueued / receivedper 5m) - filter to one bad request ID in under a second
- jump from a log line into a trace by clicking the
trace_idif your log UI supports OpenTelemetry exemplars
What this does not include: a log inside for item in body: or after every database read. Those belong in traces or metrics, not logs.
Capstone event schema (library/raw/logging.md, one page):
| Field | Type | Always? | Notes |
|---|---|---|---|
ts | ISO-8601 | yes | UTC, millisecond precision |
level | enum | yes | debug / info / warn / error |
event | area.object.verb | yes | past tense, dotted, stable |
request_id | string | yes at request ingress | flows end-to-end |
trace_id | string | when tracing active | 32 hex, OTel format |
tenant_id | string | when tenant-scoped | never PII |
reason | enum | on warn/error | bounded cardinality |
duration_ms | int | on completed operations | for latency rollups |
Eight fields. Everything else is event-specific and goes in named keys, not prose.
Common Confusion / Misconceptions
"We use printf-style logs and grep them." Fine at hello-world scale; catastrophic past that. Logs without fields become a liability the first time you have to correlate two services -- and you will, the first time an incident crosses a process boundary.
"Structured logs are just JSON." Structured means stable field names with stable types. If one service logs user_id as a string and another as an int, your aggregator will treat them as different fields. Define a small schema -- the table above is enough for a capstone -- and enforce it in a linter or a pre-commit hook.
"Log everything just in case." Volume is not free: it costs money to store, it dilutes signal, and it leaks PII. Structured logs still require sampling and redaction. Log the decision and the minimum context required to reconstruct it -- not the full request body.
"Logs and traces are redundant." They overlap, but they solve different problems. Logs tell you that a decision was made and why; traces tell you how long it took and what else was called during it. You want both, correlated by trace_id -- concept 6 depends on this correlation.
"We'll sample logs the same way we sample traces." Tempting, mostly wrong. Traces sample whole requests; logs sample event types. A signature-verification failure event that happens twenty times a day should not be downsampled even if traces for its parent requests are dropped, because the pattern is what matters, not the individual request.
How To Use It (In Your Capstone)
- List the decision boundaries in your capstone. Usually 5-15 per service.
- Name each event with a dotted, stable name:
<area>.<object>.<verb>, past tense. Examples:auth.login.failed,webhook.enqueued,payment.refund.issued. - Define a small field schema (see table above). Use the same field names across services; adopt OpenTelemetry's
semantic conventionssubset where applicable (http.*,db.*,messaging.*). - Emit at
infofor normal outcomes,warnfor user-visible rejections,erroronly for things that need on-call attention. - Add request-ID propagation on every ingress and pass it through to every outgoing call so logs across services can be joined.
- Add a redaction pass in the logger middleware: PII fields (
email,ssn,card_number, fullbody) are stripped or hashed before they hit storage. - Confirm in your aggregator (CloudWatch Logs Insights, Loki, Datadog, etc.) that you can filter by any schema field without a regex. If you cannot, the schema is aspirational.
See also (integrative)
- S8 M04 Cluster 5 -- observability three pillars (metrics/logs/traces), where the taxonomy and correlation-ID discipline were introduced:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-05-incident-and-observability/13-observability-three-pillars-metrics-logs-traces-primary.md. - S9 M05 Cluster 4 -- structured logging and log routing in cloud: log aggregation, retention, and redaction pipelines:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-04-observability-pillars-in-cloud/11-structured-logging-and-log-routing-primary.md. - S9 M05 Cluster 2 -- data classification & minimization: decides what to strip before a log hits persistence:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-02-secrets-keys-and-data/06-data-classification-and-minimization-primary.md. - S5 M05 -- HTTP status codes, the vocabulary
eventnames andreasonenums build on:../../../../semester-05-os-networking/module-05-network-protocols-sockets/concepts/cluster-04-application-protocols-and-http/10-http-1-1-request-response-methods-status-codes-primary.md. - OpenTelemetry -- Concepts -- the
signalsmodel (logs, metrics, traces) and how they interlock. - OpenTelemetry -- Semantic conventions -- stable attribute names you should adopt instead of inventing your own.
- Google SRE Workbook -- Monitoring -- when logs are the right signal versus metrics or traces (speed, cardinality, retention tradeoffs).
Check Yourself
- Why is
event="webhook.enqueued"a better log key than the free-text"enqueued webhook for provider X"? - What three fields should appear on almost every log line for correlation to work across services?
- Name two places in a typical request path where a log is not worth the cost.
- If one service logs
user_idas a string and another as an int, what practical thing breaks first? - What problem does OpenTelemetry's
semantic conventionssolve that a per-team schema cannot? - Which log level should a "user-visible rejection" be emitted at, and why not
error?
Mini Drill or Application (Capstone-scoped)
- Pick one service in your capstone and in 20 minutes: list every decision boundary in that service.
- Give each a dotted event name following
area.object.verbpast tense. Commit the list. - Define the minimum field set for that service and write it into
library/raw/logging.mdas a one-page schema. - Replace the top five free-text logs with structured equivalents, passing
trace_idandrequest_idthrough. - Confirm in your aggregator that you can filter by any field without a regex. Paste a screenshot into the runbook for the top incident.
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.