Skip to main content

Structured Logging and Log Routing

What This Concept Is

A structured log is a log line that is machine-parseable: it has stable keys, typed values, and a known schema. In practice this almost always means JSON (or Protobuf) with a consistent set of fields:

{"ts":"2026-04-22T18:30:45.123Z","level":"ERROR","service":"checkout","env":"prod","region":"us-east-1","trace_id":"8c1b...","span_id":"d2a9...","event":"payment_declined","order_id":"o_1234","reason_code":"insufficient_funds"}

Compare to an unstructured line:

2026-04-22 18:30:45 ERROR checkout: payment declined for order 1234 (insufficient funds)

The structured version is queryable by any field, joinable with traces via trace_id, routable by env or service, and safe to throw at a log processor without writing a regex.

Log routing is the pipeline that takes log events from where they are produced to where they are stored, analyzed, alerted on, and eventually expired. A typical cloud-native route:

app stdout
-> sidecar or node agent (Fluent Bit, Vector, OTel Collector)
-> aggregation + filtering + redaction
-> one or more sinks:
- hot store for search (Loki, Elasticsearch, CloudWatch Logs)
- cold store for retention (S3, GCS with lifecycle)
- SIEM for security analytics
- metrics extraction for specific high-signal events

The OWASP Logging Cheat Sheet's guidance maps cleanly onto this model: differentiate operational vs security logging, decide where to record, decide what to record, and enforce that sensitive data is not in log bodies.

Why It Matters Here

Logs are the observability pillar with the highest volume and the lowest signal-to-noise if ignored. Unstructured logs are the source of most "we have the data but cannot query it in time" incidents.

Routing matters because not every log deserves the same treatment. Audit logs belong in a tamper-resistant store with long retention. Debug logs belong in a hot store with short retention. Security-relevant events belong in the SIEM. Logs that contain PII belong nowhere -- they should be redacted before they leave the process.

Structured logging is also what makes the three pillars compose. A structured log line can carry the trace ID so you can pivot to the trace (Concept 12); it can carry the service and env labels that match your metrics (Concept 10); it can carry the request ID that matches your dashboards (Concept 13).

Concrete Example

A checkout service writes logs to stdout as structured JSON using a shared library that injects standard fields.

Common keys across all logs:

  • ts -- RFC3339 timestamp, UTC
  • level -- INFO / WARN / ERROR
  • service -- checkout
  • env -- prod / staging / dev
  • region -- us-east-1
  • trace_id, span_id -- from the active OpenTelemetry context
  • user_hash -- stable pseudonymous hash of the user, never the raw email

Per-event keys are namespaced:

  • event -- an enum like payment_declined, order_created, retry_exhausted
  • order_id, amount_cents, currency, reason_code -- event-specific

Routing pipeline:

  • stdout -> node-level OpenTelemetry Collector
  • Collector parses JSON, adds cluster/namespace/pod resource attributes
  • redaction processor strips any field in a denylist (email, card_number, authorization)
  • events where event starts with security_ are also forked to a SIEM sink
  • everything goes to a search-optimized hot store (7 days)
  • raw stream is mirrored to object storage with 400 days lifecycle for audit
  • high-signal events (payment_declined, order_created) are converted to counters at the Collector and exposed as metrics

A dashboard alert fires when event=payment_declined rate jumps; on-call clicks through and reads the actual log lines in the hot store; each line carries a trace_id that pivots to the distributed trace.

Concrete queries over the hot store show why structure matters. In Loki/LogQL:

# Rate of payment declines per reason code, last 15m
sum by (reason_code) (
rate({service="checkout", env="prod"} | json | event="payment_declined" [15m])
)

# All ERRORs for one trace
{env="prod"} | json | trace_id="8c1b0a5e..." | level="ERROR"

# Extract p99 latency from a latency_ms field in logs, last 5m
quantile_over_time(0.99,
{service="checkout"} | json | unwrap latency_ms [5m])

An unstructured version of any of these requires regex-parsing the whole stream at query time, which is slow, fragile, and breaks the first time someone changes the message format.

Common Confusion / Misconception

"Logs = observability." Logs are one pillar. Alone, they are expensive to search and hard to alert on. Metrics tell you "something is wrong"; traces tell you "where"; logs tell you "what happened at that moment". All three compose.

"printf-style logging is structured." Not just because the message contains fields. Until the fields are keys in a parseable document, you are asking queries to parse regex. "User 1234 logged in from 1.2.3.4" is unstructured; {"event":"user_login","user_hash":"...","ip":"1.2.3.4"} is structured.

"Log levels filter data volume." They do not. Levels exist so that humans can filter at read time. If you log an INFO for every request, you will fill disks; if you log only ERROR, you will lose the context that makes ERRORs actionable. Volume control is done via sampling (per-event) and routing (per-destination), not levels.

"Redaction can happen downstream." Redaction must happen before a log leaves the process, or at worst at the first hop. Relying on downstream redaction is how PII ends up in backups you cannot retract. The OWASP Logging Cheat Sheet is explicit on this: sensitive material should never be logged, full stop.

"Audit logs and application logs can share a pipeline." They can share the transport, but audit logs need tamper-resistance (append-only, WORM storage, signed chains), long retention (years, not weeks), and restricted access. Route them to a different sink with different policies. Mixing them makes audit trails look like debug output and erodes their evidentiary value.

"Trace ID is a nice-to-have." It is the join key. Without trace_id in every log line, pivoting from a log to a trace is guesswork. Enforce it in the shared logging library: if the OpenTelemetry context is active, trace_id and span_id are auto-injected.

"Structured means JSON." JSON is the common choice; Protobuf/Cap'n Proto are faster and type-safe; logfmt (key=value) is adequate for many cases. The commitment is to typed, keyed fields with a shared schema, not a specific serialization.

How To Use It

For each service, define:

  1. Schema: the fixed set of keys every log line must have (timestamp, level, service, env, region, trace context).
  2. Event vocabulary: the enum of event values, with a one-line description each. New events are added by PR.
  3. Denylist: the set of keys that must never appear in logs (emails, card numbers, raw tokens), enforced at the logger.
  4. Routing: which sinks receive which events, with retention per sink.
  5. Correlation: trace_id and span_id injected automatically from the OpenTelemetry context.

Check Yourself

  1. Give one concrete query you can run against structured logs that is very hard against unstructured logs.
  2. Why is trace_id a required field in a structured log schema?
  3. What is the operational difference between a hot log store and a cold one, and why do you need both?

Mini Drill or Application

Take five log lines from a service you have worked on. Rewrite them as structured JSON with a consistent schema. Identify any field that should be removed or pseudonymized. Mark each log line with the right level and event name.

See also (external)

Depth Path


Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.