Data Classification and Minimization

What This Concept Is

Data classification is the practice of assigning every piece of data a sensitivity label so that engineering decisions (where it is stored, who can read it, whether it is encrypted, whether it is logged, how long it is kept) are driven by the data, not by convenience.

A typical ladder has four rungs:

Public -- press releases, marketing, open-source code
Internal -- most business data; leakage is embarrassing but not catastrophic
Confidential -- customer content, financial records, non-public plans
Restricted / Regulated -- credentials, credit cards, health records, government IDs, anything with a legal definition attached (PCI, HIPAA, GDPR special categories, etc.)

Data minimization is the complementary habit: do not collect, store, or process data you do not need; throw away what you collected once its purpose is served. In regulatory frameworks (GDPR Article 5, for example) this is a legal requirement; in engineering terms it is a huge reduction in blast radius.

Together, classification and minimization answer two questions every system eventually has to answer: what is this data worth, and why do we still have it?

Why It Matters Here

Cloud storage is practically free. That is a trap. Every extra byte of sensitive data is a potential incident; every extra month of retention is a bigger window for compromise; every extra log line with PII is another place the data has to be defended.

Classification and minimization are what let you route sensitive data through the expensive controls (KMS, strict IAM, audit, narrower network paths) and let less sensitive data go through cheaper paths. Without classification, you either over-control everything (slow, expensive, friction) or under-control everything (incident waiting).

The AWS, Google Cloud, and Azure Well-Architected security guidance all call out data classification as a foundational practice, because downstream decisions -- encryption, access, logging, retention -- cascade from it.

Concrete Example

An e-commerce checkout collects: email, shipping address, credit card number, order history.

Naive approach:

everything in one table, one service, one log stream, same retention
the application logs full request bodies at DEBUG level
backups run nightly and are kept for "forever"
analytics queries hit the same table directly

Classified + minimized approach:

Public: product catalog (safe to cache widely)
Internal: order history with hashed user ID (safe in analytics warehouse)
Confidential: email, shipping address (encrypted at rest, restricted access, short log retention)
Restricted: credit card number -- never stored. Tokenized by a payment processor, and only the token persists. Logs at any level are filtered so the card cannot appear even in a stack trace.

Minimization decisions:

we do not need the full raw IP, just a country for fraud heuristics -- store the country code
we do not need the user-agent for orders older than 30 days -- drop it
card tokens are scoped to this merchant; once the card expires, the token is revoked

The result: one class of regulated data is simply gone from the system; one class is encrypted with careful access; the rest is safe enough to use in dashboards and ML without anxiety.

Common Confusion / Misconception

"Encryption replaces classification." It does not. Encryption is a control; classification is what tells you which controls to apply where. You can encrypt public data (waste of effort) or fail to encrypt restricted data (incident) if you skip the labeling step.

"Classification is a security-team activity." No, it is an engineering activity. The teams that own the data know best what is in it. A central policy sets the ladder; the labels are applied per field by the team that writes the field. A central team cannot hand-classify the schema of 200 services.

"Minimization means never log anything." It does not. Observability still needs log data, metrics, and traces. The rule is: log what you need to operate the system, do not log payloads, and if you must log an identifier, prefer a stable pseudonymous one (hashed user ID, not raw email). Concept 11 is where this lands in practice.

"Pseudonymization = anonymization." GDPR and most privacy regulations treat these very differently. Pseudonymization (a hashed or tokenized identifier that can be re-linked with a separate key) still counts as personal data. Anonymization (no practical way to re-identify) moves data out of the personal-data category -- and is genuinely hard, because joining with auxiliary data often re-identifies. Assume pseudonymization unless you can prove otherwise.

"Retention is a storage decision." It is a classification decision. Retention windows come from the class: Restricted usually has a regulatory minimum and a business-driven maximum; Internal is bounded by usefulness; Public can live forever. Encode retention in lifecycle policies on object storage, TTLs on databases, and log-store configs -- not in someone's memory.

How To Use It

For any new feature or new service:

List every field the system will touch (in, store, out, log). Include derived fields and join keys.
Label each field on the 4-rung ladder. Use a shared taxonomy across services so the label means the same thing everywhere.
For each Confidential or Restricted field, write a one-line rule: where it lives, who can read it, how long it is kept, and whether it ever appears in logs/metrics/traces.
For each field, ask: "do we need this to do the job?" If not, drop it. If maybe, keep it for a defined retention window (30/90/365 days) and review later.
Write the decisions down next to the code -- a data-classification.md in the service repo, or inline tags in the schema. A classification that lives only in someone's head is a classification that will drift.
Wire the labels into enforcement. Examples: field-level column encryption for Restricted; log-redaction denylists for Confidential; automatic object-storage lifecycle rules for retention; DLP scanners on egress paths; IAM conditions that reference a data-classification tag.
Review quarterly. New features add new fields; features retire and leave data stranded. A quarterly classification review is how "we still have 2019 session tokens" stops happening.

Check Yourself

Give one example of a field you have seen treated as "obviously internal" that should probably have been Confidential or Restricted.
Why does minimization reduce blast radius specifically (as opposed to just saving money)?
What is the difference between pseudonymization and anonymization, and which one is usually achievable?

Mini Drill or Application

Pick a small feature you have shipped. Write the 4-row classification table for its fields. Then list three fields you could have dropped without changing the feature. Those three fields are future incidents avoided.

Depth Path

Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.

Building Secure and Reliable Systems - primary book backbone for security/reliability tradeoffs.
Software Engineering at Google - support for operational engineering and process.
The Linux Command Line - support for operational investigation and automation.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

See also (external)​

Depth Path​

Source Backbone​