Runtime Detection and Response: CSPM and CWPP
What This Concept Is
Even with threat modeling, identity-first design, envelope encryption, a tight network moat, and hardened images, things will still drift. Runtime detection is the set of controls that watches what is actually happening and flags when it stops matching what should be happening.
Two categories dominate the cloud vocabulary:
- CSPM -- Cloud Security Posture Management. Continuously compares the configuration of your cloud accounts to a policy baseline. Examples: "this bucket is public", "this role has
*:*", "this database is not encrypted with a CMK", "this security group allows SSH from the internet", "this new account has no MFA enforcement". CSPM is about the control plane. - CWPP -- Cloud Workload Protection Platform. Watches workloads at runtime -- containers, VMs, serverless. Examples: "this container just spawned a shell", "this pod is making a DNS request to a known-bad domain", "this process wrote to
/etc/passwd", "this binary is not in the SBOM", "this file integrity check failed". CWPP is about the data plane.
Cloud providers bundle these (Defender for Cloud on Azure, Security Command Center on GCP, GuardDuty + Security Hub + Inspector on AWS). There are also vendor-independent tools (Falco for CWPP, various OSS scanners for CSPM).
Why It Matters Here
The controls in the other concept pages are preventive. They work only while they are configured correctly. Posture drifts -- through new services, new teams, bad defaults, rushed fixes. Workloads get compromised -- through zero-days, logic bugs, leaked credentials.
CSPM and CWPP close the gap between "we set this up right" and "we still know what is going on today".
This is also where security meets observability: both disciplines depend on continuous visibility into the system. A CWPP event is a log and a metric and a trace, routed through a different pipeline and with a different audience (the security team or on-call), but the engineering muscle is the same.
Concrete Example
A platform team runs 30 microservices across one cloud account. A CSPM tool is on, a CWPP agent runs in every node, and the security team is paged like any other on-call.
Day 1: CSPM flags that a new S3 bucket was created with public read enabled. The team gets a ticket within minutes; the bucket is made private before anyone writes sensitive data to it.
Day 7: CSPM flags that an IAM role for a new service was given broad write. A PR fixes the Terraform module to use least-privilege; rerun confirms the finding is gone.
Day 14: CWPP (Falco) alerts that a container spawned /bin/sh and then curl to an external IP. The workload is killed automatically, a forensic snapshot is taken, and the team investigates. They find a dependency-confusion attack in the pipeline; SLSA provenance and signing (see Concept 8) confirm the image was tampered with before the fix was rolled out.
The Falco rule that fired is a few lines:
- rule: Unexpected Shell In Container
desc: A shell was spawned inside a container that should not have one
condition: >
spawned_process and container
and shell_procs
and not container.image.repository in (trusted_images)
output: >
Shell spawned in container (user=%user.name container=%container.id
image=%container.image.repository command=%proc.cmdline parent=%proc.pname)
priority: WARNING
tags: [container, shell, mitre_execution]
The tags line maps to MITRE ATT&CK techniques so that dashboards can aggregate by tactic (execution, persistence, exfiltration) rather than rule name.
Day 30: CSPM reports a drop in multi-factor compliance because a new identity provider is being tested. The team fixes the policy.
Without CSPM, issue 1 is a future data incident. Without CWPP, issue 3 is a major breach. Both events are found at operational speed, not via a customer support ticket three months later.
Common Confusion / Misconception
"CSPM is a compliance scanner." Not quite. Compliance tools check a quarterly snapshot; CSPM watches continuously and alerts on drift. Compliance is a subset of what CSPM covers, and "passed audit" in Q2 tells you nothing about Q3 Week 3.
"CWPP is EDR for laptops." Shape looks similar (agent watches process activity, flags bad things) but CWPP is scoped to cloud workloads, ties into orchestrator metadata (pod, namespace, image digest, workload identity), and must be cheap at container density. An EDR designed for laptops typically cannot handle 10k pods per node-minute churn.
"CSPM/CWPP alerts are audit noise." They are an observability signal. If they are not routed to an on-call that responds, they are decoration. The same discipline that makes symptom alerts actionable (Concept 14) applies: each rule needs a runbook, a severity, and an owner, or it stops being signal.
"Detection replaces prevention." CSPM and CWPP do not replace threat modeling, image hardening, or IAM hygiene. They are detection. Detection without prevention is "learn about breaches faster". A program with 200 open CRITICAL CSPM findings has a process problem, not a tooling problem.
"Agentless CSPM is as good as agented CWPP." They answer different questions. Agentless CSPM reads the cloud API; it sees config, not process behavior. An attacker who does not change config is invisible to CSPM and visible to CWPP. Both, together.
"MITRE ATT&CK mapping is marketing." It is a labeling discipline that lets you measure coverage: which tactics (initial access, execution, persistence, credential access, lateral movement, exfiltration) have at least one rule, which do not. Gaps are visible at a glance; without a mapping, coverage is a vibe.
How To Use It
For a system you operate, answer:
- Is there a CSPM tool enabled at the account level, and who owns the findings?
- What is the turnaround SLA for a CRITICAL CSPM finding? A HIGH one?
- Is there a CWPP agent on every workload? Are its events routed to the same on-call rotation that handles incidents?
- What percentage of findings from either tool are ignored today? If the answer is "most", the tool is being calibrated wrong.
- Is there a regular review that removes rules that no longer apply? (Noise deletion is its own skill.)
Check Yourself
- What is the difference between CSPM and CWPP, and what question does each answer?
- Why is CSPM continuous rather than periodic?
- Why is a CWPP event most useful when it carries orchestrator metadata (pod, namespace, image digest)?
Mini Drill or Application
Pick three CSPM rules you would enable on day one of a new cloud account (e.g. "no public buckets", "no inbound 0.0.0.0/0 SSH", "MFA required for all humans"). For each, write the response SLA and who owns it.
See also (external)
- Google Cloud: Well-Architected Security Pillar (Detect) -- detect and respond patterns mapping to Security Command Center.
- AWS Well-Architected: Security Pillar (Detection) -- detection design principles and their mapping to GuardDuty, Security Hub, Inspector, Macie.
- Falco Documentation -- the canonical open-source CWPP: rule language, syscall sources, and OTel output for routing into observability pipelines.
- MITRE ATT&CK for Containers -- the canonical taxonomy of attacker techniques against containerized workloads; every CWPP rule should tag to one.
- Building Secure and Reliable Systems, Ch. 15: Investigating Systems -- how detection plugs into forensics and incident response.
- Cloud Security Alliance: Cloud Controls Matrix -- a mapping of security controls to cloud architectures, useful for CSPM rule libraries.
Depth Path
Source Backbone
Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.
- Building Secure and Reliable Systems - primary book backbone for security/reliability tradeoffs.
- Software Engineering at Google - support for operational engineering and process.
- The Linux Command Line - support for operational investigation and automation.