Security Groups, NACLs, and VPC Endpoints: The Network Moat

What This Concept Is

Even in an identity-first world, you still want the network layer to narrow what is reachable. In cloud networks, three controls do most of the heavy lifting:

Security groups (SGs) -- stateful firewalls attached to instances or interfaces. They evaluate allow rules, and return traffic is automatically allowed. They are the primary per-workload control.
Network ACLs (NACLs) -- stateless firewalls at the subnet level. They evaluate both allow and deny rules in order, and return traffic needs its own rule. They are coarser and are usually used as a last-ditch account-wide guardrail.
VPC endpoints -- private routes between your VPC and a cloud provider's managed service (object storage, KMS, queue service, etc.), so that calls to that service never traverse the public internet.

Vocabulary varies by provider (AWS calls them these names; GCP has firewall rules + VPC Service Controls + Private Google Access; Azure has NSGs + Service Endpoints / Private Endpoints) but the shape of the moat is the same.

Why It Matters Here

The network layer is no longer a perimeter, but it is still a filter. A well-designed network moat:

reduces the attack surface reachable from the internet
contains lateral movement if a workload is compromised
keeps sensitive traffic (KMS, databases, internal APIs) off the public internet
makes data exfiltration harder by cutting off unnecessary egress

Identity-based controls defend "who can call this"; network controls defend "who can reach this to try". Both matter.

Concrete Example

A web app with three tiers: a load balancer, an application service, and a database. The company also uses a managed object store for user uploads and a managed KMS for envelope encryption.

A badly scoped network design:

SG on app servers allows all inbound from 0.0.0.0/0 on port 443
SG on the DB allows inbound from the entire VPC CIDR
NACLs are default-open
Calls to object storage and KMS go out the internet gateway

A well-scoped network moat:

LB security group: inbound 443 from 0.0.0.0/0, outbound only to the app-tier SG on the app port
App-tier SG: inbound from the LB SG on the app port; outbound to the DB SG on 5432, to the object store endpoint, and to the KMS endpoint
DB SG: inbound from the app-tier SG on 5432 only; outbound constrained to managed service endpoints for backups
NACL on the DB subnet: deny inbound from any CIDR other than the app subnet, as a belt-and-suspenders account-wide guardrail
VPC endpoints: private endpoints for the object store, KMS, and any queue services so that traffic never leaves the VPC. The app-tier SG does not need internet egress at all

Now a compromised app instance still sees a database on 5432 (the application needs that) but cannot pivot to the DB from a random other subnet, cannot reach the KMS except via the endpoint, and cannot exfiltrate to an arbitrary internet host.

Common Confusion / Misconception

"Why both SGs and NACLs?" SGs are stateful and attached to workloads (the right tool for per-service rules: "app tier may reach DB on 5432"). NACLs are stateless and subnet-scoped (the right tool for coarse, broad-sweep guardrails: "nothing on this subnet ever talks to the internet"). Use SGs as the primary rule source; use NACLs sparingly for sweeping rules. Stateless means every return packet needs its own rule -- that is the footgun.

"VPC endpoints are a performance feature." Performance is a side-effect. Their real security value is cutting off a major exfiltration path: traffic to the managed service no longer leaves the VPC, and the endpoint policy can be scoped per-bucket or per-key. An endpoint policy like aws:PrincipalOrgID equals <your-org> on an S3 VPC endpoint prevents compromised credentials from writing to an attacker-owned bucket through that endpoint.

"The default VPC is production-ready." The default VPC in most cloud accounts has friendly defaults (open SGs, default NACLs, IGW routes, DNS on everywhere) that are not production-safe. New accounts should get a hardened baseline before any real workload lands -- delete the default VPC, use account factory / landing zone patterns.

"Security groups alone are enough." SGs defend north/south and east/west reachability at L4. They do not inspect application layer, they do not enforce mTLS, they do not rate-limit. Pair with an L7 control (service mesh, WAF, API gateway) when request-level enforcement matters (Concept 3).

"Egress rules are unnecessary because the workload is trusted." The workload is only trusted until it is compromised. Default-deny egress with a narrow allowlist (DNS, the managed services you actually use, a specific API partner) is what turns a credential leak from "data exfil" into "failed connection". The Capital One breach is the canonical example of what open egress enables.

"Cilium / eBPF replaces security groups." They add L3-L7 NetworkPolicy enforcement inside the cluster and can replace kube-proxy. They do not replace cloud-level SGs at the VPC edge; both layers still matter.

How To Use It

For each workload, fill out a small table:

Direction	Source	Destination	Port	Rationale
Inbound	LB SG	App SG	8080	HTTP from the LB
Outbound	App SG	DB SG	5432	DB access
Outbound	App SG	S3 endpoint	443	Uploads
Outbound	App SG	KMS endpoint	443	Envelope unwrap
Outbound	App SG	DNS resolver	53	Service discovery
...	...	...	...	...

If a row has rationale "because it worked", remove it and see what breaks. Default egress should be deny; every allowed destination is a conscious choice.

A sample Terraform snippet for an app-tier SG with default-deny egress:

resource "aws_security_group" "app" {
  name   = "app-tier"
  vpc_id = var.vpc_id
  egress = []  # explicit default-deny; add via aws_security_group_rule below
}

resource "aws_security_group_rule" "app_to_db" {
  type                     = "egress"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.db.id
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  description              = "Postgres"
}

Check Yourself

Why is a security group better than a NACL for per-service rules?
When is a NACL actually the right tool instead of a security group?
What exfiltration path does a VPC endpoint close that a SG alone does not?

Mini Drill or Application

For a 3-tier app you know, draw the network moat. Mark each edge with port and rationale. Then remove any edge you cannot defend. If the system still works, the moat is tighter than it was.

Depth Path

Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.

Building Secure and Reliable Systems - primary book backbone for security/reliability tradeoffs.
Software Engineering at Google - support for operational engineering and process.
The Linux Command Line - support for operational investigation and automation.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

See also (external)​

Depth Path​

Source Backbone​