Skip to main content

DNS, Private Endpoints, and Service Discovery

What This Concept Is

Names are how services find each other. The cloud gives you layered DNS:

  • Public DNS - a hosted zone for example.com holding A/AAAA/CNAME records that resolve from the public internet. Managed via Route 53, Cloud DNS, Azure DNS, etc.
  • Private DNS (private hosted zones) - a hosted zone that resolves only inside your VPC(s), so internal names do not leak to the internet.
  • Service-discovery registries - a directory of service instances keyed by name (ECS Service Discovery, Cloud Map, Consul, Kubernetes CoreDNS). Each registered instance gets a DNS record automatically, often an SRV record exposing the port too.
  • Private endpoints / VPC endpoints (PrivateLink) - a way to reach an AWS service (or another VPC/SaaS) over a private IP inside your VPC, bypassing the public internet entirely.

Combined, these answer: "given a service name, how do I reach it, and does that traffic cross the internet?"

Every provider offers the same four layers with different names. GCP has Cloud DNS private zones, Service Directory, and Private Service Connect (PSC) as the PrivateLink equivalent. Azure has Azure DNS Private Zones, Private Endpoints, and Private Link. The architectural questions are identical across providers; only the config shapes differ.

Why It Matters Here

Naming and private connectivity are where security, cost, and correctness converge:

  • routing traffic to S3 over an Internet Gateway costs egress money; a VPC Gateway Endpoint keeps it in-region and free
  • publishing your internal API under a public DNS name exposes it to scanners; a private hosted zone solves that
  • service discovery lets a container that just moved IPs still be reachable by payments.myservice.local
  • cross-account integrations (SaaS or partner) via PrivateLink keep data off the public internet and simplify security reviews
  • a DNS outage is a total outage - the cost of misreading TTLs is a production event

DNS is also the classic "it's always DNS" incident. Bake the mental model early or pay for it later.

Concrete Example

An internal API talks to S3 and to a partner's SaaS API.

DNS layout:

  • public hosted zone example.com -> api.example.com A record -> ALB public IP (via Route 53 ALIAS, not CNAME at the apex)
  • private hosted zone internal.example in the VPC -> payments.internal.example -> service-discovery name for ECS tasks
  • partner SaaS endpoint resolvable inside the VPC via PrivateLink (vpce-1a2b3c.partner.amazonaws.com)

Traffic paths:

  • external clients call api.example.com -> public DNS -> ALB (public subnets)
  • ALB forwards to app tasks in private subnets
  • app tasks resolve payments.internal.example -> private DNS -> current IPs of payments tasks (updated by service discovery)
  • app tasks call S3 via a Gateway VPC Endpoint -> S3 in region over the AWS backbone, no NAT traffic
  • app tasks call the partner SaaS via PrivateLink -> partner's service endpoint exposed into your VPC, no public internet

Shell-level debugging of DNS flows (the right moves during an incident):

dig +short api.example.com          # public resolution
dig +short payments.internal.example @10.0.0.2 # ask the VPC resolver directly
getent hosts payments.internal.example # how the libc resolver sees it
cat /etc/resolv.conf # what the OS is configured to use
curl -v https://s3.us-east-1.amazonaws.com/<bucket> # should resolve to the VPC-endpoint IP range

If dig returns the expected IP but curl hangs, suspect the VPC endpoint's security group or the route table. If dig returns an unexpected public IP for an AWS service, your Gateway Endpoint route is missing.

Common Confusion / Misconception

"Private hosted zones are secret." They are unreachable from outside the associated VPCs, but any principal inside the VPC can resolve them. Private DNS is a scoping tool, not an authentication tool.

"A CNAME at the zone apex works." At the apex (example.com), RFC 1035 forbids CNAMEs; you need A/AAAA records, or a cloud-specific alias (Route 53 ALIAS, Cloud DNS "dnsName" alias) that resolves to the target transparently.

"All AWS traffic from my VPC stays inside AWS." Only if you have the right endpoint. Without a Gateway VPC Endpoint for S3, traffic from your private subnet to S3 goes via NAT -> IGW -> public S3 endpoint. That is still within AWS's network in most cases, but you pay NAT data processing fees and lose one layer of privacy.

"Interface endpoints and Gateway endpoints are equivalent." They are not. Gateway endpoints (S3, DynamoDB only) are free and attached via route tables. Interface endpoints (everything else) cost an hourly fee per AZ plus per-GB processed and are attached via ENIs. Don't mix them up in a cost model.

"TTLs only matter on the public internet." Private hosted zones cache too. A default TTL = 172800 on a private zone means a service-discovery IP flip may be stale to clients for two days. Set internal TTLs based on churn rate.

Gotcha: TTL on DNS matters. If you run a blue/green deploy by flipping DNS and your clients cache at 300 s TTL, some clients will keep hitting the old target for 5 minutes. Design deploys around that TTL, not against it. And remember: many clients (especially JVMs) cache DNS longer than the TTL unless explicitly configured otherwise.

How To Use It

For each service and dependency:

  1. Decide whether the name is public, private, or discovered dynamically.
  2. Use private hosted zones for internal service names; never publish internal names on public DNS.
  3. Register dynamic services with a discovery registry (Cloud Map, ECS Service Discovery, or Kubernetes DNS) instead of hard-coding IPs.
  4. Use VPC Gateway Endpoints for S3 and DynamoDB, and PrivateLink for other AWS or SaaS services where egress cost or privacy matters.
  5. Keep TTLs low (30-60 s) for records that change; keep TTLs high (3600+) for stable records to reduce DNS query cost.
  6. Document every zone, endpoint, and its owning team; DNS drift is a silent outage source.
  7. Alarm on resolver error rates and ServfFail-style metrics - DNS failures are usually visible in metrics 30+ seconds before users notice.
  8. In multi-account setups, use Route 53 Resolver rules (or shared Private DNS on GCP/Azure) so one central zone can be queried from many VPCs without cross-account chaos.

Check Yourself

  1. What is the difference between a private hosted zone and a public hosted zone?
  2. What problem does a VPC Gateway Endpoint solve that a NAT Gateway does not?
  3. Why is DNS TTL a deployment concern?
  4. You switch api.example.com from an ALB in region A to an ALB in region B. Users in one office still hit region A for 20 minutes. Name the three layers of caching that could be responsible.
  5. A team wants to call a partner SaaS "without going over the internet." Which primitive (on each of AWS/GCP/Azure) achieves that?

Mini Drill or Application

For a two-service app (public API plus internal worker), sketch the DNS plan in fifteen minutes. Include: public zone records, private zone records, service-discovery vs static names, one VPC endpoint, and one PrivateLink case. Mark which names are reachable from the public internet.

Extension: on a real VPC, list every Gateway and Interface endpoint you currently pay for (or should) using aws ec2 describe-vpc-endpoints. For each, answer: "what traffic would it eliminate from NAT?" If you cannot justify it, turn it off; if you cannot find one for S3, turn it on.

Read This Only If Stuck