Skip to main content

Load Balancers: L4 vs L7, Health Checks, TLS Termination

What This Concept Is

A load balancer is a traffic distributor that accepts incoming connections or requests and spreads them across a pool of healthy backend targets.

Two important layers:

  • L4 (Layer 4, transport) - operates on TCP/UDP connections. Decisions are made on IP addresses and ports. Examples: AWS Network Load Balancer (NLB), GCP TCP/UDP Load Balancer, Azure Load Balancer. Fast, protocol-agnostic, good for gRPC/TCP, static IPs.
  • L7 (Layer 7, application) - operates on HTTP(S) requests. Decisions can use host, path, headers, cookies, JWT claims. Examples: AWS Application Load Balancer (ALB), GCP External HTTPS Load Balancer, Azure Application Gateway. Richer routing, per-request logging, WAF integration.

Both add:

  • Health checks - periodic probes to each target; unhealthy targets are removed from rotation automatically.
  • TLS termination - decrypting HTTPS at the load balancer so the backend sees plain HTTP (or re-encrypted). Moves certificate management to one place.
  • Spread across AZs - a well-configured LB has targets in multiple AZs so one AZ failing does not take the whole service down.
  • Access logs - every connection/request captured with timing, status, and target. The single most valuable signal during an incident.

A third variant, the Gateway Load Balancer (L3), exists for transparent inline appliances (firewalls, IDS). You rarely design one from scratch; you configure it when a security product tells you to.

Why It Matters Here

The load balancer is the first thing your users hit and often the last thing your team understands. Misunderstandings here cause:

  • "we forgot the health-check path and every deploy drops 30% of connections"
  • "TLS certs are expiring and nobody knows where they live"
  • "L4 vs L7 picked wrong - we can't route /api/v2 without the app rewriting headers"
  • "the ALB is only in one AZ because we forgot to check all three subnet boxes"
  • "our NLB preserves client IP to the backend, and our security group only allowed the VPC CIDR, so everything was blocked"
  • "the Kubernetes Ingress controller is creating an ALB per Ingress and we are over the account limit"

Later modules (Kubernetes Ingress, mesh, observability) rest on these primitives. Kubernetes Ingress is fundamentally an L7 load-balancer abstraction with declarative wiring.

Concrete Example

Consider two services behind a single public entry point.

L7 / ALB:

  • listener: HTTPS on :443 with an ACM certificate
  • rule 1: host api.example.com path /v1/* -> target group api-v1 (Fargate tasks)
  • rule 2: host api.example.com path /v2/* -> target group api-v2 (Fargate tasks, different service)
  • rule 3: host admin.example.com -> target group admin (internal only, SG restricts source)
  • target-group health check: GET /healthz, 2 successes to mark healthy, 3 failures to mark unhealthy, interval 15 s
  • TLS terminated at the ALB; backend traffic can be HTTP or re-encrypted HTTPS
  • access logs to S3 (s3://acme-alb-logs/prod/), partitioned by date

L4 / NLB for a gRPC service:

  • listener: TCP :443, target group uses health check of TCP:9000 (or HTTPS:8080 if preferred)
  • no awareness of HTTP paths or headers
  • client IP preserved to the backend (so the backend SG must allow the public internet CIDR, not the VPC CIDR)
  • much higher throughput, lower latency than an ALB; typically used for non-HTTP or latency-sensitive workloads

Writing a health endpoint that doesn't lie (Python/Flask):

@app.get("/healthz")
def healthz():
try:
db.execute("SELECT 1"); cache.ping()
return jsonify(status="ok"), 200
except Exception as e:
return jsonify(status="degraded", error=str(e)), 503

Split livez (is the process up?) from readyz (can it serve traffic?) for Kubernetes-style probes.

Common Confusion / Misconception

"L7 is always better." L7 adds latency, cost, and complexity. If you do not need path/host routing, an NLB is simpler, faster, and gives you a stable static IP.

"TLS termination means I don't need HTTPS to the backend." If your threat model includes anyone on the VPC network (bad actor, sidecar, misconfigured peer), you want mTLS or at least TLS to the backend. Terminating at the LB decrypts traffic; whether that decrypted traffic then re-encrypts is a separate decision.

"Health checks are just a liveness probe." A weak health check (e.g., returning 200 from /) is worse than none. It claims the process is healthy even when its downstream database is unreachable. Write a health check that exercises the dependencies the request path actually uses, or split into liveness (is the process up?) and readiness (can it serve traffic?).

"Sticky sessions solve session consistency." They solve it until the backend dies and the cookie points at a corpse. Use sticky sessions only when you have measured a reason; state in an external store (cache, DB) is usually the better answer.

Gotcha: If your ALB's "cross-zone load balancing" is disabled and one AZ has 1 target while another has 10, each AZ receives 50% of the traffic and one target gets slammed. Turn cross-zone on (it is default on for ALB, default off for NLB).

Second gotcha: An NLB preserves the client IP. Your backend security group must allow that client CIDR (potentially 0.0.0.0/0 for internet-facing), not just the VPC CIDR, or you will see connection resets at the OS level with no ALB-style rejection log. tcpdump on the backend is your friend.

How To Use It

For any internet-facing service:

  1. Pick L7 if you need routing by host, path, header, or cookie; otherwise L4.
  2. Terminate TLS at the load balancer; use ACM (AWS) or the provider's managed cert service; auto-rotate.
  3. Put the LB across at least 2 AZs (ideally 3); confirm targets exist in each AZ.
  4. Write a health check against a real GET /healthz endpoint that returns 200 only if critical dependencies work.
  5. Publish LB access logs to S3/GCS and metrics to CloudWatch/Cloud Monitoring; alert on 5xx rate and target health counts.
  6. If public-facing, sit a WAF in front (L7) or rely on security group rules (L4) to limit source ranges.
  7. Model deploys around connection draining - ensure your deployment system marks targets unhealthy before killing the process, or set a deregistration delay that matches your slowest request.
  8. Test renewal by rotating a cert manually once - the first incident is not the time to learn your ACM setup.

Check Yourself

  1. When would an NLB be a better choice than an ALB for a high-throughput service?
  2. Why is a weak health check dangerous?
  3. What does "TLS termination at the load balancer" mean for certificate management and for traffic to the backend?
  4. Your ALB is across three AZs but traffic is uneven (one target drowning). Name two causes and a fix for each.
  5. A developer says "we don't need health checks, we run rolling deploys." Why is that still wrong?

Mini Drill or Application

Pick a small service (for example, a REST API at api.example.com). In fifteen minutes, write a one-page load-balancer plan: L4 or L7 with reasons, listener and rules, health-check path, TLS approach, AZ spread, and two failure scenarios with how the LB handles them.

Extension: from any reachable machine, run curl -I https://your.example.com && openssl s_client -connect your.example.com:443 -servername your.example.com </dev/null | openssl x509 -noout -dates. Record the cert's notAfter. Write a one-line alert that fires 30 days before it.

Read This Only If Stuck