Skip to main content

Microservices Katas

Focused, repeatable drills that build fluency in the four skills this module is really about: decomposition, contract authoring, resilience wiring, and trace reading. Complete each kata multiple times until the setup feels automatic.

Kata 1: Decompose an E-Commerce Monolith

Time limit: 20 minutes Goal: Propose 4-6 services with owned data and a migration cut. Setup: You inherit a monolithic retail system with the following modules: catalog, search, cart, checkout, payments, orders, inventory, fulfillment, shipping, accounts, auth, notifications, reviews, recommendations, admin. Single Postgres DB with ~60 tables. One deploy pipeline. 14-engineer company, 3 teams.

Produce:

  1. Bounded contexts (list them, 1-sentence ubiquitous language each).
  2. Proposed service list (4-6, with owned verbs and owned data).
  3. Which modules stay in a modular monolith for now (not extracted).
  4. The first strangler-fig extraction and its rollback plan.
  5. One anti-pattern you explicitly refused to create, and why.

Repeat until: You can do this from memory without referring to the concept pages, for at least two different industries (e-commerce, SaaS, logistics, healthcare, banking).


Kata 2: Design a Contract Test

Time limit: 20 minutes Goal: Write a consumer-driven contract test for one real-ish interaction. Setup: Inventory service reads from Orders at GET /orders/{id} to build reservations for confirmed orders.

Produce:

  1. Consumer name, provider name.
  2. Provider state declaration ("given order X is confirmed").
  3. Request shape (method, path, headers).
  4. Expected response shape, including:
    • a required field consumer uses (must be present)
    • an optional field consumer ignores
    • an enum field (consumer must tolerate unknown values)
  5. Three expectations that, if broken on the provider, should fail this test.
  6. One "safe" provider change that the test should not fail on.
  7. One breaking provider change that the test catches.

Repeat until: You can produce all seven items in under 10 minutes, and the "safe vs breaking" distinction is automatic.


Kata 3: Wire Circuit Breaker Logic

Time limit: 20 minutes Goal: Implement the circuit-breaker state machine in pseudocode. Setup: You are writing a small library. It wraps a callable downstream() and enforces the closed / open / half-open state machine with configurable thresholds.

Produce:

  1. State enum: CLOSED, OPEN, HALF_OPEN.
  2. Config: failure_threshold (percent or count), window_size, open_cooldown_ms, half_open_probes.
  3. Counter: rolling window of successes/failures.
  4. Pseudocode for:
function call(downstream):
if state == OPEN and now() < opened_at + open_cooldown_ms:
raise CircuitOpen # fast-fail
if state == OPEN and now() >= opened_at + open_cooldown_ms:
state = HALF_OPEN
probe_remaining = half_open_probes
if state == HALF_OPEN and probe_remaining <= 0:
# already scheduled probes out, stay half-open waiting
raise CircuitOpen
try:
result = downstream()
record_success()
if state == HALF_OPEN: state = CLOSED; reset_counters()
return result
except TransientError:
record_failure()
if state == CLOSED and failure_rate() >= failure_threshold:
state = OPEN; opened_at = now()
if state == HALF_OPEN:
state = OPEN; opened_at = now()
raise
  1. One test case for each transition: CLOSED->OPEN, OPEN->HALF_OPEN, HALF_OPEN->CLOSED, HALF_OPEN->OPEN.

Repeat until: You can write the transitions and test cases without the reference above, and you can explain each decision (why cooldown, why probes, why reset counters on close).


Kata 4: Model a Distributed Trace for One User Request

Time limit: 25 minutes Goal: Produce a realistic trace waterfall for one request, including one latency bottleneck and one retried call. Setup: A user hits POST /checkout on a mobile app. The path is: mobile app -> API Gateway -> Mobile BFF -> fan-out to accounts, cart, orders. Orders calls payments and publishes OrderConfirmed. Payments calls an external PSP with one retry. The PSP call is slow today, so the retry fires.

Produce:

  1. The list of spans with trace_id, span_id, parent_span_id, service name, operation name, start/end ms.
  2. The waterfall rendering (text, as in concept 13).
  3. Identification of the p99 span.
  4. Identification of the retry (hint: two sibling spans with same parent and operation, first one errored).
  5. The traceparent header propagation: which hops produce it and which hops must forward it.
  6. Three log lines from three different services, each stamped with the same trace_id.
  7. One alert rule that would fire on this trace (e.g., "payments p99 > 1s" or "retry rate > 5%").

Repeat until: You can read a text-format trace and immediately point at the bottleneck, and you can reproduce the waterfall for a similar request (tweet post, photo upload, ride request) from memory.


Completion Standard

  • Can complete each kata within its time limit
  • Can explain the decisions in each kata without rereading the concept page
  • Can do kata 1 on at least two different industries
  • Can do kata 4 with a retry, a timeout, and a circuit-open shown in the same trace
  • Have done the full set at least twice, a week apart