Microservices Katas
Focused, repeatable drills that build fluency in the four skills this module is really about: decomposition, contract authoring, resilience wiring, and trace reading. Complete each kata multiple times until the setup feels automatic.
Kata 1: Decompose an E-Commerce Monolith
Time limit: 20 minutes
Goal: Propose 4-6 services with owned data and a migration cut.
Setup: You inherit a monolithic retail system with the following modules: catalog, search, cart, checkout, payments, orders, inventory, fulfillment, shipping, accounts, auth, notifications, reviews, recommendations, admin. Single Postgres DB with ~60 tables. One deploy pipeline. 14-engineer company, 3 teams.
Produce:
- Bounded contexts (list them, 1-sentence ubiquitous language each).
- Proposed service list (4-6, with owned verbs and owned data).
- Which modules stay in a modular monolith for now (not extracted).
- The first strangler-fig extraction and its rollback plan.
- One anti-pattern you explicitly refused to create, and why.
Repeat until: You can do this from memory without referring to the concept pages, for at least two different industries (e-commerce, SaaS, logistics, healthcare, banking).
Kata 2: Design a Contract Test
Time limit: 20 minutes
Goal: Write a consumer-driven contract test for one real-ish interaction.
Setup: Inventory service reads from Orders at GET /orders/{id} to build reservations for confirmed orders.
Produce:
- Consumer name, provider name.
- Provider state declaration ("given order X is confirmed").
- Request shape (method, path, headers).
- Expected response shape, including:
- a required field consumer uses (must be present)
- an optional field consumer ignores
- an enum field (consumer must tolerate unknown values)
- Three expectations that, if broken on the provider, should fail this test.
- One "safe" provider change that the test should not fail on.
- One breaking provider change that the test catches.
Repeat until: You can produce all seven items in under 10 minutes, and the "safe vs breaking" distinction is automatic.
Kata 3: Wire Circuit Breaker Logic
Time limit: 20 minutes
Goal: Implement the circuit-breaker state machine in pseudocode.
Setup: You are writing a small library. It wraps a callable downstream() and enforces the closed / open / half-open state machine with configurable thresholds.
Produce:
- State enum:
CLOSED,OPEN,HALF_OPEN. - Config:
failure_threshold(percent or count),window_size,open_cooldown_ms,half_open_probes. - Counter: rolling window of successes/failures.
- Pseudocode for:
function call(downstream):
if state == OPEN and now() < opened_at + open_cooldown_ms:
raise CircuitOpen # fast-fail
if state == OPEN and now() >= opened_at + open_cooldown_ms:
state = HALF_OPEN
probe_remaining = half_open_probes
if state == HALF_OPEN and probe_remaining <= 0:
# already scheduled probes out, stay half-open waiting
raise CircuitOpen
try:
result = downstream()
record_success()
if state == HALF_OPEN: state = CLOSED; reset_counters()
return result
except TransientError:
record_failure()
if state == CLOSED and failure_rate() >= failure_threshold:
state = OPEN; opened_at = now()
if state == HALF_OPEN:
state = OPEN; opened_at = now()
raise
- One test case for each transition: CLOSED->OPEN, OPEN->HALF_OPEN, HALF_OPEN->CLOSED, HALF_OPEN->OPEN.
Repeat until: You can write the transitions and test cases without the reference above, and you can explain each decision (why cooldown, why probes, why reset counters on close).
Kata 4: Model a Distributed Trace for One User Request
Time limit: 25 minutes
Goal: Produce a realistic trace waterfall for one request, including one latency bottleneck and one retried call.
Setup: A user hits POST /checkout on a mobile app. The path is: mobile app -> API Gateway -> Mobile BFF -> fan-out to accounts, cart, orders. Orders calls payments and publishes OrderConfirmed. Payments calls an external PSP with one retry. The PSP call is slow today, so the retry fires.
Produce:
- The list of spans with
trace_id,span_id,parent_span_id, service name, operation name, start/end ms. - The waterfall rendering (text, as in concept 13).
- Identification of the p99 span.
- Identification of the retry (hint: two sibling spans with same parent and operation, first one errored).
- The
traceparentheader propagation: which hops produce it and which hops must forward it. - Three log lines from three different services, each stamped with the same
trace_id. - One alert rule that would fire on this trace (e.g., "payments p99 > 1s" or "retry rate > 5%").
Repeat until: You can read a text-format trace and immediately point at the bottleneck, and you can reproduce the waterfall for a similar request (tweet post, photo upload, ride request) from memory.
Completion Standard
- Can complete each kata within its time limit
- Can explain the decisions in each kata without rereading the concept page
- Can do kata 1 on at least two different industries
- Can do kata 4 with a retry, a timeout, and a circuit-open shown in the same trace
- Have done the full set at least twice, a week apart