Long-Running Operations, Webhooks, and Async Callbacks

What This Concept Is

Some operations do not finish inside a single HTTP request. Three common patterns for signaling "this will take a while, here is how to find out when it is done":

Long-Running Operations (LROs): the server responds immediately with an Operation resource the client can poll. Client pulls.
Webhooks: the server calls back a URL the client registered. Server pushes.
Async callbacks via a message broker: the server emits an event on a topic the client is subscribed to. Bus pushes.

None of these are "asynchronous HTTP" - they are contracts for what the client gets back and what the server does next when work is not instant.

Why It Matters Here

Synchronous RPC is a leaky fiction for operations that take longer than a browser timeout. Forcing a 2-minute video encode behind a single POST fails every reliability dimension: timeouts, retries, load balancing, client UX. Naming the async pattern explicitly in the contract fixes all of those.

Delivery guarantees also differ across the three patterns. LRO is "at-most-once, pull for truth" - no delivery problem because the client controls when to look. Webhooks are "at-least-once, server retries" - consumers must dedupe. Events on a broker inherit whatever the broker guarantees.

Concrete Example

Pattern 1: Long-Running Operation

A video upload that takes minutes to transcode.

POST /videos
Content-Type: application/json

{ "source_url": "...", "title": "Team meeting" }

HTTP/1.1 202 Accepted
Content-Type: application/json
Location: /operations/op_01HX

{
  "name": "operations/op_01HX",
  "metadata": {
    "type": "transcode",
    "video_id": "vid_99",
    "progress_percent": 0
  },
  "done": false
}

Client polls:

GET /operations/op_01HX
-> 200 OK
{
  "name": "operations/op_01HX",
  "metadata": { "type": "transcode", "video_id": "vid_99", "progress_percent": 42 },
  "done": false
}

Eventually:

GET /operations/op_01HX
-> 200 OK
{
  "name": "operations/op_01HX",
  "metadata": { "type": "transcode", "video_id": "vid_99", "progress_percent": 100 },
  "done": true,
  "response": {
    "video_id": "vid_99",
    "duration_seconds": 832,
    "variants": [ { "resolution": "1080p", "url": "..." } ]
  }
}

On failure:

{
  "name": "operations/op_01HX",
  "done": true,
  "error": { "code": "SOURCE_UNREACHABLE", "message": "..." }
}

Client rules are uniform across LROs: done=false means keep polling; done=true with response means success; done=true with error means failure. Clients can cancel: POST /operations/op_01HX:cancel.

Pattern 2: Webhook

Client registers a URL; server calls it when events happen.

POST /webhooks
{
  "url": "https://client.example.com/hooks/payments",
  "events": ["payment.succeeded", "payment.failed"],
  "signing_secret_version": 1
}

When the event fires, the server POSTs to the client:

POST /hooks/payments HTTP/1.1
Host: client.example.com
X-Event-Id: evt_01HX
X-Event-Type: payment.succeeded
X-Signature: sha256=HEX...
X-Timestamp: 2026-04-10T15:20:00Z
Content-Type: application/json

{
  "id": "evt_01HX",
  "type": "payment.succeeded",
  "created_at": "2026-04-10T15:20:00Z",
  "data": { "payment_id": "pay_42", "amount": 4599, "currency": "USD" }
}

Delivery contract to document:

At-least-once: clients must dedupe by X-Event-Id.
Retry policy: server retries on 5xx or timeout with exponential backoff for N hours.
Signature: HMAC-SHA256(secret, timestamp + "." + raw_body) to prevent replay and spoofing.
Ordering: not guaranteed; events may arrive out of sequence.
Response: client returns 2xx within a few seconds to acknowledge; anything else is a retryable failure.

Dashboard for operators: list recent deliveries, their status, and a "redeliver" action.

Pattern 3: Async callback via broker

Used between internal services; same contract shape but delivery is via Kafka / SQS / PubSub:

topic: payments.events
key:   pay_42
value: {
  "id": "evt_01HX",
  "type": "payment.succeeded",
  "data": { "payment_id": "pay_42", ... }
}

Broker-level guarantees (at-least-once, ordered-by-key in Kafka) now are your contract. Document them in AsyncAPI (Cluster 4).

Common Confusion / Misconception

"We'll just make the client wait - it usually finishes in under 30 seconds." Production networks are not usually anything. A connection dropping at second 29 reruns the job; 30-second operations become hour-long nightmares under load. Commit to async as soon as the operation is not user-perceptible as instant.

"Webhooks are fire-and-forget." They are fire-and-retry. Every webhook client must be idempotent (dedupe by event ID) and every webhook server must have a replay tool. Failing to design for this leads to phantom duplicate orders and double-sent emails.

"LRO polling is wasteful." A GET /operations/{id} every few seconds for a minute is cheap. The alternative - long-held connections or websockets - is usually harder to operate for little gain.

"Events and webhooks are the same thing." Webhooks are HTTP events, push-to-URL. Broker events are decoupled from any specific consumer. A webhook system has N registered URLs; a broker has N consumer groups. Picking between them depends on who you are integrating with.

"We don't need to sign webhooks over HTTPS." You absolutely do. Anyone on the internet can spoof an HTTP POST to your client's public URL. Signature + timestamp prevents replay and forgery.

How To Use It

When designing an async operation:

Define the trigger: which synchronous call kicks it off, and what does that call return?
Define the status contract: how does the client know when it is done and with what result? (LRO resource, webhook event, broker message)
Define delivery guarantees: at-most-once / at-least-once / exactly-once, ordered or unordered.
Define failure handling: timeouts, retries, poisoned payloads, dead-letter topics.
Define security: signatures for webhooks, auth for brokers, ACLs for LRO polling.
Define observability: how does the consumer replay, inspect, and debug?

Write these six items as a one-page "async contract" for every such operation.

Check Yourself

Why do webhook consumers have to dedupe? What property does "at-least-once delivery" guarantee and not guarantee?
When should you prefer LRO over webhooks? When should you prefer webhooks over LRO?
If you cannot give webhook consumers ordering guarantees, how do you let them reconstruct the correct final state of a resource?

Mini Drill or Application

Design the async side of a payments API:

Model POST /payments as an LRO (authorize + capture takes time). Write the 202 Accepted response and the GET /operations/{id} responses for in-flight, success, and failure.
Design the webhook contract for payment.succeeded, payment.failed, and payment.refunded: headers, body, signature algorithm, retry policy.
Write a one-paragraph note on ordering: can the consumer assume succeeded arrives before refunded? If not, how do they reconstruct state?

What This Concept Is​

Why It Matters Here​

Concrete Example​

Pattern 1: Long-Running Operation​

Pattern 2: Webhook​

Pattern 3: Async callback via broker​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​