Designing Failure Detectors and Idempotent APIs in Real Systems

Context

Grounded in notes from Models of distributed systems and practical experience building APIs that behave well under retries.

Modern services operate over unreliable networks and imperfect nodes. Two ideas quietly underpin reliable behavior at scale: failure detection and idempotency. This post bridges theory and production patterns you can ship.

Failure detection: from theory to production

System model awareness: Most prod systems are effectively partially-synchronous with fair-loss links and crash-recovery nodes. Design detectors accordingly.
Timeouts are not truth: A timeout doesn’t prove a crash; it indicates suspicion. Treat detectors as eventually perfect, not oracles.
Probe patterns:
- Heartbeats with adaptive timeouts (quantile-based per-peer RTTs).
- Gossip to spread suspicion; converge faster without centralized bottlenecks.
- Lease-based liveness for leaders; tie safety to expirations, not heartbeats alone.
Operational practices:
- Expose detector state (suspected, confirmed, cleared) and transitions in metrics/logs.
- Separate detection from action. Let higher layers decide to shed, reroute, or degrade.
- Use jittered backoff to avoid thundering herds during partitions.

FIFO from reliable links: a pragmatic recipe

Reliable links may reorder messages. To achieve FIFO per-peer:

Attach a monotonically increasing sequence number per sender → receiver pair.
Receiver accepts seq == expected; buffers future (seq > expected); drops duplicates (seq < expected).
Persist expected in crash-recovery systems; replay buffer bounded by window size.

This mirrors the exercise in Models of distributed systems with production concerns (persistence, bounds, metrics).

Idempotency: the contract that tames retries

Why: Clients retry on timeouts; networks duplicate and reorder. Without idempotency, retried POST can create duplicate effects.
Contract: Client supplies an Idempotency-Key that scopes the intended effect. Server guarantees that repeats with the same key are safe.
Store design:
- Durable write-ahead record keyed by (client_id, idempotency_key) with status and final response pointer.
- Lock-free fast path: first request wins; losers return stored result.
- TTL cleanup for incomplete/abandoned attempts.
Side effects:
- Decompose operations to identify the “commit point”. Ensure external effects (emails, webhooks) are tied to committed records and deduplicated by key.
- Outbox pattern to make side effects at-least-once and idempotent downstream.

API surface: practical guidance

Require Idempotency-Key on mutating endpoints where retries are common (payments, orders, jobs).
Return 202 Accepted for long-running work with a status_url for polling; repeat with same key returns the same job.
Make read-after-write consistency explicit; document eventual semantics and polling intervals.

Observability and SLOs

Track: retry rate, duplicate drop rate, idempotency cache hit rate, false-suspect rate from failure detector, and end-to-end tail latencies (P95/P99).
Alerts on rising false-suspects usually indicate network jitter or GC pauses; tune timeouts and widen windows before paging humans.

TL;DR

Treat failure detectors as “suspicion services,” not truth oracles. Pair them with idempotent APIs and outbox-backed side effects. Persist sequence state if you need FIFO over crash-recovery. You’ll convert network vagaries into predictable, user-safe behavior.

Technical appendix

A. Failure detector properties and ϕ-accrual outline

Completeness: every crashed process is eventually suspected by every correct process.
Accuracy (eventual weak): there is a time after which correct processes are not suspected.

ϕ-accrual detector computes suspicion level based on heartbeat inter-arrival distribution:

[ \phi(t) = -\log*{10}(1 - F(t - t*{last})) ]

where (F) is the CDF of the inter-arrival times (often approximated by a normal over rolling window). Suspect when (\phi \geq \theta) (e.g., 5–8).

class PhiAccrual:
    def __init__(self, window=100):
        self.samples = deque(maxlen=window)
        self.last = None
 
    def heartbeat(self, now):
        if self.last is not None:
            self.samples.append(now - self.last)
        self.last = now
 
    def phi(self, now) -> float:
        if not self.samples:
            return 0.0
        mu = statistics.mean(self.samples)
        sigma = max(1e-3, statistics.pstdev(self.samples))
        x = now - self.last
        # Normal CDF approximation
        z = (x - mu) / sigma
        F = 0.5 * (1 + math.erf(z / math.sqrt(2)))
        F = min(max(F, 1e-12), 1 - 1e-12)
        return -math.log10(1 - F)

Tune (\theta) per environment; expose false-suspect rate and adjust using quantile-based adaptive timeouts.

B. FIFO over reliable links with crash-recovery

Guarantee per-sender FIFO delivery using sequence numbers and a bounded buffer; persist expected sequence across crashes.

class FifoReceiver:
    def __init__(self, store, sender_id):
        self.store = store  # durable KV store
        self.sender_id = sender_id
        self.expected = store.get((sender_id, 'expected')) or 0
        self.buffer = {}
 
    def on_message(self, seq, payload):
        if seq < self.expected:
            return  # duplicate
        if seq == self.expected:
            self.deliver(payload)
            self.expected += 1
            self.store.put((self.sender_id, 'expected'), self.expected)
            # drain buffer
            while self.expected in self.buffer:
                self.deliver(self.buffer.pop(self.expected))
                self.expected += 1
                self.store.put((self.sender_id, 'expected'), self.expected)
        else:
            self.buffer[seq] = payload  # future
 
    def deliver(self, payload):
        pass  # application handler

Space is bounded by max in-flight window per sender. Safety: never delivers out of order by construction; liveness depends on eventual delivery.

C. Idempotency store and outbox (PostgreSQL)

create table idempotency_keys (
  client_id text not null,
  key text not null,
  status text not null check (status in ('pending','succeeded','failed')),
  response jsonb,
  created_at timestamptz default now(),
  primary key (client_id, key)
);
 
create table orders (
  id uuid primary key default gen_random_uuid(),
  client_id text not null,
  sku text not null,
  qty int not null,
  created_at timestamptz default now()
);
 
create table outbox (
  id bigserial primary key,
  aggregate text not null,
  aggregate_id uuid not null,
  event_type text not null,
  payload jsonb not null,
  created_at timestamptz default now(),
  dispatched boolean default false
);

Server flow (single transaction):

-- 1) Reserve key
insert into idempotency_keys (client_id, key, status)
values ($1, $2, 'pending')
on conflict (client_id, key) do nothing;
 
-- 2) If not inserted, return stored response
-- else create order and enqueue side effects
insert into orders (client_id, sku, qty) values ($1, $3, $4) returning id into order_id;
insert into outbox (aggregate, aggregate_id, event_type, payload)
values ('order', order_id, 'OrderCreated', jsonb_build_object('order_id', order_id));
 
-- 3) Mark success
update idempotency_keys set status = 'succeeded', response = jsonb_build_object('order_id', order_id)
where client_id = $1 and key = $2;

Outbox dispatcher reads in order, delivers at-least-once; consumers must dedupe by (aggregate,event_type,aggregate_id).

The Commonplace Book

Explorer

Designing Failure Detectors and Idempotent APIs in Real Systems

Failure detection: from theory to production

FIFO from reliable links: a pragmatic recipe

Idempotency: the contract that tames retries

API surface: practical guidance

Observability and SLOs

Technical appendix

A. Failure detector properties and ϕ-accrual outline

B. FIFO over reliable links with crash-recovery

C. Idempotency store and outbox (PostgreSQL)

Graph View

Table of Contents

Backlinks