Context
Grounded in notes from Models of distributed systems and practical experience building APIs that behave well under retries.
Modern services operate over unreliable networks and imperfect nodes. Two ideas quietly underpin reliable behavior at scale: failure detection and idempotency. This post bridges theory and production patterns you can ship.
Failure detection: from theory to production
- System model awareness: Most prod systems are effectively partially-synchronous with fair-loss links and crash-recovery nodes. Design detectors accordingly.
- Timeouts are not truth: A timeout doesn’t prove a crash; it indicates suspicion. Treat detectors as eventually perfect, not oracles.
- Probe patterns:
- Heartbeats with adaptive timeouts (quantile-based per-peer RTTs).
- Gossip to spread suspicion; converge faster without centralized bottlenecks.
- Lease-based liveness for leaders; tie safety to expirations, not heartbeats alone.
- Operational practices:
- Expose detector state (suspected, confirmed, cleared) and transitions in metrics/logs.
- Separate detection from action. Let higher layers decide to shed, reroute, or degrade.
- Use jittered backoff to avoid thundering herds during partitions.
FIFO from reliable links: a pragmatic recipe
Reliable links may reorder messages. To achieve FIFO per-peer:
- Attach a monotonically increasing sequence number per sender → receiver pair.
- Receiver accepts
seq == expected; buffers future (seq > expected); drops duplicates (seq < expected). - Persist
expectedin crash-recovery systems; replay buffer bounded by window size.
This mirrors the exercise in Models of distributed systems with production concerns (persistence, bounds, metrics).
Idempotency: the contract that tames retries
- Why: Clients retry on timeouts; networks duplicate and reorder. Without idempotency, retried
POSTcan create duplicate effects. - Contract: Client supplies an
Idempotency-Keythat scopes the intended effect. Server guarantees that repeats with the same key are safe. - Store design:
- Durable write-ahead record keyed by
(client_id, idempotency_key)with status and final response pointer. - Lock-free fast path: first request wins; losers return stored result.
- TTL cleanup for incomplete/abandoned attempts.
- Durable write-ahead record keyed by
- Side effects:
- Decompose operations to identify the “commit point”. Ensure external effects (emails, webhooks) are tied to committed records and deduplicated by key.
- Outbox pattern to make side effects at-least-once and idempotent downstream.
API surface: practical guidance
- Require
Idempotency-Keyon mutating endpoints where retries are common (payments, orders, jobs). - Return
202 Acceptedfor long-running work with astatus_urlfor polling; repeat with same key returns the same job. - Make read-after-write consistency explicit; document eventual semantics and polling intervals.
Observability and SLOs
- Track: retry rate, duplicate drop rate, idempotency cache hit rate, false-suspect rate from failure detector, and end-to-end tail latencies (P95/P99).
- Alerts on rising false-suspects usually indicate network jitter or GC pauses; tune timeouts and widen windows before paging humans.
TL;DR
Treat failure detectors as “suspicion services,” not truth oracles. Pair them with idempotent APIs and outbox-backed side effects. Persist sequence state if you need FIFO over crash-recovery. You’ll convert network vagaries into predictable, user-safe behavior.
Technical appendix
A. Failure detector properties and Ď•-accrual outline
- Completeness: every crashed process is eventually suspected by every correct process.
- Accuracy (eventual weak): there is a time after which correct processes are not suspected.
Ď•-accrual detector computes suspicion level based on heartbeat inter-arrival distribution:
[ \phi(t) = -\log*{10}(1 - F(t - t*{last})) ]
where (F) is the CDF of the inter-arrival times (often approximated by a normal over rolling window). Suspect when (\phi \geq \theta) (e.g., 5–8).
class PhiAccrual:
def __init__(self, window=100):
self.samples = deque(maxlen=window)
self.last = None
def heartbeat(self, now):
if self.last is not None:
self.samples.append(now - self.last)
self.last = now
def phi(self, now) -> float:
if not self.samples:
return 0.0
mu = statistics.mean(self.samples)
sigma = max(1e-3, statistics.pstdev(self.samples))
x = now - self.last
# Normal CDF approximation
z = (x - mu) / sigma
F = 0.5 * (1 + math.erf(z / math.sqrt(2)))
F = min(max(F, 1e-12), 1 - 1e-12)
return -math.log10(1 - F)Tune (\theta) per environment; expose false-suspect rate and adjust using quantile-based adaptive timeouts.
B. FIFO over reliable links with crash-recovery
Guarantee per-sender FIFO delivery using sequence numbers and a bounded buffer; persist expected sequence across crashes.
class FifoReceiver:
def __init__(self, store, sender_id):
self.store = store # durable KV store
self.sender_id = sender_id
self.expected = store.get((sender_id, 'expected')) or 0
self.buffer = {}
def on_message(self, seq, payload):
if seq < self.expected:
return # duplicate
if seq == self.expected:
self.deliver(payload)
self.expected += 1
self.store.put((self.sender_id, 'expected'), self.expected)
# drain buffer
while self.expected in self.buffer:
self.deliver(self.buffer.pop(self.expected))
self.expected += 1
self.store.put((self.sender_id, 'expected'), self.expected)
else:
self.buffer[seq] = payload # future
def deliver(self, payload):
pass # application handlerSpace is bounded by max in-flight window per sender. Safety: never delivers out of order by construction; liveness depends on eventual delivery.
C. Idempotency store and outbox (PostgreSQL)
create table idempotency_keys (
client_id text not null,
key text not null,
status text not null check (status in ('pending','succeeded','failed')),
response jsonb,
created_at timestamptz default now(),
primary key (client_id, key)
);
create table orders (
id uuid primary key default gen_random_uuid(),
client_id text not null,
sku text not null,
qty int not null,
created_at timestamptz default now()
);
create table outbox (
id bigserial primary key,
aggregate text not null,
aggregate_id uuid not null,
event_type text not null,
payload jsonb not null,
created_at timestamptz default now(),
dispatched boolean default false
);Server flow (single transaction):
-- 1) Reserve key
insert into idempotency_keys (client_id, key, status)
values ($1, $2, 'pending')
on conflict (client_id, key) do nothing;
-- 2) If not inserted, return stored response
-- else create order and enqueue side effects
insert into orders (client_id, sku, qty) values ($1, $3, $4) returning id into order_id;
insert into outbox (aggregate, aggregate_id, event_type, payload)
values ('order', order_id, 'OrderCreated', jsonb_build_object('order_id', order_id));
-- 3) Mark success
update idempotency_keys set status = 'succeeded', response = jsonb_build_object('order_id', order_id)
where client_id = $1 and key = $2;Outbox dispatcher reads in order, delivers at-least-once; consumers must dedupe by (aggregate,event_type,aggregate_id).