Choosing the Right Data Model for Product Velocity

Info

Anchored in notes from Data Models and Query Languages and pragmatic trade-offs teams make when shipping features fast.

Senior engineers are often asked: “Should we go relational, document, or graph?” The right answer optimizes for iteration speed today while leaving options open for tomorrow. Here’s a decision framework.

1) Start from access patterns, not ideology

Enumerate the top 10 read/write paths and their SLAs. Prototype queries first.
If you’re rendering whole aggregates (profile, order, report) → document/locality is your friend.
If you’re joining across many-to-many, enforcing invariants, and ad-hoc querying → relational wins.
If relationships are the product (recommendations, network paths) → graph may be the simplest.

2) Locality vs normalization

Document models maximize storage locality; fewer roundtrips and indexes for aggregate reads.
Normalized relational schemas keep writes simple, constraints enforceable, and queries composable.
Hybridize intentionally: denormalize read models (materialized views) fed from normalized writes.

3) Evolution and schema

“Schemaless” is actually schema-on-read. You still need contracts; they just move.
In fast-changing domains, prefer document models for rapid iteration; codify validation at the edges.
When the domain stabilizes, migrate critical paths to relational for constraints and rich querying.

4) Performance levers before you switch stores

Indexes (B-Tree/GIN) tuned to queries before changing models.
Materialized views for heavy aggregations; schedule refreshes aligned to business needs.
Caching layers for hot reads; invalidate on writes using domain keys.

5) Scaling patterns and when to use them

Vertical scale until it hurts. Simpler ops, fewer moving parts.
Replication for read scaling; mind consistency choices (sync → latency; async → staleness).
Sharding only when necessary; it’s the last resort due to operational and query complexity.

6) Migration playbook

Keep write path stable; introduce parallel read models first.
Dual-write with strong observability; verify parity before cutover.
Backfills as idempotent jobs; throttle and checkpoint.

7) A staff engineer’s rubric

Does the chosen model minimize cross-aggregate transactions for the hot paths?
Can we enforce the key invariants natively (constraints, unique indexes)? If not, what guardrails replace them?
Is there a clear evolution path (validation, backfills, dual reads) for the next 6–12 months?
Are we reaching for sharding too early? Can indexing/denormalization buy us time?

TL;DR

Model around access patterns and product velocity. Use document for locality and speed of change; relational for constraints and complex queries; graph when relationships are first-class. Delay sharding; invest in indexes, materialized views, and caches first.

Technical appendix

A. B-Tree vs LSM-Tree

B-Tree: point/range reads O(log n), in-place updates, good read latency; write amplification low, random writes.
LSM-Tree: sequential appends to immutable segments + compaction; excellent write throughput, higher read amp; good fit for write-heavy and SSDs.

Compaction strategies:

Leveled (RocksDB default): lower write amp, more read amp; predictable space.
Tiered: fewer compactions, higher space amp.

Bloom filters reduce disk lookups; false positive rate (\approx (1 - e^{-kn/m})^k) where m = bits, n = items, k = hashes.

B. Index choices and examples (PostgreSQL)

-- JSONB containment and GIN
create index idx_user_profile_gin on users using gin (profile jsonb_path_ops);
 
-- Composite B-Tree to cover query and order
create index idx_orders_customer_created on orders (customer_id, created_at desc);
 
-- Partial index for hot subset
create index idx_orders_open on orders (status) where status = 'open';

C. Materialized views and refresh

create materialized view mv_daily_sales as
select date_trunc('day', created_at) as day,
       sum(amount) as gross
from payments
group by 1;
 
-- Concurrent refresh keeps view available; pair with incremental staging table if needed
refresh materialized view concurrently mv_daily_sales;

For near-real-time aggregates, prefer a streaming read model (Debezium → Kafka → consumer → table) over frequent full refreshes.

D. Replication and consistency knobs

Synchronous replication (synchronous_commit = on) trades latency for durability across replicas.
Async replication yields lower latency with potential staleness on failover.
Reads: choose follower reads for staleness-tolerant paths; leader reads where monotonicity is required.

E. Sharding notes

Consistent hashing over (N) virtual nodes smooths rebalancing; prefer power-of-two shard counts.
Co-locate hot aggregates; avoid fan-out cross-shard transactions.

class ConsistentHash:
    def __init__(self, nodes, vnodes=256):
        self.ring = []
        for n in nodes:
            for v in range(vnodes):
                self.ring.append((hash(f"{n}-{v}") & 0xffffffff, n))
        self.ring.sort()
    def pick(self, key):
        h = hash(key) & 0xffffffff
        i = bisect.bisect_left(self.ring, (h,))
        return self.ring[i % len(self.ring)][1]

The Commonplace Book

Explorer

Choosing the Right Data Model for Product Velocity

1) Start from access patterns, not ideology

2) Locality vs normalization

3) Evolution and schema

4) Performance levers before you switch stores

5) Scaling patterns and when to use them

6) Migration playbook

7) A staff engineer’s rubric

Technical appendix

A. B-Tree vs LSM-Tree

B. Index choices and examples (PostgreSQL)

C. Materialized views and refresh

D. Replication and consistency knobs

E. Sharding notes

Graph View

Table of Contents

Backlinks