Quote

The model is not the product. The system and workflows around it are.

This playbook distills practical patterns from building LLM products in production. It complements notes in Practical Guide To Building LLM Products, Shortwave’s RAG Pipeline, and Your AI Product Needs Evals.

1) System blueprint

  • Orchestrator: Controls flows (chains/DAGs/state machines), retries, fallbacks, timeouts.
  • Retrieval & tools: Hybrid search (keyword + vector), domain adapters, structured connectors. Tool selection should be explicit and observable.
  • Guardrails: Content filters, schema validators, business invariants.
  • Caching: Content-hash and semantic caches to cut latency/cost. Layer before the model.
  • Evaluations: Task-specific offline evals + online checks. Prevent regressions; guide model/flow choices.
  • Data flywheel: Collect inputs, outputs, corrections, and failure cases for continuous improvement.

Minimal viable architecture

Start with a simple chain and a keyword baseline. Add vector/hybrid search only when it beats the baseline. Add tools incrementally; keep flows observable end-to-end.

2) Prompt and context craft

  • Single-responsibility prompts: Prefer small prompts that do one thing well over “god prompts”.
  • Structured I/O: Match model preferences (e.g., XML for Claude; Markdown/JSON for GPT). Validate outputs strictly.
  • Context curation: Curate, don’t dump. Allocate tokens intentionally across sources; prefer high information density.
  • Heuristics for context allocation: Cap per-source tokens; down-rank redundant/low-signal chunks; dedupe aggressively.
  • Diversity beyond temperature: Shuffle candidate lists; enforce novelty windows to reduce repetition.

3) Retrieval-Augmented Generation (RAG)

  • Treat retrieval as its own product surface. Measure it.
  • Metrics: MRR/Recall@k for retrieval; exact-match/F1/task metrics for generation; time-to-first-token and P50/P95 latency for UX.
  • Hybrid search: Always compare against a tuned keyword baseline before/after deploying embeddings.
  • Index quality: Prefer chunking that preserves semantics over mechanical windows.
  • Cost control: Collapse near-duplicate results; cache cross-encoders; batch reranking.

See: Shortwave’s RAG Pipeline.

4) Workflow patterns that scale

  • Explicit planning step: Ask the model to plan before acting; keep the plan auditable.
  • Rewrite user → agent prompts: Useful, but lossy. Preserve critical constraints explicitly.
  • Chains vs DAGs vs state machines: Start linear. Move to DAGs/state only when observability and reuse justify the complexity.
  • Fallbacks: Smaller/faster models for easy cases; heavyweight flows for hard ones. Route by difficulty.

5) Caching: the underrated lever

  • Deterministic cache: Content hash of inputs (including retrieved context) → output.
  • Semantic cache: Embedding similarity for “close enough” requests where exactness isn’t required.
  • Partial caching: Cache tool responses (e.g., search results) separately from generation.
  • Expiration policy: Data-source aware TTLs; invalidate on underlying content changes.

6) Evaluations as the control plane

  • Fight dev–prod skew: Maintain holdouts and “vibe check” sets representative of production.
  • What to measure:
    • Task success (precision/recall/F1), safety violations, adherence to schema, abstention rate.
    • Retrieval MRR/Recall@k; end-to-end P50/P95/P99 latencies; cost per successful task.
  • Workflow-level evals: Don’t only score the final output—score each stage (retrieval, plan quality, tool selection, guardrails).
  • Regression discipline: Ship with canaries; gate releases on eval deltas, not vibes.

See: Your AI Product Needs Evals and Practical Guide To Building LLM Products.

7) Model strategy: pragmatism wins

  • Pick the smallest model that clears the bar. Use flow-engineering to compensate before upgrading size.
  • When to fine-tune: Only after proving foundation models are insufficient for your specific data/task.
  • Latency budgets: Be explicit about P95 budget per step; route by budget.
  • Self-hosting vs vendor APIs: Self-host for constraints (privacy, rate limits, cost predictability). Vendor APIs for speed and breadth.

8) Operational guardrails

  • Contracts: Strong schemas for inputs/outputs; reject/repair on violation.
  • Observability: Trace every step (prompt, retrieved docs, decisions, outputs). Log redactions for privacy.
  • Safety: Pre/post filters; human-in-the-loop for medium/low confidence.
  • Reliability: Timeouts, retries with idempotency keys; circuit breakers; golden paths for critical intents.

9) Staff-level checklist

  • Does a tuned keyword baseline beat current retrieval on MRR/Recall@k?
  • Are prompts single-purpose with strict schemas and validation?
  • Is there a deterministic and a semantic cache in front of the model?
  • Do offline evals predict online success for the top 5 intents?
  • Is routing in place to match model/flow to difficulty and latency budget?
  • Are cost and latency tracked per-stage and per-intent at P50/P95/P99?
  • Can we explain any output via traces within 2 minutes?

10) Anti-patterns to avoid

  • One mega-prompt handling every intent.
  • Vector search without a keyword baseline.
  • “Exactly-once” illusions without idempotency.
  • Unbounded context growth and unlogged prompt drift.
  • No holdout data; shipping based on demos.

TL;DR

Start with a boring, robust system: baseline retrieval, single-responsibility prompts, strict schemas, caches, and evals. Scale complexity only when metrics demand it. The workflows, data, and evaluation harness are the durable moat—models will change.


Technical appendix

A. Orchestrator skeleton (timeouts, retries, circuit breaker)

type StepResult<T> = { ok: true; value: T } | { ok: false; error: Error }
 
async function withTimeout<T>(p: Promise<T>, ms: number): Promise<T> {
  let t: NodeJS.Timeout
  const timeout = new Promise<never>(
    (_, rej) => (t = setTimeout(() => rej(new Error("timeout")), ms)),
  )
  try {
    return await Promise.race([p, timeout])
  } finally {
    clearTimeout(t!)
  }
}
 
class CircuitBreaker {
  private consecutiveFailures = 0
  private state: "closed" | "open" | "half-open" = "closed"
  constructor(
    private readonly threshold = 5,
    private readonly coolDownMs = 2000,
  ) {}
  async exec<T>(fn: () => Promise<T>): Promise<StepResult<T>> {
    if (this.state === "open") return { ok: false, error: new Error("circuit-open") }
    try {
      const res = await fn()
      this.consecutiveFailures = 0
      this.state = "closed"
      return { ok: true, value: res }
    } catch (e) {
      this.consecutiveFailures++
      if (this.consecutiveFailures >= this.threshold) {
        this.state = "open"
        setTimeout(() => (this.state = "half-open"), this.coolDownMs)
      }
      return { ok: false, error: e as Error }
    }
  }
}
 
async function retry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let last: unknown
  for (let i = 0; i < attempts; i++) {
    try {
      return await fn()
    } catch (e) {
      last = e
      await new Promise((r) => setTimeout(r, 2 ** i * 100 + Math.random() * 50))
    }
  }
  throw last
}

B. Strict output contracts

Define schemas and hard-validate. Example with JSON Schema and a tolerant repair pass:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Answer",
  "type": "object",
  "required": ["answer", "citations"],
  "properties": {
    "answer": { "type": "string", "minLength": 1 },
    "citations": {
      "type": "array",
      "items": { "type": "string", "format": "uri" },
      "maxItems": 5
    },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
  },
  "additionalProperties": false
}

Repair strategy: attempt parse → schema validate → if fail, ask the model to repair with the schema included → validate again → otherwise hard-fail.

C. Retrieval and ranking math

  • Mean Reciprocal Rank (MRR): (\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q})
  • Recall@k: fraction of queries where any relevant doc appears in top-k.
  • Reranking with cross-encoder: score (s(q, d)) and select top-k’ with argmax.

Chunking heuristic: maximize mutual information within chunks; avoid splitting across discourse boundaries (headings, bullets).

D. Semantic and deterministic caches

Deterministic key: sha256(modelId || temperature || userPrompt || retrievedDocIds || toolResultsHash).

Semantic acceptance: if cosine(emb(x), emb(y)) ≥ τ, reuse cached result for x when serving y.

function cosine(a: number[], b: number[]): number {
  const dot = a.reduce((acc, v, i) => acc + v * b[i], 0)
  const na = Math.hypot(...a)
  const nb = Math.hypot(...b)
  return dot / (na * nb)
}

Pick τ via offline evals to trade precision vs reuse rate. Use ANN (e.g., HNSW) for sublinear lookups.

E. Routing by difficulty and budgets

  • Predict difficulty via retrieval density, entropy of draft, or classifier.
  • Route easy cases to small model; promote to heavy flow on low confidence or schema violations.