Info

This is notes from the blog post - What We’ve Learned From A Year of Building with LLMs + personal opinions on the discussed concepts.

Tactical: Nuts & Bolts of Working with LLMs

Prompting

  • Right prompting techniques, when used correctly, can get us very far!
  • Focus on fundamental prompting techniques: n-shot prompting + in-context learning, chain-of-thought, and providing relevant resources.
    • N-shot prompting - Make sure examples are representative of prod distribution
  • Structured inputs and outputs - Be aware that each LLM family has their own preferences
    • Claude - XML
    • GPT - Markdown and JSON
  • Have small prompts that do one thing, and only one thing, well
    • A common anti-pattern in software is the “God Object”. A single class or function that does everything. It applies to prompts as well.
  • Crafting the context - Structure the context to underscore the relationships between parts of it and make extraction as simple as possible.
    • Rule of thumb: Check if the context is structured in a way for a human to understand.

Retrieval Augmented Generation (RAG)

  • Only as good as the retrieved documents’ relevance and detail
    • Mean Reciprocal Rank (MRR) - evaluates how well a system places the 1st relevant result in the ranked list
    • Always choose document that has high information density
  • Don’t forget keyword search - use it as a baseline and in hybrid search

Tuning and Optimising Workflows

Quote

It’s not required that every agent prompt requests structured output, but structured outputs help a lot to interface with whatever system is orchestrating the agent’s interactions with the environment

  • Step by step, multi-turn “flows” can give large boosts and few things to try:
    • A tightly-specified, explicit planning step
    • Rewriting original user prompts into agent prompts (beware - it is lossy process)
    • Agent behaviours as linear chains, DAGs, and state machines
      • Avoid over-complicating the workflow; Often simple chains provide 90% of the results.
  • Getting more diverse outputs beyond temperature
    • Increasing temperature does not guarantee that the LLM will sample outputs from probability distribution you expect (i.e. uniform random) Suggestions:
      • In case of lists: Shuffle the order of items
      • Keep a short list of recent outputs to prevent redundancy
  • Caching is underrated
    • Use hash functions to create ID of a given content; For example, summarising a piece of content and using the hash as the cache key
    • Using semantic checking to retrieve cached results is also an alternate approach over conventional caching

Operational: Day-to-day and Org concerns

Data and evaluation

  • Check for development-prod skew. This can be categorised into two types:
    • Structural - Issues like formatting discrepancies, such as differences b/w JSON dictionary, list type value, etc..
    • Content-based - Differences in meaning or context of the data
  • Choose the smallest model that gets the job done
  • In the long term, we expect to see more examples of flow-engineering with smaller models as the optimal balance of output quality, latency, and cost.

Advance detection

  • Clustering embeddings of input/output to detect semantic drift
  • As for prompt engineering, always have hold-out datasets, “vibe checks”, to align with expectations.

Product

  • Always design your UX for Human-In-The-Loop — Quality annotations are integral to improve LLM systems

Strategy: Building with LLMs without Getting Out-Manoeuvred

Quote

As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months.

  • Training from scratch (almost) never makes sense

  • Do not finetune until you’ve proven it’s necessary. Only do, if:

    • If the use case requires data not available in the mostly-open web-scale datasets used to train existing models
    • If you’ve already built an MVP that demonstrates the existing models are insufficient
  • Do not be afraid of Self-hosting. It circumvents limitations imposed by inference providers

  • Model is not the product, the system/workflows around it is. Focus on putting efforts that provide lasting value, such as:

    • Evals: To reliably measure performance on your task across models
    • Guardrails: To prevent undesired outputs no matter the model
    • Caching: To reduce latency and cost by avoiding the model altogether
    • Data flywheel: To power the iterative improvement of everything above

Quote

There’s a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a car self-driving around a block; making it into a product takes a decade. - Andrej Karpathy