Info

This is notes from the blog post - What We’ve Learned From A Year of Building with LLMs + personal opinions on the discussed concepts.

Tactical: Nuts & Bolts of Working with LLMs

Prompting

Right prompting techniques, when used correctly, can get us very far!
Focus on fundamental prompting techniques: n-shot prompting + in-context learning, chain-of-thought, and providing relevant resources.
- N-shot prompting - Make sure examples are representative of prod distribution
Structured inputs and outputs - Be aware that each LLM family has their own preferences
- Claude - XML
- GPT - Markdown and JSON

Have small prompts that do one thing, and only one thing, well
- A common anti-pattern in software is the “God Object”. A single class or function that does everything. It applies to prompts as well.
Crafting the context - Structure the context to underscore the relationships between parts of it and make extraction as simple as possible.
- Rule of thumb: Check if the context is structured in a way for a human to understand.

Retrieval Augmented Generation (RAG)

Only as good as the retrieved documents’ relevance and detail
- Mean Reciprocal Rank (MRR) - evaluates how well a system places the 1st relevant result in the ranked list
- Always choose document that has high information density
Don’t forget keyword search - use it as a baseline and in hybrid search
- Read Shortwave’s RAG Pipeline

Tuning and Optimising Workflows

Quote

It’s not required that every agent prompt requests structured output, but structured outputs help a lot to interface with whatever system is orchestrating the agent’s interactions with the environment

Step by step, multi-turn “flows” can give large boosts and few things to try:
- A tightly-specified, explicit planning step
- Rewriting original user prompts into agent prompts (beware - it is lossy process)
- Agent behaviours as linear chains, DAGs, and state machines
  - Avoid over-complicating the workflow; Often simple chains provide 90% of the results.
Getting more diverse outputs beyond temperature
- Increasing temperature does not guarantee that the LLM will sample outputs from probability distribution you expect (i.e. uniform random) Suggestions:
  - In case of lists: Shuffle the order of items
  - Keep a short list of recent outputs to prevent redundancy
Caching is underrated
- Use hash functions to create ID of a given content; For example, summarising a piece of content and using the hash as the cache key
- Using semantic checking to retrieve cached results is also an alternate approach over conventional caching

Operational: Day-to-day and Org concerns

Data and evaluation

Check for development-prod skew. This can be categorised into two types:
- Structural - Issues like formatting discrepancies, such as differences b/w JSON dictionary, list type value, etc..
- Content-based - Differences in meaning or context of the data
Choose the smallest model that gets the job done
In the long term, we expect to see more examples of flow-engineering with smaller models as the optimal balance of output quality, latency, and cost.

Advance detection

Clustering embeddings of input/output to detect semantic drift

As for prompt engineering, always have hold-out datasets, “vibe checks”, to align with expectations.

Product

Always design your UX for Human-In-The-Loop — Quality annotations are integral to improve LLM systems

Strategy: Building with LLMs without Getting Out-Manoeuvred

Quote

As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months.

Training from scratch (almost) never makes sense
Do not finetune until you’ve proven it’s necessary. Only do, if:
- If the use case requires data not available in the mostly-open web-scale datasets used to train existing models
- If you’ve already built an MVP that demonstrates the existing models are insufficient
Do not be afraid of Self-hosting. It circumvents limitations imposed by inference providers
Model is not the product, the system/workflows around it is. Focus on putting efforts that provide lasting value, such as:
- Evals: To reliably measure performance on your task across models
- Guardrails: To prevent undesired outputs no matter the model
- Caching: To reduce latency and cost by avoiding the model altogether
- Data flywheel: To power the iterative improvement of everything above

Quote

There’s a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a car self-driving around a block; making it into a product takes a decade. - Andrej Karpathy

The Commonplace Book

Explorer

Practical Guide To Building LLM Products

Tactical: Nuts & Bolts of Working with LLMs

Prompting

Retrieval Augmented Generation (RAG)

Tuning and Optimising Workflows

Operational: Day-to-day and Org concerns

Data and evaluation

Product

Strategy: Building with LLMs without Getting Out-Manoeuvred

Graph View

Table of Contents

Backlinks