Info
This is notes from the blog post - What We’ve Learned From A Year of Building with LLMs + personal opinions on the discussed concepts.
Tactical: Nuts & Bolts of Working with LLMs
Prompting
- Right prompting techniques, when used correctly, can get us very far!
- Focus on fundamental prompting techniques: n-shot prompting + in-context learning, chain-of-thought, and providing relevant resources.
- N-shot prompting - Make sure examples are representative of prod distribution
- Structured inputs and outputs - Be aware that each LLM family has their own preferences
- Claude - XML
- GPT - Markdown and JSON
- Have small prompts that do one thing, and only one thing, well
- A common anti-pattern in software is the “God Object”. A single class or function that does everything. It applies to prompts as well.
- Crafting the context - Structure the context to underscore the relationships between parts of it and make extraction as simple as possible.
- Rule of thumb: Check if the context is structured in a way for a human to understand.
Retrieval Augmented Generation (RAG)
- Only as good as the retrieved documents’ relevance and detail
- Mean Reciprocal Rank (MRR) - evaluates how well a system places the 1st relevant result in the ranked list
- Always choose document that has high information density
- Don’t forget keyword search - use it as a baseline and in hybrid search
Tuning and Optimising Workflows
Quote
It’s not required that every agent prompt requests structured output, but structured outputs help a lot to interface with whatever system is orchestrating the agent’s interactions with the environment
- Step by step, multi-turn “flows” can give large boosts and few things to try:
- A tightly-specified, explicit planning step
- Rewriting original user prompts into agent prompts (beware - it is lossy process)
- Agent behaviours as linear chains, DAGs, and state machines
- Avoid over-complicating the workflow; Often simple chains provide 90% of the results.
- Getting more diverse outputs beyond temperature
- Increasing temperature does not guarantee that the LLM will sample outputs from probability distribution you expect (i.e. uniform random)
Suggestions:
- In case of lists: Shuffle the order of items
- Keep a short list of recent outputs to prevent redundancy
- Increasing temperature does not guarantee that the LLM will sample outputs from probability distribution you expect (i.e. uniform random)
Suggestions:
- Caching is underrated
- Use hash functions to create ID of a given content; For example, summarising a piece of content and using the hash as the cache key
- Using semantic checking to retrieve cached results is also an alternate approach over conventional caching
Operational: Day-to-day and Org concerns
Data and evaluation
- Check for development-prod skew. This can be categorised into two types:
- Structural - Issues like formatting discrepancies, such as differences b/w JSON dictionary, list type value, etc..
- Content-based - Differences in meaning or context of the data
- Choose the smallest model that gets the job done
- In the long term, we expect to see more examples of flow-engineering with smaller models as the optimal balance of output quality, latency, and cost.
Advance detection
- Clustering embeddings of input/output to detect semantic drift
- As for prompt engineering, always have hold-out datasets, “vibe checks”, to align with expectations.
Product
- Always design your UX for Human-In-The-Loop — Quality annotations are integral to improve LLM systems
Strategy: Building with LLMs without Getting Out-Manoeuvred
Quote
As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months.
-
Training from scratch (almost) never makes sense
-
Do not finetune until you’ve proven it’s necessary. Only do, if:
- If the use case requires data not available in the mostly-open web-scale datasets used to train existing models
- If you’ve already built an MVP that demonstrates the existing models are insufficient
-
Do not be afraid of Self-hosting. It circumvents limitations imposed by inference providers
-
Model is not the product, the system/workflows around it is. Focus on putting efforts that provide lasting value, such as:
- Evals: To reliably measure performance on your task across models
- Guardrails: To prevent undesired outputs no matter the model
- Caching: To reduce latency and cost by avoiding the model altogether
- Data flywheel: To power the iterative improvement of everything above
Quote
There’s a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a car self-driving around a block; making it into a product takes a decade. - Andrej Karpathy