Choosing a Model for Production
A structured framework for picking between frontier and open-source models based on cost, latency, and quality.
White papers, tutorials, documentation, podcasts, and quick guides - all in one place.
Longer-form, cited write-ups on applied AI topics.
A structured framework for picking between frontier and open-source models based on cost, latency, and quality.
From document ingestion to evaluation - a practitioner's view of building retrieval-augmented systems that actually work.
Token math, caching, batching, and the trade-offs that decide whether an AI feature pays for itself.
Step-by-step walk-throughs you can actually follow along with.
Build a minimal chat app that calls a hosted LLM, streams responses, and handles basic errors.
Ingest a folder of PDFs, chunk and embed them, and answer questions grounded in the source docs.
Adapt an open-weights model to your own task on a single GPU using LoRA/QLoRA.
Reference material - links to provider docs and internal cheat sheets.
Official documentation for the Claude API, tool use, and prompt caching.
OpenAIReference for GPT and o-series models, function calling, and the Assistants API.
GoogleMultimodal usage, long context, and GCP integration.
Hugging FaceTransformers, datasets, PEFT, and TRL - the open-source stack.
Conversations on AI - hosted, featured, or recommended.
A short intro episode on what I'm hoping to build here and who it's for.
Working through a real model-selection decision in the open.
Prompting vs. RAG vs. fine-tuning - how to tell which one you actually need.
5-minute reads. One idea per guide.
Six patterns that reliably improve model output, with short before/after examples.
How to estimate tokens, plan for context limits, and avoid surprise bills.
Fixed-size, recursive, semantic - and how to choose.
The single highest-leverage thing you can do before shipping an LLM feature.