Large Language Models

What LLMs actually are, how they're trained, how they generate text, and how we evaluate them.

What is an LLM?

A large language model (LLM) is a neural network trained to predict the next token in a sequence of text. Scaled up with enough data, parameters, and compute, that simple objective produces models that can write code, summarize documents, reason through problems, and hold conversations. "Large" typically means billions to trillions of parameters.

Core concepts

Tokens

LLMs don't read letters or words directly. Text is split into tokens - subword chunks like "gonza", "lo", " mu". A rough rule of thumb: 1 token ≈ 4 characters of English, or ~¾ of a word. Pricing, context limits, and latency are all measured in tokens, not words.

The transformer

Modern LLMs are built on the transformer architecture (Vaswani et al., 2017). Its key ingredient is self-attention: for each token, the model learns how much to "pay attention" to every other token in the context. Stacked transformer layers let the model build up increasingly abstract representations of the input.

Context window

The context window is the maximum number of tokens a model can consider at once - your prompt plus its reply plus any tool calls and documents. Modern models range from 8K to over 1M tokens. Longer isn't always better: cost, latency, and "lost in the middle" effects still matter.

Decoding & sampling

At inference time the model produces a probability distribution over the next token. A decoding strategy picks from it: greedy (argmax), top-k, top-p (nucleus), or temperature-scaled sampling. Low temperature gives deterministic, focused output; higher temperature gives more variety.

How LLMs are trained

  1. Pretraining - the model learns next-token prediction on a massive corpus (web, books, code). This builds raw knowledge and language ability.
  2. Supervised fine-tuning (SFT) - the model is trained on curated instruction/response pairs so it learns to follow instructions.
  3. Preference optimization - using human or AI feedback (RLHF, DPO, constitutional methods) the model is shaped to be helpful, honest, and safe.
  4. Post-training for capabilities - additional training for tool use, long-context reasoning, coding, and agentic behavior.

How they're used in practice

Prompting

The first and most important lever. Clear instructions, relevant examples (few-shot), structured outputs (JSON schemas), and "think step by step" style guidance can dramatically change results without any model changes.

Retrieval-augmented generation (RAG)

When the model needs facts it wasn't trained on - internal documents, recent events, private data - retrieve the relevant chunks and include them in the prompt. RAG is usually the right first step before reaching for fine-tuning.

Tool use & agents

Modern LLMs can call functions, APIs, and run code. An agent is a loop: the model chooses a tool, observes the result, and decides what to do next. This is how assistants browse the web, run queries, edit files, or control other systems.

Fine-tuning

When prompting and retrieval hit their limits, you can adapt the model itself to your task, style, or domain. See the Fine-Tuning page.

Evaluating LLMs

Limitations to keep in mind