What LLMs actually are, how they're trained, how they generate text, and how we evaluate them.
A large language model (LLM) is a neural network trained to predict the next token in a sequence of text. Scaled up with enough data, parameters, and compute, that simple objective produces models that can write code, summarize documents, reason through problems, and hold conversations. "Large" typically means billions to trillions of parameters.
LLMs don't read letters or words directly. Text is split into tokens - subword chunks like "gonza", "lo", " mu". A rough rule of thumb: 1 token ≈ 4 characters of English, or ~¾ of a word. Pricing, context limits, and latency are all measured in tokens, not words.
Modern LLMs are built on the transformer architecture (Vaswani et al., 2017). Its key ingredient is self-attention: for each token, the model learns how much to "pay attention" to every other token in the context. Stacked transformer layers let the model build up increasingly abstract representations of the input.
The context window is the maximum number of tokens a model can consider at once - your prompt plus its reply plus any tool calls and documents. Modern models range from 8K to over 1M tokens. Longer isn't always better: cost, latency, and "lost in the middle" effects still matter.
At inference time the model produces a probability distribution over the next token. A decoding strategy picks from it: greedy (argmax), top-k, top-p (nucleus), or temperature-scaled sampling. Low temperature gives deterministic, focused output; higher temperature gives more variety.
The first and most important lever. Clear instructions, relevant examples (few-shot), structured outputs (JSON schemas), and "think step by step" style guidance can dramatically change results without any model changes.
When the model needs facts it wasn't trained on - internal documents, recent events, private data - retrieve the relevant chunks and include them in the prompt. RAG is usually the right first step before reaching for fine-tuning.
Modern LLMs can call functions, APIs, and run code. An agent is a loop: the model chooses a tool, observes the result, and decides what to do next. This is how assistants browse the web, run queries, edit files, or control other systems.
When prompting and retrieval hit their limits, you can adapt the model itself to your task, style, or domain. See the Fine-Tuning page.