How Does a Large Language Model Actually Work?

How does a large language model actually work from an engineering perspective? In practice, I can call an LLM API, write a prompt, and ship a feature. What’s harder is explaining what actually happens between the user’s first character and the model’s first output token. I’ve come to see that gap as more than a curiosity. It shows up in real product decisions: why does my context window fill up faster with certain languages? Why does fine-tuning sometimes make things worse? Why does my inference cost spike at longer prompts? The answers live in the stack itself.

In this article, I walk that full stack from the ground up, covering how I think about LLMs from an engineering perspective at every layer. Raw text enters as token IDs and flows through stacked transformer blocks where attention scores determine which context influences each position. The model gets trained via pretraining and alignment pipelines at massive compute scale, then served through overlapping optimization strategies that trade precision, memory, and latency. Each layer has engineering decisions baked into it. Understanding them is what lets me make better choices at every junction.

This is the kind of breakdown I publish regularly on imlucas.dev: direct, technically grounded, no paraphrasing of documentation. If you’re building LLM-powered systems or evaluating infrastructure decisions, this is the mental model I’d want you to have.

From raw text to model input: how tokenization actually works

Before a model processes a single character, that character must be converted into a token ID, an integer that indexes into the model’s vocabulary. This step is called tokenization, and the algorithm most modern LLMs use is Byte Pair Encoding (BPE). BPE starts with a corpus split into individual bytes or characters, counts the frequency of every adjacent pair, merges the most frequent pair into a single new token, and repeats until it hits a target vocabulary size. For the string “aaabdaabac”, the most frequent pair is “a”+“a”, which collapses to “aa”; the process continues until only the most compressed representation remains.

Byte-level BPE is the standard for good reason. Character-level BPE fails on non-Latin languages because CJK characters span multiple bytes, creating combinatorial explosion and a flood of out-of-vocabulary tokens. Byte-level BPE operates on UTF-8 bytes (0–255), so any text encoded in UTF-8 can be expressed as a sequence of known tokens without out-of-vocabulary failures. Code, symbols, and emoji are generally handled more robustly under this scheme as well, since no special-case rules are needed.

After tokenization, each token ID is mapped to a dense vector via an embedding matrix lookup. That matrix has shape vocab_size x embedding_dim; a vocabulary of 131,000 tokens with a 4096-dimensional embedding uses over half a billion parameters just in the embedding layer. More importantly, sequence length, not character count, determines how much of your context window you consume. A language with multi-byte characters will produce more tokens per sentence than English, and those extra tokens cost you both context budget and inference compute.

Vocabulary size sits at the intersection of several trade-offs. Larger vocabularies (GPT-4 uses roughly 100k tokens; LLaMA-3 uses roughly 128k) compress sequences more aggressively, reducing token count per prompt. But they increase the embedding layer’s memory footprint. Critically, changing vocabulary after pretraining typically requires significant embedding rework or full retraining of that layer, though mapping and adapter strategies do exist. Vocabulary design is a decision made once, before any pretraining compute is spent, and its effects propagate through every downstream system.

Inside the transformer: attention, weights, and the forward pass

Self-attention and scaled dot-product

Every transformer layer performs the same two operations in sequence: multi-head self-attention, then a position-wise feed-forward network. The attention mechanism is where the representational work happens. Given an input embedding matrix X, the model projects it into three spaces via learned weight matrices: queries (Q = XW_Q), keys (K = XW_K), and values (V = XW_V). Attention scores are then computed as softmax(QK^T / sqrt(d_k)) * V, where d_k is the dimension of the query and key vectors.

The sqrt(d_k) scaling is not cosmetic. Without it, dot products between Q and K grow in magnitude as d_k increases, pushing softmax inputs into regions where the function saturates near 0 or 1. Once softmax saturates, its gradient approaches zero, and the model stops learning useful attention patterns. Dividing by sqrt(d_k) keeps pre-softmax variance close to 1 regardless of embedding size, maintaining stable gradients throughout training. For a clear, step-by-step derivation of self-attention and its numerics, see self-attention from scratch.

Multi-head attention and feed-forward blocks

Multi-head attention runs this computation in parallel across h subspaces. Instead of one large Q, K, V projection, the embedding dimension is split into h heads, each with its own projection matrices. Interpretability research suggests that individual heads can learn to attend to different types of relationships, syntactic dependencies, coreference, positional proximity, though the precise roles vary by model and layer. The outputs from all heads are concatenated and projected back to the full embedding dimension via W_O. In PyTorch, this reduces to calling scaled_dot_product_attention(Q, K, V, mask=None) per head, which dispatches to FlashAttention-2 or Memory-Efficient Attention kernels automatically on CUDA.

After attention, each transformer block passes through a feed-forward network: two linear layers with GELU activation, applied position-wise. Residual connections wrap both the attention sub-layer and the feed-forward sub-layer, with layer normalization applied to each. This residual structure is what makes very deep networks trainable. Stacking 32, 64, or 96 of these blocks gives large models their representational depth. The O(n^2 * d) attention complexity is also why long contexts are expensive: doubling sequence length quadruples attention compute.

How LLMs learn: pretraining, fine-tuning, and alignment

Pretraining is autoregressive next-token prediction at scale. The model receives a sequence of tokens and predicts the next one at every position; the loss is cross-entropy over the vocabulary, summed across the sequence. This objective is entirely self-supervised: no human labels required, just massive amounts of text. LLaMA-3’s pretraining ran for approximately 54 days on a 16,000-GPU H100 cluster. At that scale, data quality and mixture ratios become as important as compute: the proportion of code, multilingual text, and scientific content in the pretraining corpus shapes the model’s capabilities in ways that are difficult to correct later.

The standard optimizer is AdamW with cosine learning rate decay and linear warmup. Typical hyperparameters: beta_1=0.9, beta_2=0.999, epsilon=1e-5, weight decay ~0.1, peak learning rate around 3e-5 to 5e-5 depending on model size. The warmup phase prevents large early gradient updates from destabilizing the initial random parameters. Cosine decay then brings the learning rate smoothly to near-zero by the end of training, squeezing the final performance out of the data budget.

Supervised fine-tuning (SFT) adapts the pretrained model to follow instructions by training on curated instruction-output pairs, still using next-token prediction but now on labeled data with clear correct outputs. SFT works well for tasks where gold-label outputs are well-defined: information extraction, summarization with clear objectives, classification. It falls short when the task involves subjective quality, nuanced helpfulness, or calibrated refusals. Those gaps are where RLHF comes in.

RLHF adds a reward signal on top of SFT. A separate reward model is trained on human preference comparisons between model outputs. Then Proximal Policy Optimization (PPO) updates the LLM’s parameters to maximize the reward signal. A KL divergence penalty between the updated policy and the original SFT model prevents reward hacking, where the model finds degenerate output patterns that score high on the reward model but are useless in practice. In production, SFT and RLHF are always used sequentially, not as alternatives. SFT establishes instruction-following behavior; RLHF refines it toward human preference.

Scaling to billions of parameters: parallelism strategies and hardware realities

Training large models requires distributing work across hundreds or thousands of GPUs simultaneously, and the strategy you choose affects both memory usage and communication overhead. Data parallelism replicates the full model across devices and splits the training data; each replica computes gradients on its shard, then gradients are aggregated via AllReduce. This is straightforward but doesn’t help when the model itself doesn’t fit in a single GPU’s memory. Tensor parallelism splits individual weight matrices across devices, enabling single-layer operations that exceed one GPU’s VRAM. Pipeline parallelism splits the model depth-wise, with different layers on different devices and micro-batches flowing through stages. MegatronLM has reported strong scaling efficiency combining data and tensor parallelism across hundreds of GPUs, though exact figures vary by configuration and hardware generation.

ZeRO optimization addresses a different problem: redundant memory state in data-parallel training. Standard data parallelism replicates optimizer state, gradients, and parameters on every GPU. ZeRO eliminates that redundancy by partitioning those states across data-parallel ranks. Stage 1 partitions optimizer state only, cutting memory by roughly 4x. Stage 2 adds gradient partitioning for roughly 8x reduction. Stage 3 partitions parameters as well, achieving near-linear memory reduction with GPU count. Stage 3 carries the highest communication overhead since parameters must be gathered and scattered during every forward and backward pass, but it’s the only option for models that don’t fit otherwise.

The hardware reality at this scale is instructive. Fine-tuning BLOOM-176B requires approximately 3 TB of GPU memory. That translates to roughly 72 A100 80GB GPUs running in coordinated fashion. Meta’s H100 clusters run at 700W TDP with HBM3 memory. When you operate at thousands of GPUs, individual hardware failures become statistically certain over the course of a multi-week training run. Reliability engineering, not just parallelism strategy, becomes a first-class concern: fast checkpoint recovery, training re-initialization mechanisms, and high-speed interconnects (NVLink, InfiniBand) to maintain GPU utilization when failures occur.

Engineering inference for production: the full optimization stack

A deployed LLM has two distinct computational phases: prefill, where the model processes the entire input prompt in parallel, and decode, where it generates one output token at a time. These phases have different bottlenecks. Prefill is compute-bound, dominated by large matrix multiply operations across the full sequence. During decode, the bottleneck shifts to memory bandwidth: the model must load all weights and the KV cache for each single-token generation step. Optimizing each phase requires different strategies.

The KV cache stores key and value vectors for all tokens already processed, so the model doesn’t recompute attention from scratch on every decode step. Without it, generating each new token would require recomputing O(n^2) attention over the full sequence. Continuous batching extends this efficiency to request management: instead of waiting for all in-flight requests to complete before accepting new ones, it dynamically inserts new requests as existing ones finish. This keeps GPU occupancy high even with variable sequence lengths. Benchmarks on production workloads have shown substantial throughput improvements over static batching when continuous batching is applied, in some reported configurations, an order of magnitude or more. For techniques that target long-context inference and KV-cache behavior specifically, see NVIDIA’s guidance on optimizing inference for long context and KV cache.

Quantization reduces the numerical precision of model weights and activations from FP16 to INT8 or INT4. This cuts VRAM requirements by 2-4x and reduces inference cost by roughly 50%, with minimal accuracy degradation for most workloads. Post-training quantization is faster to deploy and suitable for most cases. Quantization-aware training produces better results at very low precision (INT4 and below) but requires integrating quantization into the training loop itself. KV cache quantization specifically targets the memory bandwidth bottleneck during decode, where moving data between HBM and compute units is often the limiting factor.

Two techniques deliver outsized returns for specific workloads. Prefix caching reuses KV cache entries across requests that share a common input prefix, such as the same system prompt. For workloads with repetitive prefixes, this can deliver substantial cost savings with no model changes required. Speculative decoding runs a smaller draft model to generate candidate tokens in parallel with the main model; the main model then accepts or rejects each candidate in a single forward pass, achieving meaningful decode latency improvements when spare compute is available. PagedAttention, implemented in vLLM and SGLang, uses block-based memory allocation for KV cache, similar to virtual memory paging, reducing fragmentation dramatically and improving throughput over naive implementations. See this overview of PagedAttention for a concise explanation of the technique.

For teams sequencing these optimizations, the order matters. Start with prompt caching, which can cut costs 50-90% with zero infrastructure changes. Continuous batching comes next and typically yields 3-10x throughput gains. Quantization follows, but requires accuracy validation in a staging environment before production rollout. The optimizations compound: teams that implement caching and batching before reaching for structural serving changes consistently see stronger results than those who jump straight to complex infrastructure.

Understanding how a large language model works from an engineering perspective: putting it together

The full stack now has a clean structure. Raw text enters as token IDs shaped by vocabulary design decisions made before pretraining began. Those IDs pass through embedding lookup into a sequence of transformer blocks where scaled dot-product attention computes context-weighted representations across every position. Stacked blocks with residual connections and feed-forward networks build the representational depth that allows the model to generalize. Pretraining instills general capability; SFT aligns behavior to instructions; RLHF refines toward human preference with a KL penalty preventing collapse. At inference, KV caching, continuous batching, quantization, and paged memory management turn a theoretically expensive O(n^2) operation into something economically viable at scale.

This engineering perspective on LLM internals is what I lean on when I’m in an architecture review or a cost postmortem. Why a larger vocabulary compresses your context window but increases embedding memory. Why fine-tuning on a small dataset can degrade general capability through overfitting or distribution shift. Why INT4 quantization tends to be safe for retrieval workloads but introduces more risk on math-heavy tasks. Why your prefix caching hit rate depends entirely on how you structure your system prompt. These aren’t abstract concerns, they’re the questions that determine real product outcomes.

I publish more breakdowns at this depth regularly at imlucas.dev: system design analysis, ML engineering deep-dives, and production lessons grounded in the perspective I’ve applied here.