LLM Architecture¶
How Large Language Models work — from transformer internals and attention mechanisms through training pipelines, quantization formats, model distribution formats, and knowledge distillation.
Transformer Architecture¶
The transformer is the neural network architecture behind virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent neural networks (RNNs/LSTMs) by processing all tokens in a sequence simultaneously rather than sequentially.
Why Transformers Replaced RNNs¶
RNNs process tokens one at a time, left to right. This sequential bottleneck means:
- Training cannot be parallelized across sequence positions
- Long-range dependencies decay over distance (vanishing gradients)
- Training time scales linearly with sequence length
Transformers solve all three problems through self-attention, which computes relationships between every pair of tokens in a single matrix operation — fully parallelizable on GPUs.
High-Level Data Flow¶
graph LR
A[Raw Text] --> B[Tokenizer]
B --> C[Token IDs]
C --> D[Embedding Layer]
D --> E[+ Positional Encoding]
E --> F[Transformer Blocks x N]
F --> G[Output Layer / Logits]
G --> H[Softmax → Probability Distribution]
H --> I[Next Token]
- Tokenization — text is split into subword tokens (integers from a fixed vocabulary)
- Embedding — each token ID maps to a dense vector via a learned embedding table
- Positional Encoding — positional signals are added so the model knows token order (attention itself is order-agnostic)
- Transformer Blocks — a stack of N identical layers, each containing self-attention + feed-forward network + residual connections + layer normalization
- Output Layer — projects hidden states to vocabulary-sized logits
- Softmax — converts logits to a probability distribution over the vocabulary
Modern LLMs use 12 to several hundred transformer blocks. Deeper stacks enable richer hierarchical abstractions.
Encoder-Decoder vs Decoder-Only¶
The original transformer had two halves:
| Architecture | Used By | How It Works |
|---|---|---|
| Encoder-Decoder | T5, BART, original Transformer | Encoder reads full input bidirectionally; decoder generates output autoregressively |
| Encoder-Only | BERT, RoBERTa | Bidirectional attention for understanding tasks (classification, NER) |
| Decoder-Only | GPT series, LLaMA, Claude, Mistral | Causal (left-to-right) attention; generates text one token at a time |
Nearly all modern generative LLMs use the decoder-only variant. The encoder-only approach lives on in embedding models and classification tasks.
Self-Attention Mechanism¶
Self-attention is the core innovation that makes transformers work. It allows every token to "attend to" every other token in the sequence, computing relevance scores dynamically.
Query, Key, Value (QKV)¶
For each token, the model computes three vectors from the input embedding:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I contain?"
- Value (V) — "what information do I provide?"
The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and passed through softmax:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Where $d_k$ is the dimension of the key vectors (scaling prevents dot products from growing too large).
Causal Masking¶
In decoder-only models, a causal mask is applied to the attention matrix: the upper triangle is set to $-\infty$ before softmax, preventing tokens from attending to future positions. This ensures autoregressive generation — each token can only see tokens that came before it.
Multi-Head Attention¶
Rather than computing a single attention function, transformers use multiple attention heads (typically 32–128), each with independent Q/K/V projections. Different heads learn to capture different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.
Attention Variants¶
| Variant | Description | Used By |
|---|---|---|
| Multi-Head Attention (MHA) | Each head has its own K, V projections | Original Transformer, GPT-2 |
| Multi-Query Attention (MQA) | All heads share a single K, V projection | PaLM, Falcon |
| Grouped-Query Attention (GQA) | Heads grouped into clusters sharing K, V | LLaMA 2/3, Mistral, Gemma |
GQA is the current standard — it reduces KV cache memory by 4-8x compared to MHA with minimal quality loss.
Tokenization and Embeddings¶
Tokenization¶
Tokenization converts raw text into integer token IDs from a fixed vocabulary. LLMs use subword tokenization — a middle ground between character-level (too fine) and word-level (can't handle unknown words).
Byte Pair Encoding (BPE) is the dominant algorithm:
- Start with a vocabulary of 256 byte values
- Find the most frequent adjacent byte pair in the training corpus
- Merge that pair into a new token, add to vocabulary
- Repeat until vocabulary reaches target size (30K–100K tokens)
Common words become single tokens; rare words decompose into known subword pieces.
| Algorithm | Description | Used By |
|---|---|---|
| BPE (byte-level) | Merge most frequent byte pairs | GPT-2/3/4, LLaMA 3, Claude, Mistral |
| WordPiece | Merge pairs that maximize corpus likelihood | BERT, DistilBERT |
| SentencePiece | Language-agnostic, operates on raw text | LLaMA 1/2, Mistral (earlier), T5 |
| Unigram | Probabilistic model, prunes vocabulary down | SentencePiece variant, XLNet |
Tokenization Quirks
Many LLM "failures" trace back to tokenization. Math errors occur because multi-digit numbers split into arbitrary subword tokens. Spelling struggles happen because the model never sees individual characters. "Glitch tokens" — tokens frequent in tokenizer training data but rare in model training — produce unpredictable outputs.
Embeddings¶
The embedding layer maps each integer token ID to a dense vector (typically 4096–12288 dimensions). These vectors are learned during pretraining and encode semantic relationships: similar tokens have similar vectors.
Positional encoding adds sequence-order information since attention is inherently order-agnostic. Modern LLMs use Rotary Position Embeddings (RoPE), which encode relative positions directly into the Q/K dot product, enabling better extrapolation to longer sequences than the model was trained on.
Context Windows¶
The context window is the maximum number of tokens an LLM can process in one request. All input (system prompt, conversation history, user query) and output share this budget.
| Era | Typical Context | Example Models |
|---|---|---|
| 2018–2020 | 512–2048 | BERT, GPT-2 |
| 2022–2023 | 4K–32K | GPT-4, Claude 2 |
| 2024–2025 | 128K–1M | Claude 3.5, Gemini 1.5, GPT-4 Turbo |
| 2025–2026 | 1M–10M | Gemini 2.0, Claude 4 |
Lost in the Middle
Research shows LLMs attend strongly to tokens at the beginning and end of context but drop 30%+ accuracy on information in the middle (Liu et al., Stanford 2024). Placing critical information at the start or end of prompts improves retrieval quality.
Training Pipeline¶
Phase 1: Pretraining¶
The model learns general knowledge by predicting the next token across trillions of tokens from web crawls, books, code, and curated datasets. This is by far the most expensive phase — DeepSeek-V3 required 2.788 million H800 GPU hours (~$5.6M) for 14.8 trillion tokens.
The training objective is simple causal language modeling:
$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) $$
Phase 2: Supervised Fine-Tuning (SFT)¶
The pretrained model is further trained on curated instruction-response pairs to learn:
- Instruction following
- Output formatting (JSON, markdown, structured responses)
- Safety behaviors
- Task-specific patterns
SFT datasets are much smaller (thousands to millions of examples) but high quality.
Phase 3: Alignment¶
RLHF (Reinforcement Learning from Human Feedback)¶
The traditional alignment pipeline:
graph LR
A[SFT Model] --> B[Generate Multiple Responses]
B --> C[Human Annotators Rank Outputs]
C --> D[Train Reward Model]
D --> E[Optimize Policy via PPO]
E --> F[Aligned Model]
- Generate multiple responses per prompt
- Human annotators rank them
- Train a reward model to predict human preferences
- Use PPO (Proximal Policy Optimization) to optimize the base model against the reward model
Downsides: complex, expensive, unstable training, susceptible to reward hacking.
DPO (Direct Preference Optimization)¶
Introduced by Rafailov et al. (2023), DPO simplifies alignment by eliminating the reward model entirely. It reframes preference learning as a binary classification problem:
- Given a chosen response and a rejected response, directly optimize the model to increase the probability of the chosen response relative to the rejected one
- Requires only 2 models (policy + frozen reference) vs RLHF's 4
- Standard supervised learning infrastructure — no RL instability
By 2025, 70% of enterprises use RLHF or DPO for alignment, with DPO adoption growing 45% year-over-year.
Constitutional AI (CAI)¶
Developed by Anthropic, CAI replaces human preference labeling with self-critique based on ethical principles (a "constitution"). The model generates responses, critiques its own outputs against the constitution, and revises — enabling scalable alignment without massive human annotation.
Phase 4: Reinforcement Learning for Reasoning¶
Models like DeepSeek-R1 and OpenAI o1/o3 add an RL phase specifically targeting step-by-step reasoning:
- Train the model to generate and verify chains of thought
- Reward correct final answers and valid reasoning steps
- Results: DeepSeek-R1 achieves 97.3% on MATH, ~80% on AIME competition problems
Mixture of Experts (MoE)¶
MoE introduces sparsity into the model: instead of activating all parameters for every token, only a subset of specialized "expert" sub-networks fire. This achieves the quality of massive models at the compute cost of much smaller ones.
How MoE Works¶
graph TD
A[Input Token] --> B[Router / Gating Network]
B --> C[Expert 1]
B --> D[Expert 2]
B --> E[Expert 3]
B --> F["Expert N (inactive)"]
C --> G[Weighted Sum of Active Expert Outputs]
D --> G
E --> G
G --> H[Output]
style F fill:#ccc,stroke:#999
- A router (small neural network) scores all experts for each input token
- The top-K experts (typically top-2) are selected
- Their outputs are combined via weighted sum
- Remaining experts are not computed — saving ~90% of FLOPs
Key MoE Models¶
| Model | Total Params | Active Params | Experts | Innovation |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | First major open-source MoE; static top-2 routing |
| Mixtral 8x22B | 141B | ~39B | 8 | Scaled Mixtral architecture |
| DeepSeek-V3 | 671B | 37B | 256 | Fine-grained experts; auxiliary-loss-free load balancing; FP8 training |
| DeepSeek-R1 | 671B | 37B | 256 | RL-first reasoning on V3 base; 97.3% MATH |
| Llama 4 Scout | 109B | 17B | 16 | Meta's first MoE; 10M token context |
| Llama 4 Maverick | 400B | 17B | 128 | 128 experts, top-1 routing |
DeepSeek's MoE Innovations¶
DeepSeek introduced two key strategies:
- Fine-grained experts — segment into many small experts (256 instead of 8), activate a small subset, allowing more flexible combinations
- Shared experts — isolate some experts as "shared" across all tokens to capture common knowledge, reducing redundancy in routed experts
As of 2025, nearly all frontier models (GPT-4, Gemini, Claude, Llama 4, DeepSeek, Mistral Large) use MoE architectures.
MoE Memory Tradeoff
MoE memory scales with total parameters, not active parameters. A 671B MoE model needs hundreds of GB of VRAM even though only 37B parameters fire per token. This forces multi-GPU deployments for large MoE models.
Quantization Formats¶
Quantization reduces model weight precision from high-bit (FP32/FP16) to lower-bit (INT8/INT4) representations, dramatically reducing memory and improving inference speed.
Numeric Precision Types¶
| Format | Bits | Bytes/Param | Description | Use Case |
|---|---|---|---|---|
| FP32 | 32 | 4 | Full precision float | Training optimizer states only |
| BF16 | 16 | 2 | Brain Float 16 — wider dynamic range than FP16 | Standard training & full-quality inference; post-2022 default |
| FP16 | 16 | 2 | Half precision float | Legacy inference; highest quality but 2x memory of BF16 with no benefit |
| FP8 | 8 | 1 | 8-bit float; native on Hopper/Blackwell GPUs | Production sweet spot on modern NVIDIA hardware |
| INT8 | 8 | 1 | 8-bit integer | ~50% memory reduction vs FP16; broad hardware support |
| INT4 | 4 | 0.5 | 4-bit integer | ~75% memory reduction; per-group scaling preserves quality |
| INT2 | 2 | 0.25 | 2-bit integer | Extreme compression; significant quality loss |
How Quantization Works¶
Full-precision weights (e.g. FP16) are mapped to a smaller set of representable values:
- Per-tensor quantization — one scale factor for the entire weight tensor (fast but lossy)
- Per-channel quantization — one scale factor per output channel (better quality)
- Per-group quantization — divides weights into groups of 128 elements, each with its own scale (best quality/size tradeoff for INT4)
The scale factor maps the quantized integer range back to the original floating-point range during inference.
Quality Impact by Precision¶
Perplexity benchmarks (Llama-2-7B, lower is better):
| Format | Perplexity | Quality Loss |
|---|---|---|
| FP16 (baseline) | 7.4924 | — |
| Q8_0 | 7.4933 | Negligible |
| Q5_K_M | ~7.52 | Minimal |
| Q4_K_M | 7.5692 | Acceptable |
| Q3_K_M | ~7.85 | Noticeable |
| Q2_K | 8.6501 | Significant degradation |
Low-Bit Caveats
At Q2/Q3, models start ignoring parts of system prompts and hallucinating JSON formatting. Avoid INT4 and below for math, code generation, and reasoning-heavy tasks where quality loss is most noticeable.
Size and Speed Example (Llama 2 13B)¶
| Metric | FP16 | Q4_K_M |
|---|---|---|
| Model size | 26 GB | 7.9 GB (70% reduction) |
| RAM required | 32 GB+ | 12 GB |
| Speed | 8 tok/s | 15 tok/s |
| Quality | 100% | ~95% |
Model Formats and Quantization Methods¶
GGUF (GPT-Generated Unified Format)¶
GGUF is a self-contained file format created by the llama.cpp project. It bundles weights, tokenizer, architecture metadata, and chat template into a single .gguf file.
Key properties:
- Runs on everything — CPU, NVIDIA, AMD, Apple Silicon
- mmap-able (OS maps file into memory without loading it all)
- Endian-safe and versioned
- Powers Ollama and LM Studio under the hood
GGUF quantization naming convention:
| Name | Approx Bits/Weight | Type | Quality |
|---|---|---|---|
| Q2_K | ~2.6 | K-quant | Extreme compression, noticeable degradation |
| Q3_K_S / Q3_K_M / Q3_K_L | ~3.3–3.9 | K-quant | Budget-conscious, some quality loss |
| Q4_K_S / Q4_K_M | ~4.3–4.8 | K-quant | Best balance of quality and size |
| Q5_K_S / Q5_K_M | ~5.3–5.7 | K-quant | Near-lossless for most tasks |
| Q6_K | ~6.6 | K-quant | Very close to FP16 |
| Q8_0 | ~8.5 | Legacy | Near-identical to FP16 |
| IQ2_XXS / IQ3_S | ~2.1–3.4 | I-quant | State-of-art low-bit; uses lookup tables |
The "K" indicates k-quant method (importance-aware mixed-precision); S/M/L are compression aggressiveness levels.
GPTQ (GPT-Quantized)¶
Calibration-based 4-bit integer quantization using approximate second-order (Hessian) information to minimize quantization error. Requires a small calibration dataset. AutoGPTQ was archived in April 2025; succeeded by GPTQModel v5.8.0.
Verdict: Use only if AWQ or EXL2 versions are unavailable. Both offer better quality-per-bit.
AWQ (Activation-Aware Weight Quantization)¶
MIT research. Identifies the <1% of "salient" weights by observing activations during calibration, then preserves them at higher precision.
- ~3 percentage points better than GPTQ on MMLU at 4 bits
- Marlin-AWQ kernel: ~741 tok/s on A10G — fastest 4-bit for NVIDIA
- Best choice for vLLM multi-user deployments on NVIDIA
EXL2 (ExLlamaV2)¶
Mixed bit-width quantization — can use 2, 3, 4, 5, 6, 8 bits within a single model and even within individual layers. Supports fractional average bitwidths (e.g., 4.5 bpw).
- Fastest for interactive single-user generation on NVIDIA GPUs (40–70% faster than llama.cpp)
- NVIDIA CUDA only, no CPU fallback
- Best for single-user interactive sessions at 4–6 bpw
Quick Format Decision Guide¶
| Scenario | Best Format |
|---|---|
| CPU / Laptop / Apple Silicon | GGUF (Q4_K_M or Q5_K_M) |
| NVIDIA GPU, max serving throughput | AWQ with Marlin kernels |
| NVIDIA GPU, single-user interactive | EXL2 at 4–6 bpw |
| NVIDIA H100/Blackwell production | FP8 |
| Fine-tuning | bitsandbytes (QLoRA) |
| Limited VRAM (≤8GB) | GGUF Q4_K_M with CPU offloading |
| General starting point | Ollama with Q4_K_M |
MLX (Apple Silicon)¶
MLX is Apple's open-source array framework for machine learning on Apple Silicon. Designed for Mac-native LLM inference and fine-tuning.
Key Design Principles¶
- Unified memory — arrays live in shared CPU/GPU memory; no data transfer overhead
- Lazy computation — operations are materialized only when needed, enabling automatic fusion
- Dynamic graphs — no recompilation on shape changes (unlike TensorRT)
- Familiar APIs — Python API mirrors NumPy;
mlx.nnmirrors PyTorch
MLX LM¶
The mlx-lm package provides one-command model download, quantization conversion, and inference:
# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B --quantize --q-bits 4
# Generate text
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit --prompt "Explain attention"
Performance on Apple Silicon¶
| Chip | Memory Bandwidth | 14B Dense (BF16) TTFT | 30B MoE (4-bit) TTFT |
|---|---|---|---|
| M4 | 120 GB/s | ~12s | ~4s |
| M5 | 153 GB/s | <10s | <3s |
The M5 provides 19–27% improvement over M4, directly proportional to its 28% memory bandwidth increase.
Research shows vllm-mlx achieves 21–87% higher throughput than llama.cpp on Apple Silicon, thanks to zero-copy tensor operations and lazy evaluation.
A MacBook Pro 24GB can hold an 8B model in BF16 or a 30B MoE at 4-bit quantization comfortably.
Knowledge Distillation¶
Knowledge distillation compresses a large teacher model into a smaller student model that mimics the teacher's behavior while being far cheaper to run.
How It Works¶
graph LR
A[Input Data] --> B[Teacher Model - Large]
A --> C[Student Model - Small]
B --> D[Soft Targets / Probabilities]
D --> E[Distillation Loss]
C --> F[Student Predictions]
F --> E
E --> G[Update Student Weights]
Three main distillation approaches:
| Method | What Transfers | Description |
|---|---|---|
| Response-based | Output probabilities ("soft targets") | Student learns teacher's probability distribution over vocabulary, not just the argmax |
| Feature-based | Intermediate layer activations | Student aligns internal representations via L2 or cosine similarity |
| Attention-based | Attention maps | Student replicates teacher's attention patterns (used in DistilBERT) |
Why Soft Targets Matter¶
Instead of training on hard labels (the single correct answer), the student learns from the teacher's full probability distribution. The relative probabilities encode the teacher's learned generalizations — for example, that "dog" and "puppy" are similar while "dog" and "table" are not.
A temperature parameter $T$ (typically 2–5) controls how "soft" the distribution is: higher temperature spreads probability more evenly, exposing more of the teacher's learned structure.
Results¶
- Typical compression: 5–10x smaller, retaining 90–95% accuracy
- DistilBERT: 60% of BERT's size, 97% of its performance, 60% faster
- DeepSeek-R1-Distill models: distilled from 671B to 7B/14B/32B variants with strong reasoning capabilities
Emerging Trends (2025)¶
- Chain-of-Thought Distillation — transfers reasoning processes (not just final answers) from teacher to student using CoT rationales as training signal
- Curriculum Distillation — organizes training easy-to-hard to gradually build reasoning capacity
- Multi-Teacher Distillation — combines expertise from multiple specialized teachers with dynamic weighting
- Few-Shot Distillation — effective with as few as 8–512 calibration samples using counterfactual explanations
Scaling Laws¶
Kaplan et al. (2020) and Chinchilla (Hoffmann et al., 2022) established empirical scaling laws for LLMs:
- Model performance (loss) improves predictably as a power law of: model size (parameters), dataset size (tokens), and compute budget (FLOPs)
- Chinchilla-optimal: for a given compute budget, model size and training tokens should scale roughly equally (the "1:1 ratio" in parameter-token space)
- DeepSeek-V3 trained on 14.8T tokens with 671B params — heavily over-training relative to Chinchilla, but MoE's sparse activation changes the calculus
These laws guide decisions about how to allocate training budgets: bigger model vs more data vs longer training.
Post-Transformer Architectures¶
While transformers dominate, alternatives are emerging:
| Architecture | Key Innovation | Status |
|---|---|---|
| Mamba (State Space Models) | Selective state updates; linear-time sequence processing; no quadratic attention | Competitive with transformers at small-medium scale |
| RWKV | RNN-transformer hybrid; linear attention | Active open-source community |
| Hyena | Long convolutions replace attention | Research stage |
| PaTH Attention (MIT, 2025) | Adds data-dependent down-weighting to standard attention | Improves reasoning and long-context tasks |
None have yet displaced transformers at frontier scale, but Mamba-based hybrids (Jamba by AI21) show promise for efficiency-critical applications.