Skip to content

LLM Architecture

How Large Language Models work — from transformer internals and attention mechanisms through training pipelines, quantization formats, model distribution formats, and knowledge distillation.


Transformer Architecture

The transformer is the neural network architecture behind virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent neural networks (RNNs/LSTMs) by processing all tokens in a sequence simultaneously rather than sequentially.

Why Transformers Replaced RNNs

RNNs process tokens one at a time, left to right. This sequential bottleneck means:

  • Training cannot be parallelized across sequence positions
  • Long-range dependencies decay over distance (vanishing gradients)
  • Training time scales linearly with sequence length

Transformers solve all three problems through self-attention, which computes relationships between every pair of tokens in a single matrix operation — fully parallelizable on GPUs.

High-Level Data Flow

graph LR
    A[Raw Text] --> B[Tokenizer]
    B --> C[Token IDs]
    C --> D[Embedding Layer]
    D --> E[+ Positional Encoding]
    E --> F[Transformer Blocks x N]
    F --> G[Output Layer / Logits]
    G --> H[Softmax → Probability Distribution]
    H --> I[Next Token]
  1. Tokenization — text is split into subword tokens (integers from a fixed vocabulary)
  2. Embedding — each token ID maps to a dense vector via a learned embedding table
  3. Positional Encoding — positional signals are added so the model knows token order (attention itself is order-agnostic)
  4. Transformer Blocks — a stack of N identical layers, each containing self-attention + feed-forward network + residual connections + layer normalization
  5. Output Layer — projects hidden states to vocabulary-sized logits
  6. Softmax — converts logits to a probability distribution over the vocabulary

Modern LLMs use 12 to several hundred transformer blocks. Deeper stacks enable richer hierarchical abstractions.

Encoder-Decoder vs Decoder-Only

The original transformer had two halves:

Architecture Used By How It Works
Encoder-Decoder T5, BART, original Transformer Encoder reads full input bidirectionally; decoder generates output autoregressively
Encoder-Only BERT, RoBERTa Bidirectional attention for understanding tasks (classification, NER)
Decoder-Only GPT series, LLaMA, Claude, Mistral Causal (left-to-right) attention; generates text one token at a time

Nearly all modern generative LLMs use the decoder-only variant. The encoder-only approach lives on in embedding models and classification tasks.


Self-Attention Mechanism

Self-attention is the core innovation that makes transformers work. It allows every token to "attend to" every other token in the sequence, computing relevance scores dynamically.

Query, Key, Value (QKV)

For each token, the model computes three vectors from the input embedding:

  • Query (Q) — "what am I looking for?"
  • Key (K) — "what do I contain?"
  • Value (V) — "what information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and passed through softmax:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $d_k$ is the dimension of the key vectors (scaling prevents dot products from growing too large).

Causal Masking

In decoder-only models, a causal mask is applied to the attention matrix: the upper triangle is set to $-\infty$ before softmax, preventing tokens from attending to future positions. This ensures autoregressive generation — each token can only see tokens that came before it.

Multi-Head Attention

Rather than computing a single attention function, transformers use multiple attention heads (typically 32–128), each with independent Q/K/V projections. Different heads learn to capture different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.

Attention Variants

Variant Description Used By
Multi-Head Attention (MHA) Each head has its own K, V projections Original Transformer, GPT-2
Multi-Query Attention (MQA) All heads share a single K, V projection PaLM, Falcon
Grouped-Query Attention (GQA) Heads grouped into clusters sharing K, V LLaMA 2/3, Mistral, Gemma

GQA is the current standard — it reduces KV cache memory by 4-8x compared to MHA with minimal quality loss.


Tokenization and Embeddings

Tokenization

Tokenization converts raw text into integer token IDs from a fixed vocabulary. LLMs use subword tokenization — a middle ground between character-level (too fine) and word-level (can't handle unknown words).

Byte Pair Encoding (BPE) is the dominant algorithm:

  1. Start with a vocabulary of 256 byte values
  2. Find the most frequent adjacent byte pair in the training corpus
  3. Merge that pair into a new token, add to vocabulary
  4. Repeat until vocabulary reaches target size (30K–100K tokens)

Common words become single tokens; rare words decompose into known subword pieces.

Algorithm Description Used By
BPE (byte-level) Merge most frequent byte pairs GPT-2/3/4, LLaMA 3, Claude, Mistral
WordPiece Merge pairs that maximize corpus likelihood BERT, DistilBERT
SentencePiece Language-agnostic, operates on raw text LLaMA 1/2, Mistral (earlier), T5
Unigram Probabilistic model, prunes vocabulary down SentencePiece variant, XLNet

Tokenization Quirks

Many LLM "failures" trace back to tokenization. Math errors occur because multi-digit numbers split into arbitrary subword tokens. Spelling struggles happen because the model never sees individual characters. "Glitch tokens" — tokens frequent in tokenizer training data but rare in model training — produce unpredictable outputs.

Embeddings

The embedding layer maps each integer token ID to a dense vector (typically 4096–12288 dimensions). These vectors are learned during pretraining and encode semantic relationships: similar tokens have similar vectors.

Positional encoding adds sequence-order information since attention is inherently order-agnostic. Modern LLMs use Rotary Position Embeddings (RoPE), which encode relative positions directly into the Q/K dot product, enabling better extrapolation to longer sequences than the model was trained on.

Context Windows

The context window is the maximum number of tokens an LLM can process in one request. All input (system prompt, conversation history, user query) and output share this budget.

Era Typical Context Example Models
2018–2020 512–2048 BERT, GPT-2
2022–2023 4K–32K GPT-4, Claude 2
2024–2025 128K–1M Claude 3.5, Gemini 1.5, GPT-4 Turbo
2025–2026 1M–10M Gemini 2.0, Claude 4

Lost in the Middle

Research shows LLMs attend strongly to tokens at the beginning and end of context but drop 30%+ accuracy on information in the middle (Liu et al., Stanford 2024). Placing critical information at the start or end of prompts improves retrieval quality.


Training Pipeline

Phase 1: Pretraining

The model learns general knowledge by predicting the next token across trillions of tokens from web crawls, books, code, and curated datasets. This is by far the most expensive phase — DeepSeek-V3 required 2.788 million H800 GPU hours (~$5.6M) for 14.8 trillion tokens.

The training objective is simple causal language modeling:

$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) $$

Phase 2: Supervised Fine-Tuning (SFT)

The pretrained model is further trained on curated instruction-response pairs to learn:

  • Instruction following
  • Output formatting (JSON, markdown, structured responses)
  • Safety behaviors
  • Task-specific patterns

SFT datasets are much smaller (thousands to millions of examples) but high quality.

Phase 3: Alignment

RLHF (Reinforcement Learning from Human Feedback)

The traditional alignment pipeline:

graph LR
    A[SFT Model] --> B[Generate Multiple Responses]
    B --> C[Human Annotators Rank Outputs]
    C --> D[Train Reward Model]
    D --> E[Optimize Policy via PPO]
    E --> F[Aligned Model]
  1. Generate multiple responses per prompt
  2. Human annotators rank them
  3. Train a reward model to predict human preferences
  4. Use PPO (Proximal Policy Optimization) to optimize the base model against the reward model

Downsides: complex, expensive, unstable training, susceptible to reward hacking.

DPO (Direct Preference Optimization)

Introduced by Rafailov et al. (2023), DPO simplifies alignment by eliminating the reward model entirely. It reframes preference learning as a binary classification problem:

  • Given a chosen response and a rejected response, directly optimize the model to increase the probability of the chosen response relative to the rejected one
  • Requires only 2 models (policy + frozen reference) vs RLHF's 4
  • Standard supervised learning infrastructure — no RL instability

By 2025, 70% of enterprises use RLHF or DPO for alignment, with DPO adoption growing 45% year-over-year.

Constitutional AI (CAI)

Developed by Anthropic, CAI replaces human preference labeling with self-critique based on ethical principles (a "constitution"). The model generates responses, critiques its own outputs against the constitution, and revises — enabling scalable alignment without massive human annotation.

Phase 4: Reinforcement Learning for Reasoning

Models like DeepSeek-R1 and OpenAI o1/o3 add an RL phase specifically targeting step-by-step reasoning:

  • Train the model to generate and verify chains of thought
  • Reward correct final answers and valid reasoning steps
  • Results: DeepSeek-R1 achieves 97.3% on MATH, ~80% on AIME competition problems

Mixture of Experts (MoE)

MoE introduces sparsity into the model: instead of activating all parameters for every token, only a subset of specialized "expert" sub-networks fire. This achieves the quality of massive models at the compute cost of much smaller ones.

How MoE Works

graph TD
    A[Input Token] --> B[Router / Gating Network]
    B --> C[Expert 1]
    B --> D[Expert 2]
    B --> E[Expert 3]
    B --> F["Expert N (inactive)"]
    C --> G[Weighted Sum of Active Expert Outputs]
    D --> G
    E --> G
    G --> H[Output]
    style F fill:#ccc,stroke:#999
  1. A router (small neural network) scores all experts for each input token
  2. The top-K experts (typically top-2) are selected
  3. Their outputs are combined via weighted sum
  4. Remaining experts are not computed — saving ~90% of FLOPs

Key MoE Models

Model Total Params Active Params Experts Innovation
Mixtral 8x7B 46.7B 12.9B 8 First major open-source MoE; static top-2 routing
Mixtral 8x22B 141B ~39B 8 Scaled Mixtral architecture
DeepSeek-V3 671B 37B 256 Fine-grained experts; auxiliary-loss-free load balancing; FP8 training
DeepSeek-R1 671B 37B 256 RL-first reasoning on V3 base; 97.3% MATH
Llama 4 Scout 109B 17B 16 Meta's first MoE; 10M token context
Llama 4 Maverick 400B 17B 128 128 experts, top-1 routing

DeepSeek's MoE Innovations

DeepSeek introduced two key strategies:

  1. Fine-grained experts — segment into many small experts (256 instead of 8), activate a small subset, allowing more flexible combinations
  2. Shared experts — isolate some experts as "shared" across all tokens to capture common knowledge, reducing redundancy in routed experts

As of 2025, nearly all frontier models (GPT-4, Gemini, Claude, Llama 4, DeepSeek, Mistral Large) use MoE architectures.

MoE Memory Tradeoff

MoE memory scales with total parameters, not active parameters. A 671B MoE model needs hundreds of GB of VRAM even though only 37B parameters fire per token. This forces multi-GPU deployments for large MoE models.


Quantization Formats

Quantization reduces model weight precision from high-bit (FP32/FP16) to lower-bit (INT8/INT4) representations, dramatically reducing memory and improving inference speed.

Numeric Precision Types

Format Bits Bytes/Param Description Use Case
FP32 32 4 Full precision float Training optimizer states only
BF16 16 2 Brain Float 16 — wider dynamic range than FP16 Standard training & full-quality inference; post-2022 default
FP16 16 2 Half precision float Legacy inference; highest quality but 2x memory of BF16 with no benefit
FP8 8 1 8-bit float; native on Hopper/Blackwell GPUs Production sweet spot on modern NVIDIA hardware
INT8 8 1 8-bit integer ~50% memory reduction vs FP16; broad hardware support
INT4 4 0.5 4-bit integer ~75% memory reduction; per-group scaling preserves quality
INT2 2 0.25 2-bit integer Extreme compression; significant quality loss

How Quantization Works

Full-precision weights (e.g. FP16) are mapped to a smaller set of representable values:

  1. Per-tensor quantization — one scale factor for the entire weight tensor (fast but lossy)
  2. Per-channel quantization — one scale factor per output channel (better quality)
  3. Per-group quantization — divides weights into groups of 128 elements, each with its own scale (best quality/size tradeoff for INT4)

The scale factor maps the quantized integer range back to the original floating-point range during inference.

Quality Impact by Precision

Perplexity benchmarks (Llama-2-7B, lower is better):

Format Perplexity Quality Loss
FP16 (baseline) 7.4924
Q8_0 7.4933 Negligible
Q5_K_M ~7.52 Minimal
Q4_K_M 7.5692 Acceptable
Q3_K_M ~7.85 Noticeable
Q2_K 8.6501 Significant degradation

Low-Bit Caveats

At Q2/Q3, models start ignoring parts of system prompts and hallucinating JSON formatting. Avoid INT4 and below for math, code generation, and reasoning-heavy tasks where quality loss is most noticeable.

Size and Speed Example (Llama 2 13B)

Metric FP16 Q4_K_M
Model size 26 GB 7.9 GB (70% reduction)
RAM required 32 GB+ 12 GB
Speed 8 tok/s 15 tok/s
Quality 100% ~95%

Model Formats and Quantization Methods

GGUF (GPT-Generated Unified Format)

GGUF is a self-contained file format created by the llama.cpp project. It bundles weights, tokenizer, architecture metadata, and chat template into a single .gguf file.

Key properties:

  • Runs on everything — CPU, NVIDIA, AMD, Apple Silicon
  • mmap-able (OS maps file into memory without loading it all)
  • Endian-safe and versioned
  • Powers Ollama and LM Studio under the hood

GGUF quantization naming convention:

Name Approx Bits/Weight Type Quality
Q2_K ~2.6 K-quant Extreme compression, noticeable degradation
Q3_K_S / Q3_K_M / Q3_K_L ~3.3–3.9 K-quant Budget-conscious, some quality loss
Q4_K_S / Q4_K_M ~4.3–4.8 K-quant Best balance of quality and size
Q5_K_S / Q5_K_M ~5.3–5.7 K-quant Near-lossless for most tasks
Q6_K ~6.6 K-quant Very close to FP16
Q8_0 ~8.5 Legacy Near-identical to FP16
IQ2_XXS / IQ3_S ~2.1–3.4 I-quant State-of-art low-bit; uses lookup tables

The "K" indicates k-quant method (importance-aware mixed-precision); S/M/L are compression aggressiveness levels.

GPTQ (GPT-Quantized)

Calibration-based 4-bit integer quantization using approximate second-order (Hessian) information to minimize quantization error. Requires a small calibration dataset. AutoGPTQ was archived in April 2025; succeeded by GPTQModel v5.8.0.

Verdict: Use only if AWQ or EXL2 versions are unavailable. Both offer better quality-per-bit.

AWQ (Activation-Aware Weight Quantization)

MIT research. Identifies the <1% of "salient" weights by observing activations during calibration, then preserves them at higher precision.

  • ~3 percentage points better than GPTQ on MMLU at 4 bits
  • Marlin-AWQ kernel: ~741 tok/s on A10G — fastest 4-bit for NVIDIA
  • Best choice for vLLM multi-user deployments on NVIDIA

EXL2 (ExLlamaV2)

Mixed bit-width quantization — can use 2, 3, 4, 5, 6, 8 bits within a single model and even within individual layers. Supports fractional average bitwidths (e.g., 4.5 bpw).

  • Fastest for interactive single-user generation on NVIDIA GPUs (40–70% faster than llama.cpp)
  • NVIDIA CUDA only, no CPU fallback
  • Best for single-user interactive sessions at 4–6 bpw

Quick Format Decision Guide

Scenario Best Format
CPU / Laptop / Apple Silicon GGUF (Q4_K_M or Q5_K_M)
NVIDIA GPU, max serving throughput AWQ with Marlin kernels
NVIDIA GPU, single-user interactive EXL2 at 4–6 bpw
NVIDIA H100/Blackwell production FP8
Fine-tuning bitsandbytes (QLoRA)
Limited VRAM (≤8GB) GGUF Q4_K_M with CPU offloading
General starting point Ollama with Q4_K_M

MLX (Apple Silicon)

MLX is Apple's open-source array framework for machine learning on Apple Silicon. Designed for Mac-native LLM inference and fine-tuning.

Key Design Principles

  • Unified memory — arrays live in shared CPU/GPU memory; no data transfer overhead
  • Lazy computation — operations are materialized only when needed, enabling automatic fusion
  • Dynamic graphs — no recompilation on shape changes (unlike TensorRT)
  • Familiar APIs — Python API mirrors NumPy; mlx.nn mirrors PyTorch

MLX LM

The mlx-lm package provides one-command model download, quantization conversion, and inference:

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B --quantize --q-bits 4

# Generate text
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit --prompt "Explain attention"

Performance on Apple Silicon

Chip Memory Bandwidth 14B Dense (BF16) TTFT 30B MoE (4-bit) TTFT
M4 120 GB/s ~12s ~4s
M5 153 GB/s <10s <3s

The M5 provides 19–27% improvement over M4, directly proportional to its 28% memory bandwidth increase.

Research shows vllm-mlx achieves 21–87% higher throughput than llama.cpp on Apple Silicon, thanks to zero-copy tensor operations and lazy evaluation.

A MacBook Pro 24GB can hold an 8B model in BF16 or a 30B MoE at 4-bit quantization comfortably.


Knowledge Distillation

Knowledge distillation compresses a large teacher model into a smaller student model that mimics the teacher's behavior while being far cheaper to run.

How It Works

graph LR
    A[Input Data] --> B[Teacher Model - Large]
    A --> C[Student Model - Small]
    B --> D[Soft Targets / Probabilities]
    D --> E[Distillation Loss]
    C --> F[Student Predictions]
    F --> E
    E --> G[Update Student Weights]

Three main distillation approaches:

Method What Transfers Description
Response-based Output probabilities ("soft targets") Student learns teacher's probability distribution over vocabulary, not just the argmax
Feature-based Intermediate layer activations Student aligns internal representations via L2 or cosine similarity
Attention-based Attention maps Student replicates teacher's attention patterns (used in DistilBERT)

Why Soft Targets Matter

Instead of training on hard labels (the single correct answer), the student learns from the teacher's full probability distribution. The relative probabilities encode the teacher's learned generalizations — for example, that "dog" and "puppy" are similar while "dog" and "table" are not.

A temperature parameter $T$ (typically 2–5) controls how "soft" the distribution is: higher temperature spreads probability more evenly, exposing more of the teacher's learned structure.

Results

  • Typical compression: 5–10x smaller, retaining 90–95% accuracy
  • DistilBERT: 60% of BERT's size, 97% of its performance, 60% faster
  • DeepSeek-R1-Distill models: distilled from 671B to 7B/14B/32B variants with strong reasoning capabilities
  • Chain-of-Thought Distillation — transfers reasoning processes (not just final answers) from teacher to student using CoT rationales as training signal
  • Curriculum Distillation — organizes training easy-to-hard to gradually build reasoning capacity
  • Multi-Teacher Distillation — combines expertise from multiple specialized teachers with dynamic weighting
  • Few-Shot Distillation — effective with as few as 8–512 calibration samples using counterfactual explanations

Scaling Laws

Kaplan et al. (2020) and Chinchilla (Hoffmann et al., 2022) established empirical scaling laws for LLMs:

  • Model performance (loss) improves predictably as a power law of: model size (parameters), dataset size (tokens), and compute budget (FLOPs)
  • Chinchilla-optimal: for a given compute budget, model size and training tokens should scale roughly equally (the "1:1 ratio" in parameter-token space)
  • DeepSeek-V3 trained on 14.8T tokens with 671B params — heavily over-training relative to Chinchilla, but MoE's sparse activation changes the calculus

These laws guide decisions about how to allocate training budgets: bigger model vs more data vs longer training.


Post-Transformer Architectures

While transformers dominate, alternatives are emerging:

Architecture Key Innovation Status
Mamba (State Space Models) Selective state updates; linear-time sequence processing; no quadratic attention Competitive with transformers at small-medium scale
RWKV RNN-transformer hybrid; linear attention Active open-source community
Hyena Long convolutions replace attention Research stage
PaTH Attention (MIT, 2025) Adds data-dependent down-weighting to standard attention Improves reasoning and long-context tasks

None have yet displaced transformers at frontier scale, but Mamba-based hybrids (Jamba by AI21) show promise for efficiency-critical applications.