LLM Architecture¶
How Large Language Models work — from transformer internals and attention mechanisms through training pipelines, quantization formats, model distribution formats, and knowledge distillation.
Transformer Architecture¶
The transformer is the neural network architecture behind virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent neural networks (RNNs/LSTMs) by processing all tokens in a sequence simultaneously rather than sequentially.
Why Transformers Replaced RNNs¶
RNNs process tokens one at a time, left to right. This sequential bottleneck means:
- Training cannot be parallelized across sequence positions
- Long-range dependencies decay over distance (vanishing gradients)
- Training time scales linearly with sequence length
Transformers solve all three problems through self-attention, which computes relationships between every pair of tokens in a single matrix operation — fully parallelizable on GPUs.
High-Level Data Flow¶
graph LR
A[Raw Text] --> B[Tokenizer]
B --> C[Token IDs]
C --> D[Embedding Layer]
D --> E[+ Positional Encoding]
E --> F[Transformer Blocks x N]
F --> G[Output Layer / Logits]
G --> H[Softmax → Probability Distribution]
H --> I[Next Token]
- Tokenization — text is split into subword tokens (integers from a fixed vocabulary)
- Embedding — each token ID maps to a dense vector via a learned embedding table
- Positional Encoding — positional signals are added so the model knows token order (attention itself is order-agnostic)
- Transformer Blocks — a stack of N identical layers, each containing self-attention + feed-forward network + residual connections + layer normalization
- Output Layer — projects hidden states to vocabulary-sized logits
- Softmax — converts logits to a probability distribution over the vocabulary
Modern LLMs use 12 to several hundred transformer blocks. Deeper stacks enable richer hierarchical abstractions.
Inside a Transformer Block¶
Each transformer block contains two main sub-layers wrapped in residual connections and normalization:
graph TD
A[Input] --> B[Layer Norm]
B --> C[Multi-Head Self-Attention]
C --> D[+ Residual Connection]
D --> E[Layer Norm]
E --> F[Feed-Forward Network]
F --> G[+ Residual Connection]
G --> H[Output to Next Block]
Feed-Forward Network (FFN)¶
The FFN provides the model's primary source of nonlinearity and parameter capacity. While attention handles communication between tokens, the FFN handles computation within each token's representation — this is where the model stores and applies learned knowledge.
The original transformer used a two-layer FFN with ReLU activation and a 4x hidden dimension expansion:
$$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$
Modern LLMs have evolved significantly:
| Component | Original Transformer | Modern LLMs (LLaMA/Mistral) |
|---|---|---|
| Activation | ReLU | SwiGLU (SiLU-gated) |
| Expansion ratio | 4x | ~2.7x (compensated by gating) |
| Normalization | LayerNorm (Post-LN) | RMSNorm (Pre-LN) |
SwiGLU Activation¶
SwiGLU is a gated variant that has become the standard in LLaMA-family models. It works like a learned gate: up_proj(x) carries the information, and SiLU(gate_proj(x)) controls how much passes through:
$$ \text{SwiGLU}(x) = (\text{SiLU}(xW_{\text{gate}})) \odot (xW_{\text{up}}) $$
Even with a nominally lower expansion ratio (~2.7x vs 4x), SwiGLU-based FFNs have similar or greater effective capacity because the gate mechanism provides additional expressive power. Gemma uses GeGLU, a closely related variant.
RMSNorm vs LayerNorm¶
LayerNorm performs two operations: centering (subtracting the mean) and scaling (dividing by standard deviation). RMSNorm removes centering entirely, normalizing only by root mean square — empirical studies found centering contributes little to training stability while scaling does the heavy lifting.
$$ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}x_i^2}} \cdot \gamma $$
RMSNorm yields comparable performance to LayerNorm but shows 7–64% speed improvement.
Pre-LN vs Post-LN¶
| Placement | Description | Stability | Used By |
|---|---|---|---|
| Post-LN (original) | Norm applied after residual add | Requires careful LR warmup; gradient issues in deep nets | Original Transformer, BERT |
| Pre-LN (modern) | Norm applied before each sub-layer | Much more stable; trains without warmup | LLaMA, Mistral, GPT-3+ |
Pre-LN normalizes input to each sub-layer, preventing activation explosions. The residual path remains clean, allowing gradients to flow easily. By LLaMA's release (2023), Pre-LN with RMSNorm became the undisputed standard.
Residual Connections¶
Residual (skip) connections add each sub-layer's input directly to its output: $\text{output} = \text{sublayer}(x) + x$. This allows gradients to flow through hundreds of layers without vanishing and lets each layer learn a refinement rather than a complete transformation.
Weight Tying¶
Many models tie the input embedding matrix with the output projection matrix (the layer that produces logits). Since both map between token IDs and hidden dimensions, sharing weights reduces parameter count and can improve generalization. GPT-2 and many smaller models use weight tying; larger models like LLaMA do not.
Encoder-Decoder vs Decoder-Only¶
The original transformer had two halves:
| Architecture | Used By | How It Works |
|---|---|---|
| Encoder-Decoder | T5, BART, original Transformer | Encoder reads full input bidirectionally; decoder generates output autoregressively |
| Encoder-Only | BERT, RoBERTa | Bidirectional attention for understanding tasks (classification, NER) |
| Decoder-Only | GPT series, LLaMA, Claude, Mistral | Causal (left-to-right) attention; generates text one token at a time |
Nearly all modern generative LLMs use the decoder-only variant. The encoder-only approach lives on in embedding models and classification tasks.
Self-Attention Mechanism¶
Self-attention is the core innovation that makes transformers work. It allows every token to "attend to" every other token in the sequence, computing relevance scores dynamically.
Query, Key, Value (QKV)¶
For each token, the model computes three vectors from the input embedding:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I contain?"
- Value (V) — "what information do I provide?"
The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and passed through softmax:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Where $d_k$ is the dimension of the key vectors (scaling prevents dot products from growing too large).
Causal Masking¶
In decoder-only models, a causal mask is applied to the attention matrix: the upper triangle is set to $-\infty$ before softmax, preventing tokens from attending to future positions. This ensures autoregressive generation — each token can only see tokens that came before it.
Multi-Head Attention¶
Rather than computing a single attention function, transformers use multiple attention heads (typically 32–128), each with independent Q/K/V projections. Different heads learn to capture different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.
Attention Variants¶
| Variant | Description | Used By |
|---|---|---|
| Multi-Head Attention (MHA) | Each head has its own K, V projections | Original Transformer, GPT-2 |
| Multi-Query Attention (MQA) | All heads share a single K, V projection | PaLM, Falcon |
| Grouped-Query Attention (GQA) | Heads grouped into clusters sharing K, V | LLaMA 2/3, Mistral, Gemma |
GQA is the current standard — it reduces KV cache memory by 4-8x compared to MHA with minimal quality loss.
Tokenization and Embeddings¶
Tokenization¶
Tokenization converts raw text into integer token IDs from a fixed vocabulary. LLMs use subword tokenization — a middle ground between character-level (too fine) and word-level (can't handle unknown words).
Byte Pair Encoding (BPE) is the dominant algorithm:
- Start with a vocabulary of 256 byte values
- Find the most frequent adjacent byte pair in the training corpus
- Merge that pair into a new token, add to vocabulary
- Repeat until vocabulary reaches target size (30K–100K tokens)
Common words become single tokens; rare words decompose into known subword pieces.
| Algorithm | Description | Used By |
|---|---|---|
| BPE (byte-level) | Merge most frequent byte pairs | GPT-2/3/4, LLaMA 3, Claude, Mistral |
| WordPiece | Merge pairs that maximize corpus likelihood | BERT, DistilBERT |
| SentencePiece | Language-agnostic, operates on raw text | LLaMA 1/2, Mistral (earlier), T5 |
| Unigram | Probabilistic model, prunes vocabulary down | SentencePiece variant, XLNet |
Tokenization Quirks
Many LLM "failures" trace back to tokenization. Math errors occur because multi-digit numbers split into arbitrary subword tokens. Spelling struggles happen because the model never sees individual characters. "Glitch tokens" — tokens frequent in tokenizer training data but rare in model training — produce unpredictable outputs.
Embeddings¶
The embedding layer maps each integer token ID to a dense vector (typically 4096–12288 dimensions). These vectors are learned during pretraining and encode semantic relationships: similar tokens have similar vectors.
Positional encoding adds sequence-order information since attention is inherently order-agnostic.
Positional Encoding Methods¶
| Method | Type | How It Works | Used By |
|---|---|---|---|
| Sinusoidal | Absolute | Fixed sine/cosine functions at each position | Original Transformer |
| Learned Absolute | Absolute | Trainable embedding per position (up to max length) | GPT-2, BERT |
| RoPE (Rotary Position Embedding) | Relative | Encodes relative positions via rotation matrices applied to Q/K vectors | LLaMA 1/2/3, Mistral, Qwen, Gemma |
| ALiBi (Attention with Linear Biases) | Relative | Adds linear penalty proportional to token distance directly to attention scores | BLOOM, MPT |
| YaRN | Relative (extended) | Extends RoPE to longer contexts via NTK-aware interpolation | Long-context LLaMA variants |
RoPE is the dominant method in 2025. It applies a rotation matrix to Q and K vectors such that the dot product $q \cdot k$ depends only on their relative position, not absolute. This enables better length extrapolation than learned absolute embeddings and avoids ALiBi's precision issues (see below).
ALiBi adds a simple linear bias $-m \cdot |i - j|$ to each attention score, where $m$ is a head-specific slope and $|i-j|$ is the distance between tokens. While elegant, ALiBi has a critical interaction with reduced precision: in FP16, the last 20 positions of a head may map to only 5 distinct values, and in BF16 they may all collapse to the same value. This limits ALiBi's effectiveness for long-context inference.
Context Windows¶
The context window is the maximum number of tokens an LLM can process in one request. All input (system prompt, conversation history, user query) and output share this budget.
| Era | Typical Context | Example Models |
|---|---|---|
| 2018–2020 | 512–2048 | BERT, GPT-2 |
| 2022–2023 | 4K–32K | GPT-4, Claude 2 |
| 2024–2025 | 128K–1M | Claude 3.5, Gemini 1.5, GPT-4 Turbo |
| 2025–2026 | 1M–10M | Gemini 2.0, Claude 4 |
Lost in the Middle
Research shows LLMs attend strongly to tokens at the beginning and end of context but drop 30%+ accuracy on information in the middle (Liu et al., Stanford 2024). Placing critical information at the start or end of prompts improves retrieval quality.
Training Pipeline¶
Phase 1: Pretraining¶
The model learns general knowledge by predicting the next token across trillions of tokens from web crawls, books, code, and curated datasets. This is by far the most expensive phase — DeepSeek-V3 required 2.788 million H800 GPU hours (~$5.6M) for 14.8 trillion tokens.
The training objective is simple causal language modeling:
$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) $$
Pretraining Data Curation¶
The quality and composition of pretraining data is as critical as model architecture:
| Step | What It Does | Why It Matters |
|---|---|---|
| Deduplication | Remove near-duplicate documents (MinHash, exact substring matching) | Duplicated data causes memorization, degrades generalization, inflates benchmark scores |
| Quality filtering | Score documents via heuristics or classifier (perplexity, language ID, content quality) | Removes spam, boilerplate, machine-generated text |
| Toxicity/PII removal | Filter harmful content and personally identifiable information | Safety and legal compliance |
| Domain mixing | Control proportions of web, code, books, scientific papers, multilingual data | Affects which capabilities the model develops |
| Data scheduling | Vary data mix during training (e.g., increase code/math ratio later) | Optimizes learning curriculum |
Modern data pipelines use classifier-based filtering — training a small model on known high-quality text (e.g., Wikipedia, textbooks) and scoring all candidate documents. LLaMA 3 used this approach extensively.
Synthetic Data in Pretraining¶
Synthetic data — generated by existing LLMs — is increasingly used to augment pretraining corpora:
- Textbook-quality data: Phi models (Microsoft) demonstrated that small models trained on LLM-generated "textbook-style" data can outperform much larger models on reasoning benchmarks
- Code generation: synthetic programming problems and solutions supplement natural code repositories
- Math and reasoning: step-by-step solutions generated by strong models provide training signal for reasoning capabilities
- Instruction data: synthetic instruction-response pairs bootstrap SFT datasets at scale
Model Collapse
Training on too much synthetic data without sufficient real data can cause "model collapse" — progressive degradation of quality as the model learns from its own distribution rather than the true data distribution. Careful mixing ratios (typically <30% synthetic) and quality filtering mitigate this risk.
Phase 2: Supervised Fine-Tuning (SFT)¶
The pretrained model is further trained on curated instruction-response pairs to learn:
- Instruction following
- Output formatting (JSON, markdown, structured responses)
- Safety behaviors
- Task-specific patterns
SFT datasets are much smaller (thousands to millions of examples) but high quality.
Phase 3: Alignment¶
RLHF (Reinforcement Learning from Human Feedback)¶
The traditional alignment pipeline:
graph LR
A[SFT Model] --> B[Generate Multiple Responses]
B --> C[Human Annotators Rank Outputs]
C --> D[Train Reward Model]
D --> E[Optimize Policy via PPO]
E --> F[Aligned Model]
- Generate multiple responses per prompt
- Human annotators rank them
- Train a reward model to predict human preferences
- Use PPO (Proximal Policy Optimization) to optimize the base model against the reward model
Downsides: complex, expensive, unstable training, susceptible to reward hacking.
DPO (Direct Preference Optimization)¶
Introduced by Rafailov et al. (2023), DPO simplifies alignment by eliminating the reward model entirely. It reframes preference learning as a binary classification problem:
- Given a chosen response and a rejected response, directly optimize the model to increase the probability of the chosen response relative to the rejected one
- Requires only 2 models (policy + frozen reference) vs RLHF's 4
- Standard supervised learning infrastructure — no RL instability
By 2025, 70% of enterprises use RLHF or DPO for alignment, with DPO adoption growing 45% year-over-year.
Constitutional AI (CAI)¶
Developed by Anthropic, CAI replaces human preference labeling with self-critique based on ethical principles (a "constitution"). The model generates responses, critiques its own outputs against the constitution, and revises — enabling scalable alignment without massive human annotation.
Phase 4: Reinforcement Learning for Reasoning¶
Models like DeepSeek-R1 and OpenAI o1/o3 add an RL phase specifically targeting step-by-step reasoning:
- Train the model to generate and verify chains of thought
- Reward correct final answers and valid reasoning steps
- Results: DeepSeek-R1 achieves 97.3% on MATH, ~80% on AIME competition problems
Mixture of Experts (MoE)¶
MoE introduces sparsity into the model: instead of activating all parameters for every token, only a subset of specialized "expert" sub-networks fire. This achieves the quality of massive models at the compute cost of much smaller ones.
How MoE Works¶
graph TD
A[Input Token] --> B[Router / Gating Network]
B --> C[Expert 1]
B --> D[Expert 2]
B --> E[Expert 3]
B --> F["Expert N (inactive)"]
C --> G[Weighted Sum of Active Expert Outputs]
D --> G
E --> G
G --> H[Output]
style F fill:#ccc,stroke:#999
- A router (small neural network) scores all experts for each input token
- The top-K experts (typically top-2) are selected
- Their outputs are combined via weighted sum
- Remaining experts are not computed — saving ~90% of FLOPs
Key MoE Models¶
| Model | Total Params | Active Params | Experts | Innovation |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | First major open-source MoE; static top-2 routing |
| Mixtral 8x22B | 141B | ~39B | 8 | Scaled Mixtral architecture |
| DeepSeek-V3 | 671B | 37B | 256 | Fine-grained experts; auxiliary-loss-free load balancing; FP8 training |
| DeepSeek-R1 | 671B | 37B | 256 | RL-first reasoning on V3 base; 97.3% MATH |
| Llama 4 Scout | 109B | 17B | 16 | Meta's first MoE; 10M token context |
| Llama 4 Maverick | 400B | 17B | 128 | 128 experts, top-1 routing |
DeepSeek's MoE Innovations¶
DeepSeek introduced two key strategies:
- Fine-grained experts — segment into many small experts (256 instead of 8), activate a small subset, allowing more flexible combinations
- Shared experts — isolate some experts as "shared" across all tokens to capture common knowledge, reducing redundancy in routed experts
As of 2025, nearly all frontier models (GPT-4, Gemini, Claude, Llama 4, DeepSeek, Mistral Large) use MoE architectures.
Load Balancing and Expert Collapse¶
A critical challenge in MoE training is expert collapse — the router learns to send most tokens to a few "popular" experts while others receive little traffic and stop learning. This wastes capacity and reduces model quality.
Solutions:
| Technique | How It Works | Used By |
|---|---|---|
| Auxiliary load-balancing loss | Adds a penalty term that encourages equal token distribution across experts | Mixtral, Switch Transformer |
| Expert capacity factor | Caps the max tokens per expert; overflow tokens are dropped or sent to a default expert | GShard, Switch Transformer |
| Auxiliary-loss-free balancing | Uses a bias term in the router to balance load without distorting the main training loss | DeepSeek-V3 |
| Shared experts | Reserve some experts as "always active" to handle common knowledge, reducing pressure on routed experts | DeepSeek-V2/V3 |
DeepSeek-V3's auxiliary-loss-free approach is notable because traditional auxiliary losses can conflict with the main training objective, forcing a trade-off between load balance and model quality. By using a separate bias term, DeepSeek avoids this conflict entirely.
MoE Memory Tradeoff
MoE memory scales with total parameters, not active parameters. A 671B MoE model needs hundreds of GB of VRAM even though only 37B parameters fire per token. This forces multi-GPU deployments for large MoE models.
Quantization Formats¶
Quantization reduces model weight precision from high-bit (FP32/FP16) to lower-bit (INT8/INT4) representations, dramatically reducing memory and improving inference speed.
Numeric Precision Types¶
Floating-Point Bit Layout¶
Understanding the sign/exponent/mantissa structure explains why these formats differ:
FP32: [1 sign] [8 exponent] [23 mantissa] — 32 bits total
FP16: [1 sign] [5 exponent] [10 mantissa] — 16 bits total
BF16: [1 sign] [8 exponent] [ 7 mantissa] — 16 bits total
FP8: [1 sign] [4 exponent] [ 3 mantissa] — 8 bits total (E4M3 variant)
| Property | FP32 | BF16 | FP16 |
|---|---|---|---|
| Dynamic range (decades) | ~83 | ~79 | ~12 |
| Epsilon (precision near 1.0) | ~1.2e-7 | ~7.8e-3 | ~9.8e-4 |
| Max value | ~3.4e38 | ~3.4e38 | ~65,504 |
| Loss scaling needed? | No | Rarely | Often yes |
BF16 has the same 8-bit exponent as FP32, giving it nearly identical dynamic range — this means it can represent extremely small gradients and large activations without underflow/overflow. The trade-off is lower precision (7 mantissa bits vs FP16's 10). In practice, BF16 "just works" for training because you rarely need loss scaling.
FP16 has higher precision within a narrow range but risks overflow during training. Loss scaling (multiplying the loss by a large factor, then dividing gradients back) is often required to prevent gradient underflow.
| Format | Bits | Bytes/Param | Description | Use Case |
|---|---|---|---|---|
| FP32 | 32 | 4 | Full precision float | Training optimizer states only |
| BF16 | 16 | 2 | Brain Float 16 — wider dynamic range than FP16 | Standard training & full-quality inference; post-2022 default |
| FP16 | 16 | 2 | Half precision float | Legacy inference; highest quality but 2x memory of BF16 with no benefit |
| FP8 | 8 | 1 | 8-bit float; native on Hopper/Blackwell GPUs | Production sweet spot on modern NVIDIA hardware |
| INT8 | 8 | 1 | 8-bit integer | ~50% memory reduction vs FP16; broad hardware support |
| INT4 | 4 | 0.5 | 4-bit integer | ~75% memory reduction; per-group scaling preserves quality |
| INT2 | 2 | 0.25 | 2-bit integer | Extreme compression; significant quality loss |
How Quantization Works¶
Full-precision weights (e.g. FP16) are mapped to a smaller set of representable values:
- Per-tensor quantization — one scale factor for the entire weight tensor (fast but lossy)
- Per-channel quantization — one scale factor per output channel (better quality)
- Per-group quantization — divides weights into groups of 128 elements, each with its own scale (best quality/size tradeoff for INT4)
The scale factor maps the quantized integer range back to the original floating-point range during inference.
Quality Impact by Precision¶
Perplexity benchmarks (Llama-2-7B, lower is better):
| Format | Perplexity | Quality Loss |
|---|---|---|
| FP16 (baseline) | 7.4924 | — |
| Q8_0 | 7.4933 | Negligible |
| Q5_K_M | ~7.52 | Minimal |
| Q4_K_M | 7.5692 | Acceptable |
| Q3_K_M | ~7.85 | Noticeable |
| Q2_K | 8.6501 | Significant degradation |
Low-Bit Caveats
At Q2/Q3, models start ignoring parts of system prompts and hallucinating JSON formatting. Avoid INT4 and below for math, code generation, and reasoning-heavy tasks where quality loss is most noticeable.
Size and Speed Example (Llama 2 13B)¶
| Metric | FP16 | Q4_K_M |
|---|---|---|
| Model size | 26 GB | 7.9 GB (70% reduction) |
| RAM required | 32 GB+ | 12 GB |
| Speed | 8 tok/s | 15 tok/s |
| Quality | 100% | ~95% |
Model Formats and Quantization Methods¶
GGUF (GPT-Generated Unified Format)¶
GGUF is a self-contained file format created by the llama.cpp project. It bundles weights, tokenizer, architecture metadata, and chat template into a single .gguf file.
Key properties:
- Runs on everything — CPU, NVIDIA, AMD, Apple Silicon
- mmap-able (OS maps file into memory without loading it all)
- Endian-safe and versioned
- Powers Ollama and LM Studio under the hood
GGUF quantization naming convention:
| Name | Approx Bits/Weight | Type | Quality |
|---|---|---|---|
| Q2_K | ~2.6 | K-quant | Extreme compression, noticeable degradation |
| Q3_K_S / Q3_K_M / Q3_K_L | ~3.3–3.9 | K-quant | Budget-conscious, some quality loss |
| Q4_K_S / Q4_K_M | ~4.3–4.8 | K-quant | Best balance of quality and size |
| Q5_K_S / Q5_K_M | ~5.3–5.7 | K-quant | Near-lossless for most tasks |
| Q6_K | ~6.6 | K-quant | Very close to FP16 |
| Q8_0 | ~8.5 | Legacy | Near-identical to FP16 |
| IQ2_XXS / IQ3_S | ~2.1–3.4 | I-quant | State-of-art low-bit; uses lookup tables |
The "K" indicates k-quant method (importance-aware mixed-precision); S/M/L are compression aggressiveness levels.
GPTQ (GPT-Quantized)¶
Calibration-based 4-bit integer quantization using approximate second-order (Hessian) information to minimize quantization error. Requires a small calibration dataset. AutoGPTQ was archived in April 2025; succeeded by GPTQModel v5.8.0.
Verdict: Use only if AWQ or EXL2 versions are unavailable. Both offer better quality-per-bit.
AWQ (Activation-Aware Weight Quantization)¶
MIT research. Identifies the <1% of "salient" weights by observing activations during calibration, then preserves them at higher precision.
- ~3 percentage points better than GPTQ on MMLU at 4 bits
- Marlin-AWQ kernel: ~741 tok/s on A10G — fastest 4-bit for NVIDIA
- Best choice for vLLM multi-user deployments on NVIDIA
EXL2 (ExLlamaV2)¶
Mixed bit-width quantization — can use 2, 3, 4, 5, 6, 8 bits within a single model and even within individual layers. Supports fractional average bitwidths (e.g., 4.5 bpw).
- Fastest for interactive single-user generation on NVIDIA GPUs (40–70% faster than llama.cpp)
- NVIDIA CUDA only, no CPU fallback
- Best for single-user interactive sessions at 4–6 bpw
Quick Format Decision Guide¶
| Scenario | Best Format |
|---|---|
| CPU / Laptop / Apple Silicon | GGUF (Q4_K_M or Q5_K_M) |
| NVIDIA GPU, max serving throughput | AWQ with Marlin kernels |
| NVIDIA GPU, single-user interactive | EXL2 at 4–6 bpw |
| NVIDIA H100/Blackwell production | FP8 |
| Fine-tuning | bitsandbytes (QLoRA) |
| Limited VRAM (≤8GB) | GGUF Q4_K_M with CPU offloading |
| General starting point | Ollama with Q4_K_M |
MLX (Apple Silicon)¶
MLX is Apple's open-source array framework for machine learning on Apple Silicon. Designed for Mac-native LLM inference and fine-tuning.
Key Design Principles¶
- Unified memory — arrays live in shared CPU/GPU memory; no data transfer overhead
- Lazy computation — operations are materialized only when needed, enabling automatic fusion
- Dynamic graphs — no recompilation on shape changes (unlike TensorRT)
- Familiar APIs — Python API mirrors NumPy;
mlx.nnmirrors PyTorch
MLX LM¶
The mlx-lm package provides one-command model download, quantization conversion, and inference:
# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B --quantize --q-bits 4
# Generate text
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit --prompt "Explain attention"
Performance on Apple Silicon¶
| Chip | Memory Bandwidth | 14B Dense (BF16) TTFT | 30B MoE (4-bit) TTFT |
|---|---|---|---|
| M4 | 120 GB/s | ~12s | ~4s |
| M5 | 153 GB/s | <10s | <3s |
The M5 provides 19–27% improvement over M4, directly proportional to its 28% memory bandwidth increase.
Research shows vllm-mlx achieves 21–87% higher throughput than llama.cpp on Apple Silicon, thanks to zero-copy tensor operations and lazy evaluation.
A MacBook Pro 24GB can hold an 8B model in BF16 or a 30B MoE at 4-bit quantization comfortably.
Knowledge Distillation¶
Knowledge distillation compresses a large teacher model into a smaller student model that mimics the teacher's behavior while being far cheaper to run.
How It Works¶
graph LR
A[Input Data] --> B[Teacher Model - Large]
A --> C[Student Model - Small]
B --> D[Soft Targets / Probabilities]
D --> E[Distillation Loss]
C --> F[Student Predictions]
F --> E
E --> G[Update Student Weights]
Three main distillation approaches:
| Method | What Transfers | Description |
|---|---|---|
| Response-based | Output probabilities ("soft targets") | Student learns teacher's probability distribution over vocabulary, not just the argmax |
| Feature-based | Intermediate layer activations | Student aligns internal representations via L2 or cosine similarity |
| Attention-based | Attention maps | Student replicates teacher's attention patterns (used in DistilBERT) |
Why Soft Targets Matter¶
Instead of training on hard labels (the single correct answer), the student learns from the teacher's full probability distribution. The relative probabilities encode the teacher's learned generalizations — for example, that "dog" and "puppy" are similar while "dog" and "table" are not.
A temperature parameter $T$ (typically 2–5) controls how "soft" the distribution is: higher temperature spreads probability more evenly, exposing more of the teacher's learned structure.
Results¶
- Typical compression: 5–10x smaller, retaining 90–95% accuracy
- DistilBERT: 60% of BERT's size, 97% of its performance, 60% faster
- DeepSeek-R1-Distill models: distilled from 671B to 7B/14B/32B variants with strong reasoning capabilities
Emerging Trends (2025)¶
- Chain-of-Thought Distillation — transfers reasoning processes (not just final answers) from teacher to student using CoT rationales as training signal
- Curriculum Distillation — organizes training easy-to-hard to gradually build reasoning capacity
- Multi-Teacher Distillation — combines expertise from multiple specialized teachers with dynamic weighting
- Few-Shot Distillation — effective with as few as 8–512 calibration samples using counterfactual explanations
Scaling Laws¶
Kaplan et al. (2020) and Chinchilla (Hoffmann et al., 2022) established empirical scaling laws for LLMs:
- Model performance (loss) improves predictably as a power law of: model size (parameters), dataset size (tokens), and compute budget (FLOPs)
- Chinchilla-optimal: for a given compute budget, model size and training tokens should scale roughly equally (the "1:1 ratio" in parameter-token space)
- DeepSeek-V3 trained on 14.8T tokens with 671B params — heavily over-training relative to Chinchilla, but MoE's sparse activation changes the calculus
These laws guide decisions about how to allocate training budgets: bigger model vs more data vs longer training.
Inference-Time Compute Scaling (Test-Time Compute)¶
A newer scaling axis discovered in 2024–2025: instead of only scaling training compute, you can scale inference compute by letting models "think longer" at test time.
| Approach | How It Works | Example |
|---|---|---|
| Chain-of-Thought (CoT) | Generate step-by-step reasoning before the final answer | GPT-4, Claude |
| Best-of-N sampling | Generate N candidate answers, select the best one via verifier | Used in math benchmarks |
| Tree search | Explore multiple reasoning paths, backtrack when stuck | AlphaProof, OpenAI o1 |
| Self-verification | Model checks its own answer and retries if wrong | DeepSeek-R1 |
| Extended thinking | Dedicated "thinking" token budget separate from the visible response | Claude 3.5+ extended thinking, OpenAI o1/o3 |
OpenAI's o1/o3 and DeepSeek-R1 demonstrated that inference-time compute scaling can yield dramatic improvements on reasoning-heavy tasks, sometimes matching models 10x their size on math and coding benchmarks. The key insight: a smaller model thinking longer can outperform a larger model answering immediately.
Model Merging¶
Model merging combines the weights of multiple fine-tuned LLMs into a single model — no additional training required, no GPU needed. This creates models that combine capabilities from different specializations.
Why Merge?¶
- Combine a code-focused model with a math-focused model into one that excels at both
- Merge different LoRA adapters trained on different tasks
- Reduce the cost of multi-task deployment (one merged model vs multiple specialized ones)
- Experiment cheaply — thousands of merged models appear on the Open LLM Leaderboard
Merging Techniques¶
| Method | How It Works | Strengths | Limitations |
|---|---|---|---|
| Linear / LERP | Weighted average of model weights: $W = \alpha W_A + (1-\alpha) W_B$ | Simplest, fast | Naive averaging can cause interference between conflicting weight updates |
| SLERP (Spherical Linear Interpolation) | Interpolates along the hypersphere surface, preserving vector magnitudes | Maintains geometric properties; smoother than linear | Limited to merging exactly 2 models |
| TIES (Trim, Elect Sign & Merge) | Resets tiny deltas, resolves sign conflicts by majority vote, then merges cleaned updates | Handles interference between models; works with many models | More complex pipeline |
| DARE (Drop And REscale) | Randomly drops 90–99% of delta parameters, rescales remaining by $\frac{1}{1-p}$ | Effective even at extreme sparsity; reduces parameter interference | Random dropping adds variance |
| DARE + TIES | Combines DARE's random sparsification with TIES sign resolution | Best of both approaches | Requires tuning drop rate and thresholds |
Tooling: MergeKit¶
MergeKit (by Arcee AI) is the standard open-source tool for model merging. It provides an extensible framework supporting all major algorithms and has been used to create thousands of merged models. Configuration is YAML-based:
models:
- model: code-specialist/model
parameters:
weight: 0.6
- model: math-specialist/model
parameters:
weight: 0.4
merge_method: ties
base_model: base/model
parameters:
density: 0.5
normalize: true
dtype: bfloat16
Emerging Trends (2025)¶
- Reasoning model merging: merging "slow-thinking" reasoning models with "fast" conventional LLMs can reduce token consumption by ~50% while maintaining accuracy
- Newer algorithms: NuSLERP, DELLA (Drop and Rescale via Sampling with Magnitude), and SCE (Select, Calculate, and Erase) offer incremental improvements
- All merging methods still fall short of individually fine-tuned models on their specific tasks — merging trades peak specialization for broader capability
Post-Transformer Architectures¶
While transformers dominate, alternatives are emerging:
| Architecture | Key Innovation | Status |
|---|---|---|
| Mamba (State Space Models) | Selective state updates; linear-time sequence processing; no quadratic attention | Competitive with transformers at small-medium scale |
| RWKV | RNN-transformer hybrid; linear attention | Active open-source community |
| Hyena | Long convolutions replace attention | Research stage |
| PaTH Attention (MIT, 2025) | Adds data-dependent down-weighting to standard attention | Improves reasoning and long-context tasks |
None have yet displaced transformers at frontier scale, but Mamba-based hybrids (Jamba by AI21) show promise for efficiency-critical applications.