LLM Architecture¶

How Large Language Models work — from transformer internals and attention mechanisms through training pipelines, quantization formats, model distribution formats, and knowledge distillation.

Transformer Architecture¶

The transformer is the neural network architecture behind virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent neural networks (RNNs/LSTMs) by processing all tokens in a sequence simultaneously rather than sequentially.

Why Transformers Replaced RNNs¶

RNNs process tokens one at a time, left to right. This sequential bottleneck means:

Training cannot be parallelized across sequence positions
Long-range dependencies decay over distance (vanishing gradients)
Training time scales linearly with sequence length

Transformers solve all three problems through self-attention, which computes relationships between every pair of tokens in a single matrix operation — fully parallelizable on GPUs.

High-Level Data Flow¶

graph LR
    A[Raw Text] --> B[Tokenizer]
    B --> C[Token IDs]
    C --> D[Embedding Layer]
    D --> E[+ Positional Encoding]
    E --> F[Transformer Blocks x N]
    F --> G[Output Layer / Logits]
    G --> H[Softmax → Probability Distribution]
    H --> I[Next Token]

Tokenization — text is split into subword tokens (integers from a fixed vocabulary)
Embedding — each token ID maps to a dense vector via a learned embedding table
Positional Encoding — positional signals are added so the model knows token order (attention itself is order-agnostic)
Transformer Blocks — a stack of N identical layers, each containing self-attention + feed-forward network + residual connections + layer normalization
Output Layer — projects hidden states to vocabulary-sized logits
Softmax — converts logits to a probability distribution over the vocabulary

Modern LLMs use 12 to several hundred transformer blocks. Deeper stacks enable richer hierarchical abstractions.

Inside a Transformer Block¶

Each transformer block contains two main sub-layers wrapped in residual connections and normalization:

graph TD
    A[Input] --> B[Layer Norm]
    B --> C[Multi-Head Self-Attention]
    C --> D[+ Residual Connection]
    D --> E[Layer Norm]
    E --> F[Feed-Forward Network]
    F --> G[+ Residual Connection]
    G --> H[Output to Next Block]

Feed-Forward Network (FFN)¶

The FFN provides the model's primary source of nonlinearity and parameter capacity. While attention handles communication between tokens, the FFN handles computation within each token's representation — this is where the model stores and applies learned knowledge.

The original transformer used a two-layer FFN with ReLU activation and a 4x hidden dimension expansion:

$$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$

Modern LLMs have evolved significantly:

Component	Original Transformer	Modern LLMs (LLaMA/Mistral)
Activation	ReLU	SwiGLU (SiLU-gated)
Expansion ratio	4x	~2.7x (compensated by gating)
Normalization	LayerNorm (Post-LN)	RMSNorm (Pre-LN)

SwiGLU Activation¶

SwiGLU is a gated variant that has become the standard in LLaMA-family models. It works like a learned gate: up_proj(x) carries the information, and SiLU(gate_proj(x)) controls how much passes through:

$$ \text{SwiGLU}(x) = (\text{SiLU}(xW_{\text{gate}})) \odot (xW_{\text{up}}) $$

Even with a nominally lower expansion ratio (~2.7x vs 4x), SwiGLU-based FFNs have similar or greater effective capacity because the gate mechanism provides additional expressive power. Gemma uses GeGLU, a closely related variant.

RMSNorm vs LayerNorm¶

LayerNorm performs two operations: centering (subtracting the mean) and scaling (dividing by standard deviation). RMSNorm removes centering entirely, normalizing only by root mean square — empirical studies found centering contributes little to training stability while scaling does the heavy lifting.

$$ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}x_i^2}} \cdot \gamma $$

RMSNorm yields comparable performance to LayerNorm but shows 7–64% speed improvement.

Pre-LN vs Post-LN¶

Placement	Description	Stability	Used By
Post-LN (original)	Norm applied after residual add	Requires careful LR warmup; gradient issues in deep nets	Original Transformer, BERT
Pre-LN (modern)	Norm applied before each sub-layer	Much more stable; trains without warmup	LLaMA, Mistral, GPT-3+

Pre-LN normalizes input to each sub-layer, preventing activation explosions. The residual path remains clean, allowing gradients to flow easily. By LLaMA's release (2023), Pre-LN with RMSNorm became the undisputed standard.

Residual Connections¶

Residual (skip) connections add each sub-layer's input directly to its output: $\text{output} = \text{sublayer}(x) + x$. This allows gradients to flow through hundreds of layers without vanishing and lets each layer learn a refinement rather than a complete transformation.

Weight Tying¶

Many models tie the input embedding matrix with the output projection matrix (the layer that produces logits). Since both map between token IDs and hidden dimensions, sharing weights reduces parameter count and can improve generalization. GPT-2 and many smaller models use weight tying; larger models like LLaMA do not.

Encoder-Decoder vs Decoder-Only¶

The original transformer had two halves:

Architecture	Used By	How It Works
Encoder-Decoder	T5, BART, original Transformer	Encoder reads full input bidirectionally; decoder generates output autoregressively
Encoder-Only	BERT, RoBERTa	Bidirectional attention for understanding tasks (classification, NER)
Decoder-Only	GPT series, LLaMA, Claude, Mistral	Causal (left-to-right) attention; generates text one token at a time

Nearly all modern generative LLMs use the decoder-only variant. The encoder-only approach lives on in embedding models and classification tasks.

Self-Attention Mechanism¶

Self-attention is the core innovation that makes transformers work. It allows every token to "attend to" every other token in the sequence, computing relevance scores dynamically.

Query, Key, Value (QKV)¶

For each token, the model computes three vectors from the input embedding:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and passed through softmax:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $d_k$ is the dimension of the key vectors (scaling prevents dot products from growing too large).

Causal Masking¶

In decoder-only models, a causal mask is applied to the attention matrix: the upper triangle is set to $-\infty$ before softmax, preventing tokens from attending to future positions. This ensures autoregressive generation — each token can only see tokens that came before it.

Multi-Head Attention¶

Rather than computing a single attention function, transformers use multiple attention heads (typically 32–128), each with independent Q/K/V projections. Different heads learn to capture different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.

Attention Variants¶

Variant	Description	Used By
Multi-Head Attention (MHA)	Each head has its own K, V projections	Original Transformer, GPT-2
Multi-Query Attention (MQA)	All heads share a single K, V projection	PaLM, Falcon
Grouped-Query Attention (GQA)	Heads grouped into clusters sharing K, V	LLaMA 2/3, Mistral, Gemma

GQA is the current standard — it reduces KV cache memory by 4-8x compared to MHA with minimal quality loss.

Tokenization and Embeddings¶

Tokenization¶

Tokenization converts raw text into integer token IDs from a fixed vocabulary. LLMs use subword tokenization — a middle ground between character-level (too fine) and word-level (can't handle unknown words).

Byte Pair Encoding (BPE) is the dominant algorithm:

Start with a vocabulary of 256 byte values
Find the most frequent adjacent byte pair in the training corpus
Merge that pair into a new token, add to vocabulary
Repeat until vocabulary reaches target size (30K–100K tokens)

Common words become single tokens; rare words decompose into known subword pieces.

Algorithm	Description	Used By
BPE (byte-level)	Merge most frequent byte pairs	GPT-2/3/4, LLaMA 3, Claude, Mistral
WordPiece	Merge pairs that maximize corpus likelihood	BERT, DistilBERT
SentencePiece	Language-agnostic, operates on raw text	LLaMA 1/2, Mistral (earlier), T5
Unigram	Probabilistic model, prunes vocabulary down	SentencePiece variant, XLNet

Tokenization Quirks

Many LLM "failures" trace back to tokenization. Math errors occur because multi-digit numbers split into arbitrary subword tokens. Spelling struggles happen because the model never sees individual characters. "Glitch tokens" — tokens frequent in tokenizer training data but rare in model training — produce unpredictable outputs.

Embeddings¶

The embedding layer maps each integer token ID to a dense vector (typically 4096–12288 dimensions). These vectors are learned during pretraining and encode semantic relationships: similar tokens have similar vectors.

Positional encoding adds sequence-order information since attention is inherently order-agnostic.

Positional Encoding Methods¶

Method	Type	How It Works	Used By
Sinusoidal	Absolute	Fixed sine/cosine functions at each position	Original Transformer
Learned Absolute	Absolute	Trainable embedding per position (up to max length)	GPT-2, BERT
RoPE (Rotary Position Embedding)	Relative	Encodes relative positions via rotation matrices applied to Q/K vectors	LLaMA 1/2/3, Mistral, Qwen, Gemma
ALiBi (Attention with Linear Biases)	Relative	Adds linear penalty proportional to token distance directly to attention scores	BLOOM, MPT
YaRN	Relative (extended)	Extends RoPE to longer contexts via NTK-aware interpolation	Long-context LLaMA variants

RoPE is the dominant method in 2025. It applies a rotation matrix to Q and K vectors such that the dot product $q \cdot k$ depends only on their relative position, not absolute. This enables better length extrapolation than learned absolute embeddings and avoids ALiBi's precision issues (see below).

ALiBi adds a simple linear bias $-m \cdot |i - j|$ to each attention score, where $m$ is a head-specific slope and $|i-j|$ is the distance between tokens. While elegant, ALiBi has a critical interaction with reduced precision: in FP16, the last 20 positions of a head may map to only 5 distinct values, and in BF16 they may all collapse to the same value. This limits ALiBi's effectiveness for long-context inference.

Context Windows¶

The context window is the maximum number of tokens an LLM can process in one request. All input (system prompt, conversation history, user query) and output share this budget.

Era	Typical Context	Example Models
2018–2020	512–2048	BERT, GPT-2
2022–2023	4K–32K	GPT-4, Claude 2
2024–2025	128K–1M	Claude 3.5, Gemini 1.5, GPT-4 Turbo
2025–2026	1M–10M	Gemini 2.0, Claude 4

Lost in the Middle

Research shows LLMs attend strongly to tokens at the beginning and end of context but drop 30%+ accuracy on information in the middle (Liu et al., Stanford 2024). Placing critical information at the start or end of prompts improves retrieval quality.

Training Pipeline¶

Phase 1: Pretraining¶

The model learns general knowledge by predicting the next token across trillions of tokens from web crawls, books, code, and curated datasets. This is by far the most expensive phase — DeepSeek-V3 required 2.788 million H800 GPU hours (~$5.6M) for 14.8 trillion tokens.

The training objective is simple causal language modeling:

$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) $$

Pretraining Data Curation¶

The quality and composition of pretraining data is as critical as model architecture:

Step	What It Does	Why It Matters
Deduplication	Remove near-duplicate documents (MinHash, exact substring matching)	Duplicated data causes memorization, degrades generalization, inflates benchmark scores
Quality filtering	Score documents via heuristics or classifier (perplexity, language ID, content quality)	Removes spam, boilerplate, machine-generated text
Toxicity/PII removal	Filter harmful content and personally identifiable information	Safety and legal compliance
Domain mixing	Control proportions of web, code, books, scientific papers, multilingual data	Affects which capabilities the model develops
Data scheduling	Vary data mix during training (e.g., increase code/math ratio later)	Optimizes learning curriculum

Modern data pipelines use classifier-based filtering — training a small model on known high-quality text (e.g., Wikipedia, textbooks) and scoring all candidate documents. LLaMA 3 used this approach extensively.

Synthetic Data in Pretraining¶

Synthetic data — generated by existing LLMs — is increasingly used to augment pretraining corpora:

Textbook-quality data: Phi models (Microsoft) demonstrated that small models trained on LLM-generated "textbook-style" data can outperform much larger models on reasoning benchmarks
Code generation: synthetic programming problems and solutions supplement natural code repositories
Math and reasoning: step-by-step solutions generated by strong models provide training signal for reasoning capabilities
Instruction data: synthetic instruction-response pairs bootstrap SFT datasets at scale

Model Collapse

Training on too much synthetic data without sufficient real data can cause "model collapse" — progressive degradation of quality as the model learns from its own distribution rather than the true data distribution. Careful mixing ratios (typically <30% synthetic) and quality filtering mitigate this risk.

Phase 2: Supervised Fine-Tuning (SFT)¶

The pretrained model is further trained on curated instruction-response pairs to learn:

Instruction following
Output formatting (JSON, markdown, structured responses)
Safety behaviors
Task-specific patterns

SFT datasets are much smaller (thousands to millions of examples) but high quality.

Phase 3: Alignment¶

RLHF (Reinforcement Learning from Human Feedback)¶

The traditional alignment pipeline:

graph LR
    A[SFT Model] --> B[Generate Multiple Responses]
    B --> C[Human Annotators Rank Outputs]
    C --> D[Train Reward Model]
    D --> E[Optimize Policy via PPO]
    E --> F[Aligned Model]

Generate multiple responses per prompt
Human annotators rank them
Train a reward model to predict human preferences
Use PPO (Proximal Policy Optimization) to optimize the base model against the reward model

Downsides: complex, expensive, unstable training, susceptible to reward hacking.

DPO (Direct Preference Optimization)¶

Introduced by Rafailov et al. (2023), DPO simplifies alignment by eliminating the reward model entirely. It reframes preference learning as a binary classification problem:

Given a chosen response and a rejected response, directly optimize the model to increase the probability of the chosen response relative to the rejected one
Requires only 2 models (policy + frozen reference) vs RLHF's 4
Standard supervised learning infrastructure — no RL instability

By 2025, 70% of enterprises use RLHF or DPO for alignment, with DPO adoption growing 45% year-over-year.

Constitutional AI (CAI)¶

Developed by Anthropic, CAI replaces human preference labeling with self-critique based on ethical principles (a "constitution"). The model generates responses, critiques its own outputs against the constitution, and revises — enabling scalable alignment without massive human annotation.

Phase 4: Reinforcement Learning for Reasoning¶

Models like DeepSeek-R1 and OpenAI o1/o3 add an RL phase specifically targeting step-by-step reasoning:

Train the model to generate and verify chains of thought
Reward correct final answers and valid reasoning steps
Results: DeepSeek-R1 achieves 97.3% on MATH, ~80% on AIME competition problems

Mixture of Experts (MoE)¶

MoE introduces sparsity into the model: instead of activating all parameters for every token, only a subset of specialized "expert" sub-networks fire. This achieves the quality of massive models at the compute cost of much smaller ones.

How MoE Works¶

graph TD
    A[Input Token] --> B[Router / Gating Network]
    B --> C[Expert 1]
    B --> D[Expert 2]
    B --> E[Expert 3]
    B --> F["Expert N (inactive)"]
    C --> G[Weighted Sum of Active Expert Outputs]
    D --> G
    E --> G
    G --> H[Output]
    style F fill:#ccc,stroke:#999

A router (small neural network) scores all experts for each input token
The top-K experts (typically top-2) are selected
Their outputs are combined via weighted sum
Remaining experts are not computed — saving ~90% of FLOPs

Key MoE Models¶

Model	Total Params	Active Params	Experts	Innovation
Mixtral 8x7B	46.7B	12.9B	8	First major open-source MoE; static top-2 routing
Mixtral 8x22B	141B	~39B	8	Scaled Mixtral architecture
DeepSeek-V3	671B	37B	256	Fine-grained experts; auxiliary-loss-free load balancing; FP8 training
DeepSeek-R1	671B	37B	256	RL-first reasoning on V3 base; 97.3% MATH
Llama 4 Scout	109B	17B	16	Meta's first MoE; 10M token context
Llama 4 Maverick	400B	17B	128	128 experts, top-1 routing

DeepSeek's MoE Innovations¶

DeepSeek introduced two key strategies:

Fine-grained experts — segment into many small experts (256 instead of 8), activate a small subset, allowing more flexible combinations
Shared experts — isolate some experts as "shared" across all tokens to capture common knowledge, reducing redundancy in routed experts

As of 2025, nearly all frontier models (GPT-4, Gemini, Claude, Llama 4, DeepSeek, Mistral Large) use MoE architectures.

Load Balancing and Expert Collapse¶

A critical challenge in MoE training is expert collapse — the router learns to send most tokens to a few "popular" experts while others receive little traffic and stop learning. This wastes capacity and reduces model quality.

Solutions:

Technique	How It Works	Used By
Auxiliary load-balancing loss	Adds a penalty term that encourages equal token distribution across experts	Mixtral, Switch Transformer
Expert capacity factor	Caps the max tokens per expert; overflow tokens are dropped or sent to a default expert	GShard, Switch Transformer
Auxiliary-loss-free balancing	Uses a bias term in the router to balance load without distorting the main training loss	DeepSeek-V3
Shared experts	Reserve some experts as "always active" to handle common knowledge, reducing pressure on routed experts	DeepSeek-V2/V3

DeepSeek-V3's auxiliary-loss-free approach is notable because traditional auxiliary losses can conflict with the main training objective, forcing a trade-off between load balance and model quality. By using a separate bias term, DeepSeek avoids this conflict entirely.

MoE Memory Tradeoff

MoE memory scales with total parameters, not active parameters. A 671B MoE model needs hundreds of GB of VRAM even though only 37B parameters fire per token. This forces multi-GPU deployments for large MoE models.

Quantization Formats¶

Quantization reduces model weight precision from high-bit (FP32/FP16) to lower-bit (INT8/INT4) representations, dramatically reducing memory and improving inference speed.

Numeric Precision Types¶

Floating-Point Bit Layout¶

Understanding the sign/exponent/mantissa structure explains why these formats differ:

FP32:  [1 sign] [8 exponent] [23 mantissa]  — 32 bits total
FP16:  [1 sign] [5 exponent] [10 mantissa]  — 16 bits total
BF16:  [1 sign] [8 exponent] [ 7 mantissa]  — 16 bits total
FP8:   [1 sign] [4 exponent] [ 3 mantissa]  — 8 bits total (E4M3 variant)

Property	FP32	BF16	FP16
Dynamic range (decades)	~83	~79	~12
Epsilon (precision near 1.0)	~1.2e-7	~7.8e-3	~9.8e-4
Max value	~3.4e38	~3.4e38	~65,504
Loss scaling needed?	No	Rarely	Often yes

BF16 has the same 8-bit exponent as FP32, giving it nearly identical dynamic range — this means it can represent extremely small gradients and large activations without underflow/overflow. The trade-off is lower precision (7 mantissa bits vs FP16's 10). In practice, BF16 "just works" for training because you rarely need loss scaling.

FP16 has higher precision within a narrow range but risks overflow during training. Loss scaling (multiplying the loss by a large factor, then dividing gradients back) is often required to prevent gradient underflow.

Format	Bits	Bytes/Param	Description	Use Case
FP32	32	4	Full precision float	Training optimizer states only
BF16	16	2	Brain Float 16 — wider dynamic range than FP16	Standard training & full-quality inference; post-2022 default
FP16	16	2	Half precision float	Legacy inference; highest quality but 2x memory of BF16 with no benefit
FP8	8	1	8-bit float; native on Hopper/Blackwell GPUs	Production sweet spot on modern NVIDIA hardware
INT8	8	1	8-bit integer	~50% memory reduction vs FP16; broad hardware support
INT4	4	0.5	4-bit integer	~75% memory reduction; per-group scaling preserves quality
INT2	2	0.25	2-bit integer	Extreme compression; significant quality loss

How Quantization Works¶

Full-precision weights (e.g. FP16) are mapped to a smaller set of representable values:

Per-tensor quantization — one scale factor for the entire weight tensor (fast but lossy)
Per-channel quantization — one scale factor per output channel (better quality)
Per-group quantization — divides weights into groups of 128 elements, each with its own scale (best quality/size tradeoff for INT4)

The scale factor maps the quantized integer range back to the original floating-point range during inference.

Quality Impact by Precision¶

Perplexity benchmarks (Llama-2-7B, lower is better):

Format	Perplexity	Quality Loss
FP16 (baseline)	7.4924	—
Q8_0	7.4933	Negligible
Q5_K_M	~7.52	Minimal
Q4_K_M	7.5692	Acceptable
Q3_K_M	~7.85	Noticeable
Q2_K	8.6501	Significant degradation

Low-Bit Caveats

At Q2/Q3, models start ignoring parts of system prompts and hallucinating JSON formatting. Avoid INT4 and below for math, code generation, and reasoning-heavy tasks where quality loss is most noticeable.

Size and Speed Example (Llama 2 13B)¶

Metric	FP16	Q4_K_M
Model size	26 GB	7.9 GB (70% reduction)
RAM required	32 GB+	12 GB
Speed	8 tok/s	15 tok/s
Quality	100%	~95%

Model Formats and Quantization Methods¶

GGUF (GPT-Generated Unified Format)¶

GGUF is a self-contained file format created by the llama.cpp project. It bundles weights, tokenizer, architecture metadata, and chat template into a single .gguf file.

Key properties:

Runs on everything — CPU, NVIDIA, AMD, Apple Silicon
mmap-able (OS maps file into memory without loading it all)
Endian-safe and versioned
Powers Ollama and LM Studio under the hood

GGUF quantization naming convention:

Name	Approx Bits/Weight	Type	Quality
Q2_K	~2.6	K-quant	Extreme compression, noticeable degradation
Q3_K_S / Q3_K_M / Q3_K_L	~3.3–3.9	K-quant	Budget-conscious, some quality loss
Q4_K_S / Q4_K_M	~4.3–4.8	K-quant	Best balance of quality and size
Q5_K_S / Q5_K_M	~5.3–5.7	K-quant	Near-lossless for most tasks
Q6_K	~6.6	K-quant	Very close to FP16
Q8_0	~8.5	Legacy	Near-identical to FP16
IQ2_XXS / IQ3_S	~2.1–3.4	I-quant	State-of-art low-bit; uses lookup tables

The "K" indicates k-quant method (importance-aware mixed-precision); S/M/L are compression aggressiveness levels.

GPTQ (GPT-Quantized)¶

Calibration-based 4-bit integer quantization using approximate second-order (Hessian) information to minimize quantization error. Requires a small calibration dataset. AutoGPTQ was archived in April 2025; succeeded by GPTQModel v5.8.0.

Verdict: Use only if AWQ or EXL2 versions are unavailable. Both offer better quality-per-bit.

AWQ (Activation-Aware Weight Quantization)¶

MIT research. Identifies the <1% of "salient" weights by observing activations during calibration, then preserves them at higher precision.

~3 percentage points better than GPTQ on MMLU at 4 bits
Marlin-AWQ kernel: ~741 tok/s on A10G — fastest 4-bit for NVIDIA
Best choice for vLLM multi-user deployments on NVIDIA

EXL2 (ExLlamaV2)¶

Mixed bit-width quantization — can use 2, 3, 4, 5, 6, 8 bits within a single model and even within individual layers. Supports fractional average bitwidths (e.g., 4.5 bpw).

Fastest for interactive single-user generation on NVIDIA GPUs (40–70% faster than llama.cpp)
NVIDIA CUDA only, no CPU fallback
Best for single-user interactive sessions at 4–6 bpw

Quick Format Decision Guide¶

Scenario	Best Format
CPU / Laptop / Apple Silicon	GGUF (Q4_K_M or Q5_K_M)
NVIDIA GPU, max serving throughput	AWQ with Marlin kernels
NVIDIA GPU, single-user interactive	EXL2 at 4–6 bpw
NVIDIA H100/Blackwell production	FP8
Fine-tuning	bitsandbytes (QLoRA)
Limited VRAM (≤8GB)	GGUF Q4_K_M with CPU offloading
General starting point	Ollama with Q4_K_M

MLX (Apple Silicon)¶

MLX is Apple's open-source array framework for machine learning on Apple Silicon. Designed for Mac-native LLM inference and fine-tuning.

Key Design Principles¶

Unified memory — arrays live in shared CPU/GPU memory; no data transfer overhead
Lazy computation — operations are materialized only when needed, enabling automatic fusion
Dynamic graphs — no recompilation on shape changes (unlike TensorRT)
Familiar APIs — Python API mirrors NumPy; mlx.nn mirrors PyTorch

MLX LM¶

The mlx-lm package provides one-command model download, quantization conversion, and inference:

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B --quantize --q-bits 4

# Generate text
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit --prompt "Explain attention"

Performance on Apple Silicon¶

Chip	Memory Bandwidth	14B Dense (BF16) TTFT	30B MoE (4-bit) TTFT
M4	120 GB/s	~12s	~4s
M5	153 GB/s	<10s	<3s

The M5 provides 19–27% improvement over M4, directly proportional to its 28% memory bandwidth increase.

Research shows vllm-mlx achieves 21–87% higher throughput than llama.cpp on Apple Silicon, thanks to zero-copy tensor operations and lazy evaluation.

A MacBook Pro 24GB can hold an 8B model in BF16 or a 30B MoE at 4-bit quantization comfortably.

Knowledge Distillation¶

Knowledge distillation compresses a large teacher model into a smaller student model that mimics the teacher's behavior while being far cheaper to run.

How It Works¶

graph LR
    A[Input Data] --> B[Teacher Model - Large]
    A --> C[Student Model - Small]
    B --> D[Soft Targets / Probabilities]
    D --> E[Distillation Loss]
    C --> F[Student Predictions]
    F --> E
    E --> G[Update Student Weights]

Three main distillation approaches:

Method	What Transfers	Description
Response-based	Output probabilities ("soft targets")	Student learns teacher's probability distribution over vocabulary, not just the argmax
Feature-based	Intermediate layer activations	Student aligns internal representations via L2 or cosine similarity
Attention-based	Attention maps	Student replicates teacher's attention patterns (used in DistilBERT)

Why Soft Targets Matter¶

Instead of training on hard labels (the single correct answer), the student learns from the teacher's full probability distribution. The relative probabilities encode the teacher's learned generalizations — for example, that "dog" and "puppy" are similar while "dog" and "table" are not.

A temperature parameter $T$ (typically 2–5) controls how "soft" the distribution is: higher temperature spreads probability more evenly, exposing more of the teacher's learned structure.

Results¶

Typical compression: 5–10x smaller, retaining 90–95% accuracy
DistilBERT: 60% of BERT's size, 97% of its performance, 60% faster
DeepSeek-R1-Distill models: distilled from 671B to 7B/14B/32B variants with strong reasoning capabilities

Emerging Trends (2025)¶

Chain-of-Thought Distillation — transfers reasoning processes (not just final answers) from teacher to student using CoT rationales as training signal
Curriculum Distillation — organizes training easy-to-hard to gradually build reasoning capacity
Multi-Teacher Distillation — combines expertise from multiple specialized teachers with dynamic weighting
Few-Shot Distillation — effective with as few as 8–512 calibration samples using counterfactual explanations

Scaling Laws¶

Kaplan et al. (2020) and Chinchilla (Hoffmann et al., 2022) established empirical scaling laws for LLMs:

Model performance (loss) improves predictably as a power law of: model size (parameters), dataset size (tokens), and compute budget (FLOPs)
Chinchilla-optimal: for a given compute budget, model size and training tokens should scale roughly equally (the "1:1 ratio" in parameter-token space)
DeepSeek-V3 trained on 14.8T tokens with 671B params — heavily over-training relative to Chinchilla, but MoE's sparse activation changes the calculus

These laws guide decisions about how to allocate training budgets: bigger model vs more data vs longer training.

Inference-Time Compute Scaling (Test-Time Compute)¶

A newer scaling axis discovered in 2024–2025: instead of only scaling training compute, you can scale inference compute by letting models "think longer" at test time.

Approach	How It Works	Example
Chain-of-Thought (CoT)	Generate step-by-step reasoning before the final answer	GPT-4, Claude
Best-of-N sampling	Generate N candidate answers, select the best one via verifier	Used in math benchmarks
Tree search	Explore multiple reasoning paths, backtrack when stuck	AlphaProof, OpenAI o1
Self-verification	Model checks its own answer and retries if wrong	DeepSeek-R1
Extended thinking	Dedicated "thinking" token budget separate from the visible response	Claude 3.5+ extended thinking, OpenAI o1/o3

OpenAI's o1/o3 and DeepSeek-R1 demonstrated that inference-time compute scaling can yield dramatic improvements on reasoning-heavy tasks, sometimes matching models 10x their size on math and coding benchmarks. The key insight: a smaller model thinking longer can outperform a larger model answering immediately.

Model Merging¶

Model merging combines the weights of multiple fine-tuned LLMs into a single model — no additional training required, no GPU needed. This creates models that combine capabilities from different specializations.

Why Merge?¶

Combine a code-focused model with a math-focused model into one that excels at both
Merge different LoRA adapters trained on different tasks
Reduce the cost of multi-task deployment (one merged model vs multiple specialized ones)
Experiment cheaply — thousands of merged models appear on the Open LLM Leaderboard

Merging Techniques¶

Method	How It Works	Strengths	Limitations
Linear / LERP	Weighted average of model weights: $W = \alpha W_A + (1-\alpha) W_B$	Simplest, fast	Naive averaging can cause interference between conflicting weight updates
SLERP (Spherical Linear Interpolation)	Interpolates along the hypersphere surface, preserving vector magnitudes	Maintains geometric properties; smoother than linear	Limited to merging exactly 2 models
TIES (Trim, Elect Sign & Merge)	Resets tiny deltas, resolves sign conflicts by majority vote, then merges cleaned updates	Handles interference between models; works with many models	More complex pipeline
DARE (Drop And REscale)	Randomly drops 90–99% of delta parameters, rescales remaining by $\frac{1}{1-p}$	Effective even at extreme sparsity; reduces parameter interference	Random dropping adds variance
DARE + TIES	Combines DARE's random sparsification with TIES sign resolution	Best of both approaches	Requires tuning drop rate and thresholds

Tooling: MergeKit¶

MergeKit (by Arcee AI) is the standard open-source tool for model merging. It provides an extensible framework supporting all major algorithms and has been used to create thousands of merged models. Configuration is YAML-based:

models:
  - model: code-specialist/model
    parameters:
      weight: 0.6
  - model: math-specialist/model
    parameters:
      weight: 0.4
merge_method: ties
base_model: base/model
parameters:
  density: 0.5
  normalize: true
dtype: bfloat16

Emerging Trends (2025)¶

Reasoning model merging: merging "slow-thinking" reasoning models with "fast" conventional LLMs can reduce token consumption by ~50% while maintaining accuracy
Newer algorithms: NuSLERP, DELLA (Drop and Rescale via Sampling with Magnitude), and SCE (Select, Calculate, and Erase) offer incremental improvements
All merging methods still fall short of individually fine-tuned models on their specific tasks — merging trades peak specialization for broader capability

Post-Transformer Architectures¶

While transformers dominate, alternatives are emerging:

Architecture	Key Innovation	Status
Mamba (State Space Models)	Selective state updates; linear-time sequence processing; no quadratic attention	Competitive with transformers at small-medium scale
RWKV	RNN-transformer hybrid; linear attention	Active open-source community
Hyena	Long convolutions replace attention	Research stage
PaTH Attention (MIT, 2025)	Adds data-dependent down-weighting to standard attention	Improves reasoning and long-context tasks

None have yet displaced transformers at frontier scale, but Mamba-based hybrids (Jamba by AI21) show promise for efficiency-critical applications.

RNN & SSM Internals¶

Source: ref - alisa-book-of-llms

Vanilla RNN¶

A vanilla RNN processes sequences one timestep at a time, maintaining a hidden state $h_t$:

$$h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = W_\text{out} h_t + b_\text{out}$$

The gradient through time involves $\partial h_t / \partial h_{t-1} = \text{diag}(\tanh'(z_t)) \cdot W_h$. Since $\tanh' \in (0, 1]$, repeated multiplication causes vanishing gradients, making RNNs struggle with long-range dependencies.

LSTM¶

LSTMs introduce a cell state $c_t$ (long-term memory) flowing through a "highway" with only elementwise operations — no matmuls or nonlinearities. Information is added or removed only through gates:

Gate	Formula	Purpose
Forget	$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$	What to erase from memory
Input	$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$	What new info to write
Cell update	$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$	Combined memory
Output	$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$	What to expose
Hidden state	$h_t = o_t \odot \tanh(c_t)$	Working output

The separation of cell state and hidden state is what makes LSTMs work — a vanilla RNN tries to make a single vector serve as both long-term memory and current output.

State Space Models (Mamba)¶

SSMs model sequences as discrete dynamical systems: $x_k = A x_{k-1} + B_k u_k$. Mamba makes the transition matrices input-dependent:

$$B_k = f_B(u_k), \quad C_k = f_C(u_k), \quad \Delta_k = f_\Delta(u_k)$$

Recurrent mode: $O(n)$ time, $O(1)$ memory per step
Parallel (convolutional) mode: efficient for training
vs. Transformers: $O(n)$ instead of $O(n^2)$, but past tokens can't influence earlier processing (unlike attention's $O(1)$ random access)

Post-Training Algorithms¶

Source: ref - alisa-book-of-llms

Policy Gradients (REINFORCE)¶

An LLM generates trajectory $\tau = (s_0, a_0, \ldots)$ where each action (token) $a_t \sim \pi_\theta(\cdot \mid s_t)$. The objective is to maximize expected reward:

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau) \right]$$

Derived via the log-derivative trick: $\nabla P = P \, \nabla \log P$.

High variance

Without a baseline, all responses get reinforced (including bad ones in the batch). The baselined policy gradient subtracts $V_\psi(s_t)$ to center the reward signal. This doesn't change the expected gradient because $\sum_a \nabla \pi(a \mid s) = \nabla 1 = 0$.

The advantage function $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$ measures how much better action $a$ is compared to the average action in state $s$.

Off-Policy Policy Gradient¶

On-policy methods require inference from the current policy for every gradient step. Off-policy methods reuse trajectories from $\pi_{\theta_\text{old}}$ via importance sampling:

$$\mathcal{J}^\text{surrogate}(\theta) = \mathbb{E}{\tau \sim \pi R(\tau) \right]$$}}} \left[\sum_t \underbrace{\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}}_{r_t

PPO (Proximal Policy Optimization)¶

PPO clips the importance ratio $r_t$ to prevent destructive policy updates:

$$\mathcal{J}^\text{CLIP}(\theta) = \mathbb{E}\left[\sum_t \min\left(r_t A_t, \; \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$

The clipping behavior depends on the sign of the advantage:

Condition	Clipped to	Effect
$r_t > 1+\epsilon$, $A_t > 0$	$(1+\epsilon)A_t$	Don't over-reinforce good actions
$r_t < 1-\epsilon$, $A_t > 0$	$r_t A_t$ (unclipped)	Gradient pushes $\theta$ to increase $r_t$
$r_t > 1+\epsilon$, $A_t < 0$	$r_t A_t$ (unclipped)	Gradient pushes $\theta$ to decrease $r_t$
$r_t < 1-\epsilon$, $A_t < 0$	$(1-\epsilon)A_t$	Don't over-penalize bad actions

PPO collects a batch of trajectories from the current policy, then takes multiple gradient steps using the clipped objective.

RLHF¶

RLHF adds a KL penalty to prevent the policy from drifting too far from the reference model:

$$\mathcal{J}^\text{RLHF}(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[R(\tau) - \beta D_\text{KL}(\pi_\theta | \pi_\text{ref})\right]$$

The reward model is trained on human preference pairs using the Bradley-Terry model:

$$P(y_w \succ y_l) = \sigma(R(x, y_w) - R(x, y_l))$$

GRPO (Group Relative Policy Optimization)¶

Eliminates the need for a separate value function. For each prompt, sample $G$ completions and compute group-normalized advantages:

$$A^{(i)} = \frac{r^{(i)} - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

Then apply PPO-style clipping with these advantages. The group acts as a built-in baseline.

DPO (Direct Preference Optimization)¶

DPO derives the optimal policy in closed form from the RLHF objective, then substitutes it into the Bradley-Terry loss to eliminate the reward model entirely:

$$\mathcal{L}^\text{DPO}(\theta) = -\mathbb{E}{(x,y_w,y_l)}\left[\log \sigma\left(\beta \log \frac{\pi\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)\right]$$

No reward model training, no RL loop — just supervised learning on preference pairs.

GPU Parallelism¶

Source: ref - alisa-book-of-llms

Core Collective Operations¶

Operation	Pattern	Description
Broadcast	One → All	One GPU sends identical copy to every other GPU
AllGather	Shards → Full	Each GPU has a shard; every GPU gets the full array
ReduceScatter	Full → Reduced shards	Reduce and distribute shards
AllReduce	Full → Reduced full	ReduceScatter + AllGather (same cost)

Ring AllReduce

Implemented as ReduceScatter (does all arithmetic, no redundant copying) + AllGather (does all copying, no arithmetic). Communication time depends only on array size and bandwidth, not on the number of devices.

Data Parallelism (ZeRO Stages)¶

Stage	Shards	Memory per GPU	Communication
DDP (naive)	Nothing	Full model + optimizer	AllReduce grads
ZeRO-1	Optimizer states	$\sim 1/M$ optimizer	ReduceScatter + AllGather
ZeRO-2	+ Gradients	$\sim 1/M$ grads	Same cost as DDP
ZeRO-3 (FSDP)	+ Parameters	$\sim 1/N$ everything	AllGather params just-in-time

All ZeRO stages have the same communication cost as naive DDP — it's free memory savings.

DDP: when model fits on a single device, always use this
FSDP (ZeRO-3): can train models that don't fit on one GPU by AllGathering parameters just before each layer

Pipeline Parallelism¶

Split model layers across GPUs, divide each batch into micro-batches for overlapping compute. Naive model parallelism (one layer per GPU, serial execution) doesn't improve throughput — micro-batching fills the pipeline bubbles.

Tensor Parallelism¶

Split individual weight matrices across GPUs:

Column parallel ($W_\text{up}$, $W_\text{gate}$): each GPU computes a slice of hidden dim
Row parallel ($W_\text{down}$): each GPU computes partial result, then AllReduce to sum
Attention TP: split by heads (they're independent). One AllReduce per layer

Pattern: Column parallel → activation → Row parallel requires only one AllReduce.

5D Parallelism¶

Dimension	What It Splits	What It Scales
Data (DP)	Batch	Throughput
Tensor (TP)	Weight matrices within layers	Model memory
Pipeline (PP)	Layers/stages across GPUs	Model memory
Sequence (SP)	Sequence length	Activation memory
Expert (EP)	MoE experts	Expert memory

Precision & Mixed Training¶

Source: ref - alisa-book-of-llms

Mixed Precision Strategy¶

Component	Precision	Rationale
Master weights	FP32	Small gradients add to large weights
Forward/backward activations	BF16	Matmuls tolerate rounding noise
Gradients	BF16 → accumulate FP32	Individual grads tiny vs weights

Intuition: matmuls are tolerant of rounding noise (BF16 forward/backward is fine), but master weights in FP32 are necessary because individual gradients are tiny. BF16 has more precision near zero (can represent 0.0001) but not near one (can't represent 1.0001).

PyTorch Precision Patterns¶

# Option 1: Load entire model in BF16 (fine for inference)
model = Model.from_pretrained(..., torch_dtype=torch.bfloat16)

# Option 2: Automatic mixed precision (per-operation precision management)
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    output = model(x)
# matmul in BF16, softmax/layernorm in FP32

# Option 3: INT8 quantization via bitsandbytes
model = Model.from_pretrained(..., load_in_8bit=True)
# weights in INT8, activations in FP16

Multimodality¶

Source: ref - alisa-book-of-llms

Vision Transformer (ViT)¶

Turn an image into a sequence of patch vectors, then run a standard transformer encoder. Each patch is linearly projected to the model dimension $D$ (e.g., 4096).

LLaVA Architecture¶

ViT encoder → linear projector → concatenate visual tokens with text tokens → LLM decoder.

LLaVA-NeXT adds dynamic resolution: split high-res images into multiple crops, encode each separately, then concatenate. Each crop produces a fixed number of visual tokens (e.g., 256).

CLIP¶

Trained with contrastive learning: maximize similarity of text/image embeddings for correct pairings, minimize for incorrect. Provides the visual encoder for many multimodal LLMs.

Current paradigm: understanding via ViT encoder → features; generation via diffusion in pixel space.