Alisa's Book of LLMs¶
Distilled reference from Alisa Liu's Book of LLMs — a comprehensive, derivation-heavy technical notebook covering the full LLM stack. Alisa Liu is a UW NLP researcher (alisawuffles).
Neural Net Basics¶
Multi-Layer Perceptrons¶
A single neuron: $y = f(w^\top x + b)$. A layer with $n_\text{in}$ inputs and $n_\text{out}$ neurons is a matrix multiply:
$$H = f(XW + b) \quad \text{where } X \in \mathbb{R}^{m \times n_\text{in}}, \; W \in \mathbb{R}^{n_\text{in} \times n_\text{out}}$$
PyTorch convention
PyTorch stores $W$ as $n_\text{out} \times n_\text{in}$ and computes $X @ W^T$. The transpose is free (only changes stride). This way gradients for $W$ naturally come out as $n_\text{out} \times n_\text{in}$, matching $W$'s shape.
Linear layer backprop for $Z = XW + b$:
$$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} W^\top, \qquad \frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Z}, \qquad \frac{\partial L}{\partial b_j} = \sum_{i=1}^m \frac{\partial L}{\partial z_{ij}}$$
Batch gradient intuition
If a tensor is shared across the batch (like $W$), the batch dimension is summed out. If it's not shared (like $X$, activations), the batch dimension is preserved.
Activation Functions¶
Key activations and their derivatives:
| Function | Formula | Derivative |
|---|---|---|
| Sigmoid | $\sigma(x) = 1/(1+e^{-x})$ | $\sigma(x)(1 - \sigma(x))$ |
| Tanh | $\tanh(x)$ | $1 - \tanh^2(x)$ |
| ReLU | $\max(0, x)$ | $\mathbb{1}[x > 0]$ |
| Swish/SiLU | $x \cdot \sigma(x)$ | $\sigma(x) + \text{Swish}(x)(1 - \sigma(x))$ |
Backpropagation¶
- Computation graph: each "gate" has easy local gradients; backprop is repeated chain rule
- Upstream gradient (already computed) $\times$ local gradient = gradient flowing backward
- Gradients w.r.t. activations flow backward; gradients w.r.t. parameters are stored for the update
Softmax + cross-entropy gradient
For CE loss $L = -\log p_t$ with softmax output $\mathbf{p}$, the combined gradient is simply $\partial L / \partial \mathbf{z} = \mathbf{p} - \operatorname{one_hot}(t)$.
Optimizers (AdamW)¶
Adam with decoupled weight decay:
- Gradient: $g = \nabla_\theta L$
- First moment (momentum): $m \leftarrow \beta_1 m + (1-\beta_1) g$
- Second moment (RMSProp): $v \leftarrow \beta_2 v + (1-\beta_2) g^2$
- Bias correction: $\hat{m} = m/(1-\beta_1^t)$, $\hat{v} = v/(1-\beta_2^t)$
- Weight decay (before Adam step): $\theta \leftarrow \theta - \eta \lambda \theta$
- Update: $\theta \leftarrow \theta - \eta \hat{m} / (\sqrt{\hat{v}} + \epsilon)$
Gradient clipping: compute global norm of all grads, scale down uniformly if it exceeds max.
Learning Rate¶
- Warmup reduces primacy effect of early training examples
- Common schedules: warmup + cosine decay, warmup + linear decay
Modern Transformer Architecture¶
Dimension Reference¶
| Symbol | Meaning |
|---|---|
| $B$ | Batch size |
| $L$ | Number of layers |
| $T$ | Sequence length (tokens to generate) |
| $S$ | Sequence length (provided context) |
| $V$ | Vocab size |
| $D$ | Hidden dimension |
| $H$ | Head dimension |
| $F$ | MLP hidden dim ($\approx 8D/3$) |
| $N$ | Number of query heads ($N \cdot H = D$) |
| $K$ | Number of KV heads ($K < N$ in GQA) |
Forward Pass¶
- Token embedding: $X^{(0)} = W_e[\text{tokens}]$, $W_e \in \mathbb{R}^{V \times D}$
- For each layer $\ell$:
- RMSNorm: $\bar{X} = X / (\text{RMS}(X) + \epsilon) \odot \gamma$
- QKV projection: $Q = \bar{X} W_Q$, $K = \bar{X} W_K$, $V = \bar{X} W_V$
- RoPE: rotate Q, K by position-dependent angles
- Attention: $A = \text{softmax}(QK^\top / \sqrt{H})$, apply causal mask, $O = AV$
- Output projection: $O_\text{proj} = O W_O$
- Residual: $X \leftarrow X + O_\text{proj}$
- SwiGLU FFN: $F = \text{Swish}(\bar{X} W_\text{gate}) \odot (\bar{X} W_\text{up})$, then $\bar{X} W_\text{down}$
- Residual: $X^{(\ell+1)} = X^{(\ell)} + F$
- Final RMSNorm + Unembedding: $Z = X_\text{final} W_u \in \mathbb{R}^{B \times T \times V}$
Parameter & FLOP Accounting¶
| Component | Parameters | FLOPs (forward) |
|---|---|---|
| Embedding | $VD$ | — |
| Attention (per layer) | $\approx 4D^2$ | $8BSD^2 + 4BS^2D$ |
| FFN (per layer) | $3DF$ | $6BSDF \approx 16BSD^2$ |
| Total (per layer) | $\approx 12D^2$ | $2BSD(12D + 2S)$ |
| Full model | $2VD + 12LD^2$ | $2BSD(12LD + 2LS + V)$ |
Backward pass is $\approx 2\times$ forward FLOPs.
RoPE (Rotary Position Embedding)¶
Rotate each pair $(x_{2i}, x_{2i+1})$ of Q and K by position-dependent angle $m\theta_i$:
$$R_m^{(i)} = \begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix}$$
where $\theta_i = \Theta^{-2i/H}$ and $\Theta$ is a hyperparameter (typically 10000).
Key property: inner product depends only on relative position $m - n$:
$$\langle R_{m\theta} \mathbf{q}, R_{n\theta} \mathbf{k} \rangle = \mathbf{q}^\top R_{(n-m)\theta} \mathbf{k}$$
Complex number view
Pair $(x_{2i}, x_{2i+1})$ as complex number $z = x_{2i} + i \, x_{2i+1}$. Rotation by $\theta$ is multiplication by $e^{i\theta}$.
RMSNorm¶
$$\bar{X} = \frac{X}{\text{RMS}(X) + \epsilon} \odot \gamma, \quad \text{RMS}(X) = \sqrt{\frac{1}{D}\sum_{i=1}^D x_i^2}$$
Learnable gain $\gamma \in \mathbb{R}^D$: $\gamma_i > 1$ amplifies dimension $i$, $\gamma_i \approx 0$ effectively kills it.
SwiGLU FFN¶
$$F = \underbrace{\text{Swish}(\bar{X} W_\text{gate})}\text{gating} \odot \underbrace{(\bar{X} W\text{up})}\text{value} \cdot W\text{down}$$
The gate $G$ selectively passes information from the up-projection $U$.
Inference¶
- Latency: time to complete a single request (seconds)
- Throughput: tokens processed per unit time across all requests (tokens/second)
KV Cache¶
Pre-allocate cache with shape (batch, num_heads, max_seq_len, head_dim). At each decode step, only process the new token and attend over all cached positions:
def forward_with_cache(model, new_token, kv_cache, position):
x = model.embed(new_token) # (batch, 1, d_model)
for layer_idx, layer in enumerate(model.layers):
q, k, v = layer.qkv_proj(x).chunk(3, dim=-1)
kv_cache[layer_idx]['k'][:, :, position, :] = k.squeeze(2)
kv_cache[layer_idx]['v'][:, :, position, :] = v.squeeze(2)
k_full = kv_cache[layer_idx]['k'][:, :, :position+1, :]
v_full = kv_cache[layer_idx]['v'][:, :, :position+1, :]
x = attention(q, k_full, v_full)
x = layer.ffn(x)
return model.lm_head(x)
Reducing KV Cache Size¶
| Technique | How It Works |
|---|---|
| MHA (standard) | Each head has own K, V. Cache: $L \times N \times S \times H \times 2$ |
| MQA | All heads share K, V. Cache shrinks by $N\times$ |
| GQA | Heads grouped; each group shares K, V. Middle ground |
| MLA (DeepSeek v2) | Cache a smaller latent vector, project back at decode time |
| Cross-layer attention | Share KV across layers |
| Sliding window | Discard KV outside window. $O(nW)$ vs $O(n^2)$ |
Batching Strategies¶
- Continuous batching: slot new requests as sequences finish (batch always full)
- Selective batching: mix prefill (compute-heavy) and decode (memory-bound) sequences
- Sequence packing: concatenate examples to max length with attention masks
- Token-budget batching: batch by total token count (used in fine-tuning)
Speculative Decoding¶
Use a small draft model to propose $K$ tokens, then verify with the target model in one parallel forward pass. Acceptance probability: $\min(1, p(x)/q(x))$.
Flash Attention¶
- Standard attention materializes $N \times N$ attention matrix: $O(n^2)$ memory
- Flash Attention tiles the computation, never materializing the full matrix: $O(n)$ memory
- Uses the online softmax trick (track running max $m_k$, denominator $d_k$, output $o_k$)
- Exact method (not approximate)
- Each
(batch, head)pair is independent → massive parallelism
Sampling Strategies¶
def sample(logits, temperature=1.0, top_k=None, top_p=None):
logits = logits / temperature
if top_k is not None:
values, indices = torch.topk(logits, top_k)
logits = torch.full_like(logits, float('-inf'))
logits.scatter_(-1, indices, values)
if top_p is not None:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
sorted_mask = cumulative_probs > top_p
sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
sorted_mask[..., 0] = False
indices_to_remove = sorted_mask.scatter(-1, sorted_indices, sorted_mask)
logits = logits.masked_fill(indices_to_remove, float('-inf'))
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1)
Scaling Laws¶
- $\mu P$ (maximal update parameterization) enables hyperparameter transfer from small to large models
- Loss follows power law: $\mathcal{L}(C) = \mathcal{L}_\infty + \beta C^{-\alpha}$
- Learning rate follows power law: $\text{LR}(C) = \beta C^{-\alpha}$
- Fit with least squares: $\hat{\beta} = (X^\top X)^{-1} X^\top y$
Other Architectures¶
Vanilla RNN¶
$$h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = W_\text{out} h_t + b_\text{out}$$
Gradient through time: $\partial h_t / \partial h_{t-1} = \text{diag}(\tanh'(z_t)) \cdot W_h$. Since $\tanh' \in (0, 1]$, repeated multiplication causes vanishing gradients.
LSTM¶
Introduces cell state $c_t$ (long-term memory) flowing through a highway with only elementwise operations (no matmuls):
- Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$
- Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$
- Cell update: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
- Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$
- Hidden state: $h_t = o_t \odot \tanh(c_t)$
State Space Models (Mamba)¶
Input-dependent transitions: $x_k = A x_{k-1} + B_k u_k$ where $B_k = f_B(u_k)$, $C_k = f_C(u_k)$.
- Recurrent mode: $O(n)$ time, $O(1)$ memory per step
- Parallel (convolutional) mode: efficient training
- vs. Transformers: $O(n)$ instead of $O(n^2)$, but can't revisit past tokens
Post-Training¶
Policy Gradients (REINFORCE)¶
LLM generates trajectory $\tau = (s_0, a_0, \ldots)$ where $a_t \sim \pi_\theta(\cdot \mid s_t)$. Objective: maximize expected reward.
$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau) \right]$$
Baselined policy gradient subtracts a value function to reduce variance:
$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \underbrace{(R(\tau) - V_\psi(s_t))}_{\text{advantage } A^\pi(s,a)} \right]$$
The baseline doesn't change the expected gradient (proven by showing $\mathbb{E}_{a_t \mid s_t}[\nabla \log \pi \cdot b(s_t)] = 0$ since $\sum_a \nabla \pi(a \mid s) = \nabla 1 = 0$).
PPO (Proximal Policy Optimization)¶
Clips the importance ratio $r_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ to prevent destructive updates:
$$\mathcal{J}^\text{CLIP}(\theta) = \mathbb{E}\left[\sum_t \min\left(r_t A_t, \; \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$
- If $r_t > 1+\epsilon$ and $A_t > 0$: clip at $(1+\epsilon)A_t$ — don't over-reinforce
- If $r_t < 1-\epsilon$ and $A_t > 0$: use $r_t A_t$ — gradient pushes $\theta$ to increase $r_t$
RLHF¶
$$\mathcal{J}^\text{RLHF}(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[R(\tau) - \beta D_\text{KL}(\pi_\theta | \pi_\text{ref})\right]$$
- KL penalty computed per-token and folded into per-token reward
- Reward model trained on preference pairs with Bradley-Terry: $P(y_w \succ y_l) = \sigma(R(x, y_w) - R(x, y_l))$
GRPO (Group Relative Policy Optimization)¶
No separate value function. Sample $G$ completions per prompt, normalize rewards within the group:
$$A^{(i)} = \frac{r^{(i)} - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$
Then apply PPO-style clipped objective with these group-normalized advantages.
DPO (Direct Preference Optimization)¶
Derives the optimal policy in closed form from the RLHF objective:
$$R(y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)$$
Substituting into the Bradley-Terry loss eliminates the reward model:
$$\mathcal{L}^\text{DPO}(\theta) = -\mathbb{E}{(x,y_w,y_l)}\left[\log \sigma\left(\beta \log \frac{\pi\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)\right]$$
Precision & Mixed Training¶
Mixed Precision Strategy¶
| Component | Precision | Rationale |
|---|---|---|
| Master weights | FP32 | Small gradients add to large weights |
| Forward/backward activations | BF16 | Matmuls tolerate rounding noise |
| Gradients | BF16 → accumulate FP32 | Individual grads tiny vs weights |
- BF16 has more precision near 0 (can represent 0.0001) but not near 1 (can't represent 1.0001)
- Activations are harder to quantize than weights
Precision Options¶
| Format | Size | Notes |
|---|---|---|
| FP32 | 4 bytes | Full precision |
| FP16 | 2 bytes | Half memory |
| BF16 | 2 bytes | Better numerical stability than FP16 |
| INT8 | 1 byte | Requires quantization |
| INT4 | 0.5 bytes | Even smaller |
PyTorch Precision Patterns¶
# Load entire model in BF16 (fine for inference)
model = Model.from_pretrained(..., torch_dtype=torch.bfloat16)
# Automatic mixed precision (manages precision per operation)
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
output = model(x)
# matmul in BF16, softmax/layernorm in FP32
GPU Parallelism¶
Collective Operations¶
| Operation | Description |
|---|---|
| Broadcast | One GPU sends identical copy to all |
| AllGather | Each GPU has a shard → every GPU gets the full array |
| ReduceScatter | Reduce + each GPU keeps its shard |
| AllReduce | ReduceScatter + AllGather (equivalent cost) |
Communication cost
For AllGather, ReduceScatter, and AllReduce, communication time depends only on array size and bandwidth, NOT on the number of devices.
Data Parallelism (ZeRO Stages)¶
| Stage | What's Sharded | Memory per GPU | Communication |
|---|---|---|---|
| DDP (naive) | Nothing | Full model + optimizer | AllReduce grads |
| ZeRO-1 | Optimizer states | $\approx 1/M$ optimizer | ReduceScatter + AllGather |
| ZeRO-2 | + Gradients | $\approx 1/M$ grads | Same cost as DDP |
| ZeRO-3 (FSDP) | + Parameters | $\approx 1/N$ everything | AllGather params just-in-time |
ZeRO stages 1, 2, 3 all have the same communication cost — free memory wins.
Pipeline Parallelism¶
Split layers across GPUs, divide batch into micro-batches for overlapping compute:
model = nn.Sequential(
nn.Linear(512, 512).to('cuda:0'),
nn.ReLU().to('cuda:0'),
nn.Linear(512, 512).to('cuda:1'),
nn.ReLU().to('cuda:1'),
)
model = Pipe(model, chunks=8) # splits batch into 8 micro-batches
Tensor Parallelism¶
Split weight matrices within a single layer across GPUs:
- Column parallel (split $W_\text{up}$, $W_\text{gate}$ by columns): each GPU computes a slice of the hidden dim
- Row parallel (split $W_\text{down}$ by rows): each GPU computes partial result, then AllReduce
- Attention: split by heads (they're independent). Column parallel QKV → compute → row parallel output. One AllReduce per layer.
5D Parallelism Summary¶
| Dimension | Splits | Scales |
|---|---|---|
| Data (DP) | Batch | Throughput |
| Tensor (TP) | Weight matrices | Model memory |
| Pipeline (PP) | Layers/stages | Model memory |
| Sequence (SP) | Sequence length | Activation memory |
| Expert (EP) | MoE experts | Expert memory |
Multimodality¶
- ViT (Vision Transformer): turn image into a sequence of patch vectors, then run a standard transformer encoder
- LLaVA: ViT encoder → projector → concatenate with text tokens → LLM decoder
- LLaVA-NeXT: dynamic resolution — split high-res images into multiple crops, encode each separately, concatenate
- CLIP: contrastive learning to maximize similarity of correct text/image embedding pairs
Current paradigm: understanding via ViT encoder → features; generation via diffusion in pixel space.
Information Theory¶
- Cross-entropy: $\operatorname{CE}(p, q) = -\mathbb{E}_p[\log q] = -\sum_x p(x) \log q(x)$
- KL divergence: $\operatorname{KL}(p | q) = \sum_x p(x) (\log p(x) - \log q(x))$
- Entropy: $H(p) = -\sum_x p(x) \log p(x)$
- Relationship: $\operatorname{CE}(p, q) = \operatorname{KL}(p | q) + H(p)$
- Autoregressive loss: $\mathcal{L}(x) = -\sum_{t=1}^T \log p(x_t \mid x_{<t})$
Numerical Stability¶
Stable Softmax¶
Subtract $x_\text{max}$ before exponentiating:
$$\text{softmax}(x)i = \frac{e^{x_i - x\text{max}}}{\sum_j e^{x_j - x_\text{max}}}$$
Log-Sum-Exp Trick¶
$$\log \sum_i e^{x_i} = x_\text{max} + \log \sum_i e^{x_i - x_\text{max}}$$
Online Softmax (Flash Attention building block)¶
Maintain running statistics $(m_k, d_k)$. When processing $x_{k+1}$:
$$m_{k+1} \leftarrow \max(m_k, x_{k+1}), \quad d_{k+1} \leftarrow d_k \cdot e^{m_k - m_{k+1}} + e^{x_{k+1} - m_{k+1}}$$
Extended to weighted sums $o = \sum_i p_i v_i$ for Flash Attention output computation.
Gradient Flow Through Sampling¶
- Gumbel-Softmax: add Gumbel noise and use temperature for differentiable approximation to sampling
- Straight-through estimator: forward uses discrete sample, backward uses continuous relaxation: $\partial \mathcal{L}/\partial x = (\partial \mathcal{L}/\partial y) \cdot f(x)$