Skip to content

Alisa's Book of LLMs

Distilled reference from Alisa Liu's Book of LLMs — a comprehensive, derivation-heavy technical notebook covering the full LLM stack. Alisa Liu is a UW NLP researcher (alisawuffles).

Neural Net Basics

Multi-Layer Perceptrons

A single neuron: $y = f(w^\top x + b)$. A layer with $n_\text{in}$ inputs and $n_\text{out}$ neurons is a matrix multiply:

$$H = f(XW + b) \quad \text{where } X \in \mathbb{R}^{m \times n_\text{in}}, \; W \in \mathbb{R}^{n_\text{in} \times n_\text{out}}$$

PyTorch convention

PyTorch stores $W$ as $n_\text{out} \times n_\text{in}$ and computes $X @ W^T$. The transpose is free (only changes stride). This way gradients for $W$ naturally come out as $n_\text{out} \times n_\text{in}$, matching $W$'s shape.

Linear layer backprop for $Z = XW + b$:

$$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} W^\top, \qquad \frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Z}, \qquad \frac{\partial L}{\partial b_j} = \sum_{i=1}^m \frac{\partial L}{\partial z_{ij}}$$

Batch gradient intuition

If a tensor is shared across the batch (like $W$), the batch dimension is summed out. If it's not shared (like $X$, activations), the batch dimension is preserved.

Activation Functions

Key activations and their derivatives:

Function Formula Derivative
Sigmoid $\sigma(x) = 1/(1+e^{-x})$ $\sigma(x)(1 - \sigma(x))$
Tanh $\tanh(x)$ $1 - \tanh^2(x)$
ReLU $\max(0, x)$ $\mathbb{1}[x > 0]$
Swish/SiLU $x \cdot \sigma(x)$ $\sigma(x) + \text{Swish}(x)(1 - \sigma(x))$

Backpropagation

  • Computation graph: each "gate" has easy local gradients; backprop is repeated chain rule
  • Upstream gradient (already computed) $\times$ local gradient = gradient flowing backward
  • Gradients w.r.t. activations flow backward; gradients w.r.t. parameters are stored for the update

Softmax + cross-entropy gradient

For CE loss $L = -\log p_t$ with softmax output $\mathbf{p}$, the combined gradient is simply $\partial L / \partial \mathbf{z} = \mathbf{p} - \operatorname{one_hot}(t)$.

Optimizers (AdamW)

Adam with decoupled weight decay:

  1. Gradient: $g = \nabla_\theta L$
  2. First moment (momentum): $m \leftarrow \beta_1 m + (1-\beta_1) g$
  3. Second moment (RMSProp): $v \leftarrow \beta_2 v + (1-\beta_2) g^2$
  4. Bias correction: $\hat{m} = m/(1-\beta_1^t)$, $\hat{v} = v/(1-\beta_2^t)$
  5. Weight decay (before Adam step): $\theta \leftarrow \theta - \eta \lambda \theta$
  6. Update: $\theta \leftarrow \theta - \eta \hat{m} / (\sqrt{\hat{v}} + \epsilon)$

Gradient clipping: compute global norm of all grads, scale down uniformly if it exceeds max.

Learning Rate

  • Warmup reduces primacy effect of early training examples
  • Common schedules: warmup + cosine decay, warmup + linear decay

Modern Transformer Architecture

Dimension Reference

Symbol Meaning
$B$ Batch size
$L$ Number of layers
$T$ Sequence length (tokens to generate)
$S$ Sequence length (provided context)
$V$ Vocab size
$D$ Hidden dimension
$H$ Head dimension
$F$ MLP hidden dim ($\approx 8D/3$)
$N$ Number of query heads ($N \cdot H = D$)
$K$ Number of KV heads ($K < N$ in GQA)

Forward Pass

  1. Token embedding: $X^{(0)} = W_e[\text{tokens}]$, $W_e \in \mathbb{R}^{V \times D}$
  2. For each layer $\ell$:
  3. RMSNorm: $\bar{X} = X / (\text{RMS}(X) + \epsilon) \odot \gamma$
  4. QKV projection: $Q = \bar{X} W_Q$, $K = \bar{X} W_K$, $V = \bar{X} W_V$
  5. RoPE: rotate Q, K by position-dependent angles
  6. Attention: $A = \text{softmax}(QK^\top / \sqrt{H})$, apply causal mask, $O = AV$
  7. Output projection: $O_\text{proj} = O W_O$
  8. Residual: $X \leftarrow X + O_\text{proj}$
  9. SwiGLU FFN: $F = \text{Swish}(\bar{X} W_\text{gate}) \odot (\bar{X} W_\text{up})$, then $\bar{X} W_\text{down}$
  10. Residual: $X^{(\ell+1)} = X^{(\ell)} + F$
  11. Final RMSNorm + Unembedding: $Z = X_\text{final} W_u \in \mathbb{R}^{B \times T \times V}$

Parameter & FLOP Accounting

Component Parameters FLOPs (forward)
Embedding $VD$
Attention (per layer) $\approx 4D^2$ $8BSD^2 + 4BS^2D$
FFN (per layer) $3DF$ $6BSDF \approx 16BSD^2$
Total (per layer) $\approx 12D^2$ $2BSD(12D + 2S)$
Full model $2VD + 12LD^2$ $2BSD(12LD + 2LS + V)$

Backward pass is $\approx 2\times$ forward FLOPs.

RoPE (Rotary Position Embedding)

Rotate each pair $(x_{2i}, x_{2i+1})$ of Q and K by position-dependent angle $m\theta_i$:

$$R_m^{(i)} = \begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix}$$

where $\theta_i = \Theta^{-2i/H}$ and $\Theta$ is a hyperparameter (typically 10000).

Key property: inner product depends only on relative position $m - n$:

$$\langle R_{m\theta} \mathbf{q}, R_{n\theta} \mathbf{k} \rangle = \mathbf{q}^\top R_{(n-m)\theta} \mathbf{k}$$

Complex number view

Pair $(x_{2i}, x_{2i+1})$ as complex number $z = x_{2i} + i \, x_{2i+1}$. Rotation by $\theta$ is multiplication by $e^{i\theta}$.

RMSNorm

$$\bar{X} = \frac{X}{\text{RMS}(X) + \epsilon} \odot \gamma, \quad \text{RMS}(X) = \sqrt{\frac{1}{D}\sum_{i=1}^D x_i^2}$$

Learnable gain $\gamma \in \mathbb{R}^D$: $\gamma_i > 1$ amplifies dimension $i$, $\gamma_i \approx 0$ effectively kills it.

SwiGLU FFN

$$F = \underbrace{\text{Swish}(\bar{X} W_\text{gate})}\text{gating} \odot \underbrace{(\bar{X} W\text{up})}\text{value} \cdot W\text{down}$$

The gate $G$ selectively passes information from the up-projection $U$.

Inference

  • Latency: time to complete a single request (seconds)
  • Throughput: tokens processed per unit time across all requests (tokens/second)

KV Cache

Pre-allocate cache with shape (batch, num_heads, max_seq_len, head_dim). At each decode step, only process the new token and attend over all cached positions:

def forward_with_cache(model, new_token, kv_cache, position):
    x = model.embed(new_token)  # (batch, 1, d_model)
    for layer_idx, layer in enumerate(model.layers):
        q, k, v = layer.qkv_proj(x).chunk(3, dim=-1)
        kv_cache[layer_idx]['k'][:, :, position, :] = k.squeeze(2)
        kv_cache[layer_idx]['v'][:, :, position, :] = v.squeeze(2)
        k_full = kv_cache[layer_idx]['k'][:, :, :position+1, :]
        v_full = kv_cache[layer_idx]['v'][:, :, :position+1, :]
        x = attention(q, k_full, v_full)
        x = layer.ffn(x)
    return model.lm_head(x)

Reducing KV Cache Size

Technique How It Works
MHA (standard) Each head has own K, V. Cache: $L \times N \times S \times H \times 2$
MQA All heads share K, V. Cache shrinks by $N\times$
GQA Heads grouped; each group shares K, V. Middle ground
MLA (DeepSeek v2) Cache a smaller latent vector, project back at decode time
Cross-layer attention Share KV across layers
Sliding window Discard KV outside window. $O(nW)$ vs $O(n^2)$

Batching Strategies

  • Continuous batching: slot new requests as sequences finish (batch always full)
  • Selective batching: mix prefill (compute-heavy) and decode (memory-bound) sequences
  • Sequence packing: concatenate examples to max length with attention masks
  • Token-budget batching: batch by total token count (used in fine-tuning)

Speculative Decoding

Use a small draft model to propose $K$ tokens, then verify with the target model in one parallel forward pass. Acceptance probability: $\min(1, p(x)/q(x))$.

Flash Attention

  • Standard attention materializes $N \times N$ attention matrix: $O(n^2)$ memory
  • Flash Attention tiles the computation, never materializing the full matrix: $O(n)$ memory
  • Uses the online softmax trick (track running max $m_k$, denominator $d_k$, output $o_k$)
  • Exact method (not approximate)
  • Each (batch, head) pair is independent → massive parallelism
output = F.scaled_dot_product_attention(q, k, v, is_causal=True)

Sampling Strategies

def sample(logits, temperature=1.0, top_k=None, top_p=None):
    logits = logits / temperature
    if top_k is not None:
        values, indices = torch.topk(logits, top_k)
        logits = torch.full_like(logits, float('-inf'))
        logits.scatter_(-1, indices, values)
    if top_p is not None:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        sorted_mask = cumulative_probs > top_p
        sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
        sorted_mask[..., 0] = False
        indices_to_remove = sorted_mask.scatter(-1, sorted_indices, sorted_mask)
        logits = logits.masked_fill(indices_to_remove, float('-inf'))
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

Scaling Laws

  • $\mu P$ (maximal update parameterization) enables hyperparameter transfer from small to large models
  • Loss follows power law: $\mathcal{L}(C) = \mathcal{L}_\infty + \beta C^{-\alpha}$
  • Learning rate follows power law: $\text{LR}(C) = \beta C^{-\alpha}$
  • Fit with least squares: $\hat{\beta} = (X^\top X)^{-1} X^\top y$

Other Architectures

Vanilla RNN

$$h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = W_\text{out} h_t + b_\text{out}$$

Gradient through time: $\partial h_t / \partial h_{t-1} = \text{diag}(\tanh'(z_t)) \cdot W_h$. Since $\tanh' \in (0, 1]$, repeated multiplication causes vanishing gradients.

LSTM

Introduces cell state $c_t$ (long-term memory) flowing through a highway with only elementwise operations (no matmuls):

  • Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$
  • Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$
  • Cell update: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
  • Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$
  • Hidden state: $h_t = o_t \odot \tanh(c_t)$

State Space Models (Mamba)

Input-dependent transitions: $x_k = A x_{k-1} + B_k u_k$ where $B_k = f_B(u_k)$, $C_k = f_C(u_k)$.

  • Recurrent mode: $O(n)$ time, $O(1)$ memory per step
  • Parallel (convolutional) mode: efficient training
  • vs. Transformers: $O(n)$ instead of $O(n^2)$, but can't revisit past tokens

Post-Training

Policy Gradients (REINFORCE)

LLM generates trajectory $\tau = (s_0, a_0, \ldots)$ where $a_t \sim \pi_\theta(\cdot \mid s_t)$. Objective: maximize expected reward.

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau) \right]$$

Baselined policy gradient subtracts a value function to reduce variance:

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \underbrace{(R(\tau) - V_\psi(s_t))}_{\text{advantage } A^\pi(s,a)} \right]$$

The baseline doesn't change the expected gradient (proven by showing $\mathbb{E}_{a_t \mid s_t}[\nabla \log \pi \cdot b(s_t)] = 0$ since $\sum_a \nabla \pi(a \mid s) = \nabla 1 = 0$).

PPO (Proximal Policy Optimization)

Clips the importance ratio $r_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ to prevent destructive updates:

$$\mathcal{J}^\text{CLIP}(\theta) = \mathbb{E}\left[\sum_t \min\left(r_t A_t, \; \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$

  • If $r_t > 1+\epsilon$ and $A_t > 0$: clip at $(1+\epsilon)A_t$ — don't over-reinforce
  • If $r_t < 1-\epsilon$ and $A_t > 0$: use $r_t A_t$ — gradient pushes $\theta$ to increase $r_t$

RLHF

$$\mathcal{J}^\text{RLHF}(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[R(\tau) - \beta D_\text{KL}(\pi_\theta | \pi_\text{ref})\right]$$

  • KL penalty computed per-token and folded into per-token reward
  • Reward model trained on preference pairs with Bradley-Terry: $P(y_w \succ y_l) = \sigma(R(x, y_w) - R(x, y_l))$

GRPO (Group Relative Policy Optimization)

No separate value function. Sample $G$ completions per prompt, normalize rewards within the group:

$$A^{(i)} = \frac{r^{(i)} - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

Then apply PPO-style clipped objective with these group-normalized advantages.

DPO (Direct Preference Optimization)

Derives the optimal policy in closed form from the RLHF objective:

$$R(y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)$$

Substituting into the Bradley-Terry loss eliminates the reward model:

$$\mathcal{L}^\text{DPO}(\theta) = -\mathbb{E}{(x,y_w,y_l)}\left[\log \sigma\left(\beta \log \frac{\pi\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)\right]$$

Precision & Mixed Training

Mixed Precision Strategy

Component Precision Rationale
Master weights FP32 Small gradients add to large weights
Forward/backward activations BF16 Matmuls tolerate rounding noise
Gradients BF16 → accumulate FP32 Individual grads tiny vs weights
  • BF16 has more precision near 0 (can represent 0.0001) but not near 1 (can't represent 1.0001)
  • Activations are harder to quantize than weights

Precision Options

Format Size Notes
FP32 4 bytes Full precision
FP16 2 bytes Half memory
BF16 2 bytes Better numerical stability than FP16
INT8 1 byte Requires quantization
INT4 0.5 bytes Even smaller

PyTorch Precision Patterns

# Load entire model in BF16 (fine for inference)
model = Model.from_pretrained(..., torch_dtype=torch.bfloat16)

# Automatic mixed precision (manages precision per operation)
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    output = model(x)
# matmul in BF16, softmax/layernorm in FP32

GPU Parallelism

Collective Operations

Operation Description
Broadcast One GPU sends identical copy to all
AllGather Each GPU has a shard → every GPU gets the full array
ReduceScatter Reduce + each GPU keeps its shard
AllReduce ReduceScatter + AllGather (equivalent cost)

Communication cost

For AllGather, ReduceScatter, and AllReduce, communication time depends only on array size and bandwidth, NOT on the number of devices.

Data Parallelism (ZeRO Stages)

Stage What's Sharded Memory per GPU Communication
DDP (naive) Nothing Full model + optimizer AllReduce grads
ZeRO-1 Optimizer states $\approx 1/M$ optimizer ReduceScatter + AllGather
ZeRO-2 + Gradients $\approx 1/M$ grads Same cost as DDP
ZeRO-3 (FSDP) + Parameters $\approx 1/N$ everything AllGather params just-in-time

ZeRO stages 1, 2, 3 all have the same communication cost — free memory wins.

Pipeline Parallelism

Split layers across GPUs, divide batch into micro-batches for overlapping compute:

model = nn.Sequential(
    nn.Linear(512, 512).to('cuda:0'),
    nn.ReLU().to('cuda:0'),
    nn.Linear(512, 512).to('cuda:1'),
    nn.ReLU().to('cuda:1'),
)
model = Pipe(model, chunks=8)  # splits batch into 8 micro-batches

Tensor Parallelism

Split weight matrices within a single layer across GPUs:

  • Column parallel (split $W_\text{up}$, $W_\text{gate}$ by columns): each GPU computes a slice of the hidden dim
  • Row parallel (split $W_\text{down}$ by rows): each GPU computes partial result, then AllReduce
  • Attention: split by heads (they're independent). Column parallel QKV → compute → row parallel output. One AllReduce per layer.

5D Parallelism Summary

Dimension Splits Scales
Data (DP) Batch Throughput
Tensor (TP) Weight matrices Model memory
Pipeline (PP) Layers/stages Model memory
Sequence (SP) Sequence length Activation memory
Expert (EP) MoE experts Expert memory

Multimodality

  • ViT (Vision Transformer): turn image into a sequence of patch vectors, then run a standard transformer encoder
  • LLaVA: ViT encoder → projector → concatenate with text tokens → LLM decoder
  • LLaVA-NeXT: dynamic resolution — split high-res images into multiple crops, encode each separately, concatenate
  • CLIP: contrastive learning to maximize similarity of correct text/image embedding pairs

Current paradigm: understanding via ViT encoder → features; generation via diffusion in pixel space.

Information Theory

  • Cross-entropy: $\operatorname{CE}(p, q) = -\mathbb{E}_p[\log q] = -\sum_x p(x) \log q(x)$
  • KL divergence: $\operatorname{KL}(p | q) = \sum_x p(x) (\log p(x) - \log q(x))$
  • Entropy: $H(p) = -\sum_x p(x) \log p(x)$
  • Relationship: $\operatorname{CE}(p, q) = \operatorname{KL}(p | q) + H(p)$
  • Autoregressive loss: $\mathcal{L}(x) = -\sum_{t=1}^T \log p(x_t \mid x_{<t})$

Numerical Stability

Stable Softmax

Subtract $x_\text{max}$ before exponentiating:

$$\text{softmax}(x)i = \frac{e^{x_i - x\text{max}}}{\sum_j e^{x_j - x_\text{max}}}$$

Log-Sum-Exp Trick

$$\log \sum_i e^{x_i} = x_\text{max} + \log \sum_i e^{x_i - x_\text{max}}$$

Online Softmax (Flash Attention building block)

Maintain running statistics $(m_k, d_k)$. When processing $x_{k+1}$:

$$m_{k+1} \leftarrow \max(m_k, x_{k+1}), \quad d_{k+1} \leftarrow d_k \cdot e^{m_k - m_{k+1}} + e^{x_{k+1} - m_{k+1}}$$

Extended to weighted sums $o = \sum_i p_i v_i$ for Flash Attention output computation.

Gradient Flow Through Sampling

  • Gumbel-Softmax: add Gumbel noise and use temperature for differentiable approximation to sampling
  • Straight-through estimator: forward uses discrete sample, backward uses continuous relaxation: $\partial \mathcal{L}/\partial x = (\partial \mathcal{L}/\partial y) \cdot f(x)$