Alisa's Book of LLMs¶

Distilled reference from Alisa Liu's Book of LLMs — a comprehensive, derivation-heavy technical notebook covering the full LLM stack. Alisa Liu is a UW NLP researcher (alisawuffles).

Neural Net Basics¶

Multi-Layer Perceptrons¶

A single neuron: $y = f(w^\top x + b)$. A layer with $n_\text{in}$ inputs and $n_\text{out}$ neurons is a matrix multiply:

$$H = f(XW + b) \quad \text{where } X \in \mathbb{R}^{m \times n_\text{in}}, \; W \in \mathbb{R}^{n_\text{in} \times n_\text{out}}$$

PyTorch convention

PyTorch stores $W$ as $n_\text{out} \times n_\text{in}$ and computes $X @ W^T$. The transpose is free (only changes stride). This way gradients for $W$ naturally come out as $n_\text{out} \times n_\text{in}$, matching $W$'s shape.

Linear layer backprop for $Z = XW + b$:

$$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} W^\top, \qquad \frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Z}, \qquad \frac{\partial L}{\partial b_j} = \sum_{i=1}^m \frac{\partial L}{\partial z_{ij}}$$

Batch gradient intuition

If a tensor is shared across the batch (like $W$), the batch dimension is summed out. If it's not shared (like $X$, activations), the batch dimension is preserved.

Activation Functions¶

Key activations and their derivatives:

Function	Formula	Derivative
Sigmoid	$\sigma(x) = 1/(1+e^{-x})$	$\sigma(x)(1 - \sigma(x))$
Tanh	$\tanh(x)$	$1 - \tanh^2(x)$
ReLU	$\max(0, x)$	$\mathbb{1}[x > 0]$
Swish/SiLU	$x \cdot \sigma(x)$	$\sigma(x) + \text{Swish}(x)(1 - \sigma(x))$

Backpropagation¶

Computation graph: each "gate" has easy local gradients; backprop is repeated chain rule
Upstream gradient (already computed) $\times$ local gradient = gradient flowing backward
Gradients w.r.t. activations flow backward; gradients w.r.t. parameters are stored for the update

Softmax + cross-entropy gradient

For CE loss $L = -\log p_t$ with softmax output $\mathbf{p}$, the combined gradient is simply $\partial L / \partial \mathbf{z} = \mathbf{p} - \operatorname{one_hot}(t)$.

Optimizers (AdamW)¶

Adam with decoupled weight decay:

Gradient: $g = \nabla_\theta L$
First moment (momentum): $m \leftarrow \beta_1 m + (1-\beta_1) g$
Second moment (RMSProp): $v \leftarrow \beta_2 v + (1-\beta_2) g^2$
Bias correction: $\hat{m} = m/(1-\beta_1^t)$, $\hat{v} = v/(1-\beta_2^t)$
Weight decay (before Adam step): $\theta \leftarrow \theta - \eta \lambda \theta$
Update: $\theta \leftarrow \theta - \eta \hat{m} / (\sqrt{\hat{v}} + \epsilon)$

Gradient clipping: compute global norm of all grads, scale down uniformly if it exceeds max.

Learning Rate¶

Warmup reduces primacy effect of early training examples
Common schedules: warmup + cosine decay, warmup + linear decay

Modern Transformer Architecture¶

Dimension Reference¶

Symbol	Meaning
$B$	Batch size
$L$	Number of layers
$T$	Sequence length (tokens to generate)
$S$	Sequence length (provided context)
$V$	Vocab size
$D$	Hidden dimension
$H$	Head dimension
$F$	MLP hidden dim ($\approx 8D/3$)
$N$	Number of query heads ($N \cdot H = D$)
$K$	Number of KV heads ($K < N$ in GQA)

Forward Pass¶

Token embedding: $X^{(0)} = W_e[\text{tokens}]$, $W_e \in \mathbb{R}^{V \times D}$
For each layer $\ell$:
RMSNorm: $\bar{X} = X / (\text{RMS}(X) + \epsilon) \odot \gamma$
QKV projection: $Q = \bar{X} W_Q$, $K = \bar{X} W_K$, $V = \bar{X} W_V$
RoPE: rotate Q, K by position-dependent angles
Attention: $A = \text{softmax}(QK^\top / \sqrt{H})$, apply causal mask, $O = AV$
Output projection: $O_\text{proj} = O W_O$
Residual: $X \leftarrow X + O_\text{proj}$
SwiGLU FFN: $F = \text{Swish}(\bar{X} W_\text{gate}) \odot (\bar{X} W_\text{up})$, then $\bar{X} W_\text{down}$
Residual: $X^{(\ell+1)} = X^{(\ell)} + F$
Final RMSNorm + Unembedding: $Z = X_\text{final} W_u \in \mathbb{R}^{B \times T \times V}$

Parameter & FLOP Accounting¶

Component	Parameters	FLOPs (forward)
Embedding	$VD$	—
Attention (per layer)	$\approx 4D^2$	$8BSD^2 + 4BS^2D$
FFN (per layer)	$3DF$	$6BSDF \approx 16BSD^2$
Total (per layer)	$\approx 12D^2$	$2BSD(12D + 2S)$
Full model	$2VD + 12LD^2$	$2BSD(12LD + 2LS + V)$

Backward pass is $\approx 2\times$ forward FLOPs.

RoPE (Rotary Position Embedding)¶

Rotate each pair $(x_{2i}, x_{2i+1})$ of Q and K by position-dependent angle $m\theta_i$:

$$R_m^{(i)} = \begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix}$$

where $\theta_i = \Theta^{-2i/H}$ and $\Theta$ is a hyperparameter (typically 10000).

Key property: inner product depends only on relative position $m - n$:

$$\langle R_{m\theta} \mathbf{q}, R_{n\theta} \mathbf{k} \rangle = \mathbf{q}^\top R_{(n-m)\theta} \mathbf{k}$$

Complex number view

Pair $(x_{2i}, x_{2i+1})$ as complex number $z = x_{2i} + i \, x_{2i+1}$. Rotation by $\theta$ is multiplication by $e^{i\theta}$.

RMSNorm¶

$$\bar{X} = \frac{X}{\text{RMS}(X) + \epsilon} \odot \gamma, \quad \text{RMS}(X) = \sqrt{\frac{1}{D}\sum_{i=1}^D x_i^2}$$

Learnable gain $\gamma \in \mathbb{R}^D$: $\gamma_i > 1$ amplifies dimension $i$, $\gamma_i \approx 0$ effectively kills it.

SwiGLU FFN¶

$$F = \underbrace{\text{Swish}(\bar{X} W_\text{gate})}\text{gating} \odot \underbrace{(\bar{X} W\text{up})}\text{value} \cdot W\text{down}$$

The gate $G$ selectively passes information from the up-projection $U$.

Inference¶

Latency: time to complete a single request (seconds)
Throughput: tokens processed per unit time across all requests (tokens/second)

KV Cache¶

Pre-allocate cache with shape (batch, num_heads, max_seq_len, head_dim). At each decode step, only process the new token and attend over all cached positions:

def forward_with_cache(model, new_token, kv_cache, position):
    x = model.embed(new_token)  # (batch, 1, d_model)
    for layer_idx, layer in enumerate(model.layers):
        q, k, v = layer.qkv_proj(x).chunk(3, dim=-1)
        kv_cache[layer_idx]['k'][:, :, position, :] = k.squeeze(2)
        kv_cache[layer_idx]['v'][:, :, position, :] = v.squeeze(2)
        k_full = kv_cache[layer_idx]['k'][:, :, :position+1, :]
        v_full = kv_cache[layer_idx]['v'][:, :, :position+1, :]
        x = attention(q, k_full, v_full)
        x = layer.ffn(x)
    return model.lm_head(x)

Reducing KV Cache Size¶

Technique	How It Works
MHA (standard)	Each head has own K, V. Cache: $L \times N \times S \times H \times 2$
MQA	All heads share K, V. Cache shrinks by $N\times$
GQA	Heads grouped; each group shares K, V. Middle ground
MLA (DeepSeek v2)	Cache a smaller latent vector, project back at decode time
Cross-layer attention	Share KV across layers
Sliding window	Discard KV outside window. $O(nW)$ vs $O(n^2)$

Batching Strategies¶

Continuous batching: slot new requests as sequences finish (batch always full)
Selective batching: mix prefill (compute-heavy) and decode (memory-bound) sequences
Sequence packing: concatenate examples to max length with attention masks
Token-budget batching: batch by total token count (used in fine-tuning)

Speculative Decoding¶

Use a small draft model to propose $K$ tokens, then verify with the target model in one parallel forward pass. Acceptance probability: $\min(1, p(x)/q(x))$.

Flash Attention¶

Standard attention materializes $N \times N$ attention matrix: $O(n^2)$ memory
Flash Attention tiles the computation, never materializing the full matrix: $O(n)$ memory
Uses the online softmax trick (track running max $m_k$, denominator $d_k$, output $o_k$)
Exact method (not approximate)
Each (batch, head) pair is independent → massive parallelism

output = F.scaled_dot_product_attention(q, k, v, is_causal=True)

Sampling Strategies¶

def sample(logits, temperature=1.0, top_k=None, top_p=None):
    logits = logits / temperature
    if top_k is not None:
        values, indices = torch.topk(logits, top_k)
        logits = torch.full_like(logits, float('-inf'))
        logits.scatter_(-1, indices, values)
    if top_p is not None:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        sorted_mask = cumulative_probs > top_p
        sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
        sorted_mask[..., 0] = False
        indices_to_remove = sorted_mask.scatter(-1, sorted_indices, sorted_mask)
        logits = logits.masked_fill(indices_to_remove, float('-inf'))
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

Scaling Laws¶

$\mu P$ (maximal update parameterization) enables hyperparameter transfer from small to large models
Loss follows power law: $\mathcal{L}(C) = \mathcal{L}_\infty + \beta C^{-\alpha}$
Learning rate follows power law: $\text{LR}(C) = \beta C^{-\alpha}$
Fit with least squares: $\hat{\beta} = (X^\top X)^{-1} X^\top y$

Other Architectures¶

Vanilla RNN¶

$$h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = W_\text{out} h_t + b_\text{out}$$

Gradient through time: $\partial h_t / \partial h_{t-1} = \text{diag}(\tanh'(z_t)) \cdot W_h$. Since $\tanh' \in (0, 1]$, repeated multiplication causes vanishing gradients.

LSTM¶

Introduces cell state $c_t$ (long-term memory) flowing through a highway with only elementwise operations (no matmuls):

Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$
Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$
Cell update: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$
Hidden state: $h_t = o_t \odot \tanh(c_t)$

State Space Models (Mamba)¶

Input-dependent transitions: $x_k = A x_{k-1} + B_k u_k$ where $B_k = f_B(u_k)$, $C_k = f_C(u_k)$.

Recurrent mode: $O(n)$ time, $O(1)$ memory per step
Parallel (convolutional) mode: efficient training
vs. Transformers: $O(n)$ instead of $O(n^2)$, but can't revisit past tokens

Post-Training¶

Policy Gradients (REINFORCE)¶

LLM generates trajectory $\tau = (s_0, a_0, \ldots)$ where $a_t \sim \pi_\theta(\cdot \mid s_t)$. Objective: maximize expected reward.

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau) \right]$$

Baselined policy gradient subtracts a value function to reduce variance:

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \underbrace{(R(\tau) - V_\psi(s_t))}_{\text{advantage } A^\pi(s,a)} \right]$$

The baseline doesn't change the expected gradient (proven by showing $\mathbb{E}_{a_t \mid s_t}[\nabla \log \pi \cdot b(s_t)] = 0$ since $\sum_a \nabla \pi(a \mid s) = \nabla 1 = 0$).

PPO (Proximal Policy Optimization)¶

Clips the importance ratio $r_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ to prevent destructive updates:

$$\mathcal{J}^\text{CLIP}(\theta) = \mathbb{E}\left[\sum_t \min\left(r_t A_t, \; \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$

If $r_t > 1+\epsilon$ and $A_t > 0$: clip at $(1+\epsilon)A_t$ — don't over-reinforce
If $r_t < 1-\epsilon$ and $A_t > 0$: use $r_t A_t$ — gradient pushes $\theta$ to increase $r_t$

RLHF¶

$$\mathcal{J}^\text{RLHF}(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[R(\tau) - \beta D_\text{KL}(\pi_\theta | \pi_\text{ref})\right]$$

KL penalty computed per-token and folded into per-token reward
Reward model trained on preference pairs with Bradley-Terry: $P(y_w \succ y_l) = \sigma(R(x, y_w) - R(x, y_l))$

GRPO (Group Relative Policy Optimization)¶

No separate value function. Sample $G$ completions per prompt, normalize rewards within the group:

$$A^{(i)} = \frac{r^{(i)} - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

Then apply PPO-style clipped objective with these group-normalized advantages.

DPO (Direct Preference Optimization)¶

Derives the optimal policy in closed form from the RLHF objective:

$$R(y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)$$

Substituting into the Bradley-Terry loss eliminates the reward model:

$$\mathcal{L}^\text{DPO}(\theta) = -\mathbb{E}{(x,y_w,y_l)}\left[\log \sigma\left(\beta \log \frac{\pi\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)\right]$$

Precision & Mixed Training¶

Mixed Precision Strategy¶

Component	Precision	Rationale
Master weights	FP32	Small gradients add to large weights
Forward/backward activations	BF16	Matmuls tolerate rounding noise
Gradients	BF16 → accumulate FP32	Individual grads tiny vs weights

BF16 has more precision near 0 (can represent 0.0001) but not near 1 (can't represent 1.0001)
Activations are harder to quantize than weights

Precision Options¶

Format	Size	Notes
FP32	4 bytes	Full precision
FP16	2 bytes	Half memory
BF16	2 bytes	Better numerical stability than FP16
INT8	1 byte	Requires quantization
INT4	0.5 bytes	Even smaller

PyTorch Precision Patterns¶

# Load entire model in BF16 (fine for inference)
model = Model.from_pretrained(..., torch_dtype=torch.bfloat16)

# Automatic mixed precision (manages precision per operation)
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    output = model(x)
# matmul in BF16, softmax/layernorm in FP32

GPU Parallelism¶

Collective Operations¶

Operation	Description
Broadcast	One GPU sends identical copy to all
AllGather	Each GPU has a shard → every GPU gets the full array
ReduceScatter	Reduce + each GPU keeps its shard
AllReduce	ReduceScatter + AllGather (equivalent cost)

Communication cost

For AllGather, ReduceScatter, and AllReduce, communication time depends only on array size and bandwidth, NOT on the number of devices.

Data Parallelism (ZeRO Stages)¶

Stage	What's Sharded	Memory per GPU	Communication
DDP (naive)	Nothing	Full model + optimizer	AllReduce grads
ZeRO-1	Optimizer states	$\approx 1/M$ optimizer	ReduceScatter + AllGather
ZeRO-2	+ Gradients	$\approx 1/M$ grads	Same cost as DDP
ZeRO-3 (FSDP)	+ Parameters	$\approx 1/N$ everything	AllGather params just-in-time

ZeRO stages 1, 2, 3 all have the same communication cost — free memory wins.

Pipeline Parallelism¶

Split layers across GPUs, divide batch into micro-batches for overlapping compute:

model = nn.Sequential(
    nn.Linear(512, 512).to('cuda:0'),
    nn.ReLU().to('cuda:0'),
    nn.Linear(512, 512).to('cuda:1'),
    nn.ReLU().to('cuda:1'),
)
model = Pipe(model, chunks=8)  # splits batch into 8 micro-batches

Tensor Parallelism¶

Split weight matrices within a single layer across GPUs:

Column parallel (split $W_\text{up}$, $W_\text{gate}$ by columns): each GPU computes a slice of the hidden dim
Row parallel (split $W_\text{down}$ by rows): each GPU computes partial result, then AllReduce
Attention: split by heads (they're independent). Column parallel QKV → compute → row parallel output. One AllReduce per layer.

5D Parallelism Summary¶

Dimension	Splits	Scales
Data (DP)	Batch	Throughput
Tensor (TP)	Weight matrices	Model memory
Pipeline (PP)	Layers/stages	Model memory
Sequence (SP)	Sequence length	Activation memory
Expert (EP)	MoE experts	Expert memory

Multimodality¶

ViT (Vision Transformer): turn image into a sequence of patch vectors, then run a standard transformer encoder
LLaVA: ViT encoder → projector → concatenate with text tokens → LLM decoder
LLaVA-NeXT: dynamic resolution — split high-res images into multiple crops, encode each separately, concatenate
CLIP: contrastive learning to maximize similarity of correct text/image embedding pairs

Current paradigm: understanding via ViT encoder → features; generation via diffusion in pixel space.

Information Theory¶

Cross-entropy: $\operatorname{CE}(p, q) = -\mathbb{E}_p[\log q] = -\sum_x p(x) \log q(x)$
KL divergence: $\operatorname{KL}(p | q) = \sum_x p(x) (\log p(x) - \log q(x))$
Entropy: $H(p) = -\sum_x p(x) \log p(x)$
Relationship: $\operatorname{CE}(p, q) = \operatorname{KL}(p | q) + H(p)$
Autoregressive loss: $\mathcal{L}(x) = -\sum_{t=1}^T \log p(x_t \mid x_{<t})$

Numerical Stability¶

Stable Softmax¶

Subtract $x_\text{max}$ before exponentiating:

$$\text{softmax}(x)i = \frac{e^{x_i - x\text{max}}}{\sum_j e^{x_j - x_\text{max}}}$$

Log-Sum-Exp Trick¶

$$\log \sum_i e^{x_i} = x_\text{max} + \log \sum_i e^{x_i - x_\text{max}}$$

Online Softmax (Flash Attention building block)¶

Maintain running statistics $(m_k, d_k)$. When processing $x_{k+1}$:

$$m_{k+1} \leftarrow \max(m_k, x_{k+1}), \quad d_{k+1} \leftarrow d_k \cdot e^{m_k - m_{k+1}} + e^{x_{k+1} - m_{k+1}}$$

Extended to weighted sums $o = \sum_i p_i v_i$ for Flash Attention output computation.

Gradient Flow Through Sampling¶

Gumbel-Softmax: add Gumbel noise and use temperature for differentiable approximation to sampling
Straight-through estimator: forward uses discrete sample, backward uses continuous relaxation: $\partial \mathcal{L}/\partial x = (\partial \mathcal{L}/\partial y) \cdot f(x)$