Skip to content

LLM Architecture

How Large Language Models work — from transformer internals and attention mechanisms through training pipelines, quantization formats, model distribution formats, and knowledge distillation.


Transformer Architecture

The transformer is the neural network architecture behind virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent neural networks (RNNs/LSTMs) by processing all tokens in a sequence simultaneously rather than sequentially.

Why Transformers Replaced RNNs

RNNs process tokens one at a time, left to right. This sequential bottleneck means:

  • Training cannot be parallelized across sequence positions
  • Long-range dependencies decay over distance (vanishing gradients)
  • Training time scales linearly with sequence length

Transformers solve all three problems through self-attention, which computes relationships between every pair of tokens in a single matrix operation — fully parallelizable on GPUs.

High-Level Data Flow

graph LR
    A[Raw Text] --> B[Tokenizer]
    B --> C[Token IDs]
    C --> D[Embedding Layer]
    D --> E[+ Positional Encoding]
    E --> F[Transformer Blocks x N]
    F --> G[Output Layer / Logits]
    G --> H[Softmax → Probability Distribution]
    H --> I[Next Token]
  1. Tokenization — text is split into subword tokens (integers from a fixed vocabulary)
  2. Embedding — each token ID maps to a dense vector via a learned embedding table
  3. Positional Encoding — positional signals are added so the model knows token order (attention itself is order-agnostic)
  4. Transformer Blocks — a stack of N identical layers, each containing self-attention + feed-forward network + residual connections + layer normalization
  5. Output Layer — projects hidden states to vocabulary-sized logits
  6. Softmax — converts logits to a probability distribution over the vocabulary

Modern LLMs use 12 to several hundred transformer blocks. Deeper stacks enable richer hierarchical abstractions.

Inside a Transformer Block

Each transformer block contains two main sub-layers wrapped in residual connections and normalization:

graph TD
    A[Input] --> B[Layer Norm]
    B --> C[Multi-Head Self-Attention]
    C --> D[+ Residual Connection]
    D --> E[Layer Norm]
    E --> F[Feed-Forward Network]
    F --> G[+ Residual Connection]
    G --> H[Output to Next Block]

Feed-Forward Network (FFN)

The FFN provides the model's primary source of nonlinearity and parameter capacity. While attention handles communication between tokens, the FFN handles computation within each token's representation — this is where the model stores and applies learned knowledge.

The original transformer used a two-layer FFN with ReLU activation and a 4x hidden dimension expansion:

$$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$

Modern LLMs have evolved significantly:

Component Original Transformer Modern LLMs (LLaMA/Mistral)
Activation ReLU SwiGLU (SiLU-gated)
Expansion ratio 4x ~2.7x (compensated by gating)
Normalization LayerNorm (Post-LN) RMSNorm (Pre-LN)

SwiGLU Activation

SwiGLU is a gated variant that has become the standard in LLaMA-family models. It works like a learned gate: up_proj(x) carries the information, and SiLU(gate_proj(x)) controls how much passes through:

$$ \text{SwiGLU}(x) = (\text{SiLU}(xW_{\text{gate}})) \odot (xW_{\text{up}}) $$

Even with a nominally lower expansion ratio (~2.7x vs 4x), SwiGLU-based FFNs have similar or greater effective capacity because the gate mechanism provides additional expressive power. Gemma uses GeGLU, a closely related variant.

RMSNorm vs LayerNorm

LayerNorm performs two operations: centering (subtracting the mean) and scaling (dividing by standard deviation). RMSNorm removes centering entirely, normalizing only by root mean square — empirical studies found centering contributes little to training stability while scaling does the heavy lifting.

$$ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}x_i^2}} \cdot \gamma $$

RMSNorm yields comparable performance to LayerNorm but shows 7–64% speed improvement.

Pre-LN vs Post-LN

Placement Description Stability Used By
Post-LN (original) Norm applied after residual add Requires careful LR warmup; gradient issues in deep nets Original Transformer, BERT
Pre-LN (modern) Norm applied before each sub-layer Much more stable; trains without warmup LLaMA, Mistral, GPT-3+

Pre-LN normalizes input to each sub-layer, preventing activation explosions. The residual path remains clean, allowing gradients to flow easily. By LLaMA's release (2023), Pre-LN with RMSNorm became the undisputed standard.

Residual Connections

Residual (skip) connections add each sub-layer's input directly to its output: $\text{output} = \text{sublayer}(x) + x$. This allows gradients to flow through hundreds of layers without vanishing and lets each layer learn a refinement rather than a complete transformation.

Weight Tying

Many models tie the input embedding matrix with the output projection matrix (the layer that produces logits). Since both map between token IDs and hidden dimensions, sharing weights reduces parameter count and can improve generalization. GPT-2 and many smaller models use weight tying; larger models like LLaMA do not.

Encoder-Decoder vs Decoder-Only

The original transformer had two halves:

Architecture Used By How It Works
Encoder-Decoder T5, BART, original Transformer Encoder reads full input bidirectionally; decoder generates output autoregressively
Encoder-Only BERT, RoBERTa Bidirectional attention for understanding tasks (classification, NER)
Decoder-Only GPT series, LLaMA, Claude, Mistral Causal (left-to-right) attention; generates text one token at a time

Nearly all modern generative LLMs use the decoder-only variant. The encoder-only approach lives on in embedding models and classification tasks.


Self-Attention Mechanism

Self-attention is the core innovation that makes transformers work. It allows every token to "attend to" every other token in the sequence, computing relevance scores dynamically.

Query, Key, Value (QKV)

For each token, the model computes three vectors from the input embedding:

  • Query (Q) — "what am I looking for?"
  • Key (K) — "what do I contain?"
  • Value (V) — "what information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and passed through softmax:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $d_k$ is the dimension of the key vectors (scaling prevents dot products from growing too large).

Causal Masking

In decoder-only models, a causal mask is applied to the attention matrix: the upper triangle is set to $-\infty$ before softmax, preventing tokens from attending to future positions. This ensures autoregressive generation — each token can only see tokens that came before it.

Multi-Head Attention

Rather than computing a single attention function, transformers use multiple attention heads (typically 32–128), each with independent Q/K/V projections. Different heads learn to capture different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.

Attention Variants

Variant Description Used By
Multi-Head Attention (MHA) Each head has its own K, V projections Original Transformer, GPT-2
Multi-Query Attention (MQA) All heads share a single K, V projection PaLM, Falcon
Grouped-Query Attention (GQA) Heads grouped into clusters sharing K, V LLaMA 2/3, Mistral, Gemma

GQA is the current standard — it reduces KV cache memory by 4-8x compared to MHA with minimal quality loss.


Tokenization and Embeddings

Tokenization

Tokenization converts raw text into integer token IDs from a fixed vocabulary. LLMs use subword tokenization — a middle ground between character-level (too fine) and word-level (can't handle unknown words).

Byte Pair Encoding (BPE) is the dominant algorithm:

  1. Start with a vocabulary of 256 byte values
  2. Find the most frequent adjacent byte pair in the training corpus
  3. Merge that pair into a new token, add to vocabulary
  4. Repeat until vocabulary reaches target size (30K–100K tokens)

Common words become single tokens; rare words decompose into known subword pieces.

Algorithm Description Used By
BPE (byte-level) Merge most frequent byte pairs GPT-2/3/4, LLaMA 3, Claude, Mistral
WordPiece Merge pairs that maximize corpus likelihood BERT, DistilBERT
SentencePiece Language-agnostic, operates on raw text LLaMA 1/2, Mistral (earlier), T5
Unigram Probabilistic model, prunes vocabulary down SentencePiece variant, XLNet

Tokenization Quirks

Many LLM "failures" trace back to tokenization. Math errors occur because multi-digit numbers split into arbitrary subword tokens. Spelling struggles happen because the model never sees individual characters. "Glitch tokens" — tokens frequent in tokenizer training data but rare in model training — produce unpredictable outputs.

Embeddings

The embedding layer maps each integer token ID to a dense vector (typically 4096–12288 dimensions). These vectors are learned during pretraining and encode semantic relationships: similar tokens have similar vectors.

Positional encoding adds sequence-order information since attention is inherently order-agnostic.

Positional Encoding Methods

Method Type How It Works Used By
Sinusoidal Absolute Fixed sine/cosine functions at each position Original Transformer
Learned Absolute Absolute Trainable embedding per position (up to max length) GPT-2, BERT
RoPE (Rotary Position Embedding) Relative Encodes relative positions via rotation matrices applied to Q/K vectors LLaMA 1/2/3, Mistral, Qwen, Gemma
ALiBi (Attention with Linear Biases) Relative Adds linear penalty proportional to token distance directly to attention scores BLOOM, MPT
YaRN Relative (extended) Extends RoPE to longer contexts via NTK-aware interpolation Long-context LLaMA variants

RoPE is the dominant method in 2025. It applies a rotation matrix to Q and K vectors such that the dot product $q \cdot k$ depends only on their relative position, not absolute. This enables better length extrapolation than learned absolute embeddings and avoids ALiBi's precision issues (see below).

ALiBi adds a simple linear bias $-m \cdot |i - j|$ to each attention score, where $m$ is a head-specific slope and $|i-j|$ is the distance between tokens. While elegant, ALiBi has a critical interaction with reduced precision: in FP16, the last 20 positions of a head may map to only 5 distinct values, and in BF16 they may all collapse to the same value. This limits ALiBi's effectiveness for long-context inference.

Context Windows

The context window is the maximum number of tokens an LLM can process in one request. All input (system prompt, conversation history, user query) and output share this budget.

Era Typical Context Example Models
2018–2020 512–2048 BERT, GPT-2
2022–2023 4K–32K GPT-4, Claude 2
2024–2025 128K–1M Claude 3.5, Gemini 1.5, GPT-4 Turbo
2025–2026 1M–10M Gemini 2.0, Claude 4

Lost in the Middle

Research shows LLMs attend strongly to tokens at the beginning and end of context but drop 30%+ accuracy on information in the middle (Liu et al., Stanford 2024). Placing critical information at the start or end of prompts improves retrieval quality.


Training Pipeline

Phase 1: Pretraining

The model learns general knowledge by predicting the next token across trillions of tokens from web crawls, books, code, and curated datasets. This is by far the most expensive phase — DeepSeek-V3 required 2.788 million H800 GPU hours (~$5.6M) for 14.8 trillion tokens.

The training objective is simple causal language modeling:

$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) $$

Pretraining Data Curation

The quality and composition of pretraining data is as critical as model architecture:

Step What It Does Why It Matters
Deduplication Remove near-duplicate documents (MinHash, exact substring matching) Duplicated data causes memorization, degrades generalization, inflates benchmark scores
Quality filtering Score documents via heuristics or classifier (perplexity, language ID, content quality) Removes spam, boilerplate, machine-generated text
Toxicity/PII removal Filter harmful content and personally identifiable information Safety and legal compliance
Domain mixing Control proportions of web, code, books, scientific papers, multilingual data Affects which capabilities the model develops
Data scheduling Vary data mix during training (e.g., increase code/math ratio later) Optimizes learning curriculum

Modern data pipelines use classifier-based filtering — training a small model on known high-quality text (e.g., Wikipedia, textbooks) and scoring all candidate documents. LLaMA 3 used this approach extensively.

Synthetic Data in Pretraining

Synthetic data — generated by existing LLMs — is increasingly used to augment pretraining corpora:

  • Textbook-quality data: Phi models (Microsoft) demonstrated that small models trained on LLM-generated "textbook-style" data can outperform much larger models on reasoning benchmarks
  • Code generation: synthetic programming problems and solutions supplement natural code repositories
  • Math and reasoning: step-by-step solutions generated by strong models provide training signal for reasoning capabilities
  • Instruction data: synthetic instruction-response pairs bootstrap SFT datasets at scale

Model Collapse

Training on too much synthetic data without sufficient real data can cause "model collapse" — progressive degradation of quality as the model learns from its own distribution rather than the true data distribution. Careful mixing ratios (typically <30% synthetic) and quality filtering mitigate this risk.

Phase 2: Supervised Fine-Tuning (SFT)

The pretrained model is further trained on curated instruction-response pairs to learn:

  • Instruction following
  • Output formatting (JSON, markdown, structured responses)
  • Safety behaviors
  • Task-specific patterns

SFT datasets are much smaller (thousands to millions of examples) but high quality.

Phase 3: Alignment

RLHF (Reinforcement Learning from Human Feedback)

The traditional alignment pipeline:

graph LR
    A[SFT Model] --> B[Generate Multiple Responses]
    B --> C[Human Annotators Rank Outputs]
    C --> D[Train Reward Model]
    D --> E[Optimize Policy via PPO]
    E --> F[Aligned Model]
  1. Generate multiple responses per prompt
  2. Human annotators rank them
  3. Train a reward model to predict human preferences
  4. Use PPO (Proximal Policy Optimization) to optimize the base model against the reward model

Downsides: complex, expensive, unstable training, susceptible to reward hacking.

DPO (Direct Preference Optimization)

Introduced by Rafailov et al. (2023), DPO simplifies alignment by eliminating the reward model entirely. It reframes preference learning as a binary classification problem:

  • Given a chosen response and a rejected response, directly optimize the model to increase the probability of the chosen response relative to the rejected one
  • Requires only 2 models (policy + frozen reference) vs RLHF's 4
  • Standard supervised learning infrastructure — no RL instability

By 2025, 70% of enterprises use RLHF or DPO for alignment, with DPO adoption growing 45% year-over-year.

Constitutional AI (CAI)

Developed by Anthropic, CAI replaces human preference labeling with self-critique based on ethical principles (a "constitution"). The model generates responses, critiques its own outputs against the constitution, and revises — enabling scalable alignment without massive human annotation.

Phase 4: Reinforcement Learning for Reasoning

Models like DeepSeek-R1 and OpenAI o1/o3 add an RL phase specifically targeting step-by-step reasoning:

  • Train the model to generate and verify chains of thought
  • Reward correct final answers and valid reasoning steps
  • Results: DeepSeek-R1 achieves 97.3% on MATH, ~80% on AIME competition problems

Mixture of Experts (MoE)

MoE introduces sparsity into the model: instead of activating all parameters for every token, only a subset of specialized "expert" sub-networks fire. This achieves the quality of massive models at the compute cost of much smaller ones.

How MoE Works

graph TD
    A[Input Token] --> B[Router / Gating Network]
    B --> C[Expert 1]
    B --> D[Expert 2]
    B --> E[Expert 3]
    B --> F["Expert N (inactive)"]
    C --> G[Weighted Sum of Active Expert Outputs]
    D --> G
    E --> G
    G --> H[Output]
    style F fill:#ccc,stroke:#999
  1. A router (small neural network) scores all experts for each input token
  2. The top-K experts (typically top-2) are selected
  3. Their outputs are combined via weighted sum
  4. Remaining experts are not computed — saving ~90% of FLOPs

Key MoE Models

Model Total Params Active Params Experts Innovation
Mixtral 8x7B 46.7B 12.9B 8 First major open-source MoE; static top-2 routing
Mixtral 8x22B 141B ~39B 8 Scaled Mixtral architecture
DeepSeek-V3 671B 37B 256 Fine-grained experts; auxiliary-loss-free load balancing; FP8 training
DeepSeek-R1 671B 37B 256 RL-first reasoning on V3 base; 97.3% MATH
Llama 4 Scout 109B 17B 16 Meta's first MoE; 10M token context
Llama 4 Maverick 400B 17B 128 128 experts, top-1 routing

DeepSeek's MoE Innovations

DeepSeek introduced two key strategies:

  1. Fine-grained experts — segment into many small experts (256 instead of 8), activate a small subset, allowing more flexible combinations
  2. Shared experts — isolate some experts as "shared" across all tokens to capture common knowledge, reducing redundancy in routed experts

As of 2025, nearly all frontier models (GPT-4, Gemini, Claude, Llama 4, DeepSeek, Mistral Large) use MoE architectures.

Load Balancing and Expert Collapse

A critical challenge in MoE training is expert collapse — the router learns to send most tokens to a few "popular" experts while others receive little traffic and stop learning. This wastes capacity and reduces model quality.

Solutions:

Technique How It Works Used By
Auxiliary load-balancing loss Adds a penalty term that encourages equal token distribution across experts Mixtral, Switch Transformer
Expert capacity factor Caps the max tokens per expert; overflow tokens are dropped or sent to a default expert GShard, Switch Transformer
Auxiliary-loss-free balancing Uses a bias term in the router to balance load without distorting the main training loss DeepSeek-V3
Shared experts Reserve some experts as "always active" to handle common knowledge, reducing pressure on routed experts DeepSeek-V2/V3

DeepSeek-V3's auxiliary-loss-free approach is notable because traditional auxiliary losses can conflict with the main training objective, forcing a trade-off between load balance and model quality. By using a separate bias term, DeepSeek avoids this conflict entirely.

MoE Memory Tradeoff

MoE memory scales with total parameters, not active parameters. A 671B MoE model needs hundreds of GB of VRAM even though only 37B parameters fire per token. This forces multi-GPU deployments for large MoE models.


Quantization Formats

Quantization reduces model weight precision from high-bit (FP32/FP16) to lower-bit (INT8/INT4) representations, dramatically reducing memory and improving inference speed.

Numeric Precision Types

Floating-Point Bit Layout

Understanding the sign/exponent/mantissa structure explains why these formats differ:

FP32:  [1 sign] [8 exponent] [23 mantissa]  — 32 bits total
FP16:  [1 sign] [5 exponent] [10 mantissa]  — 16 bits total
BF16:  [1 sign] [8 exponent] [ 7 mantissa]  — 16 bits total
FP8:   [1 sign] [4 exponent] [ 3 mantissa]  — 8 bits total (E4M3 variant)
Property FP32 BF16 FP16
Dynamic range (decades) ~83 ~79 ~12
Epsilon (precision near 1.0) ~1.2e-7 ~7.8e-3 ~9.8e-4
Max value ~3.4e38 ~3.4e38 ~65,504
Loss scaling needed? No Rarely Often yes

BF16 has the same 8-bit exponent as FP32, giving it nearly identical dynamic range — this means it can represent extremely small gradients and large activations without underflow/overflow. The trade-off is lower precision (7 mantissa bits vs FP16's 10). In practice, BF16 "just works" for training because you rarely need loss scaling.

FP16 has higher precision within a narrow range but risks overflow during training. Loss scaling (multiplying the loss by a large factor, then dividing gradients back) is often required to prevent gradient underflow.

Format Bits Bytes/Param Description Use Case
FP32 32 4 Full precision float Training optimizer states only
BF16 16 2 Brain Float 16 — wider dynamic range than FP16 Standard training & full-quality inference; post-2022 default
FP16 16 2 Half precision float Legacy inference; highest quality but 2x memory of BF16 with no benefit
FP8 8 1 8-bit float; native on Hopper/Blackwell GPUs Production sweet spot on modern NVIDIA hardware
INT8 8 1 8-bit integer ~50% memory reduction vs FP16; broad hardware support
INT4 4 0.5 4-bit integer ~75% memory reduction; per-group scaling preserves quality
INT2 2 0.25 2-bit integer Extreme compression; significant quality loss

How Quantization Works

Full-precision weights (e.g. FP16) are mapped to a smaller set of representable values:

  1. Per-tensor quantization — one scale factor for the entire weight tensor (fast but lossy)
  2. Per-channel quantization — one scale factor per output channel (better quality)
  3. Per-group quantization — divides weights into groups of 128 elements, each with its own scale (best quality/size tradeoff for INT4)

The scale factor maps the quantized integer range back to the original floating-point range during inference.

Quality Impact by Precision

Perplexity benchmarks (Llama-2-7B, lower is better):

Format Perplexity Quality Loss
FP16 (baseline) 7.4924
Q8_0 7.4933 Negligible
Q5_K_M ~7.52 Minimal
Q4_K_M 7.5692 Acceptable
Q3_K_M ~7.85 Noticeable
Q2_K 8.6501 Significant degradation

Low-Bit Caveats

At Q2/Q3, models start ignoring parts of system prompts and hallucinating JSON formatting. Avoid INT4 and below for math, code generation, and reasoning-heavy tasks where quality loss is most noticeable.

Size and Speed Example (Llama 2 13B)

Metric FP16 Q4_K_M
Model size 26 GB 7.9 GB (70% reduction)
RAM required 32 GB+ 12 GB
Speed 8 tok/s 15 tok/s
Quality 100% ~95%

Model Formats and Quantization Methods

GGUF (GPT-Generated Unified Format)

GGUF is a self-contained file format created by the llama.cpp project. It bundles weights, tokenizer, architecture metadata, and chat template into a single .gguf file.

Key properties:

  • Runs on everything — CPU, NVIDIA, AMD, Apple Silicon
  • mmap-able (OS maps file into memory without loading it all)
  • Endian-safe and versioned
  • Powers Ollama and LM Studio under the hood

GGUF quantization naming convention:

Name Approx Bits/Weight Type Quality
Q2_K ~2.6 K-quant Extreme compression, noticeable degradation
Q3_K_S / Q3_K_M / Q3_K_L ~3.3–3.9 K-quant Budget-conscious, some quality loss
Q4_K_S / Q4_K_M ~4.3–4.8 K-quant Best balance of quality and size
Q5_K_S / Q5_K_M ~5.3–5.7 K-quant Near-lossless for most tasks
Q6_K ~6.6 K-quant Very close to FP16
Q8_0 ~8.5 Legacy Near-identical to FP16
IQ2_XXS / IQ3_S ~2.1–3.4 I-quant State-of-art low-bit; uses lookup tables

The "K" indicates k-quant method (importance-aware mixed-precision); S/M/L are compression aggressiveness levels.

GPTQ (GPT-Quantized)

Calibration-based 4-bit integer quantization using approximate second-order (Hessian) information to minimize quantization error. Requires a small calibration dataset. AutoGPTQ was archived in April 2025; succeeded by GPTQModel v5.8.0.

Verdict: Use only if AWQ or EXL2 versions are unavailable. Both offer better quality-per-bit.

AWQ (Activation-Aware Weight Quantization)

MIT research. Identifies the <1% of "salient" weights by observing activations during calibration, then preserves them at higher precision.

  • ~3 percentage points better than GPTQ on MMLU at 4 bits
  • Marlin-AWQ kernel: ~741 tok/s on A10G — fastest 4-bit for NVIDIA
  • Best choice for vLLM multi-user deployments on NVIDIA

EXL2 (ExLlamaV2)

Mixed bit-width quantization — can use 2, 3, 4, 5, 6, 8 bits within a single model and even within individual layers. Supports fractional average bitwidths (e.g., 4.5 bpw).

  • Fastest for interactive single-user generation on NVIDIA GPUs (40–70% faster than llama.cpp)
  • NVIDIA CUDA only, no CPU fallback
  • Best for single-user interactive sessions at 4–6 bpw

Quick Format Decision Guide

Scenario Best Format
CPU / Laptop / Apple Silicon GGUF (Q4_K_M or Q5_K_M)
NVIDIA GPU, max serving throughput AWQ with Marlin kernels
NVIDIA GPU, single-user interactive EXL2 at 4–6 bpw
NVIDIA H100/Blackwell production FP8
Fine-tuning bitsandbytes (QLoRA)
Limited VRAM (≤8GB) GGUF Q4_K_M with CPU offloading
General starting point Ollama with Q4_K_M

MLX (Apple Silicon)

MLX is Apple's open-source array framework for machine learning on Apple Silicon. Designed for Mac-native LLM inference and fine-tuning.

Key Design Principles

  • Unified memory — arrays live in shared CPU/GPU memory; no data transfer overhead
  • Lazy computation — operations are materialized only when needed, enabling automatic fusion
  • Dynamic graphs — no recompilation on shape changes (unlike TensorRT)
  • Familiar APIs — Python API mirrors NumPy; mlx.nn mirrors PyTorch

MLX LM

The mlx-lm package provides one-command model download, quantization conversion, and inference:

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B --quantize --q-bits 4

# Generate text
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit --prompt "Explain attention"

Performance on Apple Silicon

Chip Memory Bandwidth 14B Dense (BF16) TTFT 30B MoE (4-bit) TTFT
M4 120 GB/s ~12s ~4s
M5 153 GB/s <10s <3s

The M5 provides 19–27% improvement over M4, directly proportional to its 28% memory bandwidth increase.

Research shows vllm-mlx achieves 21–87% higher throughput than llama.cpp on Apple Silicon, thanks to zero-copy tensor operations and lazy evaluation.

A MacBook Pro 24GB can hold an 8B model in BF16 or a 30B MoE at 4-bit quantization comfortably.


Knowledge Distillation

Knowledge distillation compresses a large teacher model into a smaller student model that mimics the teacher's behavior while being far cheaper to run.

How It Works

graph LR
    A[Input Data] --> B[Teacher Model - Large]
    A --> C[Student Model - Small]
    B --> D[Soft Targets / Probabilities]
    D --> E[Distillation Loss]
    C --> F[Student Predictions]
    F --> E
    E --> G[Update Student Weights]

Three main distillation approaches:

Method What Transfers Description
Response-based Output probabilities ("soft targets") Student learns teacher's probability distribution over vocabulary, not just the argmax
Feature-based Intermediate layer activations Student aligns internal representations via L2 or cosine similarity
Attention-based Attention maps Student replicates teacher's attention patterns (used in DistilBERT)

Why Soft Targets Matter

Instead of training on hard labels (the single correct answer), the student learns from the teacher's full probability distribution. The relative probabilities encode the teacher's learned generalizations — for example, that "dog" and "puppy" are similar while "dog" and "table" are not.

A temperature parameter $T$ (typically 2–5) controls how "soft" the distribution is: higher temperature spreads probability more evenly, exposing more of the teacher's learned structure.

Results

  • Typical compression: 5–10x smaller, retaining 90–95% accuracy
  • DistilBERT: 60% of BERT's size, 97% of its performance, 60% faster
  • DeepSeek-R1-Distill models: distilled from 671B to 7B/14B/32B variants with strong reasoning capabilities
  • Chain-of-Thought Distillation — transfers reasoning processes (not just final answers) from teacher to student using CoT rationales as training signal
  • Curriculum Distillation — organizes training easy-to-hard to gradually build reasoning capacity
  • Multi-Teacher Distillation — combines expertise from multiple specialized teachers with dynamic weighting
  • Few-Shot Distillation — effective with as few as 8–512 calibration samples using counterfactual explanations

Scaling Laws

Kaplan et al. (2020) and Chinchilla (Hoffmann et al., 2022) established empirical scaling laws for LLMs:

  • Model performance (loss) improves predictably as a power law of: model size (parameters), dataset size (tokens), and compute budget (FLOPs)
  • Chinchilla-optimal: for a given compute budget, model size and training tokens should scale roughly equally (the "1:1 ratio" in parameter-token space)
  • DeepSeek-V3 trained on 14.8T tokens with 671B params — heavily over-training relative to Chinchilla, but MoE's sparse activation changes the calculus

These laws guide decisions about how to allocate training budgets: bigger model vs more data vs longer training.

Inference-Time Compute Scaling (Test-Time Compute)

A newer scaling axis discovered in 2024–2025: instead of only scaling training compute, you can scale inference compute by letting models "think longer" at test time.

Approach How It Works Example
Chain-of-Thought (CoT) Generate step-by-step reasoning before the final answer GPT-4, Claude
Best-of-N sampling Generate N candidate answers, select the best one via verifier Used in math benchmarks
Tree search Explore multiple reasoning paths, backtrack when stuck AlphaProof, OpenAI o1
Self-verification Model checks its own answer and retries if wrong DeepSeek-R1
Extended thinking Dedicated "thinking" token budget separate from the visible response Claude 3.5+ extended thinking, OpenAI o1/o3

OpenAI's o1/o3 and DeepSeek-R1 demonstrated that inference-time compute scaling can yield dramatic improvements on reasoning-heavy tasks, sometimes matching models 10x their size on math and coding benchmarks. The key insight: a smaller model thinking longer can outperform a larger model answering immediately.


Model Merging

Model merging combines the weights of multiple fine-tuned LLMs into a single model — no additional training required, no GPU needed. This creates models that combine capabilities from different specializations.

Why Merge?

  • Combine a code-focused model with a math-focused model into one that excels at both
  • Merge different LoRA adapters trained on different tasks
  • Reduce the cost of multi-task deployment (one merged model vs multiple specialized ones)
  • Experiment cheaply — thousands of merged models appear on the Open LLM Leaderboard

Merging Techniques

Method How It Works Strengths Limitations
Linear / LERP Weighted average of model weights: $W = \alpha W_A + (1-\alpha) W_B$ Simplest, fast Naive averaging can cause interference between conflicting weight updates
SLERP (Spherical Linear Interpolation) Interpolates along the hypersphere surface, preserving vector magnitudes Maintains geometric properties; smoother than linear Limited to merging exactly 2 models
TIES (Trim, Elect Sign & Merge) Resets tiny deltas, resolves sign conflicts by majority vote, then merges cleaned updates Handles interference between models; works with many models More complex pipeline
DARE (Drop And REscale) Randomly drops 90–99% of delta parameters, rescales remaining by $\frac{1}{1-p}$ Effective even at extreme sparsity; reduces parameter interference Random dropping adds variance
DARE + TIES Combines DARE's random sparsification with TIES sign resolution Best of both approaches Requires tuning drop rate and thresholds

Tooling: MergeKit

MergeKit (by Arcee AI) is the standard open-source tool for model merging. It provides an extensible framework supporting all major algorithms and has been used to create thousands of merged models. Configuration is YAML-based:

models:
  - model: code-specialist/model
    parameters:
      weight: 0.6
  - model: math-specialist/model
    parameters:
      weight: 0.4
merge_method: ties
base_model: base/model
parameters:
  density: 0.5
  normalize: true
dtype: bfloat16
  • Reasoning model merging: merging "slow-thinking" reasoning models with "fast" conventional LLMs can reduce token consumption by ~50% while maintaining accuracy
  • Newer algorithms: NuSLERP, DELLA (Drop and Rescale via Sampling with Magnitude), and SCE (Select, Calculate, and Erase) offer incremental improvements
  • All merging methods still fall short of individually fine-tuned models on their specific tasks — merging trades peak specialization for broader capability

Post-Transformer Architectures

While transformers dominate, alternatives are emerging:

Architecture Key Innovation Status
Mamba (State Space Models) Selective state updates; linear-time sequence processing; no quadratic attention Competitive with transformers at small-medium scale
RWKV RNN-transformer hybrid; linear attention Active open-source community
Hyena Long convolutions replace attention Research stage
PaTH Attention (MIT, 2025) Adds data-dependent down-weighting to standard attention Improves reasoning and long-context tasks

None have yet displaced transformers at frontier scale, but Mamba-based hybrids (Jamba by AI21) show promise for efficiency-critical applications.