Alisa's Math Notes¶
Distilled reference from Alisa Liu's math notes — a concise probability, statistics, and combinatorics reference aimed at machine learning practitioners. Alisa Liu is a UW NLP researcher (alisawuffles).
Distribution Quick Reference¶
| Description | Concept |
|---|---|
| Memoryless, waiting | Geometric, Exponential |
| Trials with success & failure | Bernoulli, Binomial |
| Large sample, average | CLT, approximately Gaussian |
| Bound given only mean / variance | Markov / Chebyshev inequalities |
| Estimating parameter from data | MLE, MAP, Bayesian update |
| Update belief from evidence | Bayes rule |
| Function of a RV | Jensen's inequality |
| Expected count of things | Linearity + indicators |
| Expected hitting time / return probability | First-step analysis, solve recurrence |
Discrete vs Continuous¶
| Discrete | Continuous | |
|---|---|---|
| Counting | Binomial: how many successes in $n$ trials? | Poisson: how many events in fixed window? |
| Waiting | Geometric: how many trials until first success? | Exponential: how long until first event? |
Discrete Distributions¶
Bernoulli¶
$X \sim \operatorname{Bernoulli}(p)$ models a single trial with $p$ probability of success.
$$P(X=1)=p, \quad P(X=0)=1-p$$
- Compact PMF: $P(X=x) = p^x(1-p)^{1-x}$
- $\mathbb{E}[X] = p$, $\operatorname{Var}[X] = p(1-p)$
- Categorical RV generalizes Bernoulli from 2 to $k$ outcomes
Indicator squaring
For any binary variable $X \in {0,1}$, $X^2 = X$, so $\mathbb{E}[X^2] = \mathbb{E}[X]$.
Binomial¶
$X \sim \operatorname{Binomial}(n, p)$ is the sum of $n$ independent $\operatorname{Bernoulli}(p)$ trials.
$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$
- $\mathbb{E}[X] = np$, $\operatorname{Var}[X] = np(1-p)$
- Binomial theorem: $(a+b)^n = \sum_{k=0}^n \binom{n}{k} a^k b^{n-k}$
Poisson¶
$X \sim \operatorname{Poisson}(\lambda)$ models the number of events in a fixed interval when they occur independently at average rate $\lambda$.
$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
- $\mathbb{E}[X] = \operatorname{Var}[X] = \lambda$
- Derived as the binomial limit: divide interval into $n$ tiny slots, $np = \lambda$
- Sums of independent Poisson are Poisson with rates that add
Geometric¶
$X \sim \operatorname{Geometric}(p)$ models the number of trials until first success (inclusive).
$$P(X=k) = (1-p)^{k-1} \cdot p$$
- Memoryless: $P(X > m+n \mid X > m) = P(X > n)$ — the only memoryless discrete distribution
- $\mathbb{E}[X] = 1/p$, $\operatorname{Var}(X) = (1-p)/p^2$
- Geometric series: $\sum_{k=0}^\infty r^k = \frac{1}{1-r}$ for $|r| < 1$
Continuous Distributions¶
Uniform¶
$X \sim \operatorname{Uniform}(a, b)$ — equally likely anywhere on $[a,b]$.
- $\mathbb{E}[X] = (a+b)/2$, $\operatorname{Var}(X) = (b-a)^2/12$
- PDF: $f_X(x) = 1/(b-a)$ for $a \leq x \leq b$
Exponential¶
$X \sim \operatorname{Exp}(\lambda)$ — waiting time until next event at rate $\lambda$.
$$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$
- Memoryless: $P(X > s+t \mid X > s) = P(X > t)$ — the only continuous memoryless distribution
- $\mathbb{E}[X] = 1/\lambda$, $\operatorname{Var}(X) = 1/\lambda^2$
- Tail (survival): $P(X > x) = e^{-\lambda x}$
- Minimum of exponentials: $\min(X_1, X_2) \sim \operatorname{Exp}(\lambda_1 + \lambda_2)$
Poisson-Exponential duality
"How many events in the next hour?" is Poisson. "How long until the next event?" is Exponential. Same underlying process.
Gaussian¶
$X \sim \mathcal{N}(\mu, \sigma^2)$ — standardize via $Z = (X - \mu)/\sigma \sim \mathcal{N}(0,1)$.
- Sum of independent Gaussians is Gaussian: $X_1 + X_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
- The percentile $z_\alpha$ satisfies $P(Z \leq z_\alpha) = \alpha$
- $(1-\alpha)$ confidence interval uses endpoints $z_{\alpha/2}$ and $z_{1-\alpha/2}$
Expectations & Variance¶
- Expectation: $\mathbb{E}[X] = \sum_x x \, P(X=k)$ or $\int x \, f(x) \, dx$
- Variance: $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$
- Linearity of expectation (regardless of independence): $\mathbb{E}[aX + bY + c] = a\mathbb{E}[X] + b\mathbb{E}[Y] + c$
- For independent $X, Y$: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ and $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$
- Tail formula for non-negative RVs: $\mathbb{E}[X] = \int_0^\infty P(X > x) \, dx$
Inequalities¶
| Inequality | Statement | Requires |
|---|---|---|
| Markov | $P(X \geq a) \leq \mathbb{E}[X]/a$ | Non-negative RV, $a > 0$ |
| Chebyshev | $P(\lvert X - \mu \rvert \geq k\sigma) \leq 1/k^2$ | Mean $\mu$, variance $\sigma^2$ |
| Jensen | $\varphi$ convex $\Rightarrow$ $\varphi(\mathbb{E}[X]) \leq \mathbb{E}[\varphi(X)]$ | Convexity/concavity of $\varphi$ |
| Cauchy-Schwarz | $\mathbb{E}[XY]^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]$ | — |
| Union bound | $P(A_1 \cup \cdots \cup A_n) \leq \sum P(A_i)$ | — |
Jensen mnemonic
Use $\varphi(x) = x^2$ (convex) with $X \in {-1, +1}$: $\varphi(\mathbb{E}[X]) = 0 \leq 1 = \mathbb{E}[\varphi(X)]$.
Common Convex / Concave Functions¶
- Convex ($\varphi'' \geq 0$): $x^2$, $|x|$, $a^x$ ($a > 1$), $1/x$ ($x > 0$), $\max$
- Concave ($\varphi'' \leq 0$): $\sqrt{x}$, $\log x$, $x^p$ ($0 < p < 1$), $\min$
- Worth memorizing: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$
Limit Theorems¶
Central Limit Theorem¶
Mean of $n$ iid RVs with mean $\mu$ and variance $\sigma^2$ is approximately Gaussian:
$$\bar{X}n = \frac{1}{n}\sum\right)$$}^n X_i \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n
- Standard error: $\sigma / \sqrt{n}$ — need 100x samples to reduce error 10x
- $(1-\alpha)$ CI: $\bar{X}n \pm z$} \cdot \sigma/\sqrt{n
Law of Large Numbers¶
$P(|\bar{X}_n - \mu| > \epsilon) \to 0$ for any $\epsilon > 0$. Proved from Chebyshev.
Conditional Probabilities¶
- Bayes rule: $P(A \mid B) = P(B \mid A) P(A) / P(B)$
- Law of total expectation (tower property): $\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X \mid Y]]$
- Law of total probability: $P(A) = \sum_y P(A \mid Y=y) \cdot P(Y=y)$
Combinatorics¶
| Order matters? | Replacement? | Count |
|---|---|---|
| Yes | No | $n!/(n-k)!$ (permutations) |
| No | No | $\binom{n}{k}$ (combinations) |
| Yes | Yes | $n^k$ |
| No | Yes | $\binom{n+k-1}{k}$ (stars and bars) |
- Multinomial: arranging $n$ objects with $n_1$ of type 1, ..., $n_m$ of type $m$: $n! / (n_1! \cdots n_m!)$
- Inclusion-exclusion: $|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$
- Path counting: paths from $(0,0)$ to $(m,n)$ on a grid = $\binom{m+n}{m}$
Markov Chains¶
- Sequence of RVs satisfying the Markov property: future depends only on present
- Described by transition matrix $P$ where $P_{ij} = P(X_{t+1} = j \mid X_t = i)$
- Stationary distribution $\pi$ satisfies $\pi = \pi P$; expected return time to state $i$ is $1/\pi_i$
- First-step analysis for expected absorption time:
$$h_i = 1 + \sum_j P_{ij} h_j$$
Symmetry shortcut
On graphs where every node looks the same (cycles, complete graphs), expected hitting time between adjacent vertices is always $n-1$.
Maximum Likelihood Estimation¶
Given data $x_1, \ldots, x_n$ from distribution with unknown $\theta$:
$$\hat{\theta}{\text{MLE}} = \arg\max\theta \sum_{i=1}^n \log p(x_i; \theta)$$
| Distribution | MLE |
|---|---|
| $\operatorname{Bernoulli}(p)$ | $\hat{p} = k/n$ (sample frequency) |
| $\operatorname{Exp}(\lambda)$ | $\hat{\lambda} = n / \sum x_i$ (reciprocal of sample mean) |
| $\operatorname{Uniform}(0, \theta)$ | $\hat{\theta} = \max x_i$ |
Connection to ML training
Models trained to minimize cross-entropy loss $= -\frac{1}{N}\sum_i \log p(x_i)$ is the same as maximizing log-likelihood. Models are trained to make observed data as probable as possible.
Bias & Variance¶
- Bias: $\operatorname{bias}[\hat\theta] = \mathbb{E}[\hat\theta] - \theta$
- Variance: $\operatorname{Var}(\hat\theta) = \mathbb{E}[(\hat\theta - \mathbb{E}[\hat\theta])^2]$
- MSE decomposition: $\operatorname{MSE}(\hat\theta) = \operatorname{Bias}(\hat\theta)^2 + \operatorname{Var}(\hat\theta)$
Classic Puzzles¶
Birthday Problem¶
How many people $n$ until $P(\text{collision}) > 50\%$? Answer: $n \approx 23$ for 365 days.
$$P(\text{all unique}) = \frac{d!}{(d-n)! \cdot d^n}$$
Gambler's Ruin¶
Start with \$a, win/lose \$1 with probability $p$/$1-p$. Stop at \$N or \$0. For fair game ($p = 1/2$): $P(\text{reach } N) = a/N$. Expected absorption time: $h_i = i(N-i)$.
Coupon Collector¶
$n$ coupon types, draw uniformly at random. Expected time to collect all:
$$\mathbb{E}[X] = n \cdot H_n = n \sum_{j=1}^n \frac{1}{j} \approx n \ln n$$
Secretary Problem¶
Reject first $k$ candidates, then hire the next one better than all previous. Optimal threshold: $k = n/e$, giving success probability $\approx 1/e \approx 37\%$.
Reservoir Sampling¶
Stream $x_1, x_2, \ldots$ of unknown length. Keep $s = x_1$; for each new $x_n$, replace $s$ with $x_n$ with probability $1/n$. After $n$ items, each $x_k$ is stored with probability $1/n$.
Handy Equations¶
- Harmonic number: $H_n = \sum_{k=1}^n 1/k \approx \log n$
- $e = \lim_{n\to\infty}(1 + 1/n)^n$, $e^x = \lim_{n\to\infty}(1 + x/n)^n$
- Taylor series of $e^x$: $\sum_{n=0}^\infty x^n / n!$
- Common derivatives: $(x^n)' = nx^{n-1}$, $(e^x)' = e^x$, $(\log x)' = 1/x$, $(\sin x)' = \cos x$