Alisa's Math Notes¶

Distilled reference from Alisa Liu's math notes — a concise probability, statistics, and combinatorics reference aimed at machine learning practitioners. Alisa Liu is a UW NLP researcher (alisawuffles).

Distribution Quick Reference¶

Description	Concept
Memoryless, waiting	Geometric, Exponential
Trials with success & failure	Bernoulli, Binomial
Large sample, average	CLT, approximately Gaussian
Bound given only mean / variance	Markov / Chebyshev inequalities
Estimating parameter from data	MLE, MAP, Bayesian update
Update belief from evidence	Bayes rule
Function of a RV	Jensen's inequality
Expected count of things	Linearity + indicators
Expected hitting time / return probability	First-step analysis, solve recurrence

Discrete vs Continuous¶

	Discrete	Continuous
Counting	Binomial: how many successes in $n$ trials?	Poisson: how many events in fixed window?
Waiting	Geometric: how many trials until first success?	Exponential: how long until first event?

Discrete Distributions¶

Bernoulli¶

$X \sim \operatorname{Bernoulli}(p)$ models a single trial with $p$ probability of success.

$$P(X=1)=p, \quad P(X=0)=1-p$$

Compact PMF: $P(X=x) = p^x(1-p)^{1-x}$
$\mathbb{E}[X] = p$, $\operatorname{Var}[X] = p(1-p)$
Categorical RV generalizes Bernoulli from 2 to $k$ outcomes

Indicator squaring

For any binary variable $X \in {0,1}$, $X^2 = X$, so $\mathbb{E}[X^2] = \mathbb{E}[X]$.

Binomial¶

$X \sim \operatorname{Binomial}(n, p)$ is the sum of $n$ independent $\operatorname{Bernoulli}(p)$ trials.

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$

$\mathbb{E}[X] = np$, $\operatorname{Var}[X] = np(1-p)$
Binomial theorem: $(a+b)^n = \sum_{k=0}^n \binom{n}{k} a^k b^{n-k}$

Poisson¶

$X \sim \operatorname{Poisson}(\lambda)$ models the number of events in a fixed interval when they occur independently at average rate $\lambda$.

$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

$\mathbb{E}[X] = \operatorname{Var}[X] = \lambda$
Derived as the binomial limit: divide interval into $n$ tiny slots, $np = \lambda$
Sums of independent Poisson are Poisson with rates that add

Geometric¶

$X \sim \operatorname{Geometric}(p)$ models the number of trials until first success (inclusive).

$$P(X=k) = (1-p)^{k-1} \cdot p$$

Memoryless: $P(X > m+n \mid X > m) = P(X > n)$ — the only memoryless discrete distribution
$\mathbb{E}[X] = 1/p$, $\operatorname{Var}(X) = (1-p)/p^2$
Geometric series: $\sum_{k=0}^\infty r^k = \frac{1}{1-r}$ for $|r| < 1$

Continuous Distributions¶

Uniform¶

$X \sim \operatorname{Uniform}(a, b)$ — equally likely anywhere on $[a,b]$.

$\mathbb{E}[X] = (a+b)/2$, $\operatorname{Var}(X) = (b-a)^2/12$
PDF: $f_X(x) = 1/(b-a)$ for $a \leq x \leq b$

Exponential¶

$X \sim \operatorname{Exp}(\lambda)$ — waiting time until next event at rate $\lambda$.

$$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$

Memoryless: $P(X > s+t \mid X > s) = P(X > t)$ — the only continuous memoryless distribution
$\mathbb{E}[X] = 1/\lambda$, $\operatorname{Var}(X) = 1/\lambda^2$
Tail (survival): $P(X > x) = e^{-\lambda x}$
Minimum of exponentials: $\min(X_1, X_2) \sim \operatorname{Exp}(\lambda_1 + \lambda_2)$

Poisson-Exponential duality

"How many events in the next hour?" is Poisson. "How long until the next event?" is Exponential. Same underlying process.

Gaussian¶

$X \sim \mathcal{N}(\mu, \sigma^2)$ — standardize via $Z = (X - \mu)/\sigma \sim \mathcal{N}(0,1)$.

Sum of independent Gaussians is Gaussian: $X_1 + X_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
The percentile $z_\alpha$ satisfies $P(Z \leq z_\alpha) = \alpha$
$(1-\alpha)$ confidence interval uses endpoints $z_{\alpha/2}$ and $z_{1-\alpha/2}$

Expectations & Variance¶

Expectation: $\mathbb{E}[X] = \sum_x x \, P(X=k)$ or $\int x \, f(x) \, dx$
Variance: $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$
Linearity of expectation (regardless of independence): $\mathbb{E}[aX + bY + c] = a\mathbb{E}[X] + b\mathbb{E}[Y] + c$
For independent $X, Y$: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ and $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$
Tail formula for non-negative RVs: $\mathbb{E}[X] = \int_0^\infty P(X > x) \, dx$

Inequalities¶

Inequality	Statement	Requires
Markov	$P(X \geq a) \leq \mathbb{E}[X]/a$	Non-negative RV, $a > 0$
Chebyshev	$P(\lvert X - \mu \rvert \geq k\sigma) \leq 1/k^2$	Mean $\mu$, variance $\sigma^2$
Jensen	$\varphi$ convex $\Rightarrow$ $\varphi(\mathbb{E}[X]) \leq \mathbb{E}[\varphi(X)]$	Convexity/concavity of $\varphi$
Cauchy-Schwarz	$\mathbb{E}[XY]^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]$	—
Union bound	$P(A_1 \cup \cdots \cup A_n) \leq \sum P(A_i)$	—

Jensen mnemonic

Use $\varphi(x) = x^2$ (convex) with $X \in {-1, +1}$: $\varphi(\mathbb{E}[X]) = 0 \leq 1 = \mathbb{E}[\varphi(X)]$.

Common Convex / Concave Functions¶

Convex ($\varphi'' \geq 0$): $x^2$, $|x|$, $a^x$ ($a > 1$), $1/x$ ($x > 0$), $\max$
Concave ($\varphi'' \leq 0$): $\sqrt{x}$, $\log x$, $x^p$ ($0 < p < 1$), $\min$
Worth memorizing: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$

Limit Theorems¶

Central Limit Theorem¶

Mean of $n$ iid RVs with mean $\mu$ and variance $\sigma^2$ is approximately Gaussian:

$$\bar{X}n = \frac{1}{n}\sum\right)$$}^n X_i \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n

Standard error: $\sigma / \sqrt{n}$ — need 100x samples to reduce error 10x
$(1-\alpha)$ CI: $\bar{X}n \pm z$} \cdot \sigma/\sqrt{n

Law of Large Numbers¶

$P(|\bar{X}_n - \mu| > \epsilon) \to 0$ for any $\epsilon > 0$. Proved from Chebyshev.

Conditional Probabilities¶

Bayes rule: $P(A \mid B) = P(B \mid A) P(A) / P(B)$
Law of total expectation (tower property): $\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X \mid Y]]$
Law of total probability: $P(A) = \sum_y P(A \mid Y=y) \cdot P(Y=y)$

Combinatorics¶

Order matters?	Replacement?	Count
Yes	No	$n!/(n-k)!$ (permutations)
No	No	$\binom{n}{k}$ (combinations)
Yes	Yes	$n^k$
No	Yes	$\binom{n+k-1}{k}$ (stars and bars)

Multinomial: arranging $n$ objects with $n_1$ of type 1, ..., $n_m$ of type $m$: $n! / (n_1! \cdots n_m!)$
Inclusion-exclusion: $|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$
Path counting: paths from $(0,0)$ to $(m,n)$ on a grid = $\binom{m+n}{m}$

Markov Chains¶

Sequence of RVs satisfying the Markov property: future depends only on present
Described by transition matrix $P$ where $P_{ij} = P(X_{t+1} = j \mid X_t = i)$
Stationary distribution $\pi$ satisfies $\pi = \pi P$; expected return time to state $i$ is $1/\pi_i$
First-step analysis for expected absorption time:

$$h_i = 1 + \sum_j P_{ij} h_j$$

Symmetry shortcut

On graphs where every node looks the same (cycles, complete graphs), expected hitting time between adjacent vertices is always $n-1$.

Maximum Likelihood Estimation¶

Given data $x_1, \ldots, x_n$ from distribution with unknown $\theta$:

$$\hat{\theta}{\text{MLE}} = \arg\max\theta \sum_{i=1}^n \log p(x_i; \theta)$$

Distribution	MLE
$\operatorname{Bernoulli}(p)$	$\hat{p} = k/n$ (sample frequency)
$\operatorname{Exp}(\lambda)$	$\hat{\lambda} = n / \sum x_i$ (reciprocal of sample mean)
$\operatorname{Uniform}(0, \theta)$	$\hat{\theta} = \max x_i$

Connection to ML training

Models trained to minimize cross-entropy loss $= -\frac{1}{N}\sum_i \log p(x_i)$ is the same as maximizing log-likelihood. Models are trained to make observed data as probable as possible.

Bias & Variance¶

Bias: $\operatorname{bias}[\hat\theta] = \mathbb{E}[\hat\theta] - \theta$
Variance: $\operatorname{Var}(\hat\theta) = \mathbb{E}[(\hat\theta - \mathbb{E}[\hat\theta])^2]$
MSE decomposition: $\operatorname{MSE}(\hat\theta) = \operatorname{Bias}(\hat\theta)^2 + \operatorname{Var}(\hat\theta)$

Classic Puzzles¶

Birthday Problem¶

How many people $n$ until $P(\text{collision}) > 50\%$? Answer: $n \approx 23$ for 365 days.

$$P(\text{all unique}) = \frac{d!}{(d-n)! \cdot d^n}$$

Gambler's Ruin¶

Start with \$a, win/lose \$1 with probability $p$/$1-p$. Stop at \$N or \$0. For fair game ($p = 1/2$): $P(\text{reach } N) = a/N$. Expected absorption time: $h_i = i(N-i)$.

Coupon Collector¶

$n$ coupon types, draw uniformly at random. Expected time to collect all:

$$\mathbb{E}[X] = n \cdot H_n = n \sum_{j=1}^n \frac{1}{j} \approx n \ln n$$

Secretary Problem¶

Reject first $k$ candidates, then hire the next one better than all previous. Optimal threshold: $k = n/e$, giving success probability $\approx 1/e \approx 37\%$.

Reservoir Sampling¶

Stream $x_1, x_2, \ldots$ of unknown length. Keep $s = x_1$; for each new $x_n$, replace $s$ with $x_n$ with probability $1/n$. After $n$ items, each $x_k$ is stored with probability $1/n$.

Handy Equations¶

Harmonic number: $H_n = \sum_{k=1}^n 1/k \approx \log n$
$e = \lim_{n\to\infty}(1 + 1/n)^n$, $e^x = \lim_{n\to\infty}(1 + x/n)^n$
Taylor series of $e^x$: $\sum_{n=0}^\infty x^n / n!$
Common derivatives: $(x^n)' = nx^{n-1}$, $(e^x)' = e^x$, $(\log x)' = 1/x$, $(\sin x)' = \cos x$