Skip to content

Alisa's Math Notes

Distilled reference from Alisa Liu's math notes — a concise probability, statistics, and combinatorics reference aimed at machine learning practitioners. Alisa Liu is a UW NLP researcher (alisawuffles).

Distribution Quick Reference

Description Concept
Memoryless, waiting Geometric, Exponential
Trials with success & failure Bernoulli, Binomial
Large sample, average CLT, approximately Gaussian
Bound given only mean / variance Markov / Chebyshev inequalities
Estimating parameter from data MLE, MAP, Bayesian update
Update belief from evidence Bayes rule
Function of a RV Jensen's inequality
Expected count of things Linearity + indicators
Expected hitting time / return probability First-step analysis, solve recurrence

Discrete vs Continuous

Discrete Continuous
Counting Binomial: how many successes in $n$ trials? Poisson: how many events in fixed window?
Waiting Geometric: how many trials until first success? Exponential: how long until first event?

Discrete Distributions

Bernoulli

$X \sim \operatorname{Bernoulli}(p)$ models a single trial with $p$ probability of success.

$$P(X=1)=p, \quad P(X=0)=1-p$$

  • Compact PMF: $P(X=x) = p^x(1-p)^{1-x}$
  • $\mathbb{E}[X] = p$, $\operatorname{Var}[X] = p(1-p)$
  • Categorical RV generalizes Bernoulli from 2 to $k$ outcomes

Indicator squaring

For any binary variable $X \in {0,1}$, $X^2 = X$, so $\mathbb{E}[X^2] = \mathbb{E}[X]$.

Binomial

$X \sim \operatorname{Binomial}(n, p)$ is the sum of $n$ independent $\operatorname{Bernoulli}(p)$ trials.

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$

  • $\mathbb{E}[X] = np$, $\operatorname{Var}[X] = np(1-p)$
  • Binomial theorem: $(a+b)^n = \sum_{k=0}^n \binom{n}{k} a^k b^{n-k}$

Poisson

$X \sim \operatorname{Poisson}(\lambda)$ models the number of events in a fixed interval when they occur independently at average rate $\lambda$.

$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

  • $\mathbb{E}[X] = \operatorname{Var}[X] = \lambda$
  • Derived as the binomial limit: divide interval into $n$ tiny slots, $np = \lambda$
  • Sums of independent Poisson are Poisson with rates that add

Geometric

$X \sim \operatorname{Geometric}(p)$ models the number of trials until first success (inclusive).

$$P(X=k) = (1-p)^{k-1} \cdot p$$

  • Memoryless: $P(X > m+n \mid X > m) = P(X > n)$ — the only memoryless discrete distribution
  • $\mathbb{E}[X] = 1/p$, $\operatorname{Var}(X) = (1-p)/p^2$
  • Geometric series: $\sum_{k=0}^\infty r^k = \frac{1}{1-r}$ for $|r| < 1$

Continuous Distributions

Uniform

$X \sim \operatorname{Uniform}(a, b)$ — equally likely anywhere on $[a,b]$.

  • $\mathbb{E}[X] = (a+b)/2$, $\operatorname{Var}(X) = (b-a)^2/12$
  • PDF: $f_X(x) = 1/(b-a)$ for $a \leq x \leq b$

Exponential

$X \sim \operatorname{Exp}(\lambda)$ — waiting time until next event at rate $\lambda$.

$$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$

  • Memoryless: $P(X > s+t \mid X > s) = P(X > t)$ — the only continuous memoryless distribution
  • $\mathbb{E}[X] = 1/\lambda$, $\operatorname{Var}(X) = 1/\lambda^2$
  • Tail (survival): $P(X > x) = e^{-\lambda x}$
  • Minimum of exponentials: $\min(X_1, X_2) \sim \operatorname{Exp}(\lambda_1 + \lambda_2)$

Poisson-Exponential duality

"How many events in the next hour?" is Poisson. "How long until the next event?" is Exponential. Same underlying process.

Gaussian

$X \sim \mathcal{N}(\mu, \sigma^2)$ — standardize via $Z = (X - \mu)/\sigma \sim \mathcal{N}(0,1)$.

  • Sum of independent Gaussians is Gaussian: $X_1 + X_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
  • The percentile $z_\alpha$ satisfies $P(Z \leq z_\alpha) = \alpha$
  • $(1-\alpha)$ confidence interval uses endpoints $z_{\alpha/2}$ and $z_{1-\alpha/2}$

Expectations & Variance

  • Expectation: $\mathbb{E}[X] = \sum_x x \, P(X=k)$ or $\int x \, f(x) \, dx$
  • Variance: $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$
  • Linearity of expectation (regardless of independence): $\mathbb{E}[aX + bY + c] = a\mathbb{E}[X] + b\mathbb{E}[Y] + c$
  • For independent $X, Y$: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ and $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$
  • Tail formula for non-negative RVs: $\mathbb{E}[X] = \int_0^\infty P(X > x) \, dx$

Inequalities

Inequality Statement Requires
Markov $P(X \geq a) \leq \mathbb{E}[X]/a$ Non-negative RV, $a > 0$
Chebyshev $P(\lvert X - \mu \rvert \geq k\sigma) \leq 1/k^2$ Mean $\mu$, variance $\sigma^2$
Jensen $\varphi$ convex $\Rightarrow$ $\varphi(\mathbb{E}[X]) \leq \mathbb{E}[\varphi(X)]$ Convexity/concavity of $\varphi$
Cauchy-Schwarz $\mathbb{E}[XY]^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]$
Union bound $P(A_1 \cup \cdots \cup A_n) \leq \sum P(A_i)$

Jensen mnemonic

Use $\varphi(x) = x^2$ (convex) with $X \in {-1, +1}$: $\varphi(\mathbb{E}[X]) = 0 \leq 1 = \mathbb{E}[\varphi(X)]$.

Common Convex / Concave Functions

  • Convex ($\varphi'' \geq 0$): $x^2$, $|x|$, $a^x$ ($a > 1$), $1/x$ ($x > 0$), $\max$
  • Concave ($\varphi'' \leq 0$): $\sqrt{x}$, $\log x$, $x^p$ ($0 < p < 1$), $\min$
  • Worth memorizing: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$

Limit Theorems

Central Limit Theorem

Mean of $n$ iid RVs with mean $\mu$ and variance $\sigma^2$ is approximately Gaussian:

$$\bar{X}n = \frac{1}{n}\sum\right)$$}^n X_i \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n

  • Standard error: $\sigma / \sqrt{n}$ — need 100x samples to reduce error 10x
  • $(1-\alpha)$ CI: $\bar{X}n \pm z$} \cdot \sigma/\sqrt{n

Law of Large Numbers

$P(|\bar{X}_n - \mu| > \epsilon) \to 0$ for any $\epsilon > 0$. Proved from Chebyshev.

Conditional Probabilities

  • Bayes rule: $P(A \mid B) = P(B \mid A) P(A) / P(B)$
  • Law of total expectation (tower property): $\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X \mid Y]]$
  • Law of total probability: $P(A) = \sum_y P(A \mid Y=y) \cdot P(Y=y)$

Combinatorics

Order matters? Replacement? Count
Yes No $n!/(n-k)!$ (permutations)
No No $\binom{n}{k}$ (combinations)
Yes Yes $n^k$
No Yes $\binom{n+k-1}{k}$ (stars and bars)
  • Multinomial: arranging $n$ objects with $n_1$ of type 1, ..., $n_m$ of type $m$: $n! / (n_1! \cdots n_m!)$
  • Inclusion-exclusion: $|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$
  • Path counting: paths from $(0,0)$ to $(m,n)$ on a grid = $\binom{m+n}{m}$

Markov Chains

  • Sequence of RVs satisfying the Markov property: future depends only on present
  • Described by transition matrix $P$ where $P_{ij} = P(X_{t+1} = j \mid X_t = i)$
  • Stationary distribution $\pi$ satisfies $\pi = \pi P$; expected return time to state $i$ is $1/\pi_i$
  • First-step analysis for expected absorption time:

$$h_i = 1 + \sum_j P_{ij} h_j$$

Symmetry shortcut

On graphs where every node looks the same (cycles, complete graphs), expected hitting time between adjacent vertices is always $n-1$.

Maximum Likelihood Estimation

Given data $x_1, \ldots, x_n$ from distribution with unknown $\theta$:

$$\hat{\theta}{\text{MLE}} = \arg\max\theta \sum_{i=1}^n \log p(x_i; \theta)$$

Distribution MLE
$\operatorname{Bernoulli}(p)$ $\hat{p} = k/n$ (sample frequency)
$\operatorname{Exp}(\lambda)$ $\hat{\lambda} = n / \sum x_i$ (reciprocal of sample mean)
$\operatorname{Uniform}(0, \theta)$ $\hat{\theta} = \max x_i$

Connection to ML training

Models trained to minimize cross-entropy loss $= -\frac{1}{N}\sum_i \log p(x_i)$ is the same as maximizing log-likelihood. Models are trained to make observed data as probable as possible.

Bias & Variance

  • Bias: $\operatorname{bias}[\hat\theta] = \mathbb{E}[\hat\theta] - \theta$
  • Variance: $\operatorname{Var}(\hat\theta) = \mathbb{E}[(\hat\theta - \mathbb{E}[\hat\theta])^2]$
  • MSE decomposition: $\operatorname{MSE}(\hat\theta) = \operatorname{Bias}(\hat\theta)^2 + \operatorname{Var}(\hat\theta)$

Classic Puzzles

Birthday Problem

How many people $n$ until $P(\text{collision}) > 50\%$? Answer: $n \approx 23$ for 365 days.

$$P(\text{all unique}) = \frac{d!}{(d-n)! \cdot d^n}$$

Gambler's Ruin

Start with \$a, win/lose \$1 with probability $p$/$1-p$. Stop at \$N or \$0. For fair game ($p = 1/2$): $P(\text{reach } N) = a/N$. Expected absorption time: $h_i = i(N-i)$.

Coupon Collector

$n$ coupon types, draw uniformly at random. Expected time to collect all:

$$\mathbb{E}[X] = n \cdot H_n = n \sum_{j=1}^n \frac{1}{j} \approx n \ln n$$

Secretary Problem

Reject first $k$ candidates, then hire the next one better than all previous. Optimal threshold: $k = n/e$, giving success probability $\approx 1/e \approx 37\%$.

Reservoir Sampling

Stream $x_1, x_2, \ldots$ of unknown length. Keep $s = x_1$; for each new $x_n$, replace $s$ with $x_n$ with probability $1/n$. After $n$ items, each $x_k$ is stored with probability $1/n$.

Handy Equations

  • Harmonic number: $H_n = \sum_{k=1}^n 1/k \approx \log n$
  • $e = \lim_{n\to\infty}(1 + 1/n)^n$, $e^x = \lim_{n\to\infty}(1 + x/n)^n$
  • Taylor series of $e^x$: $\sum_{n=0}^\infty x^n / n!$
  • Common derivatives: $(x^n)' = nx^{n-1}$, $(e^x)' = e^x$, $(\log x)' = 1/x$, $(\sin x)' = \cos x$