11/03/2022, Southampton

General Info

Where to find all the information: Reading Group Website

Anyone wants to help me maintain the website, please let me know.

Check out the following items frequently:

  • Timetable
  • Materials

Got questions reading lecture notes? Check out:

  • Supplementary References

Presenter and Discussant

One presenter and one discussant each session. Refer to the website for the roles description.

Tips about the reading group by Percy Liang, including

  • leading discussions / giving presentations
  • following presentations
  • benefitting from talks that you can’t follow entirely

No rush policy:

  • please target at finishing in 1 hour, but no worries if you are behind. No need to rush to the end — we can allow 10-20 minutes over every now and again.

Introduction of the introduction

Classic results

Let \(X_1, \dots, X_n \in \mathbb{R}\), and \(\overset{i.i.d}{\sim} F\), where the distribution is characterised by (usually unknown) c.d.f (cdf/CDF) \[F(x)=P(X\le x).\]

Consider the e.d.f (e.c.d.f) from the data: \[ \mathbb{F}_n(x):=\dfrac{1}{n}\sum_{i=1}^n 1_{(-\infty, x]}(X_i) = \dfrac{1}{n}\sum_{i=1}^n 1(X_i\le x) \]

Uniform Results

  • Pointwise convergence (for each fixed \(x\)):

SLLN: \(\quad\mathbb{F}_n(x) \overset{a.s}{\rightarrow} F(x)\quad\)

CLT: \(\quad\mathbb{G}_n(x):=\sqrt{n}(\mathbb{F}_n(x) - F(x)) \overset{d}{\rightarrow} N(0, F(x)(1-F(x)))\quad\)

The core of Empirical Processes theory is about developing the following results:

  • Uniform convergence (holds for all \(x\) simultaneously)

Thm 1.1 (Glivenko-Cantelli): \(\|\mathbb{F}_n-F\|_\infty:=\sup_{x}|\mathbb{F}_n(x)-F(x)|\overset{a.s}{\rightarrow}0.\)

Thm 1.2 (Donsker): \(\mathbb{G}_n(x) \overset{d}{\rightarrow} \mathbb{U}\big(F(x)\big)\), where \(\mathbb{U}\) is the Brownian bridge.

Review: Pointwise vs. Uniform

Uniform results are stronger than pointwise results, as it controls the convergence speed uniformly.

A typical example is: \(f_n(x)=x^n, x\in [0,1]\)

It converge pointwisely to \(f(x)=0, x\in[0,1)\) and \(f(x)=1, x=1\).

Note this pointwise limit \(f(x)\) is not continuous, thus does not support an uniform convergence.

Review: Convergence of Random Variables

\(X, X_1,X_2,\dots\) such that \(X\sim F\) and \(X_i\sim F_i\). \(X_n\overset{a.s.}{\rightarrow} X \implies X_n\overset{p}{\rightarrow} X \implies X_n\overset{d}{\rightarrow} X\), where

  • converge almost surely (almost everywhere/with probability 1): \(X_n\overset{a.s.}{\rightarrow} X\), if: \[ \mathbb{P}\big(\lim_{n\rightarrow\infty} X_n=X\big)=1 \]
  • converge in probability: \(X_n\overset{p}{\rightarrow} X\), if \[ \lim_{n\rightarrow\infty} \mathbb{P}\big(|X_n-X|>\varepsilon\big)=0 \text{ for all } \varepsilon>0 \]
  • converge in distribution (weak convergence): \(X_n\overset{d}{\rightarrow} X\), if \[\lim_{n\rightarrow\infty}F_n(x)=F(x), \text{ for all the continuous point } x\]

Review: Stochastic processes

Recall a stochastic process is an (uncountably infinite) collection of random variables, usually admits certain dependence structure:

  • Markov Chain: \(\{X_{1}, X_{2}, X_3\dots\}\)
  • Markov process: \(\{X_t\}_{t>0}\)

Other examples: Gaussian process/ diffusion process/ Wiener process (Brownian motion) …

This collection of random variables is indexed by some mathematical set \(\mathcal{S}\) — e.g. \(\mathcal{S}=\mathbb{R}\), some \(d\)-dimensional Euclidean space, or some functional space.

You will hear the terminology a process … indexed by \(\mathcal{S}\) throughout this reading group.

Empirical processes (Version 1)

A simplest empirical process from previous example (indexed by \(\mathbb{R}\)) is defined as \[ \mathbb{G}_n(x)=\sqrt{n}\big(\mathbb{F}_n(x) - F(x)\big) \]

1.1 A more general notation (as in functional space)

Review: Measures and Distributions

A (probability) measure \(\mu\) on some space \(\mathcal{X}\): \(\mathcal{B}\rightarrow\mathbb{R}\), can be think as a measurement of the size of all the sets in \(\mathcal{X}\), e.g:

\(\mu(A)=a\) says the volume of set \(A\) under measure \(\mu\) is \(a\).

Lebesgue integration (Real Analysis): \(\int f d\mu=\lim_{n\rightarrow\infty}\int s_n d\mu\), where \(s_n\) is finite linear combination of indicator functions.

A probability measure can be naturally associated to a distribution \(F\), i.e., if \(\mathcal{X}=\mathbb{R}\), let \(F(x)=\mu\big((-\infty, x]\big)\). From now on, we can use measures in replace of distributions.

let’s say we have \(X_1, \dots, X_n\in \mathcal{X}\) from a measure \(P\), use which we can construct the empirical measure \(\mathbb{P}_n\):

\(\mathbb{P}_n:=n^{-1} \sum_{i=1}^n\delta_{X_i},\quad\) where \(\delta_{x}(A)=1_A(x)\) (the Dirac measure).

Empirical Process (Version 2)

For function \(f: \mathcal{X}\rightarrow \mathbb{R}\), we can write \[ \mathbb{P}_nf=\int fd\mathbb{P}n=\dfrac{1}{n}\sum_{i=1}^n f(X_i) \] and \[ Pf=\int fdP \] Consider a collection of functions \(f\in\mathcal{F}\), we can define the empirical process indexed by \(\mathcal{F}\): \(\{\mathbb{G}_nf:\,f\in\mathcal{F}\}\), where: \[ \mathbb{G}_nf:=\sqrt{n}(\mathbb{P}_nf-Pf) \]

Version 1 & 2 Comparison

Version 1 Version 2
sample space \(\mathbb{R}\) general \(\mathcal{X}\)
distribution \(F\) \(P\)
empirical distribution \(\mathbb{F}_n\) \(\mathbb{P}_n\)
operate on \(x\) \(f\)
index set \(\mathbb{R}\) \(\mathcal{F}\)

Version 1 can be regarded as a special case of Version 2, e.g., let \[f(\cdot)=1_{(-\infty,x])}(\cdot), x\in \mathbb{R},\] then \(Pf=F(x)\) and \(\mathbb{P}_nf=\mathbb{F}_n(x)\).

Empirical process theory

We want to study uniform (over \(\mathcal{F}\)) approximation of \(Pf\) by \(\mathbb{P}_nf\), e.g., \[ \|\mathbb{P}_n-P\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|\mathbb{P}_nf-Pf| \quad\text{and}\quad \{\mathbb{G}_nf: f\in \mathcal{F}\} \]

Thm 1.1* (Glivenko-Cantelli): \(\|\mathbb{P}_n-P\|_\mathcal{F}{\rightarrow}0.\) (Chap. 3)

Thm 1.2* (Donsker): \(\{\mathbb{G}_nf: f\in\mathcal{F}\}\) converge to some limiting process. (Chap. 11)

  • closely related to the complexity (richness) of the class (functional space) \(\mathcal{F}\) (Chap. 2, 3, 4, 7, 11)

  • useful in studying finite sample concentration inequality (Talagrand’s inequality, Chap. 8, 13)

1.2 Example: M-estimation (emp risk minimisation)

M-estimation

\(X, X_1,\dots, X_n \in \mathcal{X}\overset{i.i.d}\sim P\), consider the estimator \[ \hat\theta_n:= \arg\max_{\theta\in\Theta} n^{-1}\sum_{i=1}^n m_\theta(X_i)= \arg\max_{\theta\in\Theta} \mathbb{P}_n(m_\theta) \] where \(-m_\theta\) is a loss function. Applications includes:

  • MLE: \(m_\theta(x)=\log p_\theta(x)\)
  • Median/Mode estimator: \(m_\theta(x)=-|x-\theta|\) or \(m_\theta(x)=-1\{|x-\theta|\le 1\}\)
  • Regression: \(m_\theta(x,y)=-(y-\theta(x))^2\)

Parameter of interest: \(\theta_0=\arg\max_{\theta\in\Theta} {P}(m_\theta)\)

Uniform closeness

If the two processes \(\mathbb{P}_n(m_\theta)\) and \(P(m_\theta)\) (both indexed by \(\Theta\)) are uniformly close

— it is possible that their maximiser \(\hat\theta_n\) and \(\theta_0\), respectively, are close.

Therefore we are interested in \[\sup_{\theta\in\Theta} |(\mathbb{P}_n-P)(m_\theta)| \]

Example 1.3 (consistency)

Theorem 1.3. \(d(\hat\theta_n, \theta_0)\overset{p}\rightarrow 0\) if we assume the class \(\mathcal{F}:=\{m_\theta(\cdot): \theta\in\Theta\}\) is P-GC.

Proof sketch: We write \[ \mathbb{M}_n(\theta):=\mathbb{P}_n(m_\theta) \quad\text{and}\quad M(\theta)=P(m_\theta) \] For any \(\delta>0\), define \(O(\theta_0, \delta)=\{\theta\in \Theta; d(\theta, \theta_0)\ge \delta\}\). Let \[\psi(\delta):=M(\theta_0)-\sup_{\theta\in O(\theta_0, \delta)} M(\theta).\] Therefore \[ \mathbb{P}\big(d(\hat\theta_n, \theta_0)>\delta\big) \le \mathbb{P}\bigg(\sup_{\theta\in O(\theta_0, \delta)} \mathbb{M}_n(\theta)\ge \mathbb{M}_n(\theta_0) \bigg) \]

Exmaple 1.3 (II)

\[ \begin{align*} \le&\mathbb{P}\bigg( \sup_{\theta\in O(\theta_0, \delta)} \{\mathbb{M}_n(\theta) -M(\theta)\} - (\mathbb{M}_n(\theta_0)-M(\theta_0))\ge {M}(\theta_0)-\sup_{\theta\in O(\theta_0, \delta)}M(\theta) \bigg)\\ \le &\mathbb{P}\bigg( 2\sup_{\theta\in \Theta} |\mathbb{M}_n(\theta) -M(\theta)|\ge \psi(\delta) \bigg) \end{align*} \]

If \(\theta_0\) is well-separated, above probability vanishes as \(n\rightarrow \infty\).

Example 1.4 (Classification)

Consider the \((Z, Y), (Z_1, Y_1), (Z_2, Y_2), \dots\) are i.i.d observations from a classification problem, where \(Y, Y_1, \dots \in \{-1, +1\}\). Consider classifiers from a class \(g \in \mathcal{C}\), and define the loss (error) and empirical loss: \[ L(g):=\mathbb{P}(g(Z)\neq Y)\] \[L_n(g):=n^{-1}\sum_{i=1}^n1\big(g(Z_i)\neq Y_i\big)=\mathbb{P}_n\big(g(Z_i)\neq Y_i\big) \] Let \[ \hat g_n:=\arg\min_{g\in\mathcal{C}}L_n(g), \quad \text{and}\quad g^\ast=\inf_{g\in\mathcal{C}}L(g) \] We want to know if the loss \(L(\hat g_n)\) approximates \(L(g^\ast)\)

Example 1.4 (II)

Note \[ \begin{align*} L(\hat g_n) &=L(\hat g_n) + L(g^\ast) + L_n(\hat g_n)- L_n(\hat g_n)- L(g^\ast)\\ &\le L(\hat g_n) + L(g^\ast) + L_n(g^\ast) - L_n(\hat g_n) -L(g^\ast)\\ &\le L(g^\ast) +2\sup_{g\in\mathcal{C}}|(L_n-L)g| \end{align*} \] Again if we know the empirical process \(\sqrt{n}(L_n-L)\) is P-GC we know \(L(\hat g_n)\) well-approximates \(L(g^\ast)\).

The details for this techniques will be illustrated in later sections.

1.3 Goodness-of-Fit Testing - another motivating example

Goodness-of-fit test

Consider data \(X_1, \dots, X_n\) from (unknown) dist. \(F\), we would like to test \[H_0: F=F_0 \text{ against } H_1: F\neq F_0\]

  • Kolmogrov-Smirnov test: \(D_n=\sqrt{n}\sup_x|\mathbb{F}_n(x)-F_0(x)|\)

  • Cramer-von-Mises test: \(W_n=n\int[\mathbb{F}_n(x)-F_0(x)]^2dF_0(x)\)

Theorem 1.4 (HW): The null distribution, i.e.,\(D_n\) (\(W_n\)) under \(H_0\), is the same regardless of \(F\).

Proof Sketch: Note \(F_0(x)=H(F_0(x))\) and \(\mathbb{F}_n(x)=\mathbb{H}_n(F(x))=\mathbb{H}_n(F_0(x))\). Since \(F_0(x)\) has the range between \((0,1)\), \[ \sup_x|\mathbb{F}_n(x)-F_0(x)|=\sup_x|H(F_0(x))-\mathbb{H}_n(F_0(x))|=\sup_y|H(y)-\mathbb{H}_n(y)| \]

Uniform Empirical Process

As a result, we can safely think \(F_0\) as the \(\text{Unif}(0,1)\) c.d,f— consider study the null distribution of \(D_n\) using \(X_1,\dots, X_n\in\text{Unif}(0,1)\).

Define the uniform empirical process \(\mathbb{U}_n(t):= \sqrt{n}(\mathbb{H}_n(t)-t)\), where \(\mathbb{H}_n(t)\) is constructed from \(\xi_1, \dots, \xi_n\overset{i.i.d}\sim \text{Unif}(0,1)\): \[ \mathbb{H}_n(t):=\dfrac{1}{n}\sum_{i=1}^n 1(\xi_i\le t). \]

Theorem 1.5. \(\mathbb{U}_n(t)\overset{d}{\rightarrow} \mathbb{U}(t)\). By continous mapping theorem we have \[\sup_t\mathbb{U}_n(t)\overset{d}{\rightarrow} \sup_t\mathbb{U}(t)\]

Review: Brownian Bridge

A Brownian Bridge \(\{\mathbb{U}(t): t\in [0,1]\}\) is a random process such that

  1. \(\mathbb{U}(0)=\mathbb{U}(1)=0\)
  2. for every \(t_1, t_2, \dots t_k\in (0,1)\), we have: \[\big(\mathbb{U}(t_1),\dots, \mathbb{U}(t_k)\big)^T\sim N_k(0, \Sigma), \text{ where } \Sigma_{ij}=\min(t_i, t_j)-t_i t_j.\]
  3. mapping \(t\rightarrow \mathbb{U}(t)\) is continuous on \([0,1]\)

1.4 Asymptotic equicontinuity

Asymptotic equicontinuity

Let \(\{Z_n(f) : f \in \mathcal{F}\}\) be a stochastic process indexed by \(\mathcal{F}\). Define \(\{Z_n\}\) as asymptotically (or stochastically) equicontinuous at \(f_0\) if for each \(\eta > 0\) and \(\varepsilon > 0\), there exists a neighbourhood \(V\) of \(f_0\) for which \[ \limsup_{n\rightarrow\infty}\mathbb{P}(\sup_{f\in V}|Z_n(f)-Z_n(f_0)|>\eta)<\varepsilon \] Recall in mathematical analysis, equicontinuity means a sequence of functions are all continuous and they have equal variation over a given neighbourhood.

Another equivalent definition of asymptotically equicontinuous.

Theorem 1.6 (HW). If a sequence \(\{\hat{f}_n\}_{n\ge 1} \subset \mathcal{F}\) such that \(\hat{f}_n\overset{p}\rightarrow f_0\), and \(\{Z_n(f) : f \in \mathcal{F}\}\) is asymptotically equicontinuous at \(f_0\), then \[ Z_n(\hat{f}_n)-Z_n(f_0)=o_p(1). \]

Asymptotic equicontinuity (continue).

Proof sketch: \[ \begin{align*} \mathbb{P}(|Z_n(\hat f_n)-Z_n(f_0)|>\eta)\le &\mathbb{P}\big(|Z_n(\hat f_n)-Z_n(f_0)|>\eta, f_n\in V\big) + \mathbb{P}(f_n\notin V)\\ &\le \mathbb{P}\big(\sup_{f\in V}|Z_n(f)-Z_n(f)|>\eta\big) + \mathbb{P}(f_n\notin V) \end{align*} \] Taking a limit simply completes the proof.

  • Empirical process theory offers very efficient methods for establishing the asymptotic equicontinuity of \(\mathbb{G}_n\) over \(\mathcal{F}\).

  • For example, if \(\mathcal{F}\) is a VC class (we will see it in Chap. 7) with square-integrable envelope, we will have the asymptotic equicontinuity.

Exmaple 1.5 Sample mean abs. deviations

Suppose \(X, X_1,X_2 \dots\) are i.i.d. \(P\) with c.d.f. \(G\) admits density \(g\). Consider \[M_n:=\mathbb{P}_n|X−\bar{X}_n|= \dfrac{1}{n}\sum_{i=1}^n|X_i-\bar{X}_n|, \] We want to study \(\lim_{n\rightarrow\infty} M_n\) and the asymptotic distribution of \(M_n\).

Theorem 1.7. \(M_n\overset{a.s.}\rightarrow M:= P|X-\mu|\), \(\sqrt{n}(M_n-M)\) is asymptotically normally distributed.

Proof sketch: Each \(|X_i-\bar{X}_n|\) is dependent, so we cannot directly apply SLLN and CLT. Instead, consider random function \[\mathbb{M}_n(t)=\mathbb{P}_n|X-t|, \text{ where } t \in (\mu-\delta, \mu+\delta).\]

Example 1.5 Almost sure convergence

Note \(M_n=\mathbb{M}_n(\bar{X}_n)\). Define the function class \(\mathcal{F}=\{f_t=|x-t|:\,\,t \in (\mu-\delta, \mu+\delta)\}\). Then: \[ \begin{align*} M_n-M = &\mathbb{P}_n f_{\bar{X}_n}-P f_\mu\\ = &(\mathbb{P}_n-P)f_{\bar{X}_n} + [P(f_{\bar{X}_n}-f_\mu)] \end{align*} \]

Note by Glivenko-Cantelli theorem (assume \(\mathcal{F}\) is \(P\)-GC) \[(\mathbb{P}_n-P)f_{\bar{X}_n} \le \sup_{f\in\mathcal{F}}|(\mathbb{P}_n-P)f|\overset{a.s.}\rightarrow 0.\]

Moreover, \[P(f_{\bar{X}_n}-f_\mu)\le P|\bar{X}_n-\mu|\overset{a.s.}\rightarrow 0,\] which implies \(M_n \overset{a.s.}\rightarrow M\).

Example 1.5 Weak convergence (I.)

\[ \begin{align*} \sqrt{n}(M_n-M) = &\sqrt{n}\big(\mathbb{P}_n f_{\bar{X}_n} -P f_\mu\big)\\ = &\sqrt{n}(\mathbb{P}_n-P)f_{\mu} + \sqrt{n}[\mathbb{P}_n(f_{\bar{X}_n}-f_\mu)]\\ = & \sqrt{n}(\mathbb{P}_n-P)f_{\mu} + \sqrt{n}[(\mathbb{P}_n-P)(f_{\bar{X}_n}-f_\mu)] + \sqrt{n}P(f_{\bar{X}_n}-f_\mu)\\ = & \mathbb{G}_n f_{\mu} + \mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) + \sqrt{n} P(f_{\bar{X}_n} -f_\mu) \end{align*} \] If we assume \(\mathbb{G}_n\) is asymptotically equicontinuous at \(f_\mu\), then \[\mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) = o_P(1).\]

Moreover, note \(\psi(t):=P(f_t)=\int f_t dP =\mathbb{E}\big(f_t(X)\big)=\mathbb{E}|X-t|\).

Example 1.5 Weak convergence (II.)

By a simple calculation, we know \[ \begin{align*} \mathbb{E}\big|X-t|=& \int|x-t|g(x)dx\\ =&\int_{-\infty}^t (t-x)g(x) dx + \int_{y}^{\infty} (x-t)g(x) dx\\ =&\int_{-\infty}^t (t-x)g(x) dx + \int_{-\infty}^{\infty} (x-t)g(x) dx - \int_{-\infty}^{t} (x-t)g(x) dx\\ =& 2tG(t)-2\int_{-\infty}^t x g(x)dx + \mu -t, \end{align*} \] and \[ \psi'(t)= 2G(t) + 2t g(t) -2t g(t) -1 = 2G(t) -1. \]

Example 1.5 Weak convergence (III.)

As a result: \[ \begin{align*} \qquad\qquad&\mathbb{G}_n f_{\mu} + \mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) + \sqrt{n} P(f_{\bar{X}_n} -f_\mu)\\ =& \mathbb{G}_n f_{\mu} + o_P(1) + \sqrt{n} (\bar{X}_n -\mu) \psi'(\mu) + o_P(1)\\ =& \mathbb{G}_n\big(f_{\mu} + X \psi'(\mu)\big) +o_P(1) \end{align*} \] Now applying the CLT yields \(\sqrt{n}(M_n-M)\) is normally distributed.

Conclusion

Conclusion

  • Empirical process theory — all about uniform SLLN and CLT
  • Empirical process indexed by function class
  • Useful in statistical applications