Chapter 1. Introduction to Empirical Processes

11/03/2022, Southampton

General Info

Where to find all the information: Reading Group Website

Anyone wants to help me maintain the website, please let me know.

Check out the following items frequently:

Timetable
Materials

Got questions reading lecture notes? Check out:

Supplementary References

Presenter and Discussant

One presenter and one discussant each session. Refer to the website for the roles description.

Tips about the reading group by Percy Liang, including

leading discussions / giving presentations
following presentations
benefitting from talks that you can’t follow entirely

No rush policy:

please target at finishing in 1 hour, but no worries if you are behind. No need to rush to the end — we can allow 10-20 minutes over every now and again.

Introduction of the introduction

Classic results

Let \(X_1, \dots, X_n \in \mathbb{R}\), and \(\overset{i.i.d}{\sim} F\), where the distribution is characterised by (usually unknown) c.d.f (cdf/CDF) \[F(x)=P(X\le x).\]

Consider the e.d.f (e.c.d.f) from the data: \[ \mathbb{F}_n(x):=\dfrac{1}{n}\sum_{i=1}^n 1_{(-\infty, x]}(X_i) = \dfrac{1}{n}\sum_{i=1}^n 1(X_i\le x) \]

Uniform Results

Pointwise convergence (for each fixed \(x\)):

SLLN: \(\quad\mathbb{F}_n(x) \overset{a.s}{\rightarrow} F(x)\quad\)

CLT: \(\quad\mathbb{G}_n(x):=\sqrt{n}(\mathbb{F}_n(x) - F(x)) \overset{d}{\rightarrow} N(0, F(x)(1-F(x)))\quad\)

The core of Empirical Processes theory is about developing the following results:

Uniform convergence (holds for all \(x\) simultaneously)

Thm 1.1 (Glivenko-Cantelli): \(\|\mathbb{F}_n-F\|_\infty:=\sup_{x}|\mathbb{F}_n(x)-F(x)|\overset{a.s}{\rightarrow}0.\)

Thm 1.2 (Donsker): \(\mathbb{G}_n(x) \overset{d}{\rightarrow} \mathbb{U}\big(F(x)\big)\), where \(\mathbb{U}\) is the Brownian bridge.

Review: Pointwise vs. Uniform

Uniform results are stronger than pointwise results, as it controls the convergence speed uniformly.

A typical example is: \(f_n(x)=x^n, x\in [0,1]\)

It converge pointwisely to \(f(x)=0, x\in[0,1)\) and \(f(x)=1, x=1\).

Note this pointwise limit \(f(x)\) is not continuous, thus does not support an uniform convergence.

Review: Convergence of Random Variables

\(X, X_1,X_2,\dots\) such that \(X\sim F\) and \(X_i\sim F_i\). \(X_n\overset{a.s.}{\rightarrow} X \implies X_n\overset{p}{\rightarrow} X \implies X_n\overset{d}{\rightarrow} X\), where

converge almost surely (almost everywhere/with probability 1): \(X_n\overset{a.s.}{\rightarrow} X\), if: \[ \mathbb{P}\big(\lim_{n\rightarrow\infty} X_n=X\big)=1 \]
converge in probability: \(X_n\overset{p}{\rightarrow} X\), if \[ \lim_{n\rightarrow\infty} \mathbb{P}\big(|X_n-X|>\varepsilon\big)=0 \text{ for all } \varepsilon>0 \]
converge in distribution (weak convergence): \(X_n\overset{d}{\rightarrow} X\), if \[\lim_{n\rightarrow\infty}F_n(x)=F(x), \text{ for all the continuous point } x\]

Review: Stochastic processes

Recall a stochastic process is an (uncountably infinite) collection of random variables, usually admits certain dependence structure:

Markov Chain: \(\{X_{1}, X_{2}, X_3\dots\}\)
Markov process: \(\{X_t\}_{t>0}\)

Other examples: Gaussian process/ diffusion process/ Wiener process (Brownian motion) …

This collection of random variables is indexed by some mathematical set \(\mathcal{S}\) — e.g. \(\mathcal{S}=\mathbb{R}\), some \(d\)-dimensional Euclidean space, or some functional space.

You will hear the terminology a process … indexed by \(\mathcal{S}\) throughout this reading group.

Empirical processes (Version 1)

A simplest empirical process from previous example (indexed by \(\mathbb{R}\)) is defined as \[ \mathbb{G}_n(x)=\sqrt{n}\big(\mathbb{F}_n(x) - F(x)\big) \]

1.1 A more general notation (as in functional space)

Review: Measures and Distributions

A (probability) measure \(\mu\) on some space \(\mathcal{X}\): \(\mathcal{B}\rightarrow\mathbb{R}\), can be think as a measurement of the size of all the sets in \(\mathcal{X}\), e.g:

\(\mu(A)=a\) says the volume of set \(A\) under measure \(\mu\) is \(a\).

Lebesgue integration (Real Analysis): \(\int f d\mu=\lim_{n\rightarrow\infty}\int s_n d\mu\), where \(s_n\) is finite linear combination of indicator functions.

A probability measure can be naturally associated to a distribution \(F\), i.e., if \(\mathcal{X}=\mathbb{R}\), let \(F(x)=\mu\big((-\infty, x]\big)\). From now on, we can use measures in replace of distributions.

let’s say we have \(X_1, \dots, X_n\in \mathcal{X}\) from a measure \(P\), use which we can construct the empirical measure \(\mathbb{P}_n\):

\(\mathbb{P}_n:=n^{-1} \sum_{i=1}^n\delta_{X_i},\quad\) where \(\delta_{x}(A)=1_A(x)\) (the Dirac measure).

Empirical Process (Version 2)

For function \(f: \mathcal{X}\rightarrow \mathbb{R}\), we can write \[ \mathbb{P}_nf=\int fd\mathbb{P}n=\dfrac{1}{n}\sum_{i=1}^n f(X_i) \] and \[ Pf=\int fdP \] Consider a collection of functions \(f\in\mathcal{F}\), we can define the empirical process indexed by \(\mathcal{F}\): \(\{\mathbb{G}_nf:\,f\in\mathcal{F}\}\), where: \[ \mathbb{G}_nf:=\sqrt{n}(\mathbb{P}_nf-Pf) \]

Version 1 & 2 Comparison

	Version 1	Version 2
sample space	\(\mathbb{R}\)	general \(\mathcal{X}\)
distribution	\(F\)	\(P\)
empirical distribution	\(\mathbb{F}_n\)	\(\mathbb{P}_n\)
operate on	\(x\)	\(f\)
index set	\(\mathbb{R}\)	\(\mathcal{F}\)

Version 1 can be regarded as a special case of Version 2, e.g., let \[f(\cdot)=1_{(-\infty,x])}(\cdot), x\in \mathbb{R},\] then \(Pf=F(x)\) and \(\mathbb{P}_nf=\mathbb{F}_n(x)\).

Empirical process theory

We want to study uniform (over \(\mathcal{F}\)) approximation of \(Pf\) by \(\mathbb{P}_nf\), e.g., \[ \|\mathbb{P}_n-P\|_{\mathcal{F}}:=\sup_{f\in\mathcal{F}}|\mathbb{P}_nf-Pf| \quad\text{and}\quad \{\mathbb{G}_nf: f\in \mathcal{F}\} \]

Thm 1.1* (Glivenko-Cantelli): \(\|\mathbb{P}_n-P\|_\mathcal{F}{\rightarrow}0.\) (Chap. 3)

Thm 1.2* (Donsker): \(\{\mathbb{G}_nf: f\in\mathcal{F}\}\) converge to some limiting process. (Chap. 11)

closely related to the complexity (richness) of the class (functional space) \(\mathcal{F}\) (Chap. 2, 3, 4, 7, 11)
useful in studying finite sample concentration inequality (Talagrand’s inequality, Chap. 8, 13)

1.2 Example: M-estimation (emp risk minimisation)

M-estimation

\(X, X_1,\dots, X_n \in \mathcal{X}\overset{i.i.d}\sim P\), consider the estimator \[ \hat\theta_n:= \arg\max_{\theta\in\Theta} n^{-1}\sum_{i=1}^n m_\theta(X_i)= \arg\max_{\theta\in\Theta} \mathbb{P}_n(m_\theta) \] where \(-m_\theta\) is a loss function. Applications includes:

MLE: \(m_\theta(x)=\log p_\theta(x)\)
Median/Mode estimator: \(m_\theta(x)=-|x-\theta|\) or \(m_\theta(x)=-1\{|x-\theta|\le 1\}\)
Regression: \(m_\theta(x,y)=-(y-\theta(x))^2\)

Parameter of interest: \(\theta_0=\arg\max_{\theta\in\Theta} {P}(m_\theta)\)

Uniform closeness

If the two processes \(\mathbb{P}_n(m_\theta)\) and \(P(m_\theta)\) (both indexed by \(\Theta\)) are uniformly close

— it is possible that their maximiser \(\hat\theta_n\) and \(\theta_0\), respectively, are close.

Therefore we are interested in \[\sup_{\theta\in\Theta} |(\mathbb{P}_n-P)(m_\theta)| \]

Example 1.3 (consistency)

Theorem 1.3. \(d(\hat\theta_n, \theta_0)\overset{p}\rightarrow 0\) if we assume the class \(\mathcal{F}:=\{m_\theta(\cdot): \theta\in\Theta\}\) is P-GC.

Proof sketch: We write \[ \mathbb{M}_n(\theta):=\mathbb{P}_n(m_\theta) \quad\text{and}\quad M(\theta)=P(m_\theta) \] For any \(\delta>0\), define \(O(\theta_0, \delta)=\{\theta\in \Theta; d(\theta, \theta_0)\ge \delta\}\). Let \[\psi(\delta):=M(\theta_0)-\sup_{\theta\in O(\theta_0, \delta)} M(\theta).\] Therefore \[ \mathbb{P}\big(d(\hat\theta_n, \theta_0)>\delta\big) \le \mathbb{P}\bigg(\sup_{\theta\in O(\theta_0, \delta)} \mathbb{M}_n(\theta)\ge \mathbb{M}_n(\theta_0) \bigg) \]

Exmaple 1.3 (II)

\[ \begin{align*} \le&\mathbb{P}\bigg( \sup_{\theta\in O(\theta_0, \delta)} \{\mathbb{M}_n(\theta) -M(\theta)\} - (\mathbb{M}_n(\theta_0)-M(\theta_0))\ge {M}(\theta_0)-\sup_{\theta\in O(\theta_0, \delta)}M(\theta) \bigg)\\ \le &\mathbb{P}\bigg( 2\sup_{\theta\in \Theta} |\mathbb{M}_n(\theta) -M(\theta)|\ge \psi(\delta) \bigg) \end{align*} \]

If \(\theta_0\) is well-separated, above probability vanishes as \(n\rightarrow \infty\).

Example 1.4 (Classification)

Consider the \((Z, Y), (Z_1, Y_1), (Z_2, Y_2), \dots\) are i.i.d observations from a classification problem, where \(Y, Y_1, \dots \in \{-1, +1\}\). Consider classifiers from a class \(g \in \mathcal{C}\), and define the loss (error) and empirical loss: \[ L(g):=\mathbb{P}(g(Z)\neq Y)\] \[L_n(g):=n^{-1}\sum_{i=1}^n1\big(g(Z_i)\neq Y_i\big)=\mathbb{P}_n\big(g(Z_i)\neq Y_i\big) \] Let \[ \hat g_n:=\arg\min_{g\in\mathcal{C}}L_n(g), \quad \text{and}\quad g^\ast=\inf_{g\in\mathcal{C}}L(g) \] We want to know if the loss \(L(\hat g_n)\) approximates \(L(g^\ast)\)

Example 1.4 (II)

Note \[ \begin{align*} L(\hat g_n) &=L(\hat g_n) + L(g^\ast) + L_n(\hat g_n)- L_n(\hat g_n)- L(g^\ast)\\ &\le L(\hat g_n) + L(g^\ast) + L_n(g^\ast) - L_n(\hat g_n) -L(g^\ast)\\ &\le L(g^\ast) +2\sup_{g\in\mathcal{C}}|(L_n-L)g| \end{align*} \] Again if we know the empirical process \(\sqrt{n}(L_n-L)\) is P-GC we know \(L(\hat g_n)\) well-approximates \(L(g^\ast)\).

The details for this techniques will be illustrated in later sections.

1.3 Goodness-of-Fit Testing - another motivating example

Goodness-of-fit test

Consider data \(X_1, \dots, X_n\) from (unknown) dist. \(F\), we would like to test \[H_0: F=F_0 \text{ against } H_1: F\neq F_0\]

Kolmogrov-Smirnov test: \(D_n=\sqrt{n}\sup_x|\mathbb{F}_n(x)-F_0(x)|\)
Cramer-von-Mises test: \(W_n=n\int[\mathbb{F}_n(x)-F_0(x)]^2dF_0(x)\)

Theorem 1.4 (HW): The null distribution, i.e.,\(D_n\) (\(W_n\)) under \(H_0\), is the same regardless of \(F\).

Proof Sketch: Note \(F_0(x)=H(F_0(x))\) and \(\mathbb{F}_n(x)=\mathbb{H}_n(F(x))=\mathbb{H}_n(F_0(x))\). Since \(F_0(x)\) has the range between \((0,1)\), \[ \sup_x|\mathbb{F}_n(x)-F_0(x)|=\sup_x|H(F_0(x))-\mathbb{H}_n(F_0(x))|=\sup_y|H(y)-\mathbb{H}_n(y)| \]

Uniform Empirical Process

As a result, we can safely think \(F_0\) as the \(\text{Unif}(0,1)\) c.d,f— consider study the null distribution of \(D_n\) using \(X_1,\dots, X_n\in\text{Unif}(0,1)\).

Define the uniform empirical process \(\mathbb{U}_n(t):= \sqrt{n}(\mathbb{H}_n(t)-t)\), where \(\mathbb{H}_n(t)\) is constructed from \(\xi_1, \dots, \xi_n\overset{i.i.d}\sim \text{Unif}(0,1)\): \[ \mathbb{H}_n(t):=\dfrac{1}{n}\sum_{i=1}^n 1(\xi_i\le t). \]

Theorem 1.5. \(\mathbb{U}_n(t)\overset{d}{\rightarrow} \mathbb{U}(t)\). By continous mapping theorem we have \[\sup_t\mathbb{U}_n(t)\overset{d}{\rightarrow} \sup_t\mathbb{U}(t)\]

Review: Brownian Bridge

A Brownian Bridge \(\{\mathbb{U}(t): t\in [0,1]\}\) is a random process such that

\(\mathbb{U}(0)=\mathbb{U}(1)=0\)
for every \(t_1, t_2, \dots t_k\in (0,1)\), we have: \[\big(\mathbb{U}(t_1),\dots, \mathbb{U}(t_k)\big)^T\sim N_k(0, \Sigma), \text{ where } \Sigma_{ij}=\min(t_i, t_j)-t_i t_j.\]
mapping \(t\rightarrow \mathbb{U}(t)\) is continuous on \([0,1]\)

1.4 Asymptotic equicontinuity

Asymptotic equicontinuity

Let \(\{Z_n(f) : f \in \mathcal{F}\}\) be a stochastic process indexed by \(\mathcal{F}\). Define \(\{Z_n\}\) as asymptotically (or stochastically) equicontinuous at \(f_0\) if for each \(\eta > 0\) and \(\varepsilon > 0\), there exists a neighbourhood \(V\) of \(f_0\) for which \[ \limsup_{n\rightarrow\infty}\mathbb{P}(\sup_{f\in V}|Z_n(f)-Z_n(f_0)|>\eta)<\varepsilon \] Recall in mathematical analysis, equicontinuity means a sequence of functions are all continuous and they have equal variation over a given neighbourhood.

Another equivalent definition of asymptotically equicontinuous.

Theorem 1.6 (HW). If a sequence \(\{\hat{f}_n\}_{n\ge 1} \subset \mathcal{F}\) such that \(\hat{f}_n\overset{p}\rightarrow f_0\), and \(\{Z_n(f) : f \in \mathcal{F}\}\) is asymptotically equicontinuous at \(f_0\), then \[ Z_n(\hat{f}_n)-Z_n(f_0)=o_p(1). \]

Asymptotic equicontinuity (continue).

Proof sketch: \[ \begin{align*} \mathbb{P}(|Z_n(\hat f_n)-Z_n(f_0)|>\eta)\le &\mathbb{P}\big(|Z_n(\hat f_n)-Z_n(f_0)|>\eta, f_n\in V\big) + \mathbb{P}(f_n\notin V)\\ &\le \mathbb{P}\big(\sup_{f\in V}|Z_n(f)-Z_n(f)|>\eta\big) + \mathbb{P}(f_n\notin V) \end{align*} \] Taking a limit simply completes the proof.

Empirical process theory offers very efficient methods for establishing the asymptotic equicontinuity of \(\mathbb{G}_n\) over \(\mathcal{F}\).
For example, if \(\mathcal{F}\) is a VC class (we will see it in Chap. 7) with square-integrable envelope, we will have the asymptotic equicontinuity.

Exmaple 1.5 Sample mean abs. deviations

Suppose \(X, X_1,X_2 \dots\) are i.i.d. \(P\) with c.d.f. \(G\) admits density \(g\). Consider \[M_n:=\mathbb{P}_n|X−\bar{X}_n|= \dfrac{1}{n}\sum_{i=1}^n|X_i-\bar{X}_n|, \] We want to study \(\lim_{n\rightarrow\infty} M_n\) and the asymptotic distribution of \(M_n\).

Theorem 1.7. \(M_n\overset{a.s.}\rightarrow M:= P|X-\mu|\), \(\sqrt{n}(M_n-M)\) is asymptotically normally distributed.

Proof sketch: Each \(|X_i-\bar{X}_n|\) is dependent, so we cannot directly apply SLLN and CLT. Instead, consider random function \[\mathbb{M}_n(t)=\mathbb{P}_n|X-t|, \text{ where } t \in (\mu-\delta, \mu+\delta).\]

Example 1.5 Almost sure convergence

Note \(M_n=\mathbb{M}_n(\bar{X}_n)\). Define the function class \(\mathcal{F}=\{f_t=|x-t|:\,\,t \in (\mu-\delta, \mu+\delta)\}\). Then: \[ \begin{align*} M_n-M = &\mathbb{P}_n f_{\bar{X}_n}-P f_\mu\\ = &(\mathbb{P}_n-P)f_{\bar{X}_n} + [P(f_{\bar{X}_n}-f_\mu)] \end{align*} \]

Note by Glivenko-Cantelli theorem (assume \(\mathcal{F}\) is \(P\)-GC) \[(\mathbb{P}_n-P)f_{\bar{X}_n} \le \sup_{f\in\mathcal{F}}|(\mathbb{P}_n-P)f|\overset{a.s.}\rightarrow 0.\]

Moreover, \[P(f_{\bar{X}_n}-f_\mu)\le P|\bar{X}_n-\mu|\overset{a.s.}\rightarrow 0,\] which implies \(M_n \overset{a.s.}\rightarrow M\).

Example 1.5 Weak convergence (I.)

\[ \begin{align*} \sqrt{n}(M_n-M) = &\sqrt{n}\big(\mathbb{P}_n f_{\bar{X}_n} -P f_\mu\big)\\ = &\sqrt{n}(\mathbb{P}_n-P)f_{\mu} + \sqrt{n}[\mathbb{P}_n(f_{\bar{X}_n}-f_\mu)]\\ = & \sqrt{n}(\mathbb{P}_n-P)f_{\mu} + \sqrt{n}[(\mathbb{P}_n-P)(f_{\bar{X}_n}-f_\mu)] + \sqrt{n}P(f_{\bar{X}_n}-f_\mu)\\ = & \mathbb{G}_n f_{\mu} + \mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) + \sqrt{n} P(f_{\bar{X}_n} -f_\mu) \end{align*} \] If we assume \(\mathbb{G}_n\) is asymptotically equicontinuous at \(f_\mu\), then \[\mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) = o_P(1).\]

Moreover, note \(\psi(t):=P(f_t)=\int f_t dP =\mathbb{E}\big(f_t(X)\big)=\mathbb{E}|X-t|\).

Example 1.5 Weak convergence (II.)

By a simple calculation, we know \[ \begin{align*} \mathbb{E}\big|X-t|=& \int|x-t|g(x)dx\\ =&\int_{-\infty}^t (t-x)g(x) dx + \int_{y}^{\infty} (x-t)g(x) dx\\ =&\int_{-\infty}^t (t-x)g(x) dx + \int_{-\infty}^{\infty} (x-t)g(x) dx - \int_{-\infty}^{t} (x-t)g(x) dx\\ =& 2tG(t)-2\int_{-\infty}^t x g(x)dx + \mu -t, \end{align*} \] and \[ \psi'(t)= 2G(t) + 2t g(t) -2t g(t) -1 = 2G(t) -1. \]

Example 1.5 Weak convergence (III.)

As a result: \[ \begin{align*} \qquad\qquad&\mathbb{G}_n f_{\mu} + \mathbb{G}_n (f_{\bar{X}_n}-f_{\mu}) + \sqrt{n} P(f_{\bar{X}_n} -f_\mu)\\ =& \mathbb{G}_n f_{\mu} + o_P(1) + \sqrt{n} (\bar{X}_n -\mu) \psi'(\mu) + o_P(1)\\ =& \mathbb{G}_n\big(f_{\mu} + X \psi'(\mu)\big) +o_P(1) \end{align*} \] Now applying the CLT yields \(\sqrt{n}(M_n-M)\) is normally distributed.

Conclusion

Empirical process theory — all about uniform SLLN and CLT
Empirical process indexed by function class
Useful in statistical applications