Experimental Design and
Analysis
Part II
3. Random Variables
The key concept of all statistics is the random variable. A formal definition of a random variable requires
a mathematical foundation (and elaboration) that takes us away from the main
focus of this course. We shall therefore not attempt a formal definition but
instead adopt a simpler practical viewpoint. We therefore define a random
variable simply as a quantity that one can observe many times but that takes
different values each time it is observed in an unpredictable, random way.
These values however will follow a probability
distribution. The probability distribution is thus the defining property of
a random variable. Thus, given a random variable, the immediate and only
question one can, and should always ask
is: What is its distribution?
We denote a random variable by an upper case letter X (Y, Z etc.). An observed value of such a random variable will be denoted by a lower
case letter x (y, z etc).
In view of the above discussion, given a random variable, one should
immediately think of the range of
possible values that it can take and its probability distribution over this range.
The definition of most statistical probability distributions involves parameters. Such a probability distribution
is completely fixed once the parameter values are known. Well known parametric
probability distributions are the normal, exponential, gamma, binomial and
Poisson.
A probability distribution is usually either discrete or continuous. A
discrete distribution takes a specific set of values, typically the integers 0,
1, 2,…. Each value i has a given
probability pi of
occurring. This set of probabilities is called its probability mass function.
Exercise 3.1: Plot the probability mass function of
(i)
the
binomial distribution, B(n, p)
(ii)
the
Poisson distribution, P(λ)
Write down what you know of each distribution. □
A continuous random variable, as its name implies, takes a continuous
range of values for example all y ≥
0. One way of defining its distribution is to give its probability density function (pdf), typically written as f(y).
The pdf is not a probability, however
it can be used to form a probability
increment. This is a good way to
view the pdf.
Exercise 3.2: Write down the pdf of
(i) the normal distribution, . Normal
Distribution
(ii) the gamma distribution, . Gamma Distribution
Plot the density functions. Write down what you know about each
distribution. □
Exercise 3.3: Suppose that X is a continuous random variable with density f(x). Let Y be a function of X, say Y = h(X).
What is the pdf, g(y) of Y, in terms of f(x)? Give the pdf of Y = X 2 when X is a standard normal random variable.
What is the name of this random variable and what is the form of its pdf? □
An alternative way of defining a probability distribution, which applies
to either a discrete or continuous distribution, is to give its cumulative distribution function (cdf).
Exercise 3.4: Write down the main properties of a cdf. □
Exercise 3.5: Plot the cdf’s of each of the examples in
the previous examples. □
Exercise 3.6: What is the relation between the pdf and
the cdf for a continuous random variable? How is one obtained from the other? □
Exercise 3.7: Define the expected value of a random variable X in terms of its pdf f(x). Define the expected value of Y = h(X) in terms of the pdf of X. □
4. Fitting Parametric
Distributions to Random Samples; Input Modelling
Random samples are the simplest data sets that are encountered. A random sample is just a set of n independent and identically distributed observations (of a random variable). We write it as Y = {Y1, Y2, … Yn,} where each Yi represents one of the observations.
Exercise 4.1: Generate random samples from
(i)
the
normal distribution . Normal
Var Generator
(ii)
the
gamma distribution . Gamma Var Generator
A basic problem is when we wish to fit a parametric distribution to a random sample. This problem is an elementary form of modelling called input modelling.
Example 2.1 (continued): Suppose we are modelling a queueing system where service times are expected to have a gamma distribution and we have some actual data of the service times of a number of customers from which we wish to estimate the parameters of the distribution. This is an example of the input modelling problem. If we can estimate the parameters of the distribution, we will have identified the distribution completely and can then use it to study the characteristics of the system employing either queueing theory or simulation. □
To fit a distribution, a method of estimating the parameters is needed. The best method by far is the method of maximum likelihood (ML). The resulting estimates of parameters, which as we shall see shortly possess a number of very desirable properties, are called maximum likelihood estimates (MLEs). ML estimation is a completely general method that applies not only to input modelling problems but to all parametric estimation problems. We describe the method next.
5. Maximum Likelihood Estimation
Suppose Y = {Y1, Y2, …, Yn} is a set of observations where the ith observation, Yi, is a random variable drawn from the continuous
distribution with pdf fi(y, θ)
(i = 1, 2, …, n). The subscript i
indicates that the distributions of the yi
can all be different.
Example 5.1: Suppose Yi ~ N(μ, σ 2) all i. In this case
(5.1)
so that the observations are identically distributed. The set of observations is therefore a random sample in this case. □
Example 5.2: Suppose Yi
~ , i = 1, 2,
..., n. In this case one often writes
(5.2)
where
(5.3)
is called the regression function, and
~ N(0, σ 2) (5.4)
can be thought of as an error term, or a perturbation affecting proper observation of the regression function. □
In the example, the regression
function is linear in both the parameters , and in the explanatory variable x. In general the regression function can be highly nonlinear in
both the parameters and in the explanatory variables. Study of nonlinear
regression models forms a major part of this course.
In Example 5.2, the pdf of Yi is
. (5.5)
Thus Y is not a random sample
in this case, because the observations are not all identically distributed.
However ML estimation still works in this case.
We now describe the method. Suppose that y = {y1, y2, …, yn} is a sampled value of Y = {Y1, Y2, …, Yn}. Then we write down the joint distribution of Y evaluated at the sampled value y as:
. (5.6)
This expression, treated
as a function of θ, is called the called the likelihood (of the
sampled value y). The logarithm:
(5.7)
is called the loglikelihood.
The ML estimate, , is that value of
which maximizes the loglikelihood.
The MLE is illustrated in Figure 5 in the one parameter case. In some cases the maximum can be obtained explicitly as the solution of the vector equation
(15)
which identifies the stationary points of the likelihood. The maximum is
often obtained at such a stationary point. This equation is called the likelihood equation. The MLE illustrated
in Figure 5.1 corresponds to a stationary point.
In certain situations, and this includes some well known standard ones,
the likelihood equations can be solved to give the ML estimators explicitly.
This is preferable when it can be done. However in general the likelihood equations
are not very tractable. Then a much more practical approach is to obtain the
maximum using a numerical search method.
Figure 5.1.
The Maximum Likelihood Estimator
There are two immediate and important points to realise in using the ML
method.
(i) An expression for the
likelihood needs to be written down using (5.6) or (5.7).
(ii) A method has to be
available for carrying out the optimization.
We illustrate (i) with some examples.
Exercise 5.1: Write down the likelihood and loglikelihood
for
(i) The sample of
Example 5.1
(ii) The sample of
Example 5.2
(iii) A sample of
observations with the Bernoulli distributions (2.10).
(iv) The sample of
Example 2.3
We now consider the second point, which concerns how to find the maximum
of the likelihood. There exists a number of powerful numerical optimizing
methods but these can be laborious to set up. An exception is the readily
accessible numerical optimizer Solver
which can be called from an Excel spreadsheet. This can handle problems that
are not too large. A more flexible alternative is to use a direct search method
like the Nelder-Mead method. This is
discussed in more detail here in the following reference here: Nelder
Mead.
Exercise 5.2: NelderMeadDemo
This is a VBA implementation of the Nelder-Mead Algorithm. Insert a function of
your own to be optimized and see if it finds the optimum correctly.
Watchpoint: Check whether an optimizer minimizes or maximizes the
objective. Nelder Mead usually does function minimization. □
Exercise 5.3: The following is a (random) sample of 47
observed times (in seconds) for vehicles to pay the toll at a booth when
crossing the
Watchpoints: Write down the loglikelihood for this example yourself, and
check that you know how it is incorporated in the spreadsheet. □
6. Accuracy of ML Estimators
A natural question to ask of an MLE is: How accurate is it? Now an MLE, being just a function of the
sample, is a statistic, and so is a random variable. Thus the question is
answered once we know its distribution.
An important property of the MLE, , is that
its asymptotic probability distribution is known to be normal. In fact it
is known that, as the sample size n
→ ∞,
~
(6.1)
where is the unknown
true parameter value and the variance
has the form
, (6.2)
where
(6.3)
is called the information matrix.
Thus the asymptotic variance of is the inverse of the information matrix
evaluated at
. Its value cannot be computed precisely as it depends on the unknown
, but it can be approximated by
. (6.4)
The expectation in the definition
of is with respect to
the joint distribution of Y and this
expectation can be hard to evaluate. In practice the approximation
'
(6.5)
where we replace the information matrix by its sample analogue, called the observed information, is quite adequate. Practical experience indicates that it tends to give a better indication of the actual variability of the MLE. Thus the working version of (6.1) is
~
(6.6)
The second derivative of the
loglikelihood, , that appears in the expression for
is called the Hessian (of L). It measures the rate of
change of the derivative of the loglikelihood. This is essentially the curvature of the loglikelihood. Thus it
will be seen that the variance is simply the inverse of the magnitude of this curvature at the
stationary point.
Though easier to calculate than
the expectation, the expression can still be very
messy to evaluate analytically. Again it is usually much easier to calculate
this numerically using a finite-difference formula for the second derivatives.
The expression is a matrix of course, and the variance-covariance matrix of the
MLE is the negative of its inverse. A
numerical procedure is needed for this inversion.
The way that (6.6) is typically used is to provide confidence intervals. For example a (1-α)100% confidence interval for the coefficient θ1 is
(6.7)
where is the upper 100α/2 percentage point of the standard
normal distribution.
Often we are interested not in θ directly, but some arbitrary, but given function of θ, g(θ) say. ML estimation has the attractive general invariant property that the MLE of
g(θ) is
. (6.8)
An approximate (1-α)100% confidence interval for g(θ) is then
(6.9)
In this formula the first
derivative of g(θ) is required. If this is not tractable to obtain
analytically then, as with the evaluation of the information matrix, it should
be obtained numerically using a finite-difference calculation.
Summarising it will be seen that we need to
(i) Formulate a statistical model of the data to be examined. (The data may or may not have been already collected. The data might arise from observation of a real situation, but it might just as well have been obtained from a simulation.)
(ii) Write down an expression for the loglikelihood of the data, identifying the parameters to be estimated.
(iii) Use this in a (Nelder-Mead say) numerical optimization of the loglikelihood.
(iv) Use the optimal parameter values to obtain estimates for the quantities of interest.
(v) Calculate confidence intervals for these quantities.
Example 6.1: Suppose that the gamma distribution fitted to the toll
booth data of Exercise 5.3 is used as the service distribution in the design of
an M/G/1 queue. Suppose the interarrival time distribution is known to be
exponential with pdf
(6.10)
but a range of possible values for the arrival rate, λ, needs to be considered.
Under these assumptions the steady state mean waiting time in the queue
is known to be
. (6.11)
Plot a graph of the mean waiting time for the queue for 0
< λ < 0.1 (per second),
assuming that the service time distribution is gamma:
. Add 95% confidence intervals to this graph to take into
account the uncertainty concerning
because estimated values
have been used. Gamma MLE □
Example 6.2: Analyse the Moroccan Data of Example 2.3. Fit the model
i = 1,2, .... 32
where xi is the age group of the ith observation,
,
and
~ N(0, θ12).
Regression Fit Morocco Data □
The previous two examples contains all the key steps in the fitting of a statistical model to data. Both examples involve regression situations. However the method extends easily to other situations like the Bernoulli model of Equation (2.10).
Example 6.3: Analyse the Vaso Constriction Data by fitting the Bernoulli model of Equation (2.10) using ML estimation. □
There are usually additional analyses that are then subsequently required such as model validation and selection and sensitivity analyses. We will be discussing these in Part III.