Computer Analysis of Data and
Models
Part III
7. Empirical Distribution Functions
Consider first a single random
sample of observations , for i = 1, 2,
..., n. The empirical distribution
function (EDF) is defined as:
. (27)
The EDF is illustrated in Figure 6. It is usually simplest to think of the observations as being ordered:
Y(1) < Y(2) < … < Y(n) . (28)
Figure 6: EDF of the Yi ,
Y(1) Y(j) Y(n)
These are what are depicted in Figure 6. Note that the subscripts are placed in brackets to indicate that this is an ordered sample.
The key point is that the EDF estimates the (unknown) cumulative distribution function (CDF) of Y. We shall make repeated use of the following:
Fundamental Theorem of Sampling: As the size of a random sample tends to infinity then the EDF constructed from the sample will tend, with probability one, to the underlying cumulative distribution function (CDF) of the distribution from which the sample is drawn.
(This result when stated in full mathematical rigour, is known as the Glivenko-Cantelli Lemma, and it underpins all of statistical methodology. It guarantees that study of increasingly large samples is ultimately equivalent to studying the underlying population.)
In the previous section we studied the CDF of the Yi ’s of a random sample by fitting a parametric distribution and then studying the fitted parameters and the fitted parametric CDF. Using the EDF, we can do one of two things:
(i) We can study the properties of the Yi directly using the EDF, without bothering to fit a parametric model at all.
(ii) We can use the EDF to study properties of the fitted parameters and fitted parametric distribution.
We shall discuss both approaches. We shall focus first on (ii) as we wish to utilise bootstrapping to give us an alternative way of studying the properties of MLE’s, to that provided by asymptotic normality theory which was discussed in Section 6.
We postpone discussion of (i) until later. We simply note at this juncture that the attraction of using the EDF directly, rather than a fitted parametric CDF, is that we make no assumption about the underlying distributional properties of Y. Thus Y can be either a continuous or a discrete random variable. Nor is it assumed to come from any particular family of distributions like the normal or Weibull. This flexibility is particularly important when studying or comparing the output from complex simulations where it is possible that the distribution of the output may be unusual. For example it may well be skew, or possibly even multimodal.
The basic process of constructing
a given statistic of interest is illustrated in Figure 7. This depicts the
steps of drawing a sample Y = (Y1, Y1, ..., Yn)
of size n from a distribution F0(y), and then the calculating the statistic of interest T from Y. The problem is then to find the distribution of T.
Bootstrapping is a very general method for numerically estimating the distribution of a statistic. It is a resampling method that operates by sampling from the data used to calculate the original statistic.
Bootstrapping is based on the following idea. Suppose we could repeat the basic process, as depicted in Figure 7, a large number of times, B say. This would give a large sample {T1, T2,..., TB} of test statistics, and, by the Fundamental Theorem of Section 3, the EDF of the Ti will tend to the CDF of T as B tends to infinity. Thus, not simply does the EDF estimate the CDF, it can be made accurate to arbitrary accuracy at least in principle, simply by making B sufficiently large.
Unfortunately to apply this result requires repeating the basic process many times. In the present context this means having to repeat the simulation trials many times - something that is certainly too expensive and impractical to do.
The bootstrap method is based on the idea of replacing F0(y) by the best estimate we have for it. The best available estimate is the EDF constructed from the sample Y.
Thus we mimic the basic process depicted in Figure 7 but instead of sampling from F0(y) we sample from the EDF of Y. This is exactly the same as sampling with replacement from Y. We carry out this process B times to get B bootstrap samples Y1*, Y2*, ..., YB*. (We have adopted the standard convention of adding an asterisk to indicate that a sample is a bootstrap sample.) From each of these bootstrap samples we calculate a corresponding bootstrap statistic value Ti* = T(Yi*), i = 1, 2, ..., B. The process is depicted in Figure 8.
Example 12: The following is a random sample of 15 observations from an unknown distribution for which the sample median has been obtained. Obtain 90% confidence limits for the unknown true sample median, assuming that the sample is normally distributed. (You will probably have to consult the literature to find the answer to this.)
Now calculate 90% confidence limits using bootstrapping. Bootstrap Median □
The EDF of the bootstrap sample, which without loss of generality we can assume to be reordered, so that T(1)* < T(2)* < ... < T(B)*, now estimates the distribution of T. This is depicted in Figure 9. Figure 9 also includes the original statistic value, T0. Its p-value, as estimated from the EDF, can be read off as
. (29)
If the p-value is small then this is an indication that T0 is in some sense unusual. We shall see how this idea can be developed into a full methodology for making different kinds of comparisons.
Figure 9: Empirical Distribution Function of the T(i)
In practice typical values of B used for bootstrapping are 500 or 1000, and such a value is generally large enough. With current computing power resampling 1000 values is, to all intents and purposes, instantaneous.
Chernick (1999) shows how
generally applicable is the bootstrap listing some 83 pages of carefully
selected references! A good handbook for bootstrap methods is Davison and
Hinkley (1997).
We stress that they all hinge on the basic idea set out in this section: namely that replication of the original process depicted in Figure 7 would allow the distribution of T to be obtained to arbitrary accuracy but that this is rarely possible; however the bootstrap version of this process as depicted in Figure 8 is possible and is easy to do. The bootstrap samples are trivially easy to obtain using the algorithm given below in Section 9. There is no problem in calculating the bootstrap statistics Ti* from the bootstrap samples Yi*, as the procedure for doing this must already have been available to calculate T from Y.
All the remaining examples discussed
in the course are essentially an application of the bootstrapping process of
Figure 8 (and its parametric variant,
which we will be describing in Section 11), with whatever procedure is
necessary inserted to calculate T
from Y.
9. Evaluating the Distribution of MLEs by Bootstrapping
Let us apply the bootstrapping idea to the evaluation of the distribution
of MLE’s. All that is required is to simply treat the MLE, as being the statistic
T of interest!
All we need to do is to follow the scheme depicted in Figure 8. We
generate bootstrap samples by resampling with replacement from the original
sample. Then we use the code to produce the bootstrap T from this sample. The pseudocode for the entire bootstrap
process is as follows:
// y = (y(1),
y(2), ..., y(n)) is the original
sample.
// T=T(y) is the calculation that produced T from y.
For k = 1 to B
{
For i = 1 to n
{
j = 1 + n × Unif() // Unif() returns a uniformly distributed
// U(0,1)
variate each time it is called.
y*(i)
= y(j)
}
T*(k) = T(y*)
}
The beautiful simplicity of bootstrapping is now apparent. The resampling
is trivially easy. The step that produces T*(k) invokes the procedure that produced T from y, only y is
replaced by y*. The key point
here is that no matter how elaborate the original procedure was to produce T, we will already have it available, as we must have set it up in order to
calculate T = T(y) in the first place. The bootstrap
procedure simply calls it a further B
times.
Example
13: Use bootstrapping to produce 100 bootstrap versions of of Exercise
11. Compare these with the confidence intervals produced for
using the
asymptotic normality theory. Gamma
Bootstrap
Produce confidence intervals for the waiting time in the queue using
bootstrapping. Again compare these results with those produced by asymptotic
theory. □
Example 13½: Use bootstrapping to produce bootstrap
estimates of the parameters in the logistic
regression model for the Vaso
Constriction Data. □
10. Comparing Samples Using
the Basic Bootstrap
Next we consider the problem where we have two samples of observations Y and Z and we wish to know if they have been drawn from the same or different distributions. We consider how the basic bootstrap provides a convenient way of answering this question.
To illustrate the ideas involved
we consider the simple situation where we have calculated the same statistic
from each of the samples Y and Z. This statistic might be the sample
mean or sample variance say. We shall call this statistic S and denote its values calculated from the two samples by S(Y)
and S(Z). An obvious statistic to use for the comparison, which we call
the comparator statistic, is the
difference . We therefore need the null distribution of T,
corresponding to when Y and Z have been drawn from the same distribution. This null situation
is depicted in Figure 10
Figure 10: Calculation of a Comparator Statistic
The procedure for producing a
bootstrap version of T under the
(null) assumption that the two samples are drawn from the same distribution is
simple. We obtain bootstrap samples Y*
and Z* from just one of the samples Y say. The bootstrap process is given
in Figure 11. In the Figure is the EDF
constructed from Y. Alternatively
the EDF
of Z, or perhaps even better, the EDF
of the two samples
Y and Z when they are combined, could be used.
The test follows the standard test procedure. We calculate the p–value of the original comparator statistic T relative to the bootstrap EDF of the {Ti*}. The null hypothesis that the two samples Y and Z are drawn from the same distribution is then rejected if the p–value is too small.
Figure 11: Bootstrap Comparison of Two Samples
Example 14: Consider the data given in Law and Kelton (1991) recording the results of a simulation experiment comparing the costs of five different inventory policies. Five simulation runs were made for each of the five policies. The cost of running the inventory was recorded in each run and these results are reproduced in Table 1, together with the means obtained using each policy. Use bootstrapping to see if there are any significant differences between the five means. Law and Kelton EG
Table 1 (Law and Kelton Table 10.5):
Average cost per month for five replicates of each of five inventory policies.
|
Average cost in Run |
||||
Replicate j |
Policy #1 |
Policy #2 |
Policy #3 |
Policy #4 |
Policy #5 |
1 |
126.97 |
118.21 |
120.77 |
131.64 |
141.09 |
2 |
124.31 |
120.22 |
129.32 |
137.07 |
143.86 |
3 |
126.68 |
122.45 |
120.61 |
129.91 |
144.30 |
4 |
122.66 |
122.68 |
123.65 |
129.97 |
141.72 |
5 |
127.23 |
119.40 |
127.34 |
131.08 |
142.61 |
Mean |
125.57 |
120.59 |
124.34 |
131.93 |
142.72 |
□
11. The Parametric Bootstrap
The previous discussion has considered the use of the basic bootstrap.
There is a second method of bootstrapping which is unfairly sometimes
considered to be less effective. This view arises probably because the problems
where it can be used are similar to those of the basic bootstrap, but where it
might be less accurate. However it can
be advantageous in certain specific situations when we are comparing models. We
will discuss these problems in Section 13. but before doing so we need to
describe how the parametric bootstrap works.
Suppose we have fitted a parametric model to data. If the parametric
model is the correct one and describes the form of the data accurately, then
the fitted parametric model will be a close representation of the unknown true
parametric model. We can therefore generate bootstrap samples not by resampling
from the original data, but by sampling from the fitted parametric model. This
is called the parametric bootstrap.
The basic process is depicted in Figure 12. It will be seen that the
method is identical in structure to the basic bootstrap. (Compare Figure 12
with Figure 8.) The only difference is that in the parametric version one has
to generate samples from a given parametric distribution, like the gamma say. A
method has to be available for doing this. As was seen in Exercise 9 the
inverse transform method is one convenient way for doing this.
Figure 12: Parametric Bootstrap Process
Example 15: Consider the queueing example where we have
already used asymptotic theory and the basic bootstrap to analyse the effect of
estimated parameter uncertainty. Repeat the exercise but now use the parametric
bootstrap instead of the basic bootstrap.
ParametricBS-GammaEG □
At first sight the parametric bootstrap does not seem to be a
particularly good idea because it adds an extra layer of uncertainty into the
process, requiring selection of a model that may or may not be right.
However there are situations where its use is advantageous and we discuss
one in the next section.
12 Goodness of Fit Testing
12.1 Classical Goodness of Fit
We consider the natural question: Does
the model that we have fitted actually fit the data very well? For instance
in Exercise 11 (at the end of Section 5) we fitted a gamma distribution to toll
booth service time data, but does the fitted gamma distribution capture the
characteristics of the data properly?
The classical way to answer this question is to use a goodness of fit test (GOF test). A very
popular test is the chi-squared goodness
of fit test. The main reason for its popularity is that it is relatively
easy to implement. The test statistic is easy to calculate and moreover it has
a known chi-squared distribution,
under the null, which makes critical values easy to obtain.
However the chi-squared test has two obvious weaknesses. It is actually
not all that powerful, and it has a certain subjective element because the user
has to divide the data into groups of her/his own choosing.
The best general GOF tests directly compare the EDF with the fitted CDF.
Such tests are called EDF goodness of fit
tests. In the past the Kolmogorov-Smirnov
test has been the most popular, but the Cramér
- von Mises test and the Anderson -
Darling test, defined below, are generally more powerful and should be used
in preference. The trouble with these tests is that, because of their
sensitivity, their critical values are very dependent on the model being
tested, and on whether the model has been fitted (with parameters having to be
estimated in consequence). This means that different tables of test values are
required for different models (see d’Agostino and Stephens, 1986).
First we describe EDF tests in
more detail. Applying the Fundamental Theorem of Section 7, we see that a
natural way to test if a sample has been drawn from the distribution with CDF F0(y), is to compare with F0(y) by looking at some quantity involving the difference
– F0(y). Such
an EDF test statistic typically has
the form
.
Here y(y) is a weighting function. Special cases are the Cramér-von Mises test statistic:
where y(y) = 1, and the Anderson-Darling test statistic:
where y(y) = [F0(y)(1 – F0 (x))]-1 .
At first sight such GOF test statistic seem difficult to calculate involving a nasty integration. In fact things are much more straightforward than this. For example, the Anderson-Darling statistic (which is conventionally denoted by A2) is easily calculated using the equivalent formula
where is the value of the
fitted CDF at the ith ordered
observation.
The basic idea in using a
goodness of fit test statistic is as follows. When the sample has really been
drawn from F0(y) then the value of the test statistic will
be small. This follows from the Fundamental Theorem of Section 7 which
guarantees that will be close in value
to F0(y) across the range of possible y
values. Thus T will be small.
Nevertheless because the test statistic is a random quantity, it will have some
variability according to a null distribution depending on the sample
size n. If the null distribution is
known then we can assess an observed value of T against this distribution. If the sample is drawn from a
distribution different from F0(y) then the T will be large. Statistically, what is conventionally called its p - value will then be small, indicating
that the distribution has not been
drawn from the supposed null distribution.
Figure 13 illustrates the process
involved in calculating a GOF test statistic for the parametric case. Two cases
are shown. In both cases the distribution from which Y has been drawn is assumed to be , but where
is unknown; thus in
each case
has been fitted to the
random sample Y giving the ML
estimate
of
. However in the first case
is the correct model.
Thus
will be the EDF of a
sample drawn from
which will therefore
converge to this distribution. In the second case the true model,
, is different from
, and may even involve a set of parameters
that is different from
. Thus the EDF
in this second case
will not converge to
, but to
. Thus in the second case, T, which is a measure of the difference between the two, will be
larger than in the first case.
The null situation is the first case where we are fitting the correct
model. We need to calculate the distribution of T for this case. A complication arises because the difference
between and
is smaller than the difference between
and the unknown true
. This is because the fitted distribution
will follow the sample
more closely than
because it has been
fitted to the sample. This has to be allowed for in calculating the null
distribution of the test statistic.
Figure
13: Process Underlying the Calculation of a GOF Test, T
It will be seen that the GOF test hinges on being able to calculate the null distribution. This is a big issue and has meant that many potentially powerful test statistics, like the Cramér - von Mises, have not been fully utilized in practice because the null distribution is difficult to obtain.
In the next subsection we show how resampling provides a simple and accurate way of resolving this problem.
12.2. Bootstrapping a GOF statistic
The null case calculation of the GOF statistic depicted in Figure 13 is
identical to that of Figure 7. Thus if we could obtain many values of T using this process then the EDF of the
Ti will converge to the
CDF of T. This is almost certainly too expensive or
impractical to do. However we can get a close approximation by simply replacing
the unknown by its MLE
. This is precisely the parametric bootstrap process as given
in Figure 12.
All being well will be close in value
to
. Thus the distributional properties of the
will be close to those
of a set of
obtained under the null case calculation of Figure 13.
The parametric bootstrap method
as it applies to a GOF statistic is illustrated in more detail in Figure 14,
where is the EDF of the
bootstrap sample
, and
is the bootstrap MLE
of
obtained from the bootstrap sample
Example 16: Examine whether the gamma model is a good fit
to the toll booth data. Examine also whether the normal model is a good fit to
the toll booth data. Use the Anderson-Darling goodness of fit statistic A2, previously given.
Gamma Fit Toll Booth Data Normal Fit Toll Booth Data □
Figure 14: Bootstrap
Process to Calculation the Distribution of a GOF Test, T
13 Comparison of Different
Models; Model Selection
We consider the situation where we have a number of alternative models that might be fitted to a set of data, and we wish to choose which model is the best fit. We shall only discuss the regression modelling case where the data are of the form
and but where it is not
clear which of the components of
are required in the
best model. A common example is in multiple linear regression where Y depends on a number of factors
but it is not clear which ones
are important. Here a decision has to be taken on each factor as to whether to
include it or not. Assuming that the constant θ0 is always fitted there are different models with
precisely k factors, k = 0, 1, ..., M, making a total of 2M
different possible models in all. Krzanowski (1998) gives three quantities that
measure how well a given model fits:
(i) Residual Mean Square RMSk = SSEk-1/(n-k), where SSEk-1 is the residual sum of squares when there are k – 1 factors in the model.
(ii)
Coefficient of Determination , where CSS is the
total sum of squares corrected for the mean.
(iii)
Mallow’s Statistic , where RSSk
is the regression sum of squares from fitting k - 1 factors and RMSM+1
is the residual mean square from fitting the full model.
We do not discuss the methodology here except to note that when M is large then a systematic search through all possible models will be computationally expensive.
Less computationally expensive methods are the forward or backward elimination methods. Interesting cross-validation methods are discussed by Davison and Hinkley (1997). Some very effective methods are discussed by Goldsman and Nelson (Chapter 8, Banks, 1998).
We shall not discuss the pros and cons of model selection, but look at how bootstrapping can be of use. Suppose that we have selected the ‘best’ model according to one of the above criteria. How reliable is this choice? Put another way what degree of confidence can we place on the choice made? Bootstrapping provides a very effective way of answering this question.
Example 17: The following data came from tests a cement manufacturer did on the amount of heat Y their cement gave off when setting, and how it depended on the amount it contained of four chemical compounds : X1, X2, X3, X4. The question is which of these compounds has the most influence? This is studied in Cement Data □
This completes our discussion of some of the basic problems that regularly occur in modelling.
14 Final Comments
This course has reviewed the process of constructing and fitting a statistical model to data whether this data arises from study of a real system or from a simulation.
The classical approach using the method of maximum likelihood has been described for fitting the model.
Two approaches for assessing the variability of fitted parameters have been discussed:
(i) The classical approach using asymptotic normality theory.
(ii) Bootstrap resampling.
Examples have been drawn mainly from regression and ANOVA applications. These have been for illustration only and I have not attempted to survey the range of statistical models likely to be encountered in practice, where typically different aspects of modelling need to be brought together. Krzanowski (1998) gives a very comprehensive but at the same time very accessible survey of the different types of situation commonly encountered. For instance, Krzanowski’s Example 6.1 gives data relating to factors affecting the risk of heart problems: social class, smoking, alcohol and so on. The data is ‘binary response’ data (i.e a patient reporting some form of ‘heart trouble’ or not). The factors are categorical (for example alcohol is coded as someone who drinks or someone who does not). The required model is therefore a logisitc regression model but with a linear predictor involving the categorical factors. Though this example has not been discussed explicitly in this course, all the elements needed to analyse it using either classical or bootstrap methods have been considered, and despite its apparent complexity, this model is quite capable of being tackled straightforwardly using the methods of this course.
In fact the spreadsheets given for the examples in this course use a number of VBA macros that enable various commonly occurring analyses to be carried out. These macros have been designed to be sufficiently flexible and accessible to be used in other applications and you are encouraged to make use of them in this way.
References
Banks,
J. (1998). Handbook of Simulation.
Chernick, M.R. (1999).
Bootstrap Methods, A Practitioner's Guide.
D'Agostino,
R.B. and Stephens, M.A. (1986). Goodness of Fit Techniques.
Davison, A.C. and Hinkley,
D.V. (1997). Bootstrap Methods and their Application.
Krzanowski,
W.J. (1998). An Introduction to Statistical Modelling.
Law,
A.M. and Kelton, W.D. (1991). Simulation Modeling and analysis, 2nd Ed.,
Urban
Hjorth, J.S. (1994). Computer Intensive Statistical Methods.