dstl Statistics Workshop 23rd November 2009

Additional Notes: dstl Statistics Workshop 23rd November 2009

Here are some additional examples raised by suggestions from workshop participants:

1. One way analysis of variance, 1-way ANOVA

The problem below is a standard one, but a watchpoint is the way parameters should be included in the model to avoid the problem of non-estimable parameters.

DAY 0	DAY 1	DAY 3	DAY 5	DAY 7	DAY 14	DAY 21
20.1	20.1	16.6	15.3	15.8	17.9	19.6
20.5	16.5	15.8	17.1	16.5	17.2	19.1
18	20.1	16.3	15.8	16.2	19.4	18.2
19.1	16.9	18.4	15.1	18.4	18.1	19.7
17.4	17.5	16.2	15.4	16.7	18.4	18.8
17.4	15.8	16.2	15.5	17.9	19.2	19.9
18.3	17.7	16.6	16.3	14.5	16.7
17.1	16.2	16.1	16.3	16.8	17.1
19.9	17.5	17.8	16.4	18.5	17.3
19.3	17.6	16.5	17.5	18.3	18
18.7	16.8	15.8	18.9	16.2	16.2
18.2	17.5	16.5	16.1	17.4	19.7
18.6	18.1	17	15.7	17.3
18.6	17.2	16.9	15.9	15.4
18.5	17.3	16.9	17.2	15.5
18.4	16.8	18.4	18.7	15.8
19.6	16.1	17.2	17	14.3
18.1	16.7	19.3	18.2	19.5
19.3	18.6	18.3	15.7
18.6	18.5	18.8	14.5
19	17.3	18.2	16
18.9	17.4	16.8	18
20	16.7	18.6	16.1
19.7	16.6	16.5	19.3
17.9	19.2	19.2
19.9	16.2	15.8
20.5	18.5	17.1
21	18.1	18.6
18.7	18.7	17.2
18.2	17.4	17.1
18.7	17.4
19.6	18.2
19.1	18.6
18.5	18.5		Different n at each time point. Is there a change in mass over time? (1-way ANOVA)
18.7	17.1
19.6	18
18.7
18.5
20
19
19.5
17.8

The Excel worksheet ANOVA is a simple way of checking if a parameterization is unsatisfactory.

ANOVD-1Way

2. Decay Rate Example

This concerns the estimation and comparison of decay rates.

	Method 1			Method 2
Time (mins)	Virus 1	Virus 2	Virus 3	Virus 1
0	491.4333	30900	82.26667	74
5	615.0467	9289.583	68.565	37
15	295.81	5175.315	55.33667	37
30	247.4067	4416.913	19.45	7.9
45	44.26	2426.28	12.13	7.9
60	16.06667	1113.45	9.416667	7.9
90	11.27667	441.5133	9.073333	7.9


	Is there a difference between virus decay rates? Is there a difference in methods? (regression analysis?)

This can be modelled as a non-linear regression problem.

Alternatively by assuming multiplicative errors we can log the data and use the linear model.

3. Poisson Regression Example

A weapon firing at a target at distance x, hits with a certain probability generating N(x) fragments.

Possible model is that

N(x) ~ Poisson(λ(x))

i.e.

Suppose

λ(x) = η(x | θ)

= , say,

a decreasing function of x. We have data

n(x₁), n(x₂), n(x₃), ....n(x_m)

The loglikelihood is

and θ is estimated by maximizing the loglikelihood.

Probability weapon does not hit is

PoissonRegressionFit

4. Combining Uncertainty

The evaluation above supposes the sources of uncertainty are additive, giving an overall uncertainty of

Y = X₁ + X₂ + X₃ + X₄

where the X_i are independent random variables contributing to the total uncertainty Y.

Then the “Most Likely” variability is based on

E(Y) = E(X₁ + X₂ + X₃ + X₄)

= E(X₁) + E(X₂) + E(X₃) + E(X₄)

And the “Worst Case” is based on

V(Y) = V(X₁ + X₂ + X₃ + X₄)

= V(X₁) + V(X₂) + V(X₃) + V(X₄)

so that

SD(Y) = SD[V(X₁) + V(X₂) + V(X₃) + V(X₄)]

A sampling approach is to create the CDF of Y numerically by simulation.

CombiningUncertaintiesbySimulation

5. Combining information from two sensors.

How to ‘fuse’ radar and infra-red sensor information?

Step 1: Treat each signal as a regression:

where

ε_ij ~ N(0, σ _i²)

and the f_i are probability density functions (pdf) to be fitted. This gives estimated parameter values .

Step 2: Then the best (in the sense of minimum variance) combined signal is the pdf of the random variable

where Z₁ has pdf and Z₂ has pdf .

If Y=aZ then f(y)dy=g(z)dz=g(z)dy/a i.e. f(y) = g(z)/a = g(y/a)/a.

Thus Z has pdf that is the convolution:

This can be calculated numerically, or more easily by resampling.

SignalFusionExample

6. Sequential Estimation of Confidence Intervals

A common problem is the construction of a confidence interval of given width and level of confidence. This can be tackled using a two-stage method or a fully sequential method.

Suppose we have observations:

X₁, X₂, ......., X_n, .....

where each is is of the form

X = μ + ε, ε ~N(0, σ²)

and we wish to estimate μ and find a confidence interval for it.

Then

and

(1)

where

and . (2)

and and is the upper quantile of Student’s t distribution with degrees of freedom.

The width depends on s², the estimate of σ ², which is not known at the outset.

A well-known solution is to use a two-stage method first proposed by Stein (1945), where in the first stage one carries out a pilot set of n observations to calculate an estimate of σ ². Then for any given interval width w and confidence level α, this allows a value N to be obtained , so that if a full set of N observations are obtained (i.e. N – n additional observations are obtained) then a confidence interval of the desired width can be found. Stein showed that setting the offset in (1) equal to the desired half width can be used to find the additional number of observations needed.

Stage 1: Sample n values of X, and calculate ths sample variance s² from (1) above. Let h be the half width required (so that w = 2h). Then set

(where denotes smallest integer greater than or equal to z).

Stage 2: Sample additional observations, then a % confidence interval is given by

Stein’s method is not fully efficient. A better, fully sequential, way allows observations to be added one at a time.This uses the same kind of probability statement

as starting point where . We transform the X_i to a sequence of independent variates (actually variates when ~):

so that if additional observations X_n+₁, X_n+₂, ... are included then the corresponding U_n, U_n+₁, ...can be added to the left hand sum in the stopping rule below without changing the previous U_i. The stopping rule is based on one suggested by Anscombe (1953):

Take N as the first n ( ≥ 3 ) for which

The % confidence interval is

where and, ;

h₁, h₂ are both positive under the condition .

ExtraNotesOnSequentialMethods