vstatmp_engl
.pdf12.2 Resampling Techniques |
337 |
of an unknown distribution from a single sample, randomization methods serve to compare two or more samples taken under di erent conditions.
Remark: We may ask with some right whether resampling makes sense since randomly choosing elements from a sample does not seem to be as e cient as a systematic evaluation of the compete sample. Indeed, it should always be optimal to evaluate the interesting parameter directly using all elements of the sample with the same weight, either analytically or numerically; but, as in Monte Carlo simulations, the big advantage of parameter estimation by randomly selecting elements relies on the simplicity of the approach. Nowadays, lack of computing capacity is not a problem and in the limit of an infinite number of drawn combinations of observations the complete available information is exhausted.
12.2.2 Definition of Bootstrap and Simple Examples
We sort the N observations of a given data sample {x1, x2, . . . , xN } according to their value, xi ≤ xi+1, and associate to each of them the probability 1/N. We call this discrete distribution P0(xi) = 1/N the sample distribution. We obtain a bootstrap sample {x1, x2, . . . , xM } by generating M observations following P0. Bootstrap observations are marked with a star " ". A bootstrap sample may contain the same observation several times. The evaluation of interesting quantities follows closely that which is used in Monte Carlo simulations. The p.d.f. used for event generation in Monte Carlo procedures is replaced by the sample distribution.
Example 143. Bootstrap evaluation of the accuracy of the estimated mean value of a distribution
We dispose already of an e cient method to estimate the variance. Here we present an alternative approach in order to introduce and illustrate the bootstrap method. Given be the sample of N = 10 observations {0.83, 0.79, 0.31, 0.09, 0.72, 2.31, 0.11, 0.32, 1.11, 0.75}. The estimate of the mean value
is obviously µ = |
|
= |
xi/N = 0.74. We have also derived in Sect. 3.2 a |
||||||
x |
|||||||||
|
|
|
|
the uncertainty, δ |
|
= s/√N |
− |
1 = 0.21. When we treat |
|
formula to estimate |
P |
|
µ |
|
|
||||
the sample |
as representative of the distribution, we are able to produce an |
||||||||
b |
|
|
|
|
|
|
empirical distribution of the mean value: We draw from the complete sample sequentially N observations (drawing with replacement ) and get for instance the bootstrap sample {0.72, 0.32, 0.79, 0.32, 0.11, 2.31, 0.83, 0.83, 0.72, 1.11}. We compute the sample mean and repeat this procedure B times and obtain in this way B mean values µk. The number of bootstrap replicates should be large compared to N, for example B typically equal to 100 or 1000 for N = 10. From the distribution of the values µb, b = 1, . . . , B we can compute again the mean value µˆ and its uncertainty δµ:
µˆ = B1 X µb ,
δµ2 = B1 X(µb − µˆ )2 .
Fig. 12.3 shows the sample distribution corresponding to the 10 observations and the bootstrap distribution of the mean values. The bootstrap estimates µˆ = 0.74, δµ = 0.19, agree reasonably well with the directly obtained values. The larger value of δµ compared to δµ is due to the bias correction in
340 12 Auxiliary Methods
distance of random points, we extract from the distribution of Fig. 12.4 that the probability to find a distance less than 0.4 is approximately equal to 10%. Exact confidence intervals can only be derived from the exact distribution.
12.2.5 Precision of Classifiers
Classifiers like decision trees and ANNs usually subdivide the learning sample in two parts, one part is used to train the classifier and a smaller part is reserved to test the classifier. The precision can be enhanced considerably by using bootstrap samples for both training and testing.
12.2.6 Random Permutations
In Chap. 10 we have treated the two-sample problem: “Do two experimental distributions belong to the same population?” In one of the tests, the energy test, we had used permutations of the observations to determine the distribution of the test quantity. The same method can be applied to an arbitrary test quantity which is able to discriminate between samples.
Example 146. Two-sample test with a decision tree
Let us assume that we want to test whether the two samples {x1, . . . , xN1 }, {x1, . . . , xN2 } of sizes N1 and N2 belong to the same population. This is our null hypothesis. Instead of using one of the established two-sample methods we may train a decision tree to separate the two samples. As a test quantity
serves the number of misclassifications ˜ which of course is smaller than
N
(N1 + N2)/2, half the size of the total sample. Now we combine the two samples, draw from the combined sample two new random samples of sizes N1 and N2, train again a decision tree to identify for each element the sample index and count the number of misclassifications. We repeat this procedure many, say 1000, times and obtain this way the distribution of the test statistic under the null hypothesis. The fraction of cases where the random selection yields a smaller number of misclassifications than the original samples is equal to the p-value of the null hypothesis.
Instead of a decision tree we can use any other classifier, for instance a ANN. The corresponding tests are potentially very powerful but also quite involved. Even with nowadays computer facilities, training of some 1000 decision trees or artificial neural nets is quite an e ort.
12.2.7 Jackknife and Bias Correction
Jackknife is mainly used for bias removal. It is assumed that in first order the bias decreases with N, i.e. that it is proportional to 1/N. How jackknife works, is explained in the following simple example where we know the exact result from Chap. 3.
When we estimate the mean squared error of a quantity x from a sample of size
N, using the formula |
X(xi − x)2/N |
δN2 = |
12.2 Resampling Techniques |
341 |
then δN2 is biased. We assume that the expected bias decreases linearly with N. Thus it is possible to estimate the bias by changing N. We remove one observation at a time and compute each time the mean squared error δN2 −1,i and average the results:
|
|
|
|
1 |
N |
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
Xi |
|
|
|
|
|
|
|
|
|
|
|
δN2 −1 = N |
δN2 −1,i . |
|
|
|
|
|
|
|||||||
|
|
=1 |
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A linear dependence of the average bias bN on 1/N implies |
|
|
|
|
||||||||||||
|
bN N = bN−1(N − 1) |
|
|
N |
|
|
1 |
(12.4) |
||||||||
) − σ2 |
σ |
= |
|
|
N |
|
|
|
|
|
|
|||||
E(δN2 |
N = E(δN2 −1) − σ2 |
(N − 1) , |
|
|
|
|||||||||||
|
|
2 |
NE(δ2 ) |
− |
(N |
− |
1)E(δ2 |
− |
|
) . |
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
Thus an improved estimate is δc2 = NδN2 − (N − 1)δN2 |
−1. |
|
|
|
|
|||||||||||
Inserting the known expectation values (see Chap. 3), |
|
|
|
|
||||||||||||
E(δN2 ) = σ2 |
N − 1 |
, E(δN2 −1) = σ2 |
N − 2 |
, |
|
|||||||||||
|
N − 1 |
|
||||||||||||||
|
|
|
|
N |
|
|
|
|
|
|
|
|
we confirm the bias corrected result E(δc2) = σ2.
This result can be generalized to any estimate t with a dominantly linear bias, b 1/N.
tc = NtN − (N − 1)tN−1
The remaining bias is zero, of order 1/N2, or of higher order.
13
Appendix
13.1 Large Number Theorems
13.1.1 The Chebyshev Inequality and the Law of Large Numbers
For a probability density f(x) with expected value µ, finite variance σ and arbitrary given positive δ, the following inequality, known as Chebyshev inequality, is valid:
P {|x − µ| ≥ δ} ≤ |
σ2 |
(13.1) |
δ2 . |
This very general theorem says that a given, fixed deviation from the expected value becomes less probable when the variance becomes smaller. It is also valid for discrete distributions.
To prove the inequality, we use the definition
Z
PI ≡ P {x I} = f(x)dx ,
I
where the domain of integration I is given by 1 ≤ |x − µ|/δ. The assertion follows from the following inequalities for the integrals:
ZI |
≤ ZI |
δ |
|
≤ Z−∞ |
δ |
|
f(x)dx |
|
x − µ |
2 f(x)dx |
∞ |
x − µ |
2 f(x)dx = σ2/δ2 . |
|
|
|
|
Applying (13.1) to the arithmetic mean x from N independent identical distributed random variables x1, . . . , xN results in one of the so-called laws of large numbers:
P {| |
|
− hxi| ≥ δ} ≤ var(x)/(Nδ2) , |
(13.2) |
x |
with the relations hxi = hxi , var(x) = var(x)/N obtained in Sects. 3.2.2 and 3.2.3. The right-hand side disappears for N → ∞, thus in this limit the probability to observe an arithmetic mean value outside an arbitrary interval centered at the expected value approaches zero. We talk about stochastic convergence or convergence in probability, here of the arithmetic mean against the expected value.
We now apply (13.2) to the indicator function II (x) = 1 for x I, else 0. The sample mean II = P II (xi)/N is the observed relative frequency of events x I in the sample. The expected value and the variance are