Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
10
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

10.3 Goodness-of-Fit Tests

257

 

 

80

 

 

 

 

 

 

 

75

 

 

 

0.001

 

 

 

70

 

 

 

0.002

 

 

 

 

 

0.003

 

 

65

 

 

 

0.007

0.005

 

 

60

 

 

 

0.02

0.01

 

 

 

 

 

0.03

 

 

 

 

 

0.05

 

 

55

 

 

 

0.07

 

 

 

 

 

0.1

 

 

50

 

 

 

 

 

 

 

 

 

0.2

 

 

 

45

 

 

 

 

 

 

 

 

 

0.3

 

c

2

40

 

 

 

0.5

 

 

35

 

 

 

 

 

 

 

 

 

 

 

 

 

 

30

 

 

 

 

 

 

 

25

 

 

 

 

 

 

 

20

 

 

 

 

 

 

 

15

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

0 0

5

10

15

20

 

 

 

 

 

NDF

 

 

 

Fig. 10.7. Critical χ2 values as a fnction of the number of degrees of freedom with the significance level as parameter.

The χ2 comparison becomes a test, if we accept the theoretical description of the data only if the p-value exceeds a critical value, the significance level α, and reject it for p < α. The χ2 test is also called Pearson test after the statistician Karl Pearson who has introduced it already in 1900.

Figure 10.7 gives the critical values of χ2, as a function of the number of degrees of freedom with the significance level as parameter. To simplify the presentation we have replaced the discrete points by curves. The p-value as a function of χ2 with NDF as parameter is available in the form of tables or in graphical form in many books. For large f, about f > 20, the χ2 distribution can be approximated su ciently well by a normal distribution with mean value x0 = f and variance s2 = 2f. We are then able to compute the p-values from integrals over the normal distribution. Tables can be found in the literature or alternatively, the computation can be performed with computer programs like Mathematica or Maple.

The Choice of Binning

There is no general rule for the choice of the number and width of the histogram bins for the χ2 comparison but we note that the χ2 test looses significance when the number of bins becomes too large.

To estimate the e ect of fine binning for a smooth deviation, we consider a systematic deviation which is constant over a certain region with a total number of entries N0 and which produces an excess of εN0 events. Partitioning the region into B bins would add to the statistical χ2 in each single bin the contribution:

258

10 Hypothesis Tests

 

 

 

Table 10.1. P-values for χ2 and EDF statistic.

 

 

 

 

 

 

 

test

p value

 

 

 

χ2, 50 Bins

0.10

 

 

 

χ2, 50 Bins, e.p.

0.05

 

 

 

χ2, 20 Bins

0.08

 

 

 

χ2, 20 Bins, e.p.

0.07

 

 

 

χ2, 10 Bins

0.06

 

 

 

χ2, 10 Bins, e.p.

0.11

 

 

 

χ2, 5 Bins

0.004

 

 

 

χ2, 5 Bins, e.p.

0.01

 

 

 

Dmax

0.005

 

 

 

W 2

0.001

 

 

 

A2

0.0005

 

 

 

 

χ2 = (εN0/B)2 = ε2N0 . s N0/B B

For B bins we increase χ2 by ε2N0 which is to be compared to the purely statistical

contribution χ2

which is in average equal to B. The significance S, i.e. the systematic

0

 

 

 

deviation in units of the expected fluctuation 2B is

 

 

2

N0

 

S = ε

 

 

.

 

 

2B

It decreases with the square root of the number of bins.

We recommend a fine binning only if deviations are considered which are restricted to narrow regions. This could be for instance pick-up spikes. These are pretty rare in our applications. Rather we have systematic deviations produced by non-linearity of measurement devices or by background and which extend over a large region. Then wide intervals are to be preferred.

In [58] it is proposed to choose the number of bins according to the formula B = 2N2/5 as a function of the sample size N.

Example 127. Comparison of di erent tests for background under an exponential distribution

In Fig. 10.2 a histogrammed sample is compared to an exponential. The sample contains, besides observations following this distribution, a small contribution of uniformly distributed events. From Table 10.1 we recognize that this defect expresses itself by small p-values and that the corresponding decrease becomes more pronounced with decreasing number of bins.

Some statisticians propose to adjust the bin parameters such that the number of events is the same in all bins. In our table this partitioning is denoted by e.p. (equal probability). In the present example this does not improve the significance.

The value of χ2 is independent of the signs of the deviations. However, if several adjacent bins show an excess (or lack) of events like in the left hand histogram of Fig. 10.8 this indicates a systematic discrepancy which one would not expect at the same level for the central histogram which produces the same value for χ2. Because correlations between neighboring bins do not enter in the test, a visual inspection is often more e ective than the mathematical test. Sometimes it is helpful to present

10.3 Goodness-of-Fit Tests

259

 

150

 

 

 

entries

100

 

 

 

50

 

 

 

 

 

 

 

 

0

0.5

1.0

0.0

 

0.0

 

 

x

 

 

0.5

1.0

0.0

0.5

1.0

x

 

 

x

 

Fig. 10.8. The left hand and the central histogram produce the same χ2 p-value, the left hand and the right hand histograms produce the same Kolmogorov p-value.

Table 10.2. χ2 values with sign of deviation for a two-dimensional histogram.

i \ j 1

2

3

4

5

6

7

8

1

0.1

-0.5

1.3

-0.3

1.6

-1.1

2.0

1.2

2

 

-1.9

0.5

-0.4

0.1

-1.2

1.3

1.5

3

 

 

-1.2

-0.8

0.2

0.1

1.3

1.9

4

 

 

 

0.2

0.7

-0.6

1.1

2.2

for every bin the value of χ2 multiplied by the sign of the deviation either graphically or in form of a table.

Example 128. χ2comparison for a two-dimensional histogram

In Table 10.2 for a two-dimensional histogram the values of χ2 accompanied with the sign are presented. The absolute values are well confined in the range of our expectation but near the right hand border we observe an accumulation of positive deviations which point to a systematic e ect.

Generalization to Arbitrary Measurements

The Pearson method can be generalized to arbitrary measurements yk with mean square errors δk2. For theoretical predictions tk we compute χ2,

χ2 = XN (yk − tk)2 , δ2

k=1 k

where χ2 follows a χ2 distribution of f = N degrees of freedom. A necessary condition for the validity of the χ2 distribution is that the uncertainties follow a normal distribution. For a large number N of individual measurements the central limit theorem applies and we can relax the condition of normality. Then χ2 is approximately normally distributed with mean N and variance 2N.

δk = δtk
k yk

260 10 Hypothesis Tests

A further generalization is given in the Appendix 13.8 where weighted events and statistical errors of the theoretical predictions, resulting from the usual Monte Carlo calculation, are considered.

Remark : The quantity δ2 has to be calculated under the assumption that the theoretical description which is to be tested is correct. This means, that normally the raw measurement error cannot be inserted. For example, instead of ascribing to a measured quantity an error δkwhich is proportional to its value yk, a corrected error

should be used.

Sometimes extremely small values of χ2 are presented. The reason is in most cases an overestimation of the errors.

The variable χ2 is frequently used to separate signal events from background. To this end, the experimental distribution of χ2 is transformed into a p-value distribution like the one presented in Fig. 10.5. In this situation it is not required that χ2 follows the χ2 distribution. It is only necessary that it is a discriminative test variable.

The χ2 Test for Composite Hypotheses

In most cases measurements do not serve to verify a fixed theory but to estimate one or more parameters. The method of least squares for parameter estimation has been discussed in Sect. 7.2. To fit a curve y = t(x, θ) to measured points yi with Gaussian errors σi, i = 1, . . . , N, we minimize the quantity

χ2 =

N

(yi − t(xi, θ1, . . . , θZ ))2

,

(10.6)

σ2

=1

i

Xi

 

with respect to the Z free parameters θk.

It is plausible that with increasing number of parameters, which are adjusted, the description of the data improves, χ2 decreases. In the extreme case where the number of parameters is equal to the number N of measured points or histogram bins it becomes zero. The distribution of χ2 in the general case where Z parameters are adjusted follows under conditions to be discussed below a χ2 distribution of f = N − Z degrees of freedom.

Setting in (10.6) zi = (yi − t(xi, θ))/σi we may interpret χ2 = PN1 zi2 as the (squared) distance of a point z with normally distributed components from the origin in an N-dimensional space . If all parameters are fixed except one, say θ1, which is left free and adjusted to the data by minimizing χ2, we have to set the derivative with respect to θ1 equal to zero:

 

 

 

 

N

 

 

1 ∂χ2

Xi

∂t

 

 

 

 

= zi

 

i = 0 .

2 ∂θ1

∂θ1

 

 

 

 

=1

 

 

If t is a linear function of the parameters, an assumption which is often justified at least approximately4, the derivatives are constants, and we get a linear relation (constraint) of the form c1z1 + · · · + cN zN = 0. It defines a (N − 1)-dimensional subspace,

4Note that also the σi have to be independent of the parameters.

10.3 Goodness-of-Fit Tests

261

a hyperplane containing the origin, of the N-dimensional z-space. Consequently, the distance in z-space is confined to this subspace and derived from N − 1 components. For Z free parameters we get Z constraints and a (N −Z)-dimensional subspace. The independent components (dimensions) of this subspace are called degrees of freedom. The number of degrees of freedom is f = N − Z as pretended above. Obviously, the sum of f squared components will follow a χ2 distribution with f degrees of freedom.

In the case of fitting a normalized distribution to a histogram with B bins which we have considered above, we had to set (see Sect. 3.6.7) f = B −1. This is explained by a constraint of the form z1 + · · · + zB = 0 which is valid due to the equality of the normalization for data and theory.

The χ2 Test for Small Samples

When the number of entries per histogram bin is small, the approximation that the variations are normally distributed is not justified. Consequently, the χ2 distribution should no longer be used to calculate the p-value.

Nevertheless we can use in this situation the sum of quadratic deviations χ2 as test statistic. The distribution f02) has then to be determined by a Monte Carlo simulation. The test then is slightly biased but the method still works pretty well.

Warning

The assumption that the distribution of the test statistic under H0 is described by a χ2 distribution relies on the following assumptions: 1. The entries in all bins of the histogram are normality distributed. 2. The expected number of entries depends linearly on the free parameters in the considered parameter range. An indication for a non-linearity are asymmetric errors of the adjusted parameters. 3. The estimated uncertainties σi in the denominators of the summands of χ2 are independent of the parameters. Deviations from these conditions a ect mostly the distribution at large values of χ2 and thus the estimation of small p-values. Corresponding conditions have to be satisfied when we test the GOF of a curve to measured points. Whenever we are not convinced about their validity we have to generate the distribution of χ2 by a Monte Carlo simulation.

10.3.4 The Likelihood Ratio Test

General Form

The likelihood ratio test compares H0 to a parameter dependent alternative H1 which includes H0 as a special case. The two hypothesis are defined through the p.d.f.s f(x|θ) and f(x|θ0) where the parameter set θ0 is a subset of θ, often just a fixed value of θ. The test statistic is the likelihood ratio λ, the ratio of the likelihood of H0 and the likelihood of H1 where the parameters are chosen such that they maximize the likelihoods for the given observations x. It is given by the expression

λ =

sup L(θ0|x)

,

(10.7)

sup L(θ|x)

 

 

 

262 10 Hypothesis Tests

or equivalently by

ln λ = sup ln L(θ0|x) − sup ln L(θ|x) .

If θ0 is a fixed value, this expression simplifies to ln λ = ln L(θ0|x) − sup ln L(θ|x). From the definition (10.7) follows that λ always obeys λ ≤ 1.

Example 129. Likelihood ratio test for a Poisson count

Let us assume that H0 predicts µ0 = 10 decays in an hour, observed are 8. The likelihood to observe 8 for the Poisson distribution is L0 = P (8|10) = e−10108/8!. The likelihood is maximal for µ = 8, it is L = P (8|8) = e−888/8! Thus the likelihood ratio is λ = P (8|10)/P (8|8) = e−2(5/4)8 = 0.807. The probability to observe a ratio smaller than or equal to 0.807 is

 

X

P (k|10) ≤ 0.807 P (10|10) .

p =

P (k|10) for k with

 

k

 

Relevant numerical values of λ(k, µ0)

= P (k|µ0)/P (k|k) and P (k|µ0) for

µ0 = 10 are given in Table 10.3 It is seen, that the sum over k runs over

Table 10.3. Values of λ and P , see text.

k

8

9

10

11

12

13

λ

0.807

0.950

1.000

0.953

0.829

0.663

P

0.113

0.125

0.125

0.114

0.095

0.073

all k, except k = 9, 10, 11, 12: p = Σk8=0P (k|10) + Σk=13P (k|10) = 1 −

Σk12=9P (k|10) = 0.541 which is certainly acceptable. This is an example for the p-value of a two-sided test.

The likelihood ratio test in this general form is useful to discriminate between a specific and a more general hypothesis, a problem which we will study in Sect. 10.5.2. To apply it as a goodness-of-fit test, we have to histogram the data.

The Likelihood Ratio Test for Histograms

Q

We have shown that the likelihood L0 = i f0(xi) of a sample cannot be used as a test statistic, but when we combine the data into bins, a likelihood ratio can be defined for the histogram and used as test quantity. The test variable is the ratio of the likelihood for the hypothesis that the bin content is predicted by H0 and the likelihood for the hypothesis that maximizes the likelihood for the given sample. The latter is the likelihood for the hypothesis where the prediction for the bin coincides with its content. If H0 is not simple, we take the ratio of the maximum likelihood allowed by H0 and the unconstrained maximum of L.

For a bin with content d, prediction t and p.d.f. f(d|t) this ratio is λ = f(d|t)/f(d|d) since at t = d the likelihood is maximal. For the histogram we have to multiply the ratios of the B individual bins. Instead we change to the log-likelihoods and use as test statistic

XB

V = ln λ = [ln f(di|ti) − ln f(di|di)] .

i=1

10.3 Goodness-of-Fit Tests

263

If the bin content follows the Poisson statistics we get (see Chap. 6, Sect. 6.5.6)

XB

V = [−ti + di ln ti − ln(di!) + di − di ln di + ln(di!)]

i=1

XB

=[di − ti + di ln(ti/di)] .

i=1

The distribution of the test statistic V is not universal, i.e. not independent of the distribution to be tested as in the case of χ2. It has to be determined by a Monte Carlo simulation. In case parameters of the prediction have been adjusted to data, the parameter adjustment has to be included in the simulation.

The method can be extended to weighted events and to the case of Monte Carlo generated predictions with corresponding statistical errors, see Appendix 13.8.

Asymptotically, N → ∞, the test statistic V approaches −χ2/2 as is seen from the expansion of the logarithm, ln(1+x) ≈ x−x2/2. After introducing xi = (di−ti)/ti which, according to the law of large numbers, becomes small for large di, we find

XB

V = [tixi − ti(1 + xi) ln(1 + xi)]

i=1

XB

ti

i=1

XB

ti

i=1

xi − (1 + xi)(xi

1

xi2)

 

 

 

 

 

 

 

 

 

 

 

2

 

2

 

2 i

 

2 i=1

 

(di

ti

B

 

1 x2

=

 

1 B

 

 

 

− ti)2

=

1

χ2 ,

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

and thus −2V is distributed according to a χ2 distribution with B degrees of freedom, but then we may also use directly the χ2 test.

If the prediction is normalized to the data, we have to replace the Poisson distribution by the multinomial distribution. We omit the calculation and present the

result:

XB

V = di ln(ti/di) .

i=1

In this case, V approaches asymptotically the χ2 distribution with B − 1 degrees of freedom.

10.3.5 The Kolmogorov–Smirnov Test

The subdivision of a sample into intervals is arbitrary and thus subjective. Unfortunately some experimenters use the freedom to choose histogram bins such that the data agree as well as possible with the theoretical description in which they believe. This problem is excluded in binning-free tests which have the additional advantage that they are also applicable to small samples.

The Kolmogorov–Smirnov test compares the distribution function

Z x

F0(x) = f0(x) dx

−∞

Total number

264 10 Hypothesis Tests

1.0

 

 

 

F(x)

 

 

 

 

 

D-

S(x)

0.5

 

 

 

 

D

 

 

 

+

 

0.0

0.0

0.5

1.0

 

 

 

 

x

Fig. 10.9. Comparison of the empirical distribution function S(x) with the theoretical distribution function F (x).

with the corresponding experimental quantity S,

S(x) = Number of observations with xi < x .

The test statistic is the maximum di erence D between the two functions:

D= sup |F (x) − S(x)| = sup(D+, D) .

The quantities D+, Ddenote the maximum positive and negative di erence, respectively. S(x) is a step function, an experimental approximation of the distribution function and is called Empirical Distribution Function (EDF ). It is depicted in Fig. 10.9 for an example and compared to the distribution function F (x) of H0. To calculate S(x) we sort all N elements in ascending order of their values, xi < xi+1 and add 1/N at each location xi to S(x). Then S(xi) is the fraction of observations with x values less or equal to xi,

i S(xi) = N ,

S(xN ) = 1 .

As in the χ2 test we can determine the expected distribution of D, which will

depend on N and transform the experimental value of D into a p-value.

To get

rid of the N dependence of the theoretical D distribution we use D =

ND.

Its distribution under H0 is for not too small N (N >≈ 100) independent of N

and available in form of tables and graphs. For event numbers larger than 20 the

5

approximation D = D(

N + 0.12 + 0.11/ N) is still a very good approximation .

The function p(D ) is displayed in Fig. 10.10.

5D does not scale exactly with N because S increases in discrete steps.

10.3 Goodness-of-Fit Tests

265

 

1

 

 

 

 

 

 

0.1

 

 

 

 

 

value

0.01

 

 

 

 

 

p-

 

 

 

 

 

 

 

 

 

 

 

 

1E-3

 

 

 

 

 

 

1E-40.0

0.5

1.0

1.5

2.0

2.5

 

 

 

 

D*

 

 

Fig. 10.10. P-value as a function of the Kolmogorov test statistic D .

The Kolmogorov–Smirnov test emphasizes more the center of the distribution than the tails because there the distribution function is tied to the values zero and one and thus is little sensitive to deviations at the borders. Since it is based on the distribution function, deviations are integrated over a certain range. Therefore it is not very sensitive to deviations which are localized in a narrow region. In Fig. 10.8 the left hand and the right hand histograms have the same excess of entries in the region left of the center. The Kolmogorov–Smirnov test produces in both cases approximately the same value of the test statistic, even though we would think that the distribution of the right hand histogram is harder to explain by a statistical fluctuation of a uniform distribution. This shows again, that the power of a test depends strongly on the alternatives to H0. The deviations of the left hand histogram are well detected by the Kolmogorov–Smirnov test, those of the right hand histogram much better by the Anderson–Darling test which we will present below.

There exist other EDF tests [57], which in most situations are more e ective than the simple Kolmogorov–Smirnov test.

10.3.6 Tests of the Kolmogorov–Smirnov – and Cramer–von Mises

Families

In the Kuiper test one uses as the test statistic the sum V = D+ + Dof the two deviations of the empirical distribution function S from F . This quantity is designed for distributions “on the circle”. This are distributions where the beginning and the end of the distributed quantity are arbitrary, like the distribution of the azimuthal angle which can be presented with equal justification in all intervals [ϕ0, ϕ0 + 2π] with arbitrary ϕ0.

The tests of the Cramer–von Mises family are based on the quadratic di erence between F and S. The simple Cramer–von Mises test employs the test statistic

266 10 Hypothesis Tests

Z

W 2 = [(F (x) − S(x)]2 dF .

−∞

In most situations the Anderson–Darling test with the test statistic A2 and the test of Watson with the test statistic U2

A2 = N

[S(x) − F (x)]2

dF ,

 

 

Z−∞ F (x) [1 − F (x)]

 

2

 

 

dF ,

U2 = N Z−∞ S(x) − F (x) − Z−∞ [S(x) − F (x)] dF

are superior to the Kolmogorov–Smirnow test.

The test of Anderson emphasizes especially the tails of the distribution while Watson’s test has been developed for distributions on the circle. The formulas above look quite complicated at first sight. They simplify considerably when we perform a probability integral transformation (PIT ). This term stands for a simple transformation of the variate x into the variate z = F0(x), which is uniformly distributed in the interval [0, 1] and which has the simple distribution function H0(z) = z. With the transformed step distribution S (z) of the sample we get

A2 = N

[S (z) − z]2

dz ,

 

 

Z−∞

 

z(1 − z)

2

 

 

 

U2 = N Z−∞

S (z) − z − Z−∞ [S (z) − z] dz

dz .

In the Appendix 13.7 we show how to compute the test statistics. There also the asymptotic distributions are collected.

10.3.7 Neyman’s Smooth Test

This test [59] is di erent from those discussed so far in that it parameterizes the alternative hypothesis. Neyman introduced the smooth test in 1937 (for a discussion by E. S. Pearson see [60]) as an alternative to the χ2 test, in that it is insensitive to deviations from H0 which are positive (or negative) in several consecutive bins. He insisted that in hypothesis testing the investigator has to bear in mind which departures from H0 are possible and thus to fix partially the p.d.f. of the alternative. The test is called “smooth” because, contrary to the χ2 test, the alternative hypothesis approaches H0 smoothly for vanishing parameter values. The hypothesis under test H0 is again that the sample after the PIT, zi = F0(xi), follows a uniform distribution in the interval [0, 1].

The smooth test excludes alternative distributions of the form

 

k

 

 

Xi

(10.8)

gk(z) =

θiπi(z),

 

=0

 

where θi are parameters and the functions πi(z) are modified orthogonal Legendre polynomials that are normalized to the interval [0, 1] and symmetric or antisymmetric with respect to z = 1/2:

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]