Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Южный Федеральный Университет

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

vstatmp_engl

.pdf

Скачиваний:

Добавлен:

12.03.2016

Размер:

6.43 Mб

Скачать

☆

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4227 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

10.2 Some Deﬁnitions

247

When we test, for instance, the hypothesis that a coordinate is distributed according to N(0, 1), then for a sample consisting of a single measurement x, a reasonable test statistic is the absolute value |x|. We assume that if H0 is wrong then |x| would be large. A typical test statistic is the χ2 deviation of a histogram from a prediction. Large values of χ2 indicate that something might be wrong with the prediction.

Before we apply the test we have to ﬁx a critical region K which leads to the rejection of H0 if t is located inside of it. Under the condition that H0 is true, the probability of rejecting H0 is α, P {t K|H0} = α where α [0, 1] normally is a small quantity (e.g. 5 %). It is called signiﬁcance level or size of the test. For a test

based on the χ2 statistic, the critical region is deﬁned by χ2 > χ2max(α) where the parameter χ2max is a function of the signiﬁcance level α. It ﬁxes the range of the

critical region.

To compute rejection probabilities we have to compute the p.d.f. f(t) of the test statistic. In some cases it is known as we will see below, but in other cases it has to be obtained by Monte Carlo simulation. The distribution f has to include all experimental conditions under which t is determined, e.g. the measurement uncertainties of t.

10.2.3 Errors of the First and Second Kind, Power of a Test

After the test parameters are selected, we can apply the test to our data. If the actually obtained value of t is outside the critical region, t / K, then we accept H0, otherwise we reject it. This procedure implies four di erent outcomes with the following a priori probabilities:

1. H0 ∩ t K, P {t K|H0} = α: error of the ﬁrst kind. (H0 is true but rejected.),

2.H0 ∩ t / K, P {t / K|H0} = 1 − α (H0 is true and accepted.),

3.H1 ∩ t K, P {t K|H1} = 1 − β (H0 is false and rejected.),

4.H1 ∩ t / K, P {t / K|H1} = β: error of the second kind (H0 is false but accepted.).

When we apply the test to a large number of data sets or events, then the rate α, the error of the ﬁrst kind, is the ine ciency in the selection of H0 events, while the rate β, the error of the second kind, represents the background with which the selected events are contaminated with H1 events. Of course, for α given, we would like to have β as small as possible. Given the rejection region K which depends on α, also β is ﬁxed for given H1. For a reasonable test we expect that β is monotonically decreasing with α increasing: With α → 0 also the critical region K is shrinking, while the power 1 − β must decrease, and the background is less suppressed. For ﬁxed α, the power indicates the quality of a test, i.e. how well alternatives to H0 can be rejected.

The power is a function, the power function, of the signiﬁcance level α. Tests which provide maximum power 1 − β with respect to H1 for all values of α are called Uniformly Most Powerful (UMP) tests. Only in rare cases where H1 is restricted in some way, there exists an optimum, i.e. UMP test. If both hypotheses are simple then as already mentioned in Chap. 6, Sect. 6.3, according to a lemma of Neyman and E. S. Pearson, the likelihood ratio can be used as test statistic to discriminate between H0 and H1 and provides a uniformly most powerful test.

248 10 Hypothesis Tests

The interpretation of α and β as error rates makes sense when many experiments or data sets of the same type are investigated. In a search experiment where we want to ﬁnd out whether a certain physical process or a phenomenon exists or in an isolated GOF test they refer to virtual experiments and it is not obvious which conclusions we can draw from their values.

10.2.4 P-Values

Strictly speaking, the result of a test is that a hypothesis is “accepted” or “rejected”. In most practical situations it is useful to replace this digital answer by a continuous parameter, the so called p-value which is a function p(t) of the test statistic t and which measures the compatibility of the sample with the null hypothesis, a small

value of p casting some doubt on the validity of H0. For an observed value tobs of the test statistic, p is the probability to obtain a value t ≥ tobs under the null hypothesis:

p = P {t ≥ tobs|H0} .

To simplify its deﬁnition, we assume that the test statistic t is conﬁned to values between zero and inﬁnity with a critical region t > tc2. Its distribution under H0 be f0(t). Then we have

Z t

p(t) = 1 − f0(t′)dt′ = 1 − F0(t) . (10.1)

Since p is a unique monotonic function of t, we can consider p as a normalized test statistic which is completely equivalent to t.

The relationship between the di erent quantities which we have introduced is illustrated in Fig. 10.1. The upper graph represents the p.d.f of the test statistic under H0. The critical region extends from tc to inﬁnity. The a priori rejection probability for a sample under H0 is α, equal to the integral of the distribution of the test statistic over the critical region. The lower graph shows the p-value function. It starts at one and is continuously decreasing to zero at inﬁnity. The smaller the test statistic is – think of χ2 – the higher is the p-value. At t = tc the p-value is equal to the signiﬁcance level α. The condition p < α leads to rejection of H0. Due to its construction, the p.d.f. of the p-value under H0 is uniform. The name p-value is derived from the word probability, but its experimentally observed value does not represent the probability that H0 is true. This is obvious from the fact that the p-value is a function of the selected test statistic. We will come back to this point when we discuss goodness-of-ﬁt.

10.2.5 Consistency and Bias of Tests

A test is called consistent if its power tends to unity as the sample size tends to inﬁnity. In other words: If we have an inﬁnitely large data sample, we should always be able to decide between H0 and the alternative H1.

We also want that independent of α the rejection probability for H1 is higher than for H0, i.e. α < 1 − β. Tests that violate this condition are called biased. Consistent tests are asymptotically unbiased.

2This condition can always be realized for one-sided tests. Two-sided tests are rare – see Example 129.

						10.2	Some Deﬁnitions		249
f(t) 0.20.3					critical
					region
0.1
	0	2	4	t	6		8	10
				c
p(t)1.0
0.5
0.0	0	2	4	t	6	t	8	10
				c		t

Fig. 10.1. Distribution of a test statistic and corresponding p-value curve.

When H1 represents a family of distributions, consistency and non-biasedness are valid only if they apply to all members of the family. Thus in case that the alternative H1 is not speciﬁed, a test is biased if there is an arbitrary hypothesis di erent from H0 with rejection probability less than α and it is inconsistent if we can ﬁnd a hypothesis di erent from H0 which is not rejected with power unity in the large sample limit.

Example 125. Bias and inconsistency of a test

Assume, we select in an experiment events of the type K0 → π+π−. The invariant mass mππ of the pion pairs has to match the K0 mass. Due to the ﬁnite experimental resolution the experimental masses of the pairs are normally distributed around the kaon mass mK with variance σ2. With the null hypothesis H0 that we observe K0 → π+π− decays, we may apply to our sample a test with the test quantity t = (mππ − mK )2/σ2, the normalized mean quadratic di erence between the observed masses of N pairs and the nominal K0 mass. Our sample is accepted if it satisﬁes t < t0 where t0 is the critical quantity which determines the error of the ﬁrst kind α and the acceptance 1 − α. The distribution of Nt under H0 is a χ2 distribution with N degrees of freedom. Clearly, the test is biased, because we can imagine mass distributions with acceptance larger than 1 − α, for instance a uniform distribution in the range t ≤ t0. This test is also inconsistent, because it would favor this speciﬁc realization of H1 also for inﬁnitely large samples. Nevertheless it is not unreasonable for very small samples in the considered case and for N = 1 there is no alternative. The situation is di erent for large samples where more powerful tests exist which take into account the Gaussian shape of the expected distribution under H0.

250 10 Hypothesis Tests

While consistency is a necessary condition for a sensible test applied to a large sample, bias and inconsistency of a test applied to a small sample cannot always be avoided and are tolerable under certain circumstances.

10.3 Goodness-of-Fit Tests

10.3.1 General Remarks

Goodness-of-ﬁt (GOF) tests check whether a sample is compatible with a given distribution. Even though this is not possible in principle without a well deﬁned alternative, this works quite well in practice, the reason being that the choice of the test statistic is inﬂuenced by speculations about the behavior of alternatives, speculations which are based on our experience. Our presumptions depend on the speciﬁc problem to be solved and therefore very di erent testing procedures are on the market.

In the empirical research outside the exact sciences, questions like “Is a certain drogue e ective?”, “Have girls less mathematical ability than boys?”, “Does the IQ follow a normal distribution? ” are to be answered. In the natural sciences, GOF tests usually serve to detect unknown systematic errors in experimental results. When we measure the mean life of an unstable particle, we know that the lifetime distribution is exponential but to apply a GOF test is informative, because a low p-value may indicate a contamination of the events by background or problems with the experimental equipment. But there are also situations where we accept or reject hypotheses as a result of a test. Examples are event selection (e.g. B-quark production), particle track selection on the bases of the quality of reconstruction and particle identiﬁcation, (e.g. electron identiﬁcation based on calorimeter or Cerenkov information). Typical for these examples is that we examine a number of similar objects and accept a certain error rate α, while when we consider the p-value of the ﬁnal result of an experiment, discussing an error rate does not make sense.

An experienced scientist has a quite good feeling for deviations between two distributions just by looking at a plot. For instance, when we examine the statistical distribution of Fig. 10.2, we will realize that its description by an exponential distribution is rather unsatisfactory. The question is: How can we quantify the disagreement? Without a concrete alternative it is rather di cult to make a judgement.

Let us discuss a di erent example: Throwing a dice produces “1” ten times in sequence. Is this result compatible with the assumption H0 that the dice is unbiased? Well, such a sequence does not occur frequently and the guess that something is wrong with the dice is well justiﬁed. On the other hand, the sequence of ten times “1” is not less probable than any other sequence, namely (1/6)10 = 1. 7 · 10−8. Our doubt relies on our experience: We have an alternative to H0 in mind, namely asymmetric dice. We can imagine asymmetric dice but not dice that produce with high probability a sequence like “4,5,1,6,3,3,6,2,5,2”. As a consequence we would choose a test which is sensitive to deviations from a uniform distribution. When we test a random number generator we would be interested, for example, in a periodicity of the results or a correlation between subsequent numbers and we would choose a di erent test. In GOF tests, we cannot specify H1 precisely, but we need to have an idea of it which then enters in the selection of the test. We search for test parameters where we suppose that they discriminate between the null hypothesis and possible alternatives.

10.3 Goodness-of-Fit Tests

251

	100	prediction
of events					experimental
					distribution
	10
number	10
number
	10.0	0.2	0.4	0.6	0.8	1.0
				lifetime

Fig. 10.2. Comparison of an experimental distribution to a prediction.

However, there is not such a thing as a best test quantity as long as the alternative is not completely speciﬁed.

A typical test quantity is the χ2-variable which we have introduced to adjust parameters of functions to experimental histograms or measured points with known error distributions. In the least square method of parameter inference, see Chap. 7, the parameters are ﬁxed such that the sum χ2 of the normalized quadratic deviations is minimum. Deviating parameter values produce larger values of χ2, consequently we expect the same e ect when we compare the data to a wrong hypothesis. If χ2 is abnormally large, it is likely that the null hypothesis is not correct.

Unfortunately, physicists use almost exclusively the χ2 test, even though for many applications more powerful tests are available. Scientists also often overestimate the signiﬁcance of the χ2 test results. Other tests like the Kolmogorov–Smirnov Test and tests of the Cramer–von Mises family avoid the always somewhat arbitrary binning of histograms in the χ2 test. These tests are restricted to univariate distributions, however. Other binning-free methods can also be applied to multivariate distributions.

Sometimes students think that a good test statistic would be the likelihood L0 of the null hypothesis, i.e. for H0 with single event distribution f0(x) the product Πf0(xi). That this is not a good idea is illustrated in Fig. 10.3 where the null hypothesis is represented by a fully speciﬁed normal distribution. From the two samples, the narrow one clearly ﬁts the distribution worse but it has the higher likelihood. A sample where all observations are located at the center would per deﬁnition maximize the likelihood but such a sample would certainly not support the null hypothesis.

While the indicated methods are distribution-free, i.e. applicable to arbitrary distributions speciﬁed by H0, there are procedures to check the agreement of data with speciﬁc distributions like normal, uniform or exponential distributions. These methods are of inferior importance for physics applications. We will deal only with distribution-free methods.

252 10 Hypothesis Tests

Fig. 10.3. Two di erent samples and a hypothesis.

We will also exclude tests based on order statistics from our discussion. These tests are mainly used to test properties of time series and are not very powerful in most of our applications.

At the end of this section we want to stress that parameter inference with a valid hypothesis and GOF test which doubt the validity of a hypothesis touch two completely di erent problems. Whenever possible deviations can be parameterized it is always appropriate to determine the likelihood function of the parameter and use the likelihood ratio to discriminate between di erent parameter values.

A good review of GOF tests can be found in [57], in which, however, more recent developments are missing.

10.3.2 P-Values

Interpretation and Use of P-Values

We have introduced p-values p in order to dispose of a quantity which measures the agreement between a sample and a distribution f0(t) of the test statistic t. Small p-values should indicate a bad agreement. Since the distribution of p under H0 is uniform in the interval [0, 1], all values of p in this interval are equally probable. When we reject a hypothesis under the condition p < 0.1 we have a probability of 10% to reject H0. The rejection probability would be the same for a rejection region p > 0.9. The reason for cutting at low p-values is the expectation that distributions of H1 would produce low p-values.

The p-value is not the probability that the hypothesis under test is true. It is the probability under H0 to obtain a p-value which is smaller than the one actually observed. A p-value between zero and p is expected to occur in the fraction p of experiments if H0 is true.

10.3 Goodness-of-Fit Tests

253

observations	15	A: p=0.082	250	B: p=0.073
	15
			200
of	10		150
of			50
number			50
			100
	5
	0	x	0	x
		x		x

Fig. 10.4. Comparison of two experimental histograms to a uniform distribution.

Example 126. The p-value and the probability of a hypothesis

In Fig. 10.4 we have histogrammed two distributions from two simulated experiments A and B. Are these uniform distributions? For experiment B with 10000 observations this is conceivable, while for experiment A with only 100 observations it is di cult to guess the shape of the distribution. Alternatives like strongly rising distributions are more strongly excluded in B than in A. We would therefore attribute a higher probability for the validity of the hypothesis of a uniform distribution for B than for A, but the p-values based on the χ2 test are very similar in both cases, namely p ≈ 0.08. Thus the deviations from a uniform distribution would have in both cases the same signiﬁcance

We learn from this example also that the p-value is more sensitive to deviations in large samples than in small samples. Since in practice small unknown systematic errors can rarely be excluded, we should not be astonished that in high statistics experiments often small p-values occur. The systematic uncertainties which usually are not considered in the null hypothesis then dominate the purely statistical ﬂuctuation.

Even though we cannot transform signiﬁcant deviations into probabilities for the validity of a hypothesis, they provide useful hints for hidden measurement errors or contamination with background. In our example a linearly rising distribution has been added to uniform distributions. The fractions were 45% in experiment A and 5% in experiment B.

In some experimental situations we are able to compare many replicates of measurements to the same hypothesis. In particle physics experiments usually a huge number of tracks has to be reconstructed. The track parameters are adjusted by a χ2 ﬁt to measured points assuming normally distributed uncertainties. The χ2 value of each ﬁt can be used as a test statistic and transformed into a p-value, often called χ2 probability. Histograms of p-values obtained in such a way are very instructive. They often look like the one shown in Fig. 10.5. The plot has two interesting features: It is slightly rising with increasing p-value which indicates that the errors have been slightly overestimated. The peak at low p-values is due to fake tracks which do not correspond to particle trajectories and which we would eliminate almost completely by a cut at about pc = 0.05. We would have to pay for it by a loss of good tracks of

254 10 Hypothesis Tests

number of events

1500

1000 cut

500
0	0.2	0.4	0.6	0.8	1.0
	0.2	0.4	0.6	0.8	1.0
			p-value

Fig. 10.5. Experimental distribution of p-values.

somewhat less than 5 %. A more precise estimate of the loss can be obtained by an extrapolation of the smooth part of the p-value distribution to p = 0.

Combination of p-Values

If two p-values p1, p2 which have been derived from independent test statistics t1, t2 are available, we would like to combine them to a single p-value p. The at ﬁrst sight obvious idea to set p = p1p2 su ers from the fact that the distribution of p will not be uniform. A popular but arbitrary choice is

p = p1p2 [1 − ln(p1p2)]

(10.2)

which can be shown to be uniformly distributed [65]. This choice has the unpleasant feature that the combination of the p-values is not associative, i.e. p [(p1, p2), p3] 6= p [p1, (p2, p3)]. There is no satisfactory way to combine p-values.

We propose, if possible, not to use (10.2) but to go back to the original test statistics and construct from them a combined statistic t and the corresponding p- distribution. For instance, the obvious combination of two χ2 statistics would be

t = χ21 + χ22.

10.3.3 The χ2 Test in Generalized Form

The Idea of the χ2 Comparison

We consider a sample of N observations which are characterized by the values xi of a variable x and a prediction f0(x) of their distribution. We subdivide the range of x into B intervals to which we attach sequence numbers k. The prediction pk for the probability that an observation is contained in interval k is:

10.3 Goodness-of-Fit Tests

255

pk = f0(x) dx ,

with Σpk = 1. The integration extends over the interval k. The number of sample observations dk found in this bin has to be compared with the expectation value Npk. To interpret the deviation dk − Npk, we have to evaluate the expected mean quadratic deviation δk2 under the condition that the prediction is correct. Since the distribution of the observations into bins follows a binomial distribution, we have

δk2 = Npk(1 − pk) .

Usually the observations are distributed into typically 10 to 50 bins. Thus the probabilities pk are small compared to unity and the expression in brackets can be omitted. This is the Poisson approximation of the binomial distribution. The mean quadratic deviation is equal to the number of expected observations in the bin:

δk2 = Npk .

We now normalize the observed to the expected mean quadratic deviation,

χ2	=	(dk − Npk)2		,
k			Npk
			Npk
and sum over all B bins:	B		(dk − Npk)2 .		(10.3)
χ2 =	B		(dk − Npk)2 .		(10.3)
	X

k=1 Npk

By construction we have:

hχ2ki ≈ 1 , hχ2i ≈ B .

If the quantity χ2 is considerably larger than the number of bins, then obviously the measurement deviates signiﬁcantly from the prediction.

A signiﬁcant deviation to small values χ2 B even though considered as unlikely is tolerated, because we know that alternative hypotheses do not produce smaller hχ2i than H0.

The χ2 Distribution and the χ2 Test

We now want to be more quantitative. If H0 is valid, the distribution of χ2 follows to a very good approximation the χ2 distribution which we have introduced in Sect. 3.6.7 and which is displayed in Fig. 3.20. The approximation relies on the approximation of the distribution of observations per bin by a normal distribution, a condition which in most applications is su ciently good if the expected number of entries per bin is larger than about 10. The parameter number of degrees of freedom (NDF ) f of the χ2 distribution is equal to the expectation value and to the number of bins minus one:

hχ2i = f = B − 1 .

(10.4)

256 10 Hypothesis Tests

f( 2 )
	P( 2 )
2		2
Fig. 10.6.		P-value for the obseration	χ2
		P-value for the obseration	b .

Originally we had set hχ2i ≈ B but this relation overestimates χ2 slightly. The smaller value B − 1 is plausible because the individual deviations are somewhat smaller than one – remember, we had approximated the binomial distribution by a Poisson distribution. For instance, in the limit of a single bin, the mean deviation is not one but zero. We will come back to this point below.

In some cases we have not only a prediction of the shape of the distribution but also a prediction N0 of the total number of observations. Then the number of entries in each bin should follow a Poisson distribution with mean N0pk, (10.3) has to be replaced by

	X
χ2 =	B	(dk − N0pk)2	.	(10.5)
		N0pk
	k=1

and we have f = B = hχ2i.

In experiments with low statistics the approximation that the distribution of the number of entries in each bin follows a normal distribution is sometimes not justiﬁed and the distribution of the χ2 quantity as deﬁned by (10.3) or (10.5) is not very well described by a χ2 distribution. Then we have the possibility to determine the distribution of our χ2 variable under H0 by a Monte Carlo simulation3.

In Fig. 10.6 we illustrate how we can deduce the p-value or χ2 probability from

the distribution and the experimental value c2 of our test statistic 2. The exper-

χ χ

imental value c2 divides the 2 distribution, which is ﬁxed through the number of

χ χ

degrees of freedom, and which is independent of the data, into two parts. According to its deﬁnition (10.1), the p-value p(χb2) is equal to the area of the right hand part. It is the fraction of many imagined experiments where χ2 is larger than the experimen-

tally observed value c2 – always assuming that is correct. As mentioned above,

χ H0

high values of χ2 and correspondingly low values of p indicate that the theoretical description is inadequate to describe the data. The reason is in most cases found in experimental problems.

3We have to be especially careful when α is small.

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4227 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
01.05.2025266.01 Кб0Voprosy_po_istorii_Rossii_s_otvetami.docx
#
27.09.2019275.97 Кб28voprosy_po_spets_kursu_k_zachetu.doc
#
23.09.2019653.82 Кб125Voprosy_vnimanie_i_pamyat_1.doc
#
01.03.2025394.96 Кб2Voprosy_ya_10-18.docx
#
02.08.201928.44 Кб20Vopros_10.docx
#
12.03.20166.43 Mб22vstatmp_engl.pdf
#
13.02.20151.12 Mб19Vsya_teoria_k_FAYa.pdf
#
14.11.2019212.99 Кб14Vvedenie_v_ specialnoct_kl.doc
#
11.11.2019739.33 Кб19vvedenie_v_socialno_ekonomicheskuyu_geografiyu.doc
#
08.11.201963.49 Кб3vvodnyy_urok_10_kl.doc
#
12.03.20161.99 Mб14web.doc