Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
9
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

 

7.2

Further Methods of Parameter Inference 197

distribution,

 

 

 

 

 

 

 

f(y1

, . . . , yN |θ) exp

1 N

 

,

2 i,j=1(yi − ti)Vij (yj − tj )

 

 

 

 

 

X

 

 

see Sect. 6.5.6. Maximizing the likelihood is again equivalent to minimizing χ2 if the errors are normally distributed.

The sum χ2 is not invariant against a non-linear variable transformation y(y). The least square method is also used when the error distribution is unknown. In this situation we do not dispose of a better method.

Example 114. Least square method: fit of a straight line

We fit the parameters a, b of the straight line

y(x) = ax + b

(7.8)

to a sample of points (xi, yi) with uncertainties δi of the ordinates. We minimize χ2:

χ2 =

(yi − a xi − b)2

,

 

 

∂χ2

Xi

δi2

 

 

 

=

(−yi + a xi + b)2xi

,

 

 

 

∂a

Xi

δi2

 

 

 

∂χ2

=

(−yi + a xi + b)2

.

 

 

 

 

∂b

Xi

δi2

 

 

 

We set the derivatives to zero and introduce the following abbreviations. (In parentheses we put the expressions for the special case where all uncertainties are equal, δi = δ):

 

 

 

 

 

=

Xi

 

xi

/

Xi

1

 

 

 

 

 

 

 

 

 

(

xi/N) ,

 

 

 

x

 

 

 

 

δi2

δi2

 

 

 

 

 

 

 

 

 

 

 

 

Xi

 

 

 

 

 

 

=

Xi

 

yi

/

Xi

1

 

 

 

 

 

 

 

 

 

(

yi/N) ,

 

 

 

 

y

 

 

 

 

 

 

 

 

 

 

 

 

 

 

δi2

δi2

 

 

 

 

 

 

 

 

 

 

 

 

Xi

 

 

 

 

 

 

 

 

 

Xi

 

 

x2

 

Xi

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x2 =

 

 

 

i

 

/

 

 

 

 

 

 

 

 

 

 

 

(

 

x2/N) ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

δi2

 

 

 

 

 

δi2

Xi

i

 

 

=

 

 

xiyi

/

Xi

 

1

 

 

 

 

 

 

(

Xi

xiyi/N) .

 

xy

 

 

 

 

 

 

 

 

 

 

 

 

δi2

 

 

 

 

 

 

 

Xi

δi2

 

 

 

 

 

 

 

We obtain

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

− aˆ x ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

b = y

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

and

 

xy − aˆ x

− b x = 0 ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

aˆ =

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

xy

x

y

,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x

 

 

 

ˆ x2 y − x xy b = .

x2 − x2

198 7 Parameter Inference II

The problem is simplified when we put the origin of the abscissa at the center of gravity x:

x= x − x ,

= xy , x′2

ˆ

b = y .

Now the equation of the straight line reads

y = aˆ(x

 

 

ˆ

.

(7.9)

 

 

 

− x) + b

We gain an additional advantage, the errors of the estimated parameters are no longer correlated.

 

x2

 

δ2(ˆa) = 1/

i

,

 

 

Xi

δi2

 

δ2(ˆb) = 1/ Xi

1

.

δi2

We recommend to use always the form (7.9) instead of (7.8).

7.2.3 Linear Regression

If the prediction depends only linearly on the parameters, we can compute the parameters which minimize χ2 analytically. We put

yt(θ) = a + Tθ .

(7.10)

Here θ is the P -dimensional parameter vector, a is a given N-dimensional vector, yt is the N-dimensional vector of predictions. T, also called the design matrix, is a rectangular matrix of given elements with P columns and N rows.

The straight line fit discussed in Example 114 is a special case of (7.10) with

yt = θ1x + θ2

, a = 0, and

 

1

· · ·

1

 

 

 

 

 

· · ·

 

T

 

 

T =

x1

xN .

We have to find the minimum of

χ2 = (y − a − Tθ)T V(y − a − Tθ)

where, as usual, V is the weight matrix, the inverse of the covariance matrix: V = C−1. In our case it is a diagonal N ×N matrix with elements 1/δi2. To simplify the notation we transform the observations

y= y − a ,

derive

χ2 = (y− Tθ)T V(y− Tθ)

with respect to the parameters θ and set the derivatives equal to zero:

1 ∂χ2

 

 

 

= 0 = −TT V(y− Tθˆ) .

(7.11)

2 ∂θ θˆ

7.3 Comparison of Estimation Methods

199

From these so-called normal equations we get the estimate for the P parameters ˆ :

θ

ˆ = ( T )−1 T . (7.12)

θ T VT T Vy

Note that TT VT is a symmetric P × P matrix with unique inverse which turns out to be the error (i.e. covariance) matrix of θ: With the usual propagation of errors we obtain (see Sect. 4.3):

Eθ = DCDT ,

= (TT VT)−1 ,

substituting for D the matrix on the right hand side of (7.12), after some simplifications.

Linear regression provides an optimal solution only for normally distributed errors. If the distribution of errors is not known up to their variance4, it is according to the Gauss–Markov theorem optimal in the following restricted sense: If the error distribution has mean value zero, the estimator is unbiased and among all other unbiased estimators which are linear in the observations its estimator has minimal variance. In fact, the result then coincides with that of a likelihood fit if normally distributed errors are assumed.

Linear problems are rare. When the prediction is a non-linear function of the parameters, the problem can be linearized by a Taylor expansion as a first rough approximation. By iteration the precision can be improved.

The importance of non-linear parameter inference by iterative linear regression has decreased considerably. The minimum searching routines which we find in all computer libraries are more e cient and easier to apply. Some basic minimum searching approaches are presented in Appendix 13.9.

7.3 Comparison of Estimation Methods

The following table contains an evaluation of the virtues and properties of the estimation approaches which we have been discussing.

Whenever possible, the likelihood method should be applied. It requires a sample of observations and a p.d.f. in analytic or well defined numerical form and is very sensitive to wrongly assigned observations in the sample. When the theoretical description of the data is given in form of a simulated histogram, the Poisson likelihood adjustment of the simulation to the bin content should be chosen. When we have to fit a function to measured data points, we use the least square method. If computing time is a limitation like in some on-line applications, the moments method lends itself. In many situations all three methods are equivalent.

All methods are sensitive to spurious background. Especially robust methods have been invented to solve this problem. An introduction and references are given in Appendix 13.15. For completeness we present in Appendix 13.3.1 some frequentist criteria of point and interval estimation which are relevant when parameters of many objects of the same type, for instance particle tracks, are measured. In the Appendix 13.6 we discuss the virtues of di erent point and interval inference approaches. Algorithms for minimum search are sketched in Appendix 13.9.

4Usually, when the variance is known, we have also an idea about the the shape of the distribution.

200 7 Parameter Inference II

Table 7.1. Virtues and caveats of di erent methods of parameter estimation.

 

moments

 

χ2

max. likelihood

simplicity

++

 

+

precision

 

+

++

individual observations

+

 

+

measured points

 

+

histograms

+

 

+

+

upper and lower limits

 

+

external constraints

 

+

+

background included

+

2

+

error assignment

from error propagation

χmin + 1

ln Lmax − 0.5

requirement

full p.d.f.

only variance

full p.d.f.

8

Interval Estimation

8.1 Introduction

In Chap. 4 we had presented a short introduction into error calculus. It was based on probability theory. In principle, error estimation is an essential part of statistics and of similar importance as parameter estimation. Measurements result from point estimation of one or several parameters, measurement errors from interval1 estimation. These two parts form an ensemble and have to be defined in an consistent way.

As we have already mentioned, the notation measurement error used by scientists is somewhat misleading, more precise is the term measurement uncertainty. In the field of statistics the common term is confidence intervals, an expression which often is restricted to the specific frequentist intervals as introduced by Neyman which we sketch in the Appendix.

It is in no way obvious how we ought define error or confidence intervals and this is why statisticians have very di erent opinions on this subject. There are various conventions in di erent fields of physics, and particle physicists have not yet adopted a common solution.

Let us start with a wish list which summarizes the properties in the single parameter case which we would like to realize. The extrapolation to several parameters is straight forward.

1.Error intervals should contain the wanted true parameter with a fixed probability.

2.For a given probability, the interval should be as short as possible.

3.The error interval should represent the mean square spread of measurements around the true parameter value. In allusion to the corresponding probability term we talk about standard deviation errors.

4.The definition has to be consistent, i.e. observations containing identical information about the parameters should lead to identical intervals. More precise measurements should have shorter intervals than less precise ones. The error interval has to contain the point estimate.

5.Error intervals should be invariant under transformation of the estimated parameter.

1The term interval is not restricted to a single dimension. In n dimensions it describes a n-dimensional volume.

202 8 Interval Estimation

6.The computation of the intervals should be free from subjective, e.g. more or less arbitrary model depending assumptions.

7.A consistent method for the combination of measurements and for error propagation has to exist.

8.The approach has to be simple and transparent.

Unfortunately it is absolutely impossible to fulfil simultaneously all these conditions which partially contradict each other. We will have to set priorities and sometimes we will have to use ad hoc solutions which are justified only from experience and common sense. Under all circumstances, we will satisfy point 4, i.e. consistency. As far as possible, we will follow the likelihood principle and derive the interval limits solely from the likelihood function.

It turns out that not always the same procedure is optimum for the interval estimation. For instance, if we measure the size or the weight of an object, precision is the dominant requirement, i.e. properties denoting the reliability or reproducibility of the data. Here, a quantity like the variance corresponding to the mean quadratic deviation is appropriate to describe the error or uncertainty intervals. Contrary, limits, for instance of the mass of a hypothetical particle like the Higgs particle, will serve to verify theoretical predictions. Here the dominant aspect is probability and we talk about confidence or credibility intervals2. Confidence intervals are usually defined such that they contain a parameter with high probability, e.g. 90% or 95% while error intervals comprise one standard deviation or something equivalent. The exact calculation of the standard deviation as well as that of the probability that a parameter is contained inside an interval require the knowledge of its p.d.f. which depends not only on the likelihood function but in addition on the prior density which in most cases is unknown. To introduce a subjective prior, however, is something which we want to avoid.

First we treat situations where the aspect precision dominates. There, as far as possible, we will base our considerations on the likelihood function only. Then we will discuss cases where the probability aspect is important. These will deal mainly with limits on hypothetical quantities, like masses of SUSY particles. There we will be obliged to include prior densities.

8.2 Error Intervals

The purpose of error intervals is to document the precision of a measurement. They are indispensable when we combine measurements. The combination of measurements permits us to improve continuously the precision of a parameter estimate with increasing number of measurements.

If the prior density is known with su cient precision, we determine the probability density of the parameter(s) and subsequently the moments. But this condition is so rarely fulfilled that we need not discuss it. Normally, we are left with the likelihood function only.

In what follows we will always assume that the likelihood function is of simple shape, di erentiable, with only one maximum and decreasing continuously to all

2The term credibility interval is used for Bayesian intervals.

8.2 Error Intervals

203

sides. This condition is realized in most cases. In the remaining ones where it is of complicated shape we have to renounce the simple parametrization by point and interval estimates and present the full likelihood function.

The width of the likelihood function indicates how precise a measurement is. The standard error limits, as introduced in Sect. 6.5.1 – decrease by a factor e1/2 from the maximum – rely on the likelihood ratio. These limits have the positive property of being independent of the parameter metric: This means in the one-dimensional case that for a parameter λ(θ) which is a monotonic function of θ that the limits λ1, λ2, θ1, θ2 fulfill the relations λ1 = λ(θ1) and λ2 = λ(θ2). It does not matter whether we write the likelihood as a function of θ or of λ.

In large experiments usually there are many di erent e ects which influence the final result and consequently also many di erent independent sources of uncertainty, most of which are of the systematic type. Systematic errors (see Sect. 4.2.3) such as calibration uncertainties can only be treated in the Bayesian formalism. We have to estimate their p.d.f. or at least a mean value and a standard deviation.

8.2.1 Parabolic Approximation

The error assignment is problematic only for small samples. As is shown in Appendix 13.3, the likelihood function approaches a Gaussian with increasing size of the sample. At the same time its width decreases with increasing sample size, and we can neglect possible variations of the prior density in the region where the likelihood is significant. Under this condition we obtain a normally distributed p.d.f. for the parameter(s) with a standard deviation error interval given by the e1/2 decrease from the maximum. It includes the parameter with probability 68.3 % (see Sect. 4.5) The log-likelihood then is parabolic and the error interval corresponds to the region within which it decreases from its maximum by a value of 1/2 as we had fixed it already previously. This situation is certainly realized for the large majority of all measurements which are published in the Particle Data Book [24].

In the parabolic approximation the MLE and the expectation value coincide, as well as the likelihood ratio error squared and the variance. Thus we can also derive the standard deviation δθ from the curvature of the likelihood function at its maximum. For a single parameter we can approximate the likelihood function by the expression

 

1

ˆ

2

 

(8.1)

 

 

 

− ln Lp =

2

V (θ − θ)

 

+ const. .

Consequently, a change of ln Lp by 1/2 corresponds to the second derivative of L at

ˆ:

θ

 

 

 

 

2 ln L

 

−1

(δθ)2 = V −1 = −

d

.

θˆ

2

For several parameters the parabolic approximation can be expressed by

1

X

ˆ

ˆ

 

− ln L =

2

i,j

i − θi)Vij j

θj ) + const. .

We obtain the symmetric weight matrix3 V from the derivatives

3It is also called Fisher-information.

204 8 Interval Estimation

 

2 ln L

 

Vij = −

∂θi∂θj

θˆ

and the covariance or error matrix from its inverse C = V−1.

If we are interested only in part of the parameters, we can eliminate the remaining nuisance parameters simply forgetting about the part of the matrix which contains the corresponding elements. This is a consequence of the considerations from Sect. 6.9.

In most cases the likelihood function is not known analytically. Usually, we have a computer program which delivers the likelihood function for arbitrary values of the parameters. Once we have determined the maximum, we are able to estimate the second derivative and the weight matrix V computing the likelihood function at parameter points close to the MLE. To ensure that the parabolic approximation is valid, we should increase the distance of the points and check whether the result remains consistent.

In the literature we find frequently statements like “The measurement excludes the theoretical prediction by four standard deviations.” These kind of statements have to be interpreted with caution. Their validity relies on the assumption that the log-likelihood is parabolic over a very wide parameter range. Neglecting tails can lead to completely wrong conclusions. We have also to remember that for a given number of standard deviations the probability decreases with the number of dimensions (see Tab. 4.2 in Sect. 4.5).

In the following section we address more problematic situations which usually occur with small data samples where the asymptotic solutions are not appropriate. Fortunately, they are rather the exception. We keep in mind that a relatively rough estimate of the error often is su cient such that approximate methods in most cases are justified.

8.2.2 General Situation

As above, we again use the likelihood ratio to define the error limits which now usually are asymmetric. In the one-dimensional case the two errors δand δ+ satisfy

ˆ

ˆ

ˆ

ˆ

(8.2)

ln L(θ) − ln L(θ

− δ) = ln L(θ) − ln L(θ + δ+) = 1/2 .

If the log-likelihood function deviates considerably from a parabola it makes sense to supplement the one standard deviation limits ln L = −1/2 with the two standard deviation limits ln L = −2 to provide a better documentation of the shape of the likelihood function. This complication can be avoided if we can obtain an approximately parabolic likelihood function by an appropriate parameter transformation. In some situations it is useful to document in addition to the mode of the likelihood function and the asymmetric errors, if available, also the mean and the standard deviation which are relevant, for instance, in some cases of error propagation which we will discuss below.

Example 115. Error of a lifetime measurement

To determine the mean lifetime τ of a particle from a sample of observed decay times, we use the likelihood function

 

 

 

 

 

8.3

Error Propagation 205

 

iY

1

 

 

 

 

Lτ =

N

1

eti=

eNt/τ .

(8.3)

 

τN

 

=1

τ

 

 

 

 

 

 

 

 

 

 

 

 

The corresponding likelihood for the decay rate is

YN

Lλ = λeλti = λN eNtλ. i=1

The values of the functions are equal at equivalent values of the two parameters τ and λ, i.e. for λ = 1/τ:

Lλ(λ) = Lτ (τ) .

Fig. 8.1 shows the two log-likelihoods for a small sample of ten events with mean value t = 0.5. The lower curves for the parameter τ are strongly asymmetric. This is also visible in the limits for changes of the log-likelihood by 0.5 or 2 units which are indicated on the right hand cut-outs. The likelihood with the decay rate as parameter (upper figures) is much more symmetric than that of the mean life. This means that the decay rate is the more appropriate parameter to document the shape of the likelihood function, to average di erent measurement and to perform error propagation, see below. On the other hand, we can of course transform the maximum likelihood estimates and errors of the two parameters into each other without knowing the likelihood function itself.

Generally, it does not matter whether we use one or the other parameter to present the result but for further applications it is always simpler and more precise to work with approximately symmetric limits. For this reason usually 1/p (p is the absolute value of the momentum) instead of p is used as parameter when charged particle trajectories are fitted to the measured hits in a magnetic spectrometer.

In the general case we satisfy the conditions 4 to 7 of our wish list but the first three are only approximately valid. We neither can associate an exact probability content to the intervals nor do the limits correspond to moments of a p.d.f..

8.3 Error Propagation

In many situations we have to evaluate a quantity which depends on one or several measurements with individual uncertainties. We thus have a problem of point estimation and of interval estimation. We look for the parameter which is best supported by the di erent measurements and for its uncertainty. Ideally, we are able to construct the likelihood function. In most cases this is not necessary and approximate procedures are adequate.

8.3.1 Averaging Measurements

In Chap. 4 we have shown that the mean of measurements with Gaussian errors δi which are independent of the measurement, is given by the weighted sum of the individual measurements (4.7) with weights proportional to the inverse errors squared

206

8

Interval Estimation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1a

 

 

 

1b

 

Likelihood

0.01

 

 

 

 

 

 

 

 

 

 

1E-3

 

 

 

 

 

0.01

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1E-4

 

 

 

 

 

 

 

 

 

 

 

1E-50

2

4

6

1E-30

2

 

4

 

 

 

 

 

 

 

2a

 

 

 

2b

 

Likelihood

0.01

 

 

 

 

 

 

 

 

 

1E-3

 

 

 

 

 

0.01

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1E-4

 

 

 

 

 

 

 

 

 

 

 

1E-5

0

1

2

 

1E-3

0.5

1.0

1.5

 

 

 

 

3

0.0

Fig. 8.1. Likelihood functions for the parameters decay rate (top) and lifetime (below). The standard deviation limits are shown in the cut-outs on the right hand side.

1/δi2. In case the errors are correlated with the measurements which occurs frequently with small event numbers, this procedure introduces a bias (see Example 56 in Chap. 4) From (6.6) we conclude that the exact method is to add the log-likelihoods of the individual measurements. Adding the log-likelihoods is equivalent to combining the raw data as if they were obtained in a single experiment. There is no loss of information and the method is not restricted to specific error conditions.

Example 116. Averaging lifetime measurements

N experiments quote lifetimes τˆi ± δi of the same unstable particle. The estimates and their errors are computed from the individual measurements

tij of the i-th experiment according to τˆi =

ni

 

=

j=1 tij /ni, respectively δi

n

decays. We can reconstruct the

τˆi/

 

where ni is the number of observed

P

 

 

 

 

i

 

N

 

,

individual log-likelihood functions and their sum ln L, with n, n =

Pi=1

n

the overall event number:

 

 

i

 

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]