Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
11
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

13.9 Extremum Search

367

The decisive advantage is its simplicity which permits to handle a large number of parameters at the same time. If convenient, for the calculation of the gradient rough approximations can be used. Important is only that the function decreases with each step. As opposed to the simplex and parabola methods its complexity increases only linear with the number of parameters. Therefore problems with huge parameter sets can be handled.

It is possible to evaluate a sample sequentially, element by element, which is especially useful for the back-propagation algorithm of neural networks.

Unsatisfactory is that the learning constant is not dimensionless. In other words, the method is not independent of the parameter scales. For a space-time parameter set the gradient path will depend, for instance, on the choice whether to measure the parameters in meters or millimeters, respectively hours or seconds.

In regions with flat parameter space the convergency is slow. In a narrow valley oscillations may appear. For too large values of α oscillations will make exact minimizing di cult.

The last mentioned problems can be reduced by various measures where the step length and direction partially depend on results of previous steps. When the function change is small and similar in successive steps α is increased. Oscillations in a valley can be avoided by adding to the gradient in step i a fraction of the gradient of step

i − 1:

λi = α ( λf(λi) + 0.5 λf(λi−1)) .

Oscillations near the minimum are easily recognized and removed by decreasing α.

The method of steepest descent is applied in ANN and useful in the updating alignment of tracking detectors [82].

13.9.5 Stochastic Elements in Minimum Search

A physical system which is cooled down to the absolute zero point will principally occupy an energetic minimum. When cooled down fast it may, though, be captured in a local (relative) minimum. An example is a particle in a potential wall. For somewhat higher temperature it may leave the local minimum, thanks to the statistical energy distribution (Fig. 13.7). This is used for instance in the stimulated annealing of defects in solid matter.

This principle can be used for minimum search in general. A step in the wrong direction, where the function increases by Δf, can be accepted, when using the method of steepest descent, e.g. with a probability

1

P (Δf) = 1 + eΔf/T .

The scale factor T (“temperature”) steers the strength of the e ect. It has been shown that for successively decreasing T the absolute minimum will be reached.

368 13 Appendix

Fig. 13.7. Stochastic annealing. A local minimum can be left with a certain probability.

13.10 Linear Regression with Constraints

We consider N measurements y at known locations x, with a N ×N covariance matrix CN and a corresponding weight matrix VN = CN1. (We indicate the dimensions of quadratic matrices with an index).

In the linear model the measurements are described by P < N parameters θ in form of linear relations

y = T(x)θ , with the rectangular N × P “design” matrix T.

In 7.2.3 we have found that the corresponding χ2 expression is minimized by

ˆ = ( T )−1 T .

θ T VN T T VN y

We now include constraints between the parameters, expressed by K < P linear relations:

Hθ = ρ ,

with H(x) a given rectangular K × P matrix and ρ a K-dimensional vector.

This problem is solved by introducing K Lagrange multipliers α and looking for a stationary point of the lagrangian

Λ = (y − Tθ)T VN (y − Tθ) + 2αT (Hθ − ρ) .

 

Di erentiating with respect to θ and α gives the normal equations

 

TT VN Tθ + Hα = TT VN y ,

(13.33)

Hθ = ρ

(13.34)

ˆ

. Note that Λ is minimized only with respect to θ, but max-

to be solved for θ and αˆ

imized with respect to α: The stationary point is a saddle point, which complicates a direct extremum search. Solving (13.33) for θ and inserting it into (13.34), we find

αˆ = CK1(HCP TT VN y − ρ) and, re-inserting the estimates into (13.33), we obtain

13.11 Formulas Related to the Polynomial Approximation 369

ˆ = [ T T −1( T − )] ,

θ CP T VN y H CK HCP T VN y ρ

where the abbreviations CP = (TT VN T)−1, CK = HCP HT have been used.

The covariance matrix is found from linear error propagation, after a somewhat lengthy calculation, as

θˆ

) =

DC DT

I C HT C−1H C

cov(

N

= ( P P

K

) P ,

where

D = CP (IP − HT CK1HCP )TT VN

has been used. The covariance matrix is symmetric positive definite. Without constraints, it equals CP , the negative term is absent. Of course, the introduction of constraints reduces the errors and thus improves the parameter estimation.

13.11 Formulas Related to the Polynomial Approximation

Errors of the Expansion Coe cients

In Sect. 11.2.2 we have discussed the approximation of measurements by orthogonal polynomials and given the following formula for the error of the expansion coe cients ak,

N 1

var(ak) = 1/

 

δ2

ν=1

ν

X

 

which is valid for all k = 1, . . . , K. Thus all errors are equal to the error of the weighted mean of the measurements yν .

Proof: from linear error propagation we have, for independent measurements yν ,

var(ak) = var

X

wν uk(xν )yν

!

ν

X

=wν2 (uk(xν ))2δν2

ν

 

 

 

X

X

 

 

 

=

w u2

(x )/

 

1

 

ν

k

ν

 

δν2

ν

 

 

 

ν

 

 

 

 

= 1/

 

1

,

 

 

 

 

 

 

 

δ2

 

 

 

 

ν

 

 

 

X

ν

where in the third step we used the definition of the weights, and in the last step the normalization of the polynomials uk.

Polynomials for Data with Uniform Errors

If the errors δ1, . . . , δN are uniform, the weights become equal to 1/N, and for certain patterns of the locations x1, . . . , xN , for instance for an equidistant distribution, the orthogonalized polynomials uk(x) can be calculated. They are given in mathematical handbooks, for instance in Ref. [94]. Although the general expression is quite involved, we reproduce it here for the convenience of the reader. For

370 13 Appendix

x defined in the domain [−1, 1] (eventually after some linear transformation and shift), and N = 2M + 1 equidistant (with distance Δx = 1/M) measured points xν = ν/M, ν = 0, ±1, . . . , ±M, they are given by

uk(x) =

1/2

k

 

 

 

X

(i + k)[2i](M + t)[i]

(2M + 1)(2k + 1)[(2M)!]2

i=0 (−1)i+k

(2M + k + 1)!(2M k)!

(i!)2(2M)[i]

,

for k = 0, 1, 2, . . . 2M, where we used the notation t = x/Δx = xM and the definitions

z[i] = z(z − 1)(z − 2) · · · (z − i + 1) z[0] = 1, z ≥ 0, 0[i] = 0, i = 1, 2, . . . .

13.12 Formulas for B-Spline Functions

13.12.1 Linear B-Splines

Linear B-splines cover an interval 2b and overlap with both neighbors:

B(x; x ) = 2

x − x0 − b

for x

0

b

x

x ,

0

b

 

 

 

0

= 2

−x − x0 + b

for

x

x

x

+ b ,

 

b

 

0

 

0

 

= 0

else .

 

 

 

 

 

 

 

They are normalized to unit area. Since the central values are equidistant, we fix

them by the lower limit xmin of the x-interval and count them as x0(k) = xmin + kb, with the index k running from kmin = 0 to kmax = (xmax − xmin)/b = K.

At the borders only half of a spline is used.

Remark: The border splines are defined in the same way as the other splines. After the fit the part of the function outside of its original domain is ignored. In the literature the definition of the border splines is often di erent.

13.12.2 Quadratic B-Splines

 

 

 

 

 

 

 

 

 

 

The definition of quadratic splines is analogous:

 

 

 

 

 

 

 

 

1

 

x

x

+ 3/2b

 

2

 

 

 

 

 

 

 

B(x; x0) =

 

 

 

 

0

 

 

for x0 − 3b/2 ≤ x ≤ x0 − b/2 ,

2b

 

 

 

 

b

 

 

1

 

3

 

 

 

 

x − x0

 

2

 

 

 

 

 

 

 

=

 

 

2

 

 

for x

 

 

b/2

 

x

x

+ b/2 ,

2b

"2

 

 

 

 

 

 

 

 

 

b

#

0

 

 

0

 

1

 

x

x

0

3/2b

 

2

 

 

 

 

 

 

 

=

 

 

 

 

 

for x0 + b/2 ≤ x ≤ x0 + 3b/2 ,

2b

 

 

 

 

b

 

= 0

else .

 

 

 

 

 

 

 

 

 

 

 

 

 

The supporting points x0 = xmin + (k − 1/2)b lie now partly outside of the x- domain. The index k runs from 0 to kmax = (xmax − xmin)/b + 2. Thus, the number K of splines is by two higher than the number of intervals. The relations (11.13) and

(11.12) are valid as before.

13.13 Support Vector Classifiers

371

13.12.3 Cubic B-Splines

Cubic B-splines are defined as follows:

B(x; x ) =

1

2 +

x − x0

 

 

3

for x

2b

x

x

0

b ,

 

 

 

 

6b

 

 

 

 

 

 

 

0

 

 

 

 

b

 

 

 

0

b

 

 

 

 

0 − ≤ ≤ 0

 

6b

"

b

 

 

 

 

 

 

 

 

 

 

 

#

 

1

 

 

 

 

x − x0

 

3

 

 

 

 

x − x0

 

2

 

 

 

 

 

 

 

 

=

 

 

3

 

 

 

 

 

 

 

6

 

 

 

 

 

+ 4

for x

 

b

x

x ,

 

6b

"

 

 

b

 

3

 

 

 

b

 

 

 

 

2

 

#

 

0

 

0

 

=

1

3

 

x

− x0

 

 

 

 

6

 

x

− x0

 

 

+ 4

 

for

x

x

 

x

+ b ,

1

 

 

x − x0

 

 

3

 

 

 

 

 

 

 

 

=

2

 

 

 

 

for x

+ b

x

 

x

+ 2b ,

 

 

 

 

6b

 

 

 

 

 

 

 

 

 

 

 

 

b

 

 

 

 

0

 

 

 

0

 

 

 

 

 

 

= 0

else .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The shift of the center of the spline is performed as before: x0 = xmin + (k − 1)b. The index k runs from 0 to kmax = (xmax − xmin)/b + 3. The number kmax + 1 of splines is equal to the number of intervals plus 3.

13.13 Support Vector Classifiers

Support vector machines are described some detail in Refs. [13, 71, 72, 73].

13.13.1 Linear Classifiers

Linear classifiers6 separate the two training samples by a hyperplane. Let us initially assume that in this way a complete separation is possible. Then an optimal hyperplane is the plane which divides the two samples with the largest margin. This is shown in Fig. 13.8. The hyperplane can be constructed in the following way: The shortest connection between the convex hulls7 of the two non-overlapping classes determines the direction w/|w| of the normal w of this plane which cuts the distance at its center. We represent the hyperplane in the form

w · x + b = 0 ,

(13.35)

where b fixes its distance from the origin. Note that w is not normalized, a common factor in w and b does not change condition (13.35). Once we have found the hyperplane {w, b} which separates the two classes yi = ±1 of the training sample {(x1, y1), . . . , (xN , yN )} we can use it to classify new input:

yˆ = f(x) = sign(w · x + b) .

(13.36)

To find the optimal hyperplane which divides

into equal parts, we define the

two marginal planes which touch the hulls:

 

6A linear classification scheme was already introduced in Sect. 11.4.1.

7The convex hull is the smallest polyhedron which contains all points and their connecting straight lines.

372 13 Appendix

Fig. 13.8. The central hyperplane separates squares from circles. Shown are the convex hulls and the support vectors (open symbols).

w · x + b = ±1 .

If x+, xare located at the two marginal hyperplanes, the following relations hold which also fix the norm of w:

w · (x+ − x) = 2 = |ww| · (x+ − x) = |w2 | .

The optimal hyperplane is now found by solving a constrained quadratic optimization problem

|w|2 = minimum , subject to yi(w · xi + b) ≥ 1 , i = 1, . . . , N .

For the solution, only the constraints with equals sign are relevant. The vectors corresponding to points on the marginal planes form the so-called active set and are called support vectors (see Fig. 13.8). The optimal solution can be written as

X

w = αiyixi i

with αi > 0 for the active set, else αi = 0, and furthermore P αiyi = 0. The last condition ensures translation invariance: w(xi − a) = w(xi). Together with the active constraints, after substituting the above expression for w, it provides just the required number of linear equations to fix αi and b. Of course, the main problem is to find the active set. For realistic cases this requires the solution of a large quadratic optimization problem, subject to linear inequalities. For this purpose an extended literature as well as program libraries exist.

This picture can be generalized to the case of overlapping classes. Assuming that the optimal separation is still given by a hyperplane, the picture remains essentially the same, but the optimization process is substantially more complex. The standard way is to introduce so called soft margin classifiers. There some points on the wrong side of their marginal plane are tolerated, but with a certain penalty in the optimization process. It is chosen proportional to the sum of their distances or their square distance from their own territory. The proportionality constant is adjusted to the given problem.

13.13 Support Vector Classifiers

373

13.13.2 General Kernel Classifiers

All quantities determining the linear classifier yˆ (13.36) depend only on inner products of vectors of the input space. This concerns not only the dividing hyperplane, given by (13.35), but also the expressions for w, b and the factors αi associated to the support vectors. The inner product x · xwhich is a bilinear symmetric scalar function of two vectors, is now replaced by another scalar function K(x, x) of two vectors, the kernel, which need not be bilinear, but should also be symmetric, and is usually required to be positive definite. In this way a linear problem in an inner product space is mapped into a very non-linear problem in the original input space where the kernel is defined. We then are able to separate the classes by a hyperplane in the inner product space that may correspond to a very complicated hypersurface in the input space. This is the so-called kernel trick.

To illustrate how a non-linear surface can be mapped into a hyperplane, we consider a simple example. In order to work with a linear cut, i.e. with a dividing hyperplane, we transform our input variables x into new variables: x → X(x). For instance, if x1, x2, x3 are momentum components and a cut in energy, x21 + x22 + x23 < r2, is to be applied, we could transform the momentum space into a space

X = {x21, x22, x23, . . .} .

where the cut corresponds to the hyperplane X1 + X2 + X3 = r2. Such a mapping can be realized by substituting the inner product by a kernel:

x · x→ K(x, x) = X(x) · X(x).

In our example a kernel of the so-called monomial form is appropriate:

K(x, x) = (x · x)d

with d = 2 ,

 

 

 

 

 

(13.37)

(x

·

x)2 = (x x

+ x x

+ x x)2

= X(x)

·

X(x)

 

 

1

1

 

2

2

3

3

 

 

 

 

 

 

 

with

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

2

2

 

 

 

 

 

 

 

 

 

 

X(x) = {x1

, x2

, x3

, 2x1x2, 2x1x3, 2x2x3} .

 

The sphere x21 +x22 +x23 = r2 in x-space is mapped into the 5-dimensional hyperplane X1+X2+X3 = r2 in 6-dimensional X-space. (A kernel inducing instead of monomials of order d (13.37), polynomials of all orders, up to order d is K(x, x) = (1 + x·x)d.)

The most common kernel used for classification is the Gaussian (see Sect. 11.2.1):

2s2

 

K(x, x) = exp

(x − x)2

.

 

It can be shown that it induces a mapping into a space of infinite dimensions [71] and that nevertheless the training vectors can in most cases be replaced by a relatively small number of support vectors. The only free parameter is the penalty constant which regulates the degree of overlap of the two classes. A high value leads to a very irregular shape of the hypersurface separating the training samples of the two classes to a high degree in the original space whereas for a low value its shape is much smoother and more minority observations are tolerated.

In practice, this mapping into the inner product space is not performed explicitly, in fact it is even rarely known. All calculations are performed in x-space, especially the determination of the support vectors and their weights α. The kernel trick merely

374 13 Appendix

serves to prove that a classification with support vectors is feasible. The classification of new input then proceeds with the kernel K and the support vectors directly:

X

αiK(x, xi) −

X

yˆ = sign

αiK(x, xi)! .

yi=+1

 

yi=−1

The use of a relatively small number of support vectors (typically only about 5% of all αi are di erent from zero) drastically reduces the storage requirement and the computing time for the classification. Remark that the result of the support vector classifier is not identical to that of the original kernel classifier but very similar.

13.14 The Bayes Factor

In Chap. 6 we have introduced the likelihood ratio to discriminate between simple hypotheses. For two composite hypotheses H1 and H2 with free parameters, in the Bayesian approach the simple ratio is to be replaced by the so-called Bayes factors.

Let us assume for a moment that H1 applies. Then the actual parameters will follow a p.d.f. proportional to L11|x)π11) where L11|x) is the likelihood function and π11) the prior density of the parameters. The same reasoning is valid for

H . The probability that H (H ) is true is proportional to the integral over the

2 R 1 2 R

parameter space, L11|x)π11)dθ1 ( L22|x)π22)dθ2). The relative betting odds thus are given by the Bayes factor B,

R

B = R L1(θ1|x1(θ1)dθ1 . L22|x)π22)dθ2

In the case with no free parameters, B reduces to the simple likelihood ratio L1/L2.

The two terms forming the ratio are called marginal likelihoods. The integration automatically introduces a penalty for additional parameters and related overfitting: The higher the dimensionality of the parameter space is, the larger is in average the contribution of low likelihood regions to the integral. In this way the concept follows the philosophy of Ockham’s razor 8 which in short states that from di erent

competing theories, the one with the fewest assumptions, i.e. the simplest, should be preferred.

The Bayes factor is intended to replace the p-value of frequentist statistics.

H. Je reys [19] has suggested a classification of Bayes factors into di erent categories ranging from < 3 (barely worth mentioning) to > 100 (decisive).

For the example of Chap. 10 Sect. 10.5, Fig.10.18 with a resonance above a uniform background for uniform prior densities in the signal fraction t, 0 ≤ t ≤ 0.5 and the location µ, 0.2 ≤ µ ≤ 0.8 the Bayes factor is B = 54 which is considered as very significant. This result is inversely proportional to the range in µ as is expected because the probability to find a fake signal in a flat background is proportional to its range. In the cited example we had found a likelihood ratio of 1.1 · 104 taken at the MLE. The corresponding p-value was p = 1.8 · 10−4 for the hypothesis of a flat background, much smaller than the betting odds of 1/54 for this hypothesis. While

8Postulated by William of Ockham, English logician in the 14th century.

13.15 Robust Fitting Methods

375

the Bayes factor takes into account the uncertainty of the parameter estimate, the uncertainty is completely neglected in the the p-value derived from the likelihood ratio taken simply at the MLE. On the other hand, for the calculation of the Bayes factor an at least partially subjective prior probability has to be included.

For the final rating the Bayes factor has to be multiplied by the prior factors of the competing hypotheses:

 

πH1

R

L11 x)π11)dθ1 πH1

 

 

|

 

 

 

R = B πH2

2)dθ2

πH2 .

= R

L22|x)π2

The posterior rating is equal to the prior rating times the Bayes factor.

The Bayes factor is a very reasonable and conceptually attractive concept which requires little computational e ort. It is to be preferred to the frequentist p-value approach in decision making. However, for the documentation of a measurement it has the typical Bayesian drawback that it depends on prior densities and unfortunately there is no objective way to fix those.

13.15 Robust Fitting Methods

13.15.1 Introduction

If one or a few observations in a sample are separated from the bulk of the data, we speak of outliers. The reasons for their existence range from trivial mistakes or detector failures to important physical e ects. In any case, the assumed statistical model has to be questioned if one is not willing to admit that a large and very improbable fluctuation did occur.

Outliers are quite disturbing: They can change parameter estimates by large amounts and increase their errors drastically.

Frequently outliers can be detected simply by inspection of appropriate plots. It goes without saying, that simply dropping them is not a good advice. In any case at least a complete documentation of such an event is required. Clearly, objective methods for their detection and treatment are preferable.

In the following, we restrict our treatment to the simple one-dimensional case of Gaussian-like distributions, where outliers are located far from the average, and where we are interested in the mean value. If a possible outlier is contained in the allowed variate range of the distribution – which is always true for a Gaussian – a statistical fluctuation cannot be excluded as a logical possibility. Since the outliers are removed on the basis of a statistical procedure, the corresponding modification of results due to the possible removal of correct observations can be evaluated.

We distinguish three cases:

1.The standard deviations of the measured points are known.

2.The standard deviations of the measured points are unknown but known to be the same for all points.

3.The standard deviations are unknown and di erent.

376 13 Appendix

It is obvious that case 3 of unknown and unequal standard deviations cannot be treated.

The treatment of outliers, especially in situations like case 2, within the LS formalism is not really satisfying. If the data are of bad quality we may expect a sizeable fraction of outliers with large deviations. These may distort the LS fit to such an extend that outliers become di cult to define (masking of outliers). This kind of fragility of the LS method, and the fact that in higher dimensions the outlier detection becomes even more critical, has lead statisticians to look for estimators which are less disturbed by data not obeying the assumed statistical model (typical are deviations from the assumed normal distribution), even when the e ciency su ers. In a second – not robust – fit procedure with cleaned data it is always possible to optimize the e ciency.

In particle physics, a typical problem is the reconstruction of particle tracks from hits in wire or silicon detectors. Here outliers due to other tracks or noise are a common di culty, and for a first rough estimate of the track parameters and the associated hit selection for the pattern recognition, robust methods are useful.

13.15.2 Robust Methods

Truncated Least Square Fits

The simplest method to remove outliers is to eliminate those measurements which contribute excessively to the χ2 of a least square (LS) fit. In this truncated least square fit (LST) all observations that deviate by more than a certain number of standard deviations from the mean are excluded. Reasonable values lie between 1.5 and 2 standard deviations, corresponding to a χ2 cut χ2max = 2. 25 to 4. The optimal value of this cut depends on the expected amount of background or false measurements and the number of observations. In case 2 the variance has to be estimated from the

data and the estimated variance ˆ2 is, according to Chap. 3.2.3, given by

δ

X

ˆ2 2

δ = (yi µˆ) /(N 1) .

This method can be improved by removing outliers sequentially (LSTS). In a first step we use all measurements y1, . . . , yN , with standard deviations δ1, . . . , δN to determine the mean value µˆ which in our case is just the weighted mean. Then we compute the normalized residuals, also called pulls, ri = (yi − µˆ)/δi and select the measurement with the largest value of ri2. The value of χ2 is computed with respect to the mean and variance of the remaining observations and the measurement is excluded if it exceeds the parameter χ2max9. The fit is repeated until all measurements are within the margin. In case that all measurements are genuine Gaussian measurements, this procedure only marginally reduces the precision of the fit.

In both methods LST and LSTS a minimum fraction of measurements has to be retained. A reasonable value is 50 % but depending on the problem other values may be appropriate.

9If the variance has to be estimated from the data its value is biased towards smaller values because for a genuine Gaussian distribution eliminating the measurement with the largest pull reduces the expected variance.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]