vstatmp_engl
.pdf13.9 Extremum Search |
367 |
•The decisive advantage is its simplicity which permits to handle a large number of parameters at the same time. If convenient, for the calculation of the gradient rough approximations can be used. Important is only that the function decreases with each step. As opposed to the simplex and parabola methods its complexity increases only linear with the number of parameters. Therefore problems with huge parameter sets can be handled.
•It is possible to evaluate a sample sequentially, element by element, which is especially useful for the back-propagation algorithm of neural networks.
•Unsatisfactory is that the learning constant is not dimensionless. In other words, the method is not independent of the parameter scales. For a space-time parameter set the gradient path will depend, for instance, on the choice whether to measure the parameters in meters or millimeters, respectively hours or seconds.
•In regions with flat parameter space the convergency is slow. In a narrow valley oscillations may appear. For too large values of α oscillations will make exact minimizing di cult.
The last mentioned problems can be reduced by various measures where the step length and direction partially depend on results of previous steps. When the function change is small and similar in successive steps α is increased. Oscillations in a valley can be avoided by adding to the gradient in step i a fraction of the gradient of step
i − 1:
λi = α ( λf(λi) + 0.5 λf(λi−1)) .
Oscillations near the minimum are easily recognized and removed by decreasing α.
The method of steepest descent is applied in ANN and useful in the updating alignment of tracking detectors [82].
13.9.5 Stochastic Elements in Minimum Search
A physical system which is cooled down to the absolute zero point will principally occupy an energetic minimum. When cooled down fast it may, though, be captured in a local (relative) minimum. An example is a particle in a potential wall. For somewhat higher temperature it may leave the local minimum, thanks to the statistical energy distribution (Fig. 13.7). This is used for instance in the stimulated annealing of defects in solid matter.
This principle can be used for minimum search in general. A step in the wrong direction, where the function increases by Δf, can be accepted, when using the method of steepest descent, e.g. with a probability
1
P (Δf) = 1 + eΔf/T .
The scale factor T (“temperature”) steers the strength of the e ect. It has been shown that for successively decreasing T the absolute minimum will be reached.
368 13 Appendix
Fig. 13.7. Stochastic annealing. A local minimum can be left with a certain probability.
13.10 Linear Regression with Constraints
We consider N measurements y at known locations x, with a N ×N covariance matrix CN and a corresponding weight matrix VN = C−N1. (We indicate the dimensions of quadratic matrices with an index).
In the linear model the measurements are described by P < N parameters θ in form of linear relations
y = T(x)θ , with the rectangular N × P “design” matrix T.
In 7.2.3 we have found that the corresponding χ2 expression is minimized by
ˆ = ( T )−1 T .
θ T VN T T VN y
We now include constraints between the parameters, expressed by K < P linear relations:
Hθ = ρ ,
with H(x) a given rectangular K × P matrix and ρ a K-dimensional vector.
This problem is solved by introducing K Lagrange multipliers α and looking for a stationary point of the lagrangian
Λ = (y − Tθ)T VN (y − Tθ) + 2αT (Hθ − ρ) . |
|
Di erentiating with respect to θ and α gives the normal equations |
|
TT VN Tθ + Hα = TT VN y , |
(13.33) |
Hθ = ρ |
(13.34) |
ˆ |
. Note that Λ is minimized only with respect to θ, but max- |
to be solved for θ and αˆ |
imized with respect to α: The stationary point is a saddle point, which complicates a direct extremum search. Solving (13.33) for θ and inserting it into (13.34), we find
αˆ = C−K1(HCP TT VN y − ρ) and, re-inserting the estimates into (13.33), we obtain
13.11 Formulas Related to the Polynomial Approximation 369
ˆ = [ T − T −1( T − )] ,
θ CP T VN y H CK HCP T VN y ρ
where the abbreviations CP = (TT VN T)−1, CK = HCP HT have been used.
The covariance matrix is found from linear error propagation, after a somewhat lengthy calculation, as
θˆ |
) = |
DC DT |
I C HT C−1H C |
||
cov( |
N |
= ( P − P |
K |
) P , |
where
D = CP (IP − HT C−K1HCP )TT VN
has been used. The covariance matrix is symmetric positive definite. Without constraints, it equals CP , the negative term is absent. Of course, the introduction of constraints reduces the errors and thus improves the parameter estimation.
13.11 Formulas Related to the Polynomial Approximation
Errors of the Expansion Coe cients
In Sect. 11.2.2 we have discussed the approximation of measurements by orthogonal polynomials and given the following formula for the error of the expansion coe cients ak,
N 1
var(ak) = 1/
|
δ2 |
ν=1 |
ν |
X |
|
which is valid for all k = 1, . . . , K. Thus all errors are equal to the error of the weighted mean of the measurements yν .
Proof: from linear error propagation we have, for independent measurements yν ,
var(ak) = var |
X |
wν uk(xν )yν |
!
ν
X
=wν2 (uk(xν ))2δν2
ν |
|
|
|
X |
|
X |
|
|
|
||
= |
w u2 |
(x )/ |
|
1 |
|
|
ν |
k |
ν |
|
δν2 |
ν |
|
|
|
ν |
|
|
|
|
|
||
= 1/ |
|
1 |
, |
|
|
|
|
|
|
||
|
δ2 |
|
|
||
|
|
ν |
|
|
|
X
ν
where in the third step we used the definition of the weights, and in the last step the normalization of the polynomials uk.
Polynomials for Data with Uniform Errors
If the errors δ1, . . . , δN are uniform, the weights become equal to 1/N, and for certain patterns of the locations x1, . . . , xN , for instance for an equidistant distribution, the orthogonalized polynomials uk(x) can be calculated. They are given in mathematical handbooks, for instance in Ref. [94]. Although the general expression is quite involved, we reproduce it here for the convenience of the reader. For
370 13 Appendix
x defined in the domain [−1, 1] (eventually after some linear transformation and shift), and N = 2M + 1 equidistant (with distance Δx = 1/M) measured points xν = ν/M, ν = 0, ±1, . . . , ±M, they are given by
uk(x) = |
− |
1/2 |
k |
|
|
|
X |
(i + k)[2i](M + t)[i] |
|||
(2M + 1)(2k + 1)[(2M)!]2 |
i=0 (−1)i+k |
||||
(2M + k + 1)!(2M k)! |
(i!)2(2M)[i] |
, |
for k = 0, 1, 2, . . . 2M, where we used the notation t = x/Δx = xM and the definitions
z[i] = z(z − 1)(z − 2) · · · (z − i + 1) z[0] = 1, z ≥ 0, 0[i] = 0, i = 1, 2, . . . .
13.12 Formulas for B-Spline Functions
13.12.1 Linear B-Splines
Linear B-splines cover an interval 2b and overlap with both neighbors:
B(x; x ) = 2 |
x − x0 − b |
for x |
0 − |
b |
≤ |
x |
≤ |
x , |
0 |
b |
|
|
|
0 |
|||
= 2 |
−x − x0 + b |
for |
x |
≤ |
x |
≤ |
x |
+ b , |
|
b |
|
0 |
|
0 |
|
||
= 0 |
else . |
|
|
|
|
|
|
|
They are normalized to unit area. Since the central values are equidistant, we fix
them by the lower limit xmin of the x-interval and count them as x0(k) = xmin + kb, with the index k running from kmin = 0 to kmax = (xmax − xmin)/b = K.
At the borders only half of a spline is used.
Remark: The border splines are defined in the same way as the other splines. After the fit the part of the function outside of its original domain is ignored. In the literature the definition of the border splines is often di erent.
13.12.2 Quadratic B-Splines |
|
|
|
|
|
|
|
|
|
|
|||||||||
The definition of quadratic splines is analogous: |
|
|
|
|
|
|
|
||||||||||||
|
1 |
|
x |
− |
x |
+ 3/2b |
|
2 |
|
|
|
|
|
|
|
||||
B(x; x0) = |
|
|
|
|
0 |
|
|
for x0 − 3b/2 ≤ x ≤ x0 − b/2 , |
|||||||||||
2b |
|
|
|
|
b |
|
|||||||||||||
|
1 |
|
3 |
|
|
|
|
x − x0 |
|
2 |
|
|
|
|
|
|
|
||
= |
|
|
2 |
|
|
for x |
|
|
b/2 |
|
x |
x |
+ b/2 , |
||||||
2b |
"2 |
|
− |
|
|
|
|
− |
≤ |
||||||||||
|
|
|
|
b |
# |
0 |
|
|
≤ |
0 |
|||||||||
|
1 |
|
x |
− |
x |
0 − |
3/2b |
|
2 |
|
|
|
|
|
|
|
|||
= |
|
|
|
|
|
for x0 + b/2 ≤ x ≤ x0 + 3b/2 , |
|||||||||||||
2b |
|
|
|
|
b |
|
|||||||||||||
= 0 |
else . |
|
|
|
|
|
|
|
|
|
|
|
|
|
The supporting points x0 = xmin + (k − 1/2)b lie now partly outside of the x- domain. The index k runs from 0 to kmax = (xmax − xmin)/b + 2. Thus, the number K of splines is by two higher than the number of intervals. The relations (11.13) and
(11.12) are valid as before.
372 13 Appendix
Fig. 13.8. The central hyperplane separates squares from circles. Shown are the convex hulls and the support vectors (open symbols).
w · x + b = ±1 .
If x+, x− are located at the two marginal hyperplanes, the following relations hold which also fix the norm of w:
w · (x+ − x−) = 2 = |ww| · (x+ − x−) = |w2 | .
The optimal hyperplane is now found by solving a constrained quadratic optimization problem
|w|2 = minimum , subject to yi(w · xi + b) ≥ 1 , i = 1, . . . , N .
For the solution, only the constraints with equals sign are relevant. The vectors corresponding to points on the marginal planes form the so-called active set and are called support vectors (see Fig. 13.8). The optimal solution can be written as
X
w = αiyixi i
with αi > 0 for the active set, else αi = 0, and furthermore P αiyi = 0. The last condition ensures translation invariance: w(xi − a) = w(xi). Together with the active constraints, after substituting the above expression for w, it provides just the required number of linear equations to fix αi and b. Of course, the main problem is to find the active set. For realistic cases this requires the solution of a large quadratic optimization problem, subject to linear inequalities. For this purpose an extended literature as well as program libraries exist.
This picture can be generalized to the case of overlapping classes. Assuming that the optimal separation is still given by a hyperplane, the picture remains essentially the same, but the optimization process is substantially more complex. The standard way is to introduce so called soft margin classifiers. There some points on the wrong side of their marginal plane are tolerated, but with a certain penalty in the optimization process. It is chosen proportional to the sum of their distances or their square distance from their own territory. The proportionality constant is adjusted to the given problem.
13.13 Support Vector Classifiers |
373 |
13.13.2 General Kernel Classifiers
All quantities determining the linear classifier yˆ (13.36) depend only on inner products of vectors of the input space. This concerns not only the dividing hyperplane, given by (13.35), but also the expressions for w, b and the factors αi associated to the support vectors. The inner product x · x′ which is a bilinear symmetric scalar function of two vectors, is now replaced by another scalar function K(x, x′) of two vectors, the kernel, which need not be bilinear, but should also be symmetric, and is usually required to be positive definite. In this way a linear problem in an inner product space is mapped into a very non-linear problem in the original input space where the kernel is defined. We then are able to separate the classes by a hyperplane in the inner product space that may correspond to a very complicated hypersurface in the input space. This is the so-called kernel trick.
To illustrate how a non-linear surface can be mapped into a hyperplane, we consider a simple example. In order to work with a linear cut, i.e. with a dividing hyperplane, we transform our input variables x into new variables: x → X(x). For instance, if x1, x2, x3 are momentum components and a cut in energy, x21 + x22 + x23 < r2, is to be applied, we could transform the momentum space into a space
X = {x21, x22, x23, . . .} .
where the cut corresponds to the hyperplane X1 + X2 + X3 = r2. Such a mapping can be realized by substituting the inner product by a kernel:
x · x′ → K(x, x′) = X(x) · X(x′).
In our example a kernel of the so-called monomial form is appropriate:
K(x, x′) = (x · x′)d |
with d = 2 , |
|
|
|
|
|
(13.37) |
|||||||||||
(x |
· |
x′)2 = (x x′ |
+ x x′ |
+ x x′ )2 |
= X(x) |
· |
X(x′) |
|||||||||||
|
|
1 |
1 |
|
2 |
2 |
3 |
3 |
|
|
|
|
|
|
|
|||
with |
|
|
|
|
|
√ |
|
|
|
√ |
|
|
√ |
|
|
|
|
|
|
|
2 |
|
2 |
2 |
|
|
|
|
|
|
|
|
|
|
|||
X(x) = {x1 |
, x2 |
, x3 |
, 2x1x2, 2x1x3, 2x2x3} . |
|
The sphere x21 +x22 +x23 = r2 in x-space is mapped into the 5-dimensional hyperplane X1+X2+X3 = r2 in 6-dimensional X-space. (A kernel inducing instead of monomials of order d (13.37), polynomials of all orders, up to order d is K(x, x′) = (1 + x·x′)d.)
The most common kernel used for classification is the Gaussian (see Sect. 11.2.1):
− |
2s2 |
|
K(x, x′) = exp |
(x − x′)2 |
. |
|
It can be shown that it induces a mapping into a space of infinite dimensions [71] and that nevertheless the training vectors can in most cases be replaced by a relatively small number of support vectors. The only free parameter is the penalty constant which regulates the degree of overlap of the two classes. A high value leads to a very irregular shape of the hypersurface separating the training samples of the two classes to a high degree in the original space whereas for a low value its shape is much smoother and more minority observations are tolerated.
In practice, this mapping into the inner product space is not performed explicitly, in fact it is even rarely known. All calculations are performed in x-space, especially the determination of the support vectors and their weights α. The kernel trick merely
13.15 Robust Fitting Methods |
375 |
the Bayes factor takes into account the uncertainty of the parameter estimate, the uncertainty is completely neglected in the the p-value derived from the likelihood ratio taken simply at the MLE. On the other hand, for the calculation of the Bayes factor an at least partially subjective prior probability has to be included.
For the final rating the Bayes factor has to be multiplied by the prior factors of the competing hypotheses:
|
πH1 |
R |
L1(θ1 x)π1(θ1)dθ1 πH1 |
||||
|
|
| |
|
|
|
||
R = B πH2 |
(θ2)dθ2 |
πH2 . |
|||||
= R |
L2(θ2|x)π2 |
The posterior rating is equal to the prior rating times the Bayes factor.
The Bayes factor is a very reasonable and conceptually attractive concept which requires little computational e ort. It is to be preferred to the frequentist p-value approach in decision making. However, for the documentation of a measurement it has the typical Bayesian drawback that it depends on prior densities and unfortunately there is no objective way to fix those.
13.15 Robust Fitting Methods
13.15.1 Introduction
If one or a few observations in a sample are separated from the bulk of the data, we speak of outliers. The reasons for their existence range from trivial mistakes or detector failures to important physical e ects. In any case, the assumed statistical model has to be questioned if one is not willing to admit that a large and very improbable fluctuation did occur.
Outliers are quite disturbing: They can change parameter estimates by large amounts and increase their errors drastically.
Frequently outliers can be detected simply by inspection of appropriate plots. It goes without saying, that simply dropping them is not a good advice. In any case at least a complete documentation of such an event is required. Clearly, objective methods for their detection and treatment are preferable.
In the following, we restrict our treatment to the simple one-dimensional case of Gaussian-like distributions, where outliers are located far from the average, and where we are interested in the mean value. If a possible outlier is contained in the allowed variate range of the distribution – which is always true for a Gaussian – a statistical fluctuation cannot be excluded as a logical possibility. Since the outliers are removed on the basis of a statistical procedure, the corresponding modification of results due to the possible removal of correct observations can be evaluated.
We distinguish three cases:
1.The standard deviations of the measured points are known.
2.The standard deviations of the measured points are unknown but known to be the same for all points.
3.The standard deviations are unknown and di erent.