vstatmp_engl
.pdf10.5 Significance of Signals |
287 |
Often the significance of a signal s is stated in units of standard deviations σ:
s = p Ns .
N0 + δ02
Here Ns is the number of events associated to the signal, N0 is the number of events in the signal region expected from H0 and δ0 its uncertainty. In the Gaussian approximation it can be transformed into a p-value via (10.23). Unless N0 is very large and δ0 is very well known, this p-value has to be considered as a lower limit or a rough guess.
11
Statistical Learning
11.1 Introduction
In the process of its mental evolution a child learns to classify objects, persons, animals, and plants. This process partially proceeds through explanations by parents and teachers (supervised learning), but partially also by cognition of the similarities of di erent objects (unsupervised learning). But the process of learning – of children and adults – is not restricted to the development of the ability merely to classify but it includes also the realization of relations between similar objects, which leads to ordering and quantifying physical quantities, like size, time, temperature, etc.. This is relatively easy, when the laws of nature governing a specific relation have been discovered. If this is not the case, we have to rely on approximations, like interor extrapolations.
Also computers, when appropriately programmed, can perform learning processes in a similar way, though to a rather modest degree. The achievements of the so-called artificial intelligence are still rather moderate in most areas, however a substantial progress has been achieved in the fields of supervised learning and classification and there computers profit from their ability to handle a large amount of data in a short time and to provide precise quantitative solutions to well defined specific questions. The techniques and programs that allow computers to learn and to classify are summarized in the literature under the term machine learning.
Let us now specify the type of problems which we discuss in this chapter: For an input vector x we want to find an output yˆ. The input is also called predictor, the output response. Usually, each input consists of several components (attributes, properties), and is written therefore in boldface letters. Normally, it is a metric (quantifiable) quantity but it could also be a categorical quantity like a color or a particle type. The output can also contain several components or consists of a single real or discrete (Yes or No) variable. Like a human being, a computer program learns from past experience. The teaching process, called training, uses a training sample
{(x1, y1), (x2, y2) . . . (xN , yN )}, where for each input vector xi the response yi is known. When we ask for the response to an arbitrary continuous input x, usually its
estimate yˆ(x) will be more accurate when the distance to the nearest input vector of the training sample is small than when it is far away. Consequently, the training sample should be as large as possible or a ordable. The region of interest should be covered with input vectors homogeneously, and we should be aware that the accuracy of the estimate decreases at the boundary of this region.
11.2 Smoothing of Measurements and Approximation by Analytic Functions |
291 |
are enhanced. When a fixed number of points is uniformly distributed over a hyper- |
|
cube of dimension d, the mean distance between the points is proportional to |
√ |
d. |
The higher the dimension, the more empty is the space. At the same time the region where estimates become less accurate due to surface e ects increases. The fraction of the volume taken by a hyper-sphere inscribed into a hyper-cube is only 5.2% for d = 5, and the fraction of the volume within a distance to the surface less than 10% of the edge length increases like 1 − 0.8d, this means from 20% for d = 1 to 67% for
d = 5.
Example 139. Curse of dimensionality
A training sample of 1000 five-dimensional inputs is uniformly distributed over a hyper-cube of edge length a. To estimate the function value at the center of the region we take all sample elements within a distance of a/4 from the center. These are on average one to two only ( 1000 × 0.052 × 0.55 = 1.6), while in one dimension 500 elements would contribute.
In the following, we will first discuss the approximation of measurements a icted with errors by analytic functions and the interpolation by smoothing techniques. Next we introduce the factor analysis, including the so-called principal component analysis. The last section deals with classification methods, based on artificial neural networks, kernel algorithms, and decision trees. In recent years we observed a fast progress in this field due to new developments, i.e. support vector machines, boosting, and the availability of powerful general computer algorithms. This book can only introduce these methods, without claim of completeness. A nice review of the whole field is given in [13].
11.2 Smoothing of Measurements and Approximation by Analytic Functions
We start with two simple examples, which illustrate applications:
i)In a sequence of measurements the gas amplification of an ionization chamber as a function of the applied voltage has been determined. We would like to describe the dependence in form of a smooth curve.
ii)With optical probes it is possible to scan a surface profile point-wise. The objects may be workpieces, tools, or human bodies. The measurements can be used by milling machines or cutting devices to produce replicates or clothes. To steer these machines, a complete surface profile of the objects is needed. The discrete points have to be approximated by a continuous function. When the surface is su ciently smooth, this may be achieved by means of a spline approximation.
More generally, we are given a number N of measurements yi with uncertainties δi at fixed locations xi, the independent variables, but are interested in the values of the dependent or response variable y at di erent values of x, that is, we search for a function f(x) which approximates the measurements, improves their precision and interand extrapolates in x. The simplest way to achieve this is to smooth the polygon connecting the data points.
More e cient is the approximation of the measurement by a parameter dependent analytic function f(x, θ). We then determine the parameters by a least square fit, i.e.
292 11 Statistical Learning
minimize the sum over the squared and normalized residuals P [(yi − f(xi, θ)]2 /δi2 with respect to θ. The approximation should be compatible with the measurements within their statistical errors but the number of free parameters should be as small as possible. The accuracy of the measurements has a decisive influence on the number of free parameters which we permit in the fit. For large errors we allow also for large deviations of the approximation from the measurements. As a criterion for the number of free parameters, we use statistical tests like the χ2 test. The value of χ2 should then be compatible with the number of constraints, i.e. the number of measured points minus the number of fitted parameters. Too low a number of parameters leads to a bias of the predictions, while too many parameters reduce the accuracy, since we profit less from constraints.
Both approaches rely on the presumption that the true function is simple and smooth. Experience tells us that these conditions are justified in most cases.
The approximation of measurements which all have the same uncertainty by analytic functions is called regression analysis. Linear regression had been described in Chap. 7.2.3. In this section we treat the general non-linear case with arbitrary errors.
In principle, the independent variable may also be multi-dimensional. Since then the treatment is essentially the same as in the one-dimensional situation, we will mainly discuss the latter.
11.2.1 Smoothing Methods
We use the measured points in the neighborhood of x to get an estimate of the value of y(x). We denote the uncertainties of the output vectors of the training sample by δj for the component j of y. When the points of the training sample have large errors, we average over a larger region than in the case of small errors. The better accuracy of the average for a larger region has to be paid for by a larger bias, due to the possibility of larger fluctuations of the true function in this region. Weighting methods work properly if the function is approximately linear. Di culties arise in regions with lot of structure and at the boundaries of the region if there the function is not approximately constant.
k-Nearest Neighbors
The simplest method for a function approximation is similar to the density estimation which we treat in Chap. 9 and which uses the nearest neighbors in the training sample. We define a distance di = |x − xi| and sort the elements of the training sample in the order of their distances di < di+1. We choose a number K of nearest neighbors and average over the corresponding output vectors:
1 XK yˆ(x) = K i=1 yi .
This relation holds for constant errors. Otherwise for the component j of y the corresponding weighted mean
|
K |
|
yˆj (x) = |
i=1 yij /δij2 |
|
P |
|
|
|
K |
1/δij2 |
|
P i=1 |
11.2 Smoothing of Measurements and Approximation by Analytic Functions |
295 |
|||||||
Table 11.1. Characteristics of orthogonal polynomials. |
|
|||||||
|
|
|
|
|
|
|
||
|
Polynomial |
Domain |
|
Weight function |
|
|
||
|
Legendre, Pi(x) |
[−1, +1] |
|
w(x) = 1 |
2 |
|
|
|
|
Hermite, Hi(x) |
|
w(x) = exp(−x |
) |
|
|
||
|
(−∞, +∞) |
|
|
|
||||
|
Laguerre, Li(x) |
[0, ∞) |
|
w(x) = exp(−x) |
|
|
useful, some of which are displayed in Table 11.2.2. The orthogonal functions are
proportional to polynomials p (x) of degree i multiplied by the square root of a
i p
weight function w(x), ui(x) = pi(x) w(x). Specifying the domain [a, b] and w, and requiring orthogonality for ui,j ,
(ui, uj ) = ciδij ,
fixes the polynomials up to the somewhat conventional normalization factors √ci.
The most familiar orthogonal functions are the trigonometric functions used in the Fourier series mentioned above. From electrodynamics and quantum mechanics we are also familiar with Legendre polynomials and spherical harmonics. These functions are useful for data depending on variables defined on the circle or on the sphere, e.g. angular distributions. For example, the distribution of the intensity of the microwave background radiation which contains information about the curvature of the space, the baryon density and the amount of dark matter in the universe, is usually described as a function of the solid angle by a superposition of spherical harmonics. In particle physics the angular distributions of scattered or produced particles can be described by Legendre polynomials or spherical harmonics. Functions extending to ±∞ are often approximated by the eigenfunctions of the harmonic oscillator consisting of Hermite polynomials multiplied by the exponential exp(−x2/2) and functions bounded to x ≥ 0 by Laguerre polynomials multiplied by e−x/2.
In order to approximate a given measurement by one of the orthogonal function systems, one usually has to shift and scale the independent variable x.
Polynomial Approximation
The simplest function approximation is achieved with a simple polynomial f(x) = |
|||||||||||
X |
|
ν |
|
|
ν |
|
X |
|
ν |
|
|
|
akxk or more generally by f(x) = |
|
akuk where uk is a polynomial of order k. |
||||||||
Given data y |
|
with uncertainties δ |
|
at locations x we minimize |
|
||||||
|
|
|
|
δν2 |
" |
|
− |
|
# |
|
|
|
|
|
N |
1 |
|
|
|
K |
2 |
|
|
|
|
|
X |
|
|
|
|
|
X |
|
(11.3) |
|
|
|
χ2 = |
|
|
|
yν |
|
|
akuk(xν ) , |
|
|
|
|
ν=1 |
|
|
|
|
|
k=0 |
|
|
in order to determine the coe cients ak. To constrain the coe cients, their number K +1 has to be smaller than the number N of measurements. All polynomial systems of the same order describe the data equally well but di er in the degree to which the coe cients are correlated. The power of the polynomial is increased until it is compatible within statistics with the data. The decision is based on a χ2 criterion.
The purpose of this section is to show how we can select polynomials with uncorrelated coe cients. In principle, these polynomials and their coe cients can be computed through diagonalization of the error matrix but they can also be obtained