Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
11
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

10.5 Significance of Signals

287

Often the significance of a signal s is stated in units of standard deviations σ:

s = p Ns .

N0 + δ02

Here Ns is the number of events associated to the signal, N0 is the number of events in the signal region expected from H0 and δ0 its uncertainty. In the Gaussian approximation it can be transformed into a p-value via (10.23). Unless N0 is very large and δ0 is very well known, this p-value has to be considered as a lower limit or a rough guess.

11

Statistical Learning

11.1 Introduction

In the process of its mental evolution a child learns to classify objects, persons, animals, and plants. This process partially proceeds through explanations by parents and teachers (supervised learning), but partially also by cognition of the similarities of di erent objects (unsupervised learning). But the process of learning – of children and adults – is not restricted to the development of the ability merely to classify but it includes also the realization of relations between similar objects, which leads to ordering and quantifying physical quantities, like size, time, temperature, etc.. This is relatively easy, when the laws of nature governing a specific relation have been discovered. If this is not the case, we have to rely on approximations, like interor extrapolations.

Also computers, when appropriately programmed, can perform learning processes in a similar way, though to a rather modest degree. The achievements of the so-called artificial intelligence are still rather moderate in most areas, however a substantial progress has been achieved in the fields of supervised learning and classification and there computers profit from their ability to handle a large amount of data in a short time and to provide precise quantitative solutions to well defined specific questions. The techniques and programs that allow computers to learn and to classify are summarized in the literature under the term machine learning.

Let us now specify the type of problems which we discuss in this chapter: For an input vector x we want to find an output yˆ. The input is also called predictor, the output response. Usually, each input consists of several components (attributes, properties), and is written therefore in boldface letters. Normally, it is a metric (quantifiable) quantity but it could also be a categorical quantity like a color or a particle type. The output can also contain several components or consists of a single real or discrete (Yes or No) variable. Like a human being, a computer program learns from past experience. The teaching process, called training, uses a training sample

{(x1, y1), (x2, y2) . . . (xN , yN )}, where for each input vector xi the response yi is known. When we ask for the response to an arbitrary continuous input x, usually its

estimate yˆ(x) will be more accurate when the distance to the nearest input vector of the training sample is small than when it is far away. Consequently, the training sample should be as large as possible or a ordable. The region of interest should be covered with input vectors homogeneously, and we should be aware that the accuracy of the estimate decreases at the boundary of this region.

290 11 Statistical Learning

Learning which exceeds simple memorizing relies on the existence of more or less simple relations between input and response: Similar input corresponds to similar response. In our approach this translates into the requirement that the responses are similar for input vectors which are close. We can not learn much from erratic distributions.

Example 135. Simple empirical relations

The resistance R of a wire is used for a measurement of the temperature T . In the teaching process which here is called calibration, a sample of corresponding values Ri, Ti is acquired. In the application we want to find for a given input R an estimate of T . Usually a simple interpolation will solve this problem.

For more complicated relations, approximations with polynomials, higher spline functions or orthogonal functions are useful.

Example 136. Search for common properties

A certain class of molecules has a positive medical e ect. The structure, physical and chemical properties x of these molecules are known. In order to find out which combination of the properties is relevant, the distribution of all attributes of the molecules which represent the training objects is investigated. A linear method for the solution of this task is the principal component

analysis.

Example 137. Two-class classification, SPAM mails

A sizable fraction of electronic mails are of no interest to the addressee and considered by him as a nuisance. Many mailing systems use filter programs to eliminate these undesired so-called SPAM1 mails. After evaluation of a training sample where the classification into Yes or No (accept or reject) is done by the user, the programs are able to take over the classification job. They identify certain characteristic words, like Viagra, sex, profit, advantage, meeting, experiment, university and other attributes like large letters, colors to distinguish between SPAM and serious mails. This kind of problem is e ciently solved by decision trees and artificial neural networks.

The attributes are here categorical variables. In the following we will restrict ourselves mainly to continuous variables.

Example 138. Multi-class classification, pattern recognition

Hand-written letters or figures have to be recognized. Again a sample for which the relation between the written pixels and the letters is known, is used to train the program. Also this problem can be treated by decision trees, artificial neural networks, and by kernel methods. Here the attributes are the pixel coordinates.

As we have observed also previously, multivariate applications su er from the curse of dimensionality. There are two reasons: i) With increasing number d of dimensions, the distance between the input vectors increases and ii) the surface e ects

1SPAM is an artificial nonsense word borrowed from a sketch of a British comedy series of Monty Python’s Flying Circus where in a cafe every meal contains SPAM.

11.2 Smoothing of Measurements and Approximation by Analytic Functions

291

are enhanced. When a fixed number of points is uniformly distributed over a hyper-

cube of dimension d, the mean distance between the points is proportional to

d.

The higher the dimension, the more empty is the space. At the same time the region where estimates become less accurate due to surface e ects increases. The fraction of the volume taken by a hyper-sphere inscribed into a hyper-cube is only 5.2% for d = 5, and the fraction of the volume within a distance to the surface less than 10% of the edge length increases like 1 − 0.8d, this means from 20% for d = 1 to 67% for

d = 5.

Example 139. Curse of dimensionality

A training sample of 1000 five-dimensional inputs is uniformly distributed over a hyper-cube of edge length a. To estimate the function value at the center of the region we take all sample elements within a distance of a/4 from the center. These are on average one to two only ( 1000 × 0.052 × 0.55 = 1.6), while in one dimension 500 elements would contribute.

In the following, we will first discuss the approximation of measurements a icted with errors by analytic functions and the interpolation by smoothing techniques. Next we introduce the factor analysis, including the so-called principal component analysis. The last section deals with classification methods, based on artificial neural networks, kernel algorithms, and decision trees. In recent years we observed a fast progress in this field due to new developments, i.e. support vector machines, boosting, and the availability of powerful general computer algorithms. This book can only introduce these methods, without claim of completeness. A nice review of the whole field is given in [13].

11.2 Smoothing of Measurements and Approximation by Analytic Functions

We start with two simple examples, which illustrate applications:

i)In a sequence of measurements the gas amplification of an ionization chamber as a function of the applied voltage has been determined. We would like to describe the dependence in form of a smooth curve.

ii)With optical probes it is possible to scan a surface profile point-wise. The objects may be workpieces, tools, or human bodies. The measurements can be used by milling machines or cutting devices to produce replicates or clothes. To steer these machines, a complete surface profile of the objects is needed. The discrete points have to be approximated by a continuous function. When the surface is su ciently smooth, this may be achieved by means of a spline approximation.

More generally, we are given a number N of measurements yi with uncertainties δi at fixed locations xi, the independent variables, but are interested in the values of the dependent or response variable y at di erent values of x, that is, we search for a function f(x) which approximates the measurements, improves their precision and interand extrapolates in x. The simplest way to achieve this is to smooth the polygon connecting the data points.

More e cient is the approximation of the measurement by a parameter dependent analytic function f(x, θ). We then determine the parameters by a least square fit, i.e.

292 11 Statistical Learning

minimize the sum over the squared and normalized residuals P [(yi − f(xi, θ)]2 i2 with respect to θ. The approximation should be compatible with the measurements within their statistical errors but the number of free parameters should be as small as possible. The accuracy of the measurements has a decisive influence on the number of free parameters which we permit in the fit. For large errors we allow also for large deviations of the approximation from the measurements. As a criterion for the number of free parameters, we use statistical tests like the χ2 test. The value of χ2 should then be compatible with the number of constraints, i.e. the number of measured points minus the number of fitted parameters. Too low a number of parameters leads to a bias of the predictions, while too many parameters reduce the accuracy, since we profit less from constraints.

Both approaches rely on the presumption that the true function is simple and smooth. Experience tells us that these conditions are justified in most cases.

The approximation of measurements which all have the same uncertainty by analytic functions is called regression analysis. Linear regression had been described in Chap. 7.2.3. In this section we treat the general non-linear case with arbitrary errors.

In principle, the independent variable may also be multi-dimensional. Since then the treatment is essentially the same as in the one-dimensional situation, we will mainly discuss the latter.

11.2.1 Smoothing Methods

We use the measured points in the neighborhood of x to get an estimate of the value of y(x). We denote the uncertainties of the output vectors of the training sample by δj for the component j of y. When the points of the training sample have large errors, we average over a larger region than in the case of small errors. The better accuracy of the average for a larger region has to be paid for by a larger bias, due to the possibility of larger fluctuations of the true function in this region. Weighting methods work properly if the function is approximately linear. Di culties arise in regions with lot of structure and at the boundaries of the region if there the function is not approximately constant.

k-Nearest Neighbors

The simplest method for a function approximation is similar to the density estimation which we treat in Chap. 9 and which uses the nearest neighbors in the training sample. We define a distance di = |x − xi| and sort the elements of the training sample in the order of their distances di < di+1. We choose a number K of nearest neighbors and average over the corresponding output vectors:

1 XK yˆ(x) = K i=1 yi .

This relation holds for constant errors. Otherwise for the component j of y the corresponding weighted mean

 

K

 

j (x) =

i=1 yij ij2

P

 

 

K

1/δij2

 

P i=1

11.2 Smoothing of Measurements and Approximation by Analytic Functions

293

has to be used. The choice of K depends on the density of points in the training sample and the expected variation of the true function y(x).

If all individual points in the projection j have mean square errors δj2, the error of the prediction δyj is given by

2

 

δj2

2

 

(11.1)

(δyj )

=

 

+ hyj (x) − yˆj(x)i

 

.

K

 

The first term is the statistical fluctuation of the mean value. The second term is the bias which is equal to the systematic shift squared, and which is usually di cult to evaluate. There is the usual trade-o between the two error components: with increasing K the statistical term decreases, but the bias increases by an amount depending on the size of the fluctuations of the true function within the averaging region.

k-Nearest Neighbors with Linear Approximation

The simple average su ers from the drawback that at the boundary of the variable space the measurements contributing to the average are distributed asymmetrically with respect to the point of interest x. If, for instance, the function falls strongly toward the left-hand boundary of a one-dimensional space, averaging over points which are predominantly located at the right hand side of x leads to too large a result. (See also the example at the end of this section). This problem can be avoided by fitting a linear function through the K neighboring points instead of using the mean value of y.

Gaussian Kernels

To take all K-nearest neighbors into account with the same weight independent of their distance to x is certainly not optimal. Furthermore, its output function is piecewise constant (or linear) and thus discontinuous. Better should be a weighting procedure, where the weights become smaller with increasing distances. An often used weighting or kernel function2 is the Gaussian. The sum is now taken over all N training inputs:

 

N

x

x

i|

2

 

yˆ(x) =

i=1 yieα| −

 

 

.

N

 

 

2

 

P i=1 eα|x−xi|

 

 

 

 

P

 

 

 

 

 

The constant α determines the range of the correlation. Therefore the width s =

1/ 2α of the Gaussian has to be adjusted to the density of the points and to the curvature of the function. If computing time has to be economized, the sum may of course be truncated and restricted to the neighborhood of x, for instance to the distance 2s. According to (11.1) the mean squared error becomes3:

(δyj )2 = δj2

e−2α|x−xi|2

P eα|x−xi|2

2 + hyj (x) − yˆj(x)i2 .

 

P

 

 

2The denotation kernel will be justified later, when we introduce classification methods. 3This relation has to be modified if not all errors are equal.

294 11 Statistical Learning

11.2.2 Approximation by Orthogonal Functions

Complete orthogonal function systems o er three attractive features: i) The fitted function coe cients are uncorrelated, ii) The function systems are complete and thus able to approximate any well behaved, i.e. square integrable, function, iii) They are naturally ordered with increasing oscillation frequency4. The function system to be used depends on the specific problem, i.e. on the domain of the variable and the asymptotic behavior of the function. Since the standard orthogonal functions are well known to physicists, we will be very brief and omit all mathematical details, they can be looked-up in mathematical handbooks.

Complete normalized orthogonal function systems {ui(x)} defined on the finite or infinite interval [a, b] fulfil the orthogonality and the completeness relations. To simplify the notation, we introduce the inner product (g, h)

Z b

(g, h) ≡ g (x)h(x)dx

a

and have

(ui, uj ) = δij ,

X

ui (x)ui(x) = δ(x − x) .

i

For instance, the functions of the well known Fourier system for the interval

[a, b] = [−L/2, L/2] are u (x) = 1 exp(i2πnx/L).

n L

Every square integrable function can be represented by the series

X

f(x) = aiui(x) , with ai = (ui, f)

i=0

in the sense that the squared di erence converges to zero with increasing number of terms5:

Klim

"f(x) −

K

aiui(x)#2

= 0 .

(11.2)

→∞

 

Xi

 

 

 

 

=0

 

 

 

The coe cients ai become small for large i, if f(x) is smooth as compared to the ui(x), which oscillate faster and faster for increasing i. Truncation of the series therefore causes some smoothing of the function.

The approximation of measurements by orthogonal functions works quite well for very smooth data. When the measurements show strong short range variations, sharp peaks or valleys, then a large number of functions is required to describe the data. Neglecting individually insignificant contributions may lead to a poor approximation. Typically, their truncation may produce spurious oscillations (“ringing”) in regions near to the peaks, where the true function is already flat.

For large data sets with equidistant points and equal errors the Fast Fourier Transform, FFT, plays an important role, especially for data smoothing and image processing. Besides the trigonometric functions, other orthogonal systems are

4We use the term frequency also for spatial dimensions.

5At eventual discontinuities, f(x) should be taken as [f(x + 0) + f(x − 0)]/2.

11.2 Smoothing of Measurements and Approximation by Analytic Functions

295

Table 11.1. Characteristics of orthogonal polynomials.

 

 

 

 

 

 

 

 

 

Polynomial

Domain

 

Weight function

 

 

 

Legendre, Pi(x)

[−1, +1]

 

w(x) = 1

2

 

 

 

 

Hermite, Hi(x)

 

w(x) = exp(−x

)

 

 

 

(−∞, +∞)

 

 

 

 

Laguerre, Li(x)

[0, ∞)

 

w(x) = exp(−x)

 

 

useful, some of which are displayed in Table 11.2.2. The orthogonal functions are

proportional to polynomials p (x) of degree i multiplied by the square root of a

i p

weight function w(x), ui(x) = pi(x) w(x). Specifying the domain [a, b] and w, and requiring orthogonality for ui,j ,

(ui, uj ) = ciδij ,

fixes the polynomials up to the somewhat conventional normalization factors ci.

The most familiar orthogonal functions are the trigonometric functions used in the Fourier series mentioned above. From electrodynamics and quantum mechanics we are also familiar with Legendre polynomials and spherical harmonics. These functions are useful for data depending on variables defined on the circle or on the sphere, e.g. angular distributions. For example, the distribution of the intensity of the microwave background radiation which contains information about the curvature of the space, the baryon density and the amount of dark matter in the universe, is usually described as a function of the solid angle by a superposition of spherical harmonics. In particle physics the angular distributions of scattered or produced particles can be described by Legendre polynomials or spherical harmonics. Functions extending to ±∞ are often approximated by the eigenfunctions of the harmonic oscillator consisting of Hermite polynomials multiplied by the exponential exp(−x2/2) and functions bounded to x ≥ 0 by Laguerre polynomials multiplied by ex/2.

In order to approximate a given measurement by one of the orthogonal function systems, one usually has to shift and scale the independent variable x.

Polynomial Approximation

The simplest function approximation is achieved with a simple polynomial f(x) =

X

 

ν

 

 

ν

 

X

 

ν

 

 

akxk or more generally by f(x) =

 

akuk where uk is a polynomial of order k.

Given data y

 

with uncertainties δ

 

at locations x we minimize

 

 

 

 

 

δν2

"

 

 

#

 

 

 

 

N

1

 

 

 

K

2

 

 

 

 

X

 

 

 

 

 

X

 

(11.3)

 

 

 

χ2 =

 

 

 

yν

 

 

akuk(xν ) ,

 

 

 

ν=1

 

 

 

 

 

k=0

 

 

in order to determine the coe cients ak. To constrain the coe cients, their number K +1 has to be smaller than the number N of measurements. All polynomial systems of the same order describe the data equally well but di er in the degree to which the coe cients are correlated. The power of the polynomial is increased until it is compatible within statistics with the data. The decision is based on a χ2 criterion.

The purpose of this section is to show how we can select polynomials with uncorrelated coe cients. In principle, these polynomials and their coe cients can be computed through diagonalization of the error matrix but they can also be obtained

296 11 Statistical Learning

directly with the Gram–Schmidt method. This method has the additional advantage that the polynomials and their coe cients are given by simple algebraic relations.

For a given sample of measured points yν weights in the usual way

1 wν = w(xν ) = δν2 /

= f(xν ) with errors δν , we fix the

X 1

j δj2

,

and now define the inner product of two functions g(x), h(x) by

 

 

X

 

 

(g, h) =

wν g(xν )h(xν )

 

 

ν

 

 

with the requirement

 

 

 

 

 

 

(ui, uj) = δij .

Minimizing χ2 is equivalent to minimizing

K akuk(xν )#2 .

X2 =

N

wν "yν

 

X

 

 

X

 

ν=1

 

 

k=0

For K = N − 1 the square bracket at the minimum of X2 is zero,

NX−1

yν − akuk(xν ) = 0

k=0

for all ν, and forming the inner product with uj we get

(y, uj) = aj .

(11.4)

This relation produces the coe cients also in the interesting case K < N − 1. To construct the orthogonal polynomials, we set v0 = 1,

ui =

 

vi

 

,

(11.5)

 

 

 

 

 

)

 

 

 

 

 

 

p(vi, vi i

 

vi+1 = xi+1

X

(11.6)

 

(uj , xi+1)uj .

j=0

The first two terms in the corresponding expansion, a0u0 and a1u1, are easily calculated. From (11.5), (11.6), (11.4) and the following definition of the moments of the weighted sample

 

X

X

 

 

X

x

=

wν xν , sx2 =

wν (xν

x

)2 , sxy = wν (xν yν

xy

)

 

ν

ν

 

 

ν

we find the coe cients and functions which fix the polynomial expansion of y:

y = y + sxy (x − x) . s2x

We recover the well known result for the best fit by a straight line in the form with independent coe cients: This is of course no surprise, as the functions that are

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]