Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
11
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

11.3

Linear Factor Analysis and Principal Components 307

 

 

N

 

 

X

δp2

 

1

 

 

 

=

N 1

(Xnp − Xp)2 .

 

 

 

n=1

The quantity xnp is the normalized deviation of the measurement value of type p for the object n from the average over all objects for this measurement.

In the same way as in Chap. 4 we construct the correlation matrix for our sample by averaging the P × P products of the components xn1 . . . xnP over all N objects:

C =

 

1

 

XT X

,

 

N − 1

 

 

 

N

 

 

 

 

X

 

Cpq =

 

1

 

 

xnpxnq .

N

 

1

n=1

 

 

 

 

 

It is a symmetric positive definite P × P matrix. Due to the normalization the diagonal elements are equal to unity.

Then this matrix is brought into diagonal form by an orthogonal transformation corresponding to a rotation in the P -dimensional feature space.

C → VT CV = diag(λ1 . . . λP ) .

The uncorrelated feature vectors in the rotated space yn = {yn1, . . . , ynP } are given by

yn = VT xn , xn = Vyn .

To obtain eigenvalues and -vectors we solve the linear equation system

 

(C − λpI)vp = 0 ,

(11.14)

where λp is the eigenvalue belonging to the eigenvector vp:

Cvp = λpvp .

The P eigenvalues are found as the solutions of the characteristic equation det(C − λI) = 0 .

In the simple case described above of only two features, this is a quadratic equation

 

C11 − λ

 

 

 

C12

 

= 0 ,

 

C21

C22

 

 

 

λ

 

that fixes the two eigenvalues. The eigenvectors are calculated from (11.14) after substituting the respective eigenvalue. As they are fixed only up to an arbitrary factor, they are usually normalized. The rotation matrix V is constructed by taking the eigenvectors vp as its columns: vqp = (vp)q.

Since the eigenvalues are the diagonal elements in the rotated, diagonal correlation matrix, they correspond to the variances of the data distribution with respect to the principal axes. A small eigenvalue means that the projection of the data on this axis has a narrow distribution. The respective component is then, presumably, only of small influence on the data, and may perhaps be ignored in a model of the data. Large eigenvalues belong to the important principal components.

Factors fnp are obtained by standardization of the transformed variables ynp by division by the square root of the eigenvalues λp:

308 11 Statistical Learning

ynp

fnp = p . λp

By construction, these factors represent variates with zero mean and unit variance. In most cases they are assumed to be normally distributed. Their relation to the original data xnp is given by a linear (not orthogonal) transformation with a matrix A, the elements of which are called factor loadings. Its definition is

xn = Af n , or XT = A FT .

(11.15)

Its components show, how strongly the input data are influenced by certain factors.

In the classical factor analysis, the idea is to reduce the number of factors such that the description of the data is still satisfactory within tolerable deviations ε:

x1 = a11f1 + · · · + a1QfQ + ε1 x2 = a21f1 + · · · + a2QfQ + ε2

.

.

.

.

.

.

xP = aP 1f1 + · · · + aP QfQ + εP

with Q < P , where the “factors” (latent random variables) f1, . . . , fQ are considered as uncorrelated and distributed according to N(0, 1), plus uncorrelated zero-mean Gaussian variables εp, with variances σp2, representing the residual statistical fluctuations not described by the linear combinations. As a first guess, Q is taken as the index of the smallest eigenvalue λQ which is considered to be still significant. In the ideal case Q = 1 only one decisive factor would be the dominant element able to describe the data.

Generally, the aim is to estimate the loadings apq, the eigenvalues λp, and the variances σp2 from the sampling data, in order to reduce the number of relevant quantities responsible for their description.

The same results as we have found above by the traditional method by solving the eigenvalue problem for the correlation matrix9 can be obtained directly by using the singular value decomposition (SVD) of the matrix X (remember that it has N rows and P columns):

X = U D VT ,

where U and V are orthogonal. U is not a square matrix, nevertheless UT U = I, where the unit matrix I has dimension P . D is a diagonal matrix with elements pλp, ordered according to decreasing values, and called here singular values. The decomposition (11.15) is obtained by setting F = U and A = V D.

The decomposition (11.15) is not unique: If we multiply both F and A with a rotation matrix R from the right we get an equivalent decomposition:

˜ ˜T

= F R(A R)

T

= F R R

T

T

= U D V

T

,

(11.16)

X = F A

 

A

 

 

which is the same as (11.15), with factors and loadings being rotated.

There exist program packages which perform the numerical calculation of principal components and factors.

Remarks:

9Physicists may find the method familiar from the discussion of the inertial momentum tensor and many similar problems.

11.4 Classification

309

1.The transformation of the correlation matrix to diagonal form makes sense, as we obtain in this way uncorrelated inputs. The new variables help to understand better the relations between the various measurements.

2.The silent assumption that the principal components with larger eigenvalues are the more important ones is not always convincing, since starting with uncorrelated measurements, due to the scaling procedure, would result in eigenvalues which are all identical. An additional di culty for interpreting the data comes from the ambiguity (11.16) concerning rotations of factors and loadings.

11.4 Classification

We have come across classification already when we have treated goodness-of-fit. There the problem was either to accept or to reject a hypothesis without a clear alternative. Now we consider a situation where we dispose of information of two or more classes of events.

The assignment of an object according to some quality to a class or category is described by a so-called categorical variable. For two categories we can label the two possibilities by discrete numbers; usually the values ±1 or 1 and 0 are chosen. In most cases it makes sense, to give as a result instead of a discrete classification a continuous variable as a measure for the correctness of the classification. The classification into more than two cases can be performed sequentially by first combining classes such that we have a two class system and then splitting them further.

Classification is indispensable in data analysis in many areas. Examples in particle physics are the identification of particles from shower profiles or from Cerenkov ring images, beauty, top or Higgs particles from kinematics and secondaries and the separation of rare interactions from frequent ones. In astronomy the classification of galaxies and other stellar objects is of interest. But classification is also a precondition for decisions in many scientific fields and in everyday life.

We start with an example: A patient su ers from certain symptoms: stomachache, diarrhoea, temperature, head-ache. The doctor has to give a diagnosis. He will consider further factors, as age, sex, earlier diseases, possibility of infection, duration of the illness, etc.. The diagnosis is based on the experience and education of the doctor.

A computer program which is supposed to help the doctor in this matter should be able to learn from past cases, and to compare new inputs in a sensible way with the stored data. Of course, as opposed to most problems in science, it is not possible here to provide a functional, parametric relation. Hence there is a need for suitable methods which interpolate or extrapolate in the space of the input variables. If these quantities cannot be ordered, e.g. sex, color, shape, they have to be classified. In a broad sense, all this problems may be considered as variants of function approximation.

The most important methods for this kind of problems are the discriminant analysis, artificial neural nets, kernel or weighting methods, and decision trees. In the last years, remarkable progress in these fields could be realized with the development of support vector machines, boosted decision trees, and random forests classifiers.

Before discussing these methods in more detail let us consider a further example.

310 11 Statistical Learning

 

1.0

 

 

 

 

 

contamination

0.8

 

 

 

 

 

0.6

 

 

 

 

 

0.4

 

 

 

 

 

 

 

 

 

 

 

 

0.2

 

 

 

 

 

 

0.0

0.2

0.4

0.6

0.8

1.0

 

0.0

 

 

 

efficiency

 

 

Fig. 11.7. Fraction of wrongly assigned events as a function of the e ciency.

Example 141. Classification of particles with calorimeters

The interactions of electrons and hadrons in calorimeter detectors of particle physics di er in a many parameters. Calorimeters consist of a large number of detector elements, for which the signal heights are evaluated and recorded. The system should learn from a training sample obtained from test measurements with known particle beams to classify electrons and hadrons with minimal error rates.

An optimal classification is possible if the likelihood ratio is available which then is used as a cut variable. The goal of intelligent classification methods is to approximate the likelihood ratio or an equivalent variable which is a unique function of the likelihood ratio. The relation itself need not be known.

When we optimize a given method, it is not only the percentage of right decisions which is of interest, but we will also consider the consequences of the various kinds of errors. It is less serious if a SPAM has not been detected, as compared to the loss of an important message. In statistics this is taken into account by a loss function which has to be defined by the user. In the standard situation where we want to select a certain class of events, we have to consider e ciency and contamination10. The larger the e ciency, the larger is also the relative contamination by wrongly assigned events. A typical curve showing this relation if plotted in Fig. 11.7. The user will select the value of his cut variable on the bases of this curve.

The loss has to be evaluated from the training sample. It is recommended to use a part of the training sample to develop the method and to reserve a certain fraction to validate it. As statistics is nearly always too small, also more economic methods

10One minus the contamination is called purity.

11.4 Classification

311

of validation, cross validation and bootstrap ( see Sect.12.2), have been developed which permit to use the full sample to adjust the parameters of the method. In an n-fold cross validation the whole sample is randomly divided into n equal parts of N/n events each. In turn one of the parts is used for the validation of the training result from the other n − 1 parts. All n validation results are then averaged. Typical choices are n equal to 5 or 10.

11.4.1 The Discriminant Analysis

The classical discriminant analysis as developed by Fisher is a special case of the classification method that we introduce in the following. We follow our discussions of Chap. 6, Sect. 6.3.

If we know the p.d.f.s f1(x) and f2(x) for two classes of events it is easy to assign an observation x to one of the two classes in such a way that the error rate is minimal (case 1):

x→ class 1, if f1(x) > f2(x) ,

x→ class 2, if f1(x) < f2(x) .

Normally we will get a di erent number of wrong assignments for the two classes: observations originating from the broader distribution will be miss-assigned more often, see Fig. 11.8) than those of the narrower distribution. In most cases it will matter whether an input from class 1 or from class 2 is wrongly assigned. An optimal classification is then reached using an appropriately adjusted likelihood ratio:

x→ class 1, if f1(x)/f2(x) > c ,

x→ class 2, if f1(x)/f2(x) < c .

If we want to have the same error rates (case 2), we must choose the constant c such that the integrals over the densities in the selected regions are equal:

Z Z

f1(x)dx = f2(x)dx . (11.17)

f1/f2>c f1/f2<c

This assignment has again a minimal error rate, but now under the constraint (11.17). We illustrate the two possibilities in Fig. 11.8 for univariate functions.

For normal distributions we can formulate the condition for the classification explicitly: For case 2 we choose that class for which the observation x has the smallest distance to the mean measured in standard deviations. This condition can then be written as a function of the exponents. With the usual notations we get

(x − µ1)T V1(x − µ1) − (x − µ2)T V2

(x − µ2) < 0 → class 1 ,

(x − µ1)T V1(x − µ1) − (x − µ2)T V2

(x − µ2) > 0 → class 2 .

This condition can easily be generalized to more than two classes; the assignment according to the standardized distances will then, however, no longer lead to equal error rates for all classes.

The classical discriminant analysis sets V1 = V2. The left-hand side in the above relations becomes a linear combination of the xp. The quadratic terms cancel. Equating it to zero defines a hyperplane which separates the two classes. The sign of this

312 11 Statistical Learning

0.3

 

 

 

 

 

f

 

 

 

1

 

0.2

 

 

 

f(X)

 

 

f

 

 

 

2

0.1

 

 

 

0.0

0

10

20

 

 

 

X

 

Fig. 11.8. Separation of two classes. The dashed line separates the events such that the error rate is minimal, the dotted line such that the wrongly assigned events are the same in both classes.

linear combination thus determines the class membership. Note that the separating hyperplane is cutting the line connecting the distribution centers under a right angle only for spherical symmetric distributions.

If the distributions are only known empirically from representative samples, we approximate them by continuous distributions, usually by a normal distribution, and fix their parameters to reproduce the empirical moments. In situations where the empirical distributions strongly overlap, for instance when a narrow distribution is located at the center of a broad one, the simple discriminant analysis does no longer work. The classification methods introduced in the following sections have been developed for this and other more complicated situations and where the only source of information on the population is a training sample. The various approaches are all based on the continuity assumption that observations with similar attributes lead to similar outputs.

11.4.2 Artificial Neural Networks

Introduction

The application of artificial neural networks, ANN, has seen a remarkable boom in the last decades, parallel to the exploding computing capacities. From its many variants, in science the most popular are the relatively simple forward ANNs with back-propagation, to which we will restrict our discussion. The interested reader should consult the broad specialized literature on this subject, where fascinating self organizing networks are described which certainly will play a role also in science, in the more distant future. It could e.g. be imagined that a self-organizing ANN would be able to classify a data set of events produced at an accelerator without human intervention and thus would be able to discover new reactions and particles.

11.4 Classification

313

The species considered here has a comparably more modest aim: The network is trained in a first step to ascribe a certain output (response) to the inputs. In this supervised learning scheme, the response is compared with the target response, and then the network parameters are modified to improve the agreement. After a training phase the network is able to classify new data.

ANNs are used in many fields for a broad variety of problems. Examples are pattern recognition, e.g. for hand-written letters or figures, or the forecast of stock prices. They are successful in situations where the relations between many parameters are too complex for an analytical treatment. In particle physics they have been used, among other applications, to distinguish electron from hadron cascades and to identify reactions with heavy quarks.

With ANNs, many independent computing steps have to be performed. Therefore specialized computers have been developed which are able to evaluate the required functions very fast by parallel processing.

Primarily, the net approximates an algebraic function which transforms the input vector x into the response vector y,

y = f(x|w) .

Here w symbolizes a large set of parameters, typically there are, depending on the application, 103 − 104 parameters. The training process corresponds to a fitting problem. The parameters are adjusted such that the response agrees within the uncertainties with a target vector yt which is known for the events of the training sample.

There are two di erent applications of neural nets, simple function approximation11 and classification. The net could be trained, for instance, to estimate the energy of a hadronic showers from the energies deposited in di erent cells of the detector. The net could also be trained to separate electron from hadron showers. Then it should produce a number close to 1 for electrons and close to 0 for hadrons.

With the large number of parameters it is evident that the solution is not always unique. Networks with di erent parameters can perform the same function within the desired accuracy.

For the fitting of the large number of parameters minimizing programs like simplex (see Appendix 13.9) are not suited. The gradient descent method is much more practicable here. It is able to handle a large number of parameters and to process the input data sequentially.

A simple, more detailed introduction to the field of ANN is given in [69]

Network Structure

Our network consists of two layers of knots (neurons), see Fig. 11.9. Each component xk of the n-component input vector x is transmitted to all knots, labeled i = 1, . . . , m,

of the first layer. Each individual data line k → i is ascribed a weight W (1). In

P (1) ik

each unit the weighted sum ui = k Wik xk of the data components connected

11Here function approximation is used to perform calculations. In the previous section its purpose was to parametrize data.

314 11 Statistical Learning

x1

w(1)11

w(2)11

 

 

y1

 

s

w(2)12

s

 

 

 

 

 

w(1)12

 

 

x2

 

s

 

s

 

y2

 

 

 

x3

 

s

 

s

 

y3

 

 

 

 

 

 

 

 

xn

w(1)n3

s

ym

s

Fig. 11.9. Back propagation. At each knot the sigmoid function of the sum of the weighted inputs is computed.

to it is calculated. Each knot symbolizes a non-linear so-called activation function xi = s(ui), which is identical for all units. The first layer produces a new data vector x. The second layer, with mknots, acts analogously on the outputs of the first one. We call the corresponding m × mweight matrix W(2). It produces the output vector y. The first layer is called hidden layer, since its output is not observed directly. In principle, additional hidden layers could be implemented but experience shows that this does not improve the performance of the net.

The net executes the following functions:

 

 

 

xj= s

k

Wjk(1)xk! ,

 

 

 

 

X

Wij(2)xj.

 

 

 

yi = s

j

 

 

This leads to the final result:

 

X

 

 

 

X

 

 

X

!

 

 

 

 

 

yi = s

 

Wij(2)s

Wjk(1)xk

.

(11.18)

 

 

 

 

k

 

 

 

j

 

 

 

 

Sometimes it is appropriate to shift the input of each unit in the first layer by a constant amount (bias). This is easily realized by specifying an artificial additional input component x0 ≡ 1.

The number of weights (the parameters to be fitted) is, when we include the component x0, (n + 1) × m + mm.

Activation Function

The activation function s(x) has to be non-linear, in order to achieve that the superposition (11.18) is able to approximate widely arbitrary functions. It is plausible

11.4 Classification

315

 

1.2

 

 

 

 

 

 

 

 

1.0

 

 

 

 

 

 

 

 

0.8

 

 

 

 

 

 

 

s

0.6

 

 

 

 

 

 

 

 

0.4

 

 

 

 

 

 

 

 

0.2

 

 

 

 

 

 

 

 

0.0

-6

-4

-2

0

2

4

6

 

 

 

 

 

 

 

x

 

 

 

Fig. 11.10. Sigmoid function.

that it should be more sensitive to variations of the arguments near zero than for very large absolute values. The input bias x0 helps to shift input parameters which have a large mean value into the sensitive region. The activation function is usually standardized to vary between zero and one. The most popular activation function is the sigmoid function

1 s(u) = eu + 1 ,

which is similar to the Fermi function. It is shown in Fig. 11.10.

The Training Process

In the training phase the weights will be adapted after each new input object. Each time the output vector of the network y is compared with the target vector yt. We define again the loss function E:

E = (y − yt)2,

(11.19)

which measures for each training object the deviation of the response from the expected one.

To reduce the error E we walk downwards in the weight space. This means, we change each weight component by ΔW , proportional to the sensitivity ∂E/∂W of E to changes of W :

ΔW = − 12 α∂W∂E

∂y = −α(y − yt) · ∂W .

The proportionality constant α, the learning rate, determines the step width. We now have to find the derivatives. Let us start with s:

ds

 

eu

(11.20)

 

=

 

= s(1 − s) .

du

(eu + 1)2

From (11.18) and (11.20) we compute the derivatives with respect to the weight components of the first and the second layer,

316 11 Statistical Learning

 

 

 

 

∂yi

= y

(1

y

)x

,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

∂W

(2)

 

 

i

 

i

j

 

 

 

 

 

 

 

ij

 

 

 

 

 

 

 

 

 

 

and

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

∂yi

= y

(1

y

)x

W

(2)(1

x

)x .

 

 

 

 

∂W

(1)

 

i

 

i

 

j

 

ij

 

j

k

 

 

jk

 

 

 

 

 

 

 

 

 

 

 

 

 

It is seen that the derivatives depend on the same quantities which have already been calculated for the determination of y (the forward run through the net). Now we run backwards, change first the matrix W(2) and then with already computed quantities also W(1). This is the reason why this process is called back propagation. The weights are changed in the following way:

Wjk(1)

→ Wjk(1)

− α(y

− yt) Xi

yi(1 − yi)xjWij(2)(1 − xj)xk ,

W

(2)

W

(2)

α(y

y

)y

(1

y

)x.

 

ij

 

ij

 

t

i

 

i

j

Testing and Interpreting

The gradient descending minimum search has not necessarily reached the minimum after processing the training sample a single time, especially when the available sample is small. Then the should be used several times (e.g. 10 or 100 times). On the other hand it may happen for too small a training sample that the net performs correctly for the training data, but produces wrong results for new data. The network has, so to say, learned the training data by heart. Similar to other minimizing concepts, the net interpolates and extrapolates the training data. When the number of fitted parameters (here the weights) become of the same order as the number of constraints from the training data, the net will occasionally, after su cient training time, describe the training data exactly but fail for new input data. This e ect is called over-fitting and is common to all fitting schemes when too many parameters are adjusted.

It is therefore indispensable to validate the network function after the optimization, with data not used in the training phase or to perform a cross validation. If in the training phase simulated data are used, it is easy to generate new data for testing. If only experimental data are available with no possibility to enlarge the sample size, usually a certain fraction of the data is reserved for testing. If the validation result is not satisfactory, one should try to solve the problem by reducing the number of network parameters or the number of repetitions of the training runs with the same data set.

The neural network generates from the input data the response through the fairly complicated function (11.18). It is impossible by an internal analysis of this function to gain some understanding of the relation between input and resulting response. Nevertheless, it is not necessary to regard the ANN as a “black box”. We have the possibility to display graphically correlations between input quantities and the result, and all functional relations. In this way we gain some insight into possible connections. If, for instance, a physicist would have the idea to train a net with an experimental data sample to predict for a certain gas the volume from the pressure and the temperature, he would be able to reproduce, with a certain accuracy, the results of the van-der-Waals equation. He could display the relations between the

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]