Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
26
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

11.4 Classification

317

three quantities graphically. Of course the analytic form of the equation and its interpretation cannot be delivered by the network.

Often a study of the optimized weights makes it possible to simplify the net. Very small weights can be set to zero, i.e. the corresponding connections between knots are cut. We can check whether switching o certain neurons has a sizable influence on the response. If this is not the case, these neurons can be eliminated. Of course, the modified network has to be trained again.

Practical Hints for the Application

Computer Programs for ANNs with back-propagation are relatively simple and available at many places but the e ort to write an ANN program is also not very large. The number of input vector components n and the number of knots m and mare parameters to be chosen by the user, thus the program is universal, only the loss function has to be adapted to the specific problem.

The number of units in each layer should more or less match the number of input components. Some experts plead for a higher number. The user should try to find the optimal number.

The sigmoid function has values only between zero and unity. Therefore the output or the target value has to be appropriately scaled by the user.

The raw input components are usually correlated. The net is more e cient if the user orthogonalizes them. Then often some of the new components have negligible e ect on the output and can be discarded.

The weights have to be initialized at the beginning of the training phase. This can be done by a random number generator or they can be set to fixed values.

The loss function E (11.19) has be adjusted to the problem to be solved.

The learning rate α should be chosen relatively high at the beginning of a training phase, e.g. α = 10. In the course of fitting it should be reduced to avoid oscillations.

The convergence of minimizing process is slow if the gradient is small. If this is the case, and the fit is still bad, it is recommended to increase the learning constant for a certain number of iterations.

In order to check whether a minimum is only local, one should train the net with di erent start values of the weights.

Other possibilities for the improvement of the convergence and the elimination of local minima can be found in the substantial literature. An ANN program package that proceeds automatically along many of the proposed steps is described in [70].

Example: Čerenkov circles

Charged, relativistic particles can emit photons by the Čerenkov e ect. The photons hit a detector plane at points located on a circle. Of interest are radius and center of this circle, since they provide information on direction and velocity of the emitting particle. The number of photons and the coordinates where they hit the detector

318 11 Statistical Learning

 

0.1

 

Cerenkov circles

 

 

 

 

 

 

0.01

 

=20, without momentum term

 

 

 

error

 

 

 

 

 

 

 

 

 

 

new learning constant

=1

 

 

 

 

 

 

 

 

 

 

 

 

=10

=5

 

 

1E-3

 

 

 

 

 

 

 

=40

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=20

 

 

 

 

1E-4

1000

10000

100000

 

1000000

1E7

 

100

 

number of iterations

Fig. 11.11. Reconstruction of radii of circles through 5 points by means of an ANN with di ernt sequences of learning constants a.

fluctuate statistically and are disturbed by spurious noise signals. It has turned out that ANNs can reconstruct the parameters of interest from the available coordinates with good e ciency and accuracy.

We study this problem by a Monte Carlo simulation. In a simplified model, we assume that exactly 5 photons are emitted by a particle and that the coordinate pairs are located on a circle and registered. The center, the radii, and the hit coordinates are generated stochastically. The input vector of the net thus consists of 10 components, the 5 coordinate pairs. The output is a single value, the radius R. The loss function is (R − Rtrue)2, where the true value Rtrue is known from the simulation.

The relative accuracy of the reconstruction as a function of the iteration step is shown in Fig. 11.11. Di erent sequences of the learning rate have been tried. Typically, the process is running by steps, where after a flat phase follows a rather abrupt improvement. The number of iterations required to reach the minimum is quite large.

Hardware Realization

The structure of back propagation network can be implemented by a hardware network. The weights are stored locally at the units which are realized by rather simple microprocessors. Each microprocessor performs the knot function, e.g. the sigmoid function. A trained net can then calculate the fitted function very fast, since all processors are working in parallel. Such processors can be employed for the triggering

11.4 Classification

319

in experiments where a quick decision is required, whether to accept an event and to store the corresponding data.

11.4.3 Weighting Methods

For the decision whether to assign an observation at the location x to a certain class, an obvious option is to do this according to the classification of neighboring objects of the training sample. One possibility is to consider a certain region around x and to take a “majority vote” of the training objects inside this region to decide about the class membership of the input. The region to be considered here can be chosen in di erent ways; it can be a fixed volume around x, or a variable volume defined by requiring that it contains a fixed number of observations, or an infinite volume, introducing weights for the training objects which decrease with their distance from x.

In any case we need a metric to define the distance. The choice of a metric in multi-dimensional applications is often a rather intricate problem, especially if some of the input components are physically of very di erent nature. A way-out seems to be to normalize the di erent quantities to equal variance and to eliminate global correlations by a linear variable transformation. This corresponds to the transformation to principal components discussed above (see Sect. 11.3) with subsequent scaling of the principal components. An alternative but equivalent possibility is to use a direction dependent weighting. The same result is achieved when we apply the Mahalanobis metric, which we have introduced in Sect. 10.3.9.

For a large training sample the calculation of all distances is expensive in computing time. A drastic reduction of the number of distances to be calculated is in many cases possible by the so-called support vector machines which we will discuss below. Those are not machines, but programs which reduce the training sample to a few, but decisive inputs, without impairing the results.

K-Nearest Neighbors

We choose a number K which of course will depend on the size of the training sample and the overlap of the classes. For an input x we determine the K nearest neighbors and the numbers k1, k2 = K − k1, of observations that belong to class I and II, respectively. For a ratio k1/k2 greater than α, we assign the new observation to class I, in the opposite case to class II:

k1/k2 > α = class I , k1/k2 < α = class II .

The choice of α depends on the loss function. When the loss function treats all classes alike, then α will be unity and we get a simple majority vote. To find the optimal value of K we minimize the average of the loss function computed for all observations of the training sample.

Distance Dependent Weighting

Instead of treating all training vector inputs xwithin a given region in the same way, one should attribute a larger weight to those located nearer to the input x. A

320 11 Statistical Learning

sensible choice is again a Gaussian kernel,

 

 

 

2s2

 

K(x, x)

 

exp

 

(x − x)2

.

 

 

 

With this choice we obtain for the class β the weight wβ ,

wβ = Xi

K(x, xβi) ,

(11.21)

where xβi are the locations of the training vectors of the class β. If there are only two classes, writing the training sample as

{x1, y1 . . . xN , yN }

with the response vector yi = ±1, the classification of a new input x is done according to the value ±1 of the classifier yˆ(x), given by

yˆ(x) = sign

K(x, xi) −

K(x, xi)! = sign

yiK(x, xi)! . (11.22)

yi=+1

yi=−1

i

 

X

X

X

 

For a direction dependent density of the training sample, we can use a direction dependent kernel, eventually in the Mahalanobis form mentioned above:

K(x, x) exp − 12 (x − x)T V(x − x) .

with the weight matrix V. When we first normalize the sample, this complication is not necessary. The parameter s of the matrix V, which determines the width of the kernel function, again is optimized by minimizing the loss for the training sample.

Support Vector Machines

Support vector machines (SVMs) produce similar results as ordinary distance depending weighting methods, but they require less memory for the storage of learning data and the classification is extremely fast. Therefore, they are especially useful in on-line applications.

The class assignment usually is the same for all elements in large connected regions of the variable x. Very often, in a two case classification, there are only two regions separated by a hypersurface. For short range kernels it is obvious then that for the classification of observations, the knowledge of only those input vectors of the training sample is essential which are located in the vicinity of the hypersurface. These input vectors are called support vectors [73]. SVMs are programs which try to determine them, respectively their weights, in an optimal way, setting the weights of all other inputs vectors to zero.

In the one-dimensional case with non-overlapping classes it is su cient to know those inputs of each class which are located nearest to the dividing limit between the classes. Sums like (11.21) are then running over one element only. This, of course, makes the calculation extremely fast.

In higher dimensional spaces with overlapping classes and for more than two classes the problem to determine support vectors is of course more complicated. But

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11.4 Classification

321

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

region 2

region 1

region 1

Fig. 11.12. Separation of two classes. Top: learning sample, bottom: wrongly assigned events of a test sample.

also in these circumstances the number of relevant training inputs can be reduced drastically. The success of SVMs is based on the so-called kernel trick, by which nonlinear problems in the input space are treated as linear problems in some higherdimensional space by well known optimization algorithms. For the corresponding algorithms and proofs we refer to the literature, e.g. [13, 72]. A short introduction is given in Appendix 13.13.

Example and Discussion

In Fig. 11.12 are shown in the top panel two overlapping training samples of 500 inputs each. The loss function is the number of wrong assignments independent of the respective class. Since the distributions are quite similar in both coordinates we do not change the metric. We use a Gaussian kernel. The optimization of the parameter s by means of the training sample shows only a small change of the error rate for a change of s by a factor four. The lower panel displays the result of the classification for a test sample of the same size (500 inputs per class). Only the wrong assignments are shown.

We realize that wrongly assigned training observations occur in two separate, non overlapping regions which can be separated by a curve or a polygon chain as indicated

322 11 Statistical Learning

in the figure. Obviously all new observations would be assigned to the class corresponding to the region in which they are located. If we would have used instead of the distance-depending weighting the K-nearest neighbors method, the result would have been almost identical. In spite of the opposite expectation, this more primitive method is more expensive in both the programming and in the calculation, when compared to the weighting with a distance dependent kernel.

Since for the classification only the separation curve between the classes is required, it must be su cient to know the class assignment for those training observations which lie near this curve. They would define the support vectors of a SVM. Thus the number of inputs needed for the assignment of new observations would be drastically reduced. However, for a number of assignments below about 106 the e ort to determine support vectors usually does not pay. The SVMs are useful for large event numbers in applications where computing time is relevant.

11.4.4 Decision Trees

Simple Trees

We consider the simple case, the two class classification, i.e. the assignment of inputs to one of two classes I and II, and N observations with P features x1, x2, . . . , xP , which we consider, as before, as the components of an input vector.

In the first step we consider the first component x11, x21, . . . , xN1 for all N input vectors of the training sample. We search for a value xc1 which optimally divides the two classes and obtain a division of the training sample into two parts A and B. Each of these parts which belong to two di erent subspaces, will now be further treated separately. Next we take the subspace A, look at the feature x2, and divide it, in the same way as before the full space, again into two parts. Analogously we treat the subspace B. Now we can switch to the next feature or return to feature 1 and perform further splittings. The sequence of divisions leads to smaller and smaller subspaces, each of them assigned to a certain class. This subdivision process can be regarded as the development of a decision tree for input vectors for which the class membership is to be determined. The growing of the tree is stopped by a pruning rule. The final partitions are called leaves.

In Fig. 11.13 we show schematically the subdivision into subspaces and the corresponding decision tree for a training sample of 32 elements with only two features. The training sample which determines the decisions is indicated. At the end of the tree (here at the bottom) the decision about the class membership is taken.

It is not obvious, how one should optimize the sequence of partitions and the position of cuts, and also not, under which circumstances the procedure should be stopped.

For the optimization of splits we must again define a loss function which will depend on the given problem. A simple possibility in the case of two classes is, to maximize for each splitting the di erence ΔN = Nr − Nf between right and wrong assignments. We used this in our example Fig. 11.13. For the first division this quantity was equal to 20 − 12 = 8. To some extend the position of the splitting hyperplane is still arbitrary, the loss function changes its value only when it hits the nearest input. It could, for example, be put at the center between the two nearest

11.4 Classification

323

10

8

X 6

2

4

2

0 1 2 3 4 5

X

1

Fig. 11.13. Decision tree (bottom) corresponding to the classification shown below.

points. Often the importance of e ciency and purity is di erent for the two classes. Then we would chose an asymmetric loss function.

Very popular is the following, slightly more complicated criterion: We define the impurity PI of class I

PI =

NI

,

(11.23)

NI + NII

which for optimal classification would be 1 or 0. The quantity

G = PI (1 − PI ) + PII (1 − PII )

(11.24)

the Gini-index, should be as small as possible. For each separation of a parent node E with Gini index GE into two children nodes A, B with GA, respectively GB , we minimize the sum GA + GB .

The di erence

D = GE − GA − GB

is taken as stopping or pruning parameter. The quantity D measures the increase in purity, it is large for a parent node with large G and two children nodes with small G. When D becomes less than a certain critical value Dc the branch will not be split further and ends at a leave. The leave is assigned to the class which has the majority in it.

Besides the Gini index, also other measures for the purity or impurity are used [13]. An interesting quantity is entropy S = −PI ln PI − PII ln PII , a well known measure of disorder, i.e. of impurity.

324 11 Statistical Learning

The purity parameter, e.g. G, is also used to organize the splitting sequence. We choose always that input vector component in which the splitting produces the most significant separation.

A further possibility would be to generalize the orthogonal splitting by allowing also non-orthogonal planes to reach better separations. But in the standard case all components are treated independently.

Unfortunately, the classification by decision trees is usually not perfect. The discontinuity at the boundaries and the fixed splitting sequence impair the accuracy. On the other hand, they are simple, transparent and the corresponding computer programs are extremely fast.

Boosted Decision Trees

Boosting [75] is based on a simple idea: By a weighted superposition of many moderately e ective classifiers it should be possible to reach a fairly precise assignment. Instead of only one decision tree, many di erent trees are grown. Each time, before the development of a new tree is started, wrongly assigned training inputs are boosted to higher weights in order to lower their probability of being wrongly classified in the following tree. The final class assignment is then done by averaging the decisions from all trees. Obviously, the computing e ort for these boosted decision trees is increased, but the precision is significantly enhanced. The results of boosted decision trees are usually as good as those of ANNs. Their algorithm is very well suited for parallel processing. There are first applications in particle physics [76].

Before the first run, all training inputs have the weight 1. In the following run each input gets a weight wi, determined by a certain boosting algorithm (see below) which depends on the particular method. The definition of the node impurity P for calculating the loss function, see (11.23), (11.24), is changed accordingly to

 

 

I

wi

 

P =

P

I wPi +

P

,

 

II wi

where the sums PI , PII run over all events in class I or II, respectively. Again the weights will be boosted and the next run started. Typically M ≈ 1000 trees are generated in this way.

If we indicate the decision of a tree m for the input xi by Tm(xi) = 1 (for class I) and = −1 (for class II), the final result will be given by the sign of the weighted sum over the results from all trees

 

M

 

X

TM (xi) = sign

m=1 αmTm(xi)! .

We proceed in the following way: To the first tree we assign a weight α1 = 1. The weights of the wrongly assigned input vectors are increased. The weight12 α2 of the second tree T2(x) is chosen such that the overall loss from all input vectors of the training sample is minimal for the combination [α1T1(x) + α2T2(x)] / [α1 + α2]. We continue in the same way and add further trees. For tree i the weight αi is optimized such that the existing trees are complemented in an optimal way. How this is done depends of course on the loss function.

12We have two kinds of weight, weights of input vectors (wi) and weights of trees (αm).

11.4 Classification

325

A well tested recipe for the choice of weights is AdaBoost [75]. The training algorithm proceeds as follows:

The i-th input xi gets the weight wi = 1 and the value yi = 1, (= −1), if it belongs to class I, (II).

Tm(xi) = 1 (= −1), if the input ends in a leave belonging to class I (II). Sm(xi) = (1 − yiTm(xi))/2 = 1 (= 0), if the assignment is wrong (right).

The fraction of the weighted wrong assignments εm is used to change the weights for the next iteration:

XX

εm = wiSm(xi)/ wi ,

ii

αm = ln 1 − εm , εm

wi → wieαmSm .

Weights of correctly assigned training inputs thus remain unchanged. For example, for εm = 0.1, wrongly assigned inputs will be boosted by a factor 0.9/0.1 = 9. Note that αm > 0 if ε < 0.5; this is required because otherwise the replacement Tm(xi) → −Tm(xi) would produce a better decision tree.

• The response for a new input which is to be classified is

 

M

 

X

TM (xi) = sign

m=1 αmTm(xi)! .

For εm = 0.1 the weight of the tree is αm = ln 9 ≈ 2.20. For certain applications it may be useful to reduce the weight factors αm somewhat, for instance αm = 0.5 ln ((1 − εm)/εm) [76].

11.4.5 Bagging and Random Forest

Bagging

The concept of bagging was first introduced by Breiman [83]. He has shown that the performance of unstable classifiers can be improved considerably by training many classifiers with bootstrap replicates and then using a majority vote of those: From a training sample containing N input vectors, N vectors are drawn at random with replacement. Some vectors will be contained several times. This bootstrap13 sample is used to train a classifier. Many, 100 or 1000 classifiers are produced in this way. New inputs are run through all trees and each tree “votes” for a certain classification. The classification receiving the majority of votes is chosen. In a study of real data [83] a reduction of error rates by bagging between 20% and 47% was found. There the bagging concept had been applied to simple decision trees, however, the bagging concept is quite general and can be adopted also to other classifiers.

13We will discuss bootstrap methods in the following chapter.

326 11 Statistical Learning

Random Forest

Another new development [84] which includes the bootstrap idea, is the extension of the decision tree concept to the random forest classifier.

Many trees are generated from bootstrap samples of the training sample, but now part of the input vector components are suppressed. A tree is constructed in the following way: First m out of the M components or attributes of the input vectors are selected at random. The tree is grown in a m-dimensional subspace of the full input vector space. It is not obvious how m is to be chosen, but the author proposes m M and says that the results show little dependence on m. With large m the individual trees are powerful but strongly correlated. The value of m is the same for all trees.

From the N truncated bootstrap vectors, Nb are separated, put into a bag and reserved for testing. A fraction f = Nb/N ≈ 1/3 is proposed. The remaining ones are used to generate the tree. For each split that attribute out of the m available attributes is chosen which gives the smallest number of wrong classifications. Each leave contains only elements of a single class. There is no pruning.

Following the bagging concept, the classification of new input vectors is obtained by the majority vote of all trees.

The out-of-the-bag (oob) data are used to estimate the error rate. To this end, each oob-vector of the k-th sample is run through the k-th tree and classified. The fraction of wrong classifications from all oob vectors is the error rate. (For T trees there are in total T × Nb oob vectors.) The oob data can also be used to optimize the constant m.

The random forest classifier has received quite some interest. The concept is simple and seems to be similarly powerful as that of other classifiers. It is especially well suited for large data sets in high dimensions.

11.4.6 Comparison of the Methods

We have discussed various methods for classification. Each of them has its advantages and its drawbacks. It depends on the special problem, which one is the most suitable.

The discriminant analysis o ers itself for oneor two dimensional continuous distributions (preferably Gaussians or other unimodal distributions). It is useful for event selection in simple situations.

Kernel methods are relatively easy to apply. They work well if the division line between classes is su ciently smooth and transitions between di erent classes are continuous. Categorical variables cannot be treated. The variant with support vectors reduces computing time and the memory space for the storage of the training sample. In standard cases with not too extensive statistics one should avoid this additional complication. Kernel methods can perform event selection in more complicated environments than is possible with the primitive discriminant analysis. For the better performance the possibility of interpreting the results is diminished, however.

Artificial neural networks are, due to the enormous number of free parameters, able to solve any problem in an optimal way. They su er from the disadvantage that the user usually has to intervene to guide the minimizing process to a correct minimum. The user has to check and improve the result by changing the network

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]