vstatmp_engl
.pdf11.4 Classification |
317 |
three quantities graphically. Of course the analytic form of the equation and its interpretation cannot be delivered by the network.
Often a study of the optimized weights makes it possible to simplify the net. Very small weights can be set to zero, i.e. the corresponding connections between knots are cut. We can check whether switching o certain neurons has a sizable influence on the response. If this is not the case, these neurons can be eliminated. Of course, the modified network has to be trained again.
Practical Hints for the Application
Computer Programs for ANNs with back-propagation are relatively simple and available at many places but the e ort to write an ANN program is also not very large. The number of input vector components n and the number of knots m and m′ are parameters to be chosen by the user, thus the program is universal, only the loss function has to be adapted to the specific problem.
•The number of units in each layer should more or less match the number of input components. Some experts plead for a higher number. The user should try to find the optimal number.
•The sigmoid function has values only between zero and unity. Therefore the output or the target value has to be appropriately scaled by the user.
•The raw input components are usually correlated. The net is more e cient if the user orthogonalizes them. Then often some of the new components have negligible e ect on the output and can be discarded.
•The weights have to be initialized at the beginning of the training phase. This can be done by a random number generator or they can be set to fixed values.
•The loss function E (11.19) has be adjusted to the problem to be solved.
•The learning rate α should be chosen relatively high at the beginning of a training phase, e.g. α = 10. In the course of fitting it should be reduced to avoid oscillations.
•The convergence of minimizing process is slow if the gradient is small. If this is the case, and the fit is still bad, it is recommended to increase the learning constant for a certain number of iterations.
•In order to check whether a minimum is only local, one should train the net with di erent start values of the weights.
•Other possibilities for the improvement of the convergence and the elimination of local minima can be found in the substantial literature. An ANN program package that proceeds automatically along many of the proposed steps is described in [70].
Example: Čerenkov circles
Charged, relativistic particles can emit photons by the Čerenkov e ect. The photons hit a detector plane at points located on a circle. Of interest are radius and center of this circle, since they provide information on direction and velocity of the emitting particle. The number of photons and the coordinates where they hit the detector
318 11 Statistical Learning
|
0.1 |
|
Cerenkov circles |
|
|||
|
|
|
|
||||
|
0.01 |
|
=20, without momentum term |
|
|
|
|
error |
|
|
|
|
|
|
|
|
|
|
|
new learning constant |
=1 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
=10 |
=5 |
|
|
1E-3 |
|
|
|
|
|
|
|
|
=40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
=20 |
|
|
|
|
1E-4 |
1000 |
10000 |
100000 |
|
1000000 |
1E7 |
|
100 |
|
|||||
number of iterations
Fig. 11.11. Reconstruction of radii of circles through 5 points by means of an ANN with di ernt sequences of learning constants a.
fluctuate statistically and are disturbed by spurious noise signals. It has turned out that ANNs can reconstruct the parameters of interest from the available coordinates with good e ciency and accuracy.
We study this problem by a Monte Carlo simulation. In a simplified model, we assume that exactly 5 photons are emitted by a particle and that the coordinate pairs are located on a circle and registered. The center, the radii, and the hit coordinates are generated stochastically. The input vector of the net thus consists of 10 components, the 5 coordinate pairs. The output is a single value, the radius R. The loss function is (R − Rtrue)2, where the true value Rtrue is known from the simulation.
The relative accuracy of the reconstruction as a function of the iteration step is shown in Fig. 11.11. Di erent sequences of the learning rate have been tried. Typically, the process is running by steps, where after a flat phase follows a rather abrupt improvement. The number of iterations required to reach the minimum is quite large.
Hardware Realization
The structure of back propagation network can be implemented by a hardware network. The weights are stored locally at the units which are realized by rather simple microprocessors. Each microprocessor performs the knot function, e.g. the sigmoid function. A trained net can then calculate the fitted function very fast, since all processors are working in parallel. Such processors can be employed for the triggering
11.4 Classification |
319 |
in experiments where a quick decision is required, whether to accept an event and to store the corresponding data.
11.4.3 Weighting Methods
For the decision whether to assign an observation at the location x to a certain class, an obvious option is to do this according to the classification of neighboring objects of the training sample. One possibility is to consider a certain region around x and to take a “majority vote” of the training objects inside this region to decide about the class membership of the input. The region to be considered here can be chosen in di erent ways; it can be a fixed volume around x, or a variable volume defined by requiring that it contains a fixed number of observations, or an infinite volume, introducing weights for the training objects which decrease with their distance from x.
In any case we need a metric to define the distance. The choice of a metric in multi-dimensional applications is often a rather intricate problem, especially if some of the input components are physically of very di erent nature. A way-out seems to be to normalize the di erent quantities to equal variance and to eliminate global correlations by a linear variable transformation. This corresponds to the transformation to principal components discussed above (see Sect. 11.3) with subsequent scaling of the principal components. An alternative but equivalent possibility is to use a direction dependent weighting. The same result is achieved when we apply the Mahalanobis metric, which we have introduced in Sect. 10.3.9.
For a large training sample the calculation of all distances is expensive in computing time. A drastic reduction of the number of distances to be calculated is in many cases possible by the so-called support vector machines which we will discuss below. Those are not machines, but programs which reduce the training sample to a few, but decisive inputs, without impairing the results.
K-Nearest Neighbors
We choose a number K which of course will depend on the size of the training sample and the overlap of the classes. For an input x we determine the K nearest neighbors and the numbers k1, k2 = K − k1, of observations that belong to class I and II, respectively. For a ratio k1/k2 greater than α, we assign the new observation to class I, in the opposite case to class II:
k1/k2 > α = class I , k1/k2 < α = class II .
The choice of α depends on the loss function. When the loss function treats all classes alike, then α will be unity and we get a simple majority vote. To find the optimal value of K we minimize the average of the loss function computed for all observations of the training sample.
Distance Dependent Weighting
Instead of treating all training vector inputs x′ within a given region in the same way, one should attribute a larger weight to those located nearer to the input x. A
320 11 Statistical Learning
sensible choice is again a Gaussian kernel,
|
|
|
− |
2s2 |
|
K(x, x′) |
|
exp |
|
(x − x′)2 |
. |
|
|
|
|||
With this choice we obtain for the class β the weight wβ , |
|||||
wβ = Xi |
K(x, xβi) , |
(11.21) |
|||
where xβi are the locations of the training vectors of the class β. If there are only two classes, writing the training sample as
{x1, y1 . . . xN , yN }
with the response vector yi = ±1, the classification of a new input x is done according to the value ±1 of the classifier yˆ(x), given by
yˆ(x) = sign |
K(x, xi) − |
K(x, xi)! = sign |
yiK(x, xi)! . (11.22) |
yi=+1 |
yi=−1 |
i |
|
X |
X |
X |
|
For a direction dependent density of the training sample, we can use a direction dependent kernel, eventually in the Mahalanobis form mentioned above:
K(x, x′) exp − 12 (x − x′)T V(x − x′) .
with the weight matrix V. When we first normalize the sample, this complication is not necessary. The parameter s of the matrix V, which determines the width of the kernel function, again is optimized by minimizing the loss for the training sample.
Support Vector Machines
Support vector machines (SVMs) produce similar results as ordinary distance depending weighting methods, but they require less memory for the storage of learning data and the classification is extremely fast. Therefore, they are especially useful in on-line applications.
The class assignment usually is the same for all elements in large connected regions of the variable x. Very often, in a two case classification, there are only two regions separated by a hypersurface. For short range kernels it is obvious then that for the classification of observations, the knowledge of only those input vectors of the training sample is essential which are located in the vicinity of the hypersurface. These input vectors are called support vectors [73]. SVMs are programs which try to determine them, respectively their weights, in an optimal way, setting the weights of all other inputs vectors to zero.
In the one-dimensional case with non-overlapping classes it is su cient to know those inputs of each class which are located nearest to the dividing limit between the classes. Sums like (11.21) are then running over one element only. This, of course, makes the calculation extremely fast.
In higher dimensional spaces with overlapping classes and for more than two classes the problem to determine support vectors is of course more complicated. But
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11.4 Classification |
321 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
region 2
region 1 |
region 1 |
Fig. 11.12. Separation of two classes. Top: learning sample, bottom: wrongly assigned events of a test sample.
also in these circumstances the number of relevant training inputs can be reduced drastically. The success of SVMs is based on the so-called kernel trick, by which nonlinear problems in the input space are treated as linear problems in some higherdimensional space by well known optimization algorithms. For the corresponding algorithms and proofs we refer to the literature, e.g. [13, 72]. A short introduction is given in Appendix 13.13.
Example and Discussion
In Fig. 11.12 are shown in the top panel two overlapping training samples of 500 inputs each. The loss function is the number of wrong assignments independent of the respective class. Since the distributions are quite similar in both coordinates we do not change the metric. We use a Gaussian kernel. The optimization of the parameter s by means of the training sample shows only a small change of the error rate for a change of s by a factor four. The lower panel displays the result of the classification for a test sample of the same size (500 inputs per class). Only the wrong assignments are shown.
We realize that wrongly assigned training observations occur in two separate, non overlapping regions which can be separated by a curve or a polygon chain as indicated
322 11 Statistical Learning
in the figure. Obviously all new observations would be assigned to the class corresponding to the region in which they are located. If we would have used instead of the distance-depending weighting the K-nearest neighbors method, the result would have been almost identical. In spite of the opposite expectation, this more primitive method is more expensive in both the programming and in the calculation, when compared to the weighting with a distance dependent kernel.
Since for the classification only the separation curve between the classes is required, it must be su cient to know the class assignment for those training observations which lie near this curve. They would define the support vectors of a SVM. Thus the number of inputs needed for the assignment of new observations would be drastically reduced. However, for a number of assignments below about 106 the e ort to determine support vectors usually does not pay. The SVMs are useful for large event numbers in applications where computing time is relevant.
11.4.4 Decision Trees
Simple Trees
We consider the simple case, the two class classification, i.e. the assignment of inputs to one of two classes I and II, and N observations with P features x1, x2, . . . , xP , which we consider, as before, as the components of an input vector.
In the first step we consider the first component x11, x21, . . . , xN1 for all N input vectors of the training sample. We search for a value xc1 which optimally divides the two classes and obtain a division of the training sample into two parts A and B. Each of these parts which belong to two di erent subspaces, will now be further treated separately. Next we take the subspace A, look at the feature x2, and divide it, in the same way as before the full space, again into two parts. Analogously we treat the subspace B. Now we can switch to the next feature or return to feature 1 and perform further splittings. The sequence of divisions leads to smaller and smaller subspaces, each of them assigned to a certain class. This subdivision process can be regarded as the development of a decision tree for input vectors for which the class membership is to be determined. The growing of the tree is stopped by a pruning rule. The final partitions are called leaves.
In Fig. 11.13 we show schematically the subdivision into subspaces and the corresponding decision tree for a training sample of 32 elements with only two features. The training sample which determines the decisions is indicated. At the end of the tree (here at the bottom) the decision about the class membership is taken.
It is not obvious, how one should optimize the sequence of partitions and the position of cuts, and also not, under which circumstances the procedure should be stopped.
For the optimization of splits we must again define a loss function which will depend on the given problem. A simple possibility in the case of two classes is, to maximize for each splitting the di erence ΔN = Nr − Nf between right and wrong assignments. We used this in our example Fig. 11.13. For the first division this quantity was equal to 20 − 12 = 8. To some extend the position of the splitting hyperplane is still arbitrary, the loss function changes its value only when it hits the nearest input. It could, for example, be put at the center between the two nearest
11.4 Classification |
323 |
10
8
X 6
2
4
2
0 1 2 3 4 5
X
1
Fig. 11.13. Decision tree (bottom) corresponding to the classification shown below.
points. Often the importance of e ciency and purity is di erent for the two classes. Then we would chose an asymmetric loss function.
Very popular is the following, slightly more complicated criterion: We define the impurity PI of class I
PI = |
NI |
, |
(11.23) |
NI + NII |
which for optimal classification would be 1 or 0. The quantity
G = PI (1 − PI ) + PII (1 − PII ) |
(11.24) |
the Gini-index, should be as small as possible. For each separation of a parent node E with Gini index GE into two children nodes A, B with GA, respectively GB , we minimize the sum GA + GB .
The di erence
D = GE − GA − GB
is taken as stopping or pruning parameter. The quantity D measures the increase in purity, it is large for a parent node with large G and two children nodes with small G. When D becomes less than a certain critical value Dc the branch will not be split further and ends at a leave. The leave is assigned to the class which has the majority in it.
Besides the Gini index, also other measures for the purity or impurity are used [13]. An interesting quantity is entropy S = −PI ln PI − PII ln PII , a well known measure of disorder, i.e. of impurity.
326 11 Statistical Learning
Random Forest
Another new development [84] which includes the bootstrap idea, is the extension of the decision tree concept to the random forest classifier.
Many trees are generated from bootstrap samples of the training sample, but now part of the input vector components are suppressed. A tree is constructed in the following way: First m out of the M components or attributes of the input vectors are selected at random. The tree is grown in a m-dimensional subspace of the full input vector space. It is not obvious how m is to be chosen, but the author proposes m M and says that the results show little dependence on m. With large m the individual trees are powerful but strongly correlated. The value of m is the same for all trees.
From the N truncated bootstrap vectors, Nb are separated, put into a bag and reserved for testing. A fraction f = Nb/N ≈ 1/3 is proposed. The remaining ones are used to generate the tree. For each split that attribute out of the m available attributes is chosen which gives the smallest number of wrong classifications. Each leave contains only elements of a single class. There is no pruning.
Following the bagging concept, the classification of new input vectors is obtained by the majority vote of all trees.
The out-of-the-bag (oob) data are used to estimate the error rate. To this end, each oob-vector of the k-th sample is run through the k-th tree and classified. The fraction of wrong classifications from all oob vectors is the error rate. (For T trees there are in total T × Nb oob vectors.) The oob data can also be used to optimize the constant m.
The random forest classifier has received quite some interest. The concept is simple and seems to be similarly powerful as that of other classifiers. It is especially well suited for large data sets in high dimensions.
11.4.6 Comparison of the Methods
We have discussed various methods for classification. Each of them has its advantages and its drawbacks. It depends on the special problem, which one is the most suitable.
The discriminant analysis o ers itself for oneor two dimensional continuous distributions (preferably Gaussians or other unimodal distributions). It is useful for event selection in simple situations.
Kernel methods are relatively easy to apply. They work well if the division line between classes is su ciently smooth and transitions between di erent classes are continuous. Categorical variables cannot be treated. The variant with support vectors reduces computing time and the memory space for the storage of the training sample. In standard cases with not too extensive statistics one should avoid this additional complication. Kernel methods can perform event selection in more complicated environments than is possible with the primitive discriminant analysis. For the better performance the possibility of interpreting the results is diminished, however.
Artificial neural networks are, due to the enormous number of free parameters, able to solve any problem in an optimal way. They su er from the disadvantage that the user usually has to intervene to guide the minimizing process to a correct minimum. The user has to check and improve the result by changing the network
