Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Южный Федеральный Университет

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

vstatmp_engl

.pdf

Скачиваний:

Добавлен:

12.03.2016

Размер:

6.43 Mб

Скачать

☆

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 3334 / 4234 35 36 37 38 39 40 41 42 > Следующая >>>

11.4 Classiﬁcation

317

three quantities graphically. Of course the analytic form of the equation and its interpretation cannot be delivered by the network.

Often a study of the optimized weights makes it possible to simplify the net. Very small weights can be set to zero, i.e. the corresponding connections between knots are cut. We can check whether switching o certain neurons has a sizable inﬂuence on the response. If this is not the case, these neurons can be eliminated. Of course, the modiﬁed network has to be trained again.

Practical Hints for the Application

Computer Programs for ANNs with back-propagation are relatively simple and available at many places but the e ort to write an ANN program is also not very large. The number of input vector components n and the number of knots m and m′ are parameters to be chosen by the user, thus the program is universal, only the loss function has to be adapted to the speciﬁc problem.

•The number of units in each layer should more or less match the number of input components. Some experts plead for a higher number. The user should try to ﬁnd the optimal number.

•The sigmoid function has values only between zero and unity. Therefore the output or the target value has to be appropriately scaled by the user.

•The raw input components are usually correlated. The net is more e cient if the user orthogonalizes them. Then often some of the new components have negligible e ect on the output and can be discarded.

•The weights have to be initialized at the beginning of the training phase. This can be done by a random number generator or they can be set to ﬁxed values.

•The loss function E (11.19) has be adjusted to the problem to be solved.

•The learning rate α should be chosen relatively high at the beginning of a training phase, e.g. α = 10. In the course of ﬁtting it should be reduced to avoid oscillations.

•The convergence of minimizing process is slow if the gradient is small. If this is the case, and the ﬁt is still bad, it is recommended to increase the learning constant for a certain number of iterations.

•In order to check whether a minimum is only local, one should train the net with di erent start values of the weights.

•Other possibilities for the improvement of the convergence and the elimination of local minima can be found in the substantial literature. An ANN program package that proceeds automatically along many of the proposed steps is described in [70].

Example: Čerenkov circles

Charged, relativistic particles can emit photons by the Čerenkov e ect. The photons hit a detector plane at points located on a circle. Of interest are radius and center of this circle, since they provide information on direction and velocity of the emitting particle. The number of photons and the coordinates where they hit the detector

318 11 Statistical Learning

	0.1		Cerenkov circles
			Cerenkov circles
	0.01		=20, without momentum term
error	0.01
					new learning constant		=1

					=10	=5
	1E-3				=10
	1E-3		=40
			=40
				=20
	1E-4	1000	10000	100000		1000000	1E7
	100	1000	10000	100000		1000000	1E7

number of iterations

Fig. 11.11. Reconstruction of radii of circles through 5 points by means of an ANN with di ernt sequences of learning constants a.

ﬂuctuate statistically and are disturbed by spurious noise signals. It has turned out that ANNs can reconstruct the parameters of interest from the available coordinates with good e ciency and accuracy.

We study this problem by a Monte Carlo simulation. In a simpliﬁed model, we assume that exactly 5 photons are emitted by a particle and that the coordinate pairs are located on a circle and registered. The center, the radii, and the hit coordinates are generated stochastically. The input vector of the net thus consists of 10 components, the 5 coordinate pairs. The output is a single value, the radius R. The loss function is (R − Rtrue)2, where the true value Rtrue is known from the simulation.

The relative accuracy of the reconstruction as a function of the iteration step is shown in Fig. 11.11. Di erent sequences of the learning rate have been tried. Typically, the process is running by steps, where after a ﬂat phase follows a rather abrupt improvement. The number of iterations required to reach the minimum is quite large.

Hardware Realization

The structure of back propagation network can be implemented by a hardware network. The weights are stored locally at the units which are realized by rather simple microprocessors. Each microprocessor performs the knot function, e.g. the sigmoid function. A trained net can then calculate the ﬁtted function very fast, since all processors are working in parallel. Such processors can be employed for the triggering

11.4 Classiﬁcation

319

in experiments where a quick decision is required, whether to accept an event and to store the corresponding data.

11.4.3 Weighting Methods

For the decision whether to assign an observation at the location x to a certain class, an obvious option is to do this according to the classiﬁcation of neighboring objects of the training sample. One possibility is to consider a certain region around x and to take a “majority vote” of the training objects inside this region to decide about the class membership of the input. The region to be considered here can be chosen in di erent ways; it can be a ﬁxed volume around x, or a variable volume deﬁned by requiring that it contains a ﬁxed number of observations, or an inﬁnite volume, introducing weights for the training objects which decrease with their distance from x.

In any case we need a metric to deﬁne the distance. The choice of a metric in multi-dimensional applications is often a rather intricate problem, especially if some of the input components are physically of very di erent nature. A way-out seems to be to normalize the di erent quantities to equal variance and to eliminate global correlations by a linear variable transformation. This corresponds to the transformation to principal components discussed above (see Sect. 11.3) with subsequent scaling of the principal components. An alternative but equivalent possibility is to use a direction dependent weighting. The same result is achieved when we apply the Mahalanobis metric, which we have introduced in Sect. 10.3.9.

For a large training sample the calculation of all distances is expensive in computing time. A drastic reduction of the number of distances to be calculated is in many cases possible by the so-called support vector machines which we will discuss below. Those are not machines, but programs which reduce the training sample to a few, but decisive inputs, without impairing the results.

K-Nearest Neighbors

We choose a number K which of course will depend on the size of the training sample and the overlap of the classes. For an input x we determine the K nearest neighbors and the numbers k1, k2 = K − k1, of observations that belong to class I and II, respectively. For a ratio k1/k2 greater than α, we assign the new observation to class I, in the opposite case to class II:

k1/k2 > α = class I , k1/k2 < α = class II .

The choice of α depends on the loss function. When the loss function treats all classes alike, then α will be unity and we get a simple majority vote. To ﬁnd the optimal value of K we minimize the average of the loss function computed for all observations of the training sample.

Distance Dependent Weighting

Instead of treating all training vector inputs x′ within a given region in the same way, one should attribute a larger weight to those located nearer to the input x. A

320 11 Statistical Learning

sensible choice is again a Gaussian kernel,

		−	2s2
K(x, x′)	exp		(x − x′)2	.

With this choice we obtain for the class β the weight wβ ,
wβ = Xi		K(x, xβi) ,		(11.21)

where xβi are the locations of the training vectors of the class β. If there are only two classes, writing the training sample as

{x1, y1 . . . xN , yN }

with the response vector yi = ±1, the classiﬁcation of a new input x is done according to the value ±1 of the classiﬁer yˆ(x), given by

yˆ(x) = sign	K(x, xi) −	K(x, xi)! = sign	yiK(x, xi)! . (11.22)
yi=+1	yi=−1	i
X	X	X

For a direction dependent density of the training sample, we can use a direction dependent kernel, eventually in the Mahalanobis form mentioned above:

K(x, x′) exp − 12 (x − x′)T V(x − x′) .

with the weight matrix V. When we ﬁrst normalize the sample, this complication is not necessary. The parameter s of the matrix V, which determines the width of the kernel function, again is optimized by minimizing the loss for the training sample.

Support Vector Machines

Support vector machines (SVMs) produce similar results as ordinary distance depending weighting methods, but they require less memory for the storage of learning data and the classiﬁcation is extremely fast. Therefore, they are especially useful in on-line applications.

The class assignment usually is the same for all elements in large connected regions of the variable x. Very often, in a two case classiﬁcation, there are only two regions separated by a hypersurface. For short range kernels it is obvious then that for the classiﬁcation of observations, the knowledge of only those input vectors of the training sample is essential which are located in the vicinity of the hypersurface. These input vectors are called support vectors [73]. SVMs are programs which try to determine them, respectively their weights, in an optimal way, setting the weights of all other inputs vectors to zero.

In the one-dimensional case with non-overlapping classes it is su cient to know those inputs of each class which are located nearest to the dividing limit between the classes. Sums like (11.21) are then running over one element only. This, of course, makes the calculation extremely fast.

In higher dimensional spaces with overlapping classes and for more than two classes the problem to determine support vectors is of course more complicated. But

11.4 Classiﬁcation

321

region 2

region 1

Fig. 11.12. Separation of two classes. Top: learning sample, bottom: wrongly assigned events of a test sample.

also in these circumstances the number of relevant training inputs can be reduced drastically. The success of SVMs is based on the so-called kernel trick, by which nonlinear problems in the input space are treated as linear problems in some higherdimensional space by well known optimization algorithms. For the corresponding algorithms and proofs we refer to the literature, e.g. [13, 72]. A short introduction is given in Appendix 13.13.

Example and Discussion

In Fig. 11.12 are shown in the top panel two overlapping training samples of 500 inputs each. The loss function is the number of wrong assignments independent of the respective class. Since the distributions are quite similar in both coordinates we do not change the metric. We use a Gaussian kernel. The optimization of the parameter s by means of the training sample shows only a small change of the error rate for a change of s by a factor four. The lower panel displays the result of the classiﬁcation for a test sample of the same size (500 inputs per class). Only the wrong assignments are shown.

We realize that wrongly assigned training observations occur in two separate, non overlapping regions which can be separated by a curve or a polygon chain as indicated

322 11 Statistical Learning

in the ﬁgure. Obviously all new observations would be assigned to the class corresponding to the region in which they are located. If we would have used instead of the distance-depending weighting the K-nearest neighbors method, the result would have been almost identical. In spite of the opposite expectation, this more primitive method is more expensive in both the programming and in the calculation, when compared to the weighting with a distance dependent kernel.

Since for the classiﬁcation only the separation curve between the classes is required, it must be su cient to know the class assignment for those training observations which lie near this curve. They would deﬁne the support vectors of a SVM. Thus the number of inputs needed for the assignment of new observations would be drastically reduced. However, for a number of assignments below about 106 the e ort to determine support vectors usually does not pay. The SVMs are useful for large event numbers in applications where computing time is relevant.

11.4.4 Decision Trees

Simple Trees

We consider the simple case, the two class classiﬁcation, i.e. the assignment of inputs to one of two classes I and II, and N observations with P features x1, x2, . . . , xP , which we consider, as before, as the components of an input vector.

In the ﬁrst step we consider the ﬁrst component x11, x21, . . . , xN1 for all N input vectors of the training sample. We search for a value xc1 which optimally divides the two classes and obtain a division of the training sample into two parts A and B. Each of these parts which belong to two di erent subspaces, will now be further treated separately. Next we take the subspace A, look at the feature x2, and divide it, in the same way as before the full space, again into two parts. Analogously we treat the subspace B. Now we can switch to the next feature or return to feature 1 and perform further splittings. The sequence of divisions leads to smaller and smaller subspaces, each of them assigned to a certain class. This subdivision process can be regarded as the development of a decision tree for input vectors for which the class membership is to be determined. The growing of the tree is stopped by a pruning rule. The ﬁnal partitions are called leaves.

In Fig. 11.13 we show schematically the subdivision into subspaces and the corresponding decision tree for a training sample of 32 elements with only two features. The training sample which determines the decisions is indicated. At the end of the tree (here at the bottom) the decision about the class membership is taken.

It is not obvious, how one should optimize the sequence of partitions and the position of cuts, and also not, under which circumstances the procedure should be stopped.

For the optimization of splits we must again deﬁne a loss function which will depend on the given problem. A simple possibility in the case of two classes is, to maximize for each splitting the di erence ΔN = Nr − Nf between right and wrong assignments. We used this in our example Fig. 11.13. For the ﬁrst division this quantity was equal to 20 − 12 = 8. To some extend the position of the splitting hyperplane is still arbitrary, the loss function changes its value only when it hits the nearest input. It could, for example, be put at the center between the two nearest

11.4 Classiﬁcation

323

X 6

0 1 2 3 4 5

Fig. 11.13. Decision tree (bottom) corresponding to the classiﬁcation shown below.

points. Often the importance of e ciency and purity is di erent for the two classes. Then we would chose an asymmetric loss function.

Very popular is the following, slightly more complicated criterion: We deﬁne the impurity PI of class I

PI =	NI	,	(11.23)
	NI + NII

which for optimal classiﬁcation would be 1 or 0. The quantity

G = PI (1 − PI ) + PII (1 − PII )

(11.24)

the Gini-index, should be as small as possible. For each separation of a parent node E with Gini index GE into two children nodes A, B with GA, respectively GB , we minimize the sum GA + GB .

The di erence

D = GE − GA − GB

is taken as stopping or pruning parameter. The quantity D measures the increase in purity, it is large for a parent node with large G and two children nodes with small G. When D becomes less than a certain critical value Dc the branch will not be split further and ends at a leave. The leave is assigned to the class which has the majority in it.

Besides the Gini index, also other measures for the purity or impurity are used [13]. An interesting quantity is entropy S = −PI ln PI − PII ln PII , a well known measure of disorder, i.e. of impurity.

324 11 Statistical Learning

The purity parameter, e.g. G, is also used to organize the splitting sequence. We choose always that input vector component in which the splitting produces the most signiﬁcant separation.

A further possibility would be to generalize the orthogonal splitting by allowing also non-orthogonal planes to reach better separations. But in the standard case all components are treated independently.

Unfortunately, the classiﬁcation by decision trees is usually not perfect. The discontinuity at the boundaries and the ﬁxed splitting sequence impair the accuracy. On the other hand, they are simple, transparent and the corresponding computer programs are extremely fast.

Boosted Decision Trees

Boosting [75] is based on a simple idea: By a weighted superposition of many moderately e ective classiﬁers it should be possible to reach a fairly precise assignment. Instead of only one decision tree, many di erent trees are grown. Each time, before the development of a new tree is started, wrongly assigned training inputs are boosted to higher weights in order to lower their probability of being wrongly classiﬁed in the following tree. The ﬁnal class assignment is then done by averaging the decisions from all trees. Obviously, the computing e ort for these boosted decision trees is increased, but the precision is signiﬁcantly enhanced. The results of boosted decision trees are usually as good as those of ANNs. Their algorithm is very well suited for parallel processing. There are ﬁrst applications in particle physics [76].

Before the ﬁrst run, all training inputs have the weight 1. In the following run each input gets a weight wi, determined by a certain boosting algorithm (see below) which depends on the particular method. The deﬁnition of the node impurity P for calculating the loss function, see (11.23), (11.24), is changed accordingly to

		I	wi
P =	P	I wPi +	P	,
P =		I wPi +	II wi	,

where the sums PI , PII run over all events in class I or II, respectively. Again the weights will be boosted and the next run started. Typically M ≈ 1000 trees are generated in this way.

If we indicate the decision of a tree m for the input xi by Tm(xi) = 1 (for class I) and = −1 (for class II), the ﬁnal result will be given by the sign of the weighted sum over the results from all trees

	M
	X
TM (xi) = sign	m=1 αmTm(xi)! .

We proceed in the following way: To the ﬁrst tree we assign a weight α1 = 1. The weights of the wrongly assigned input vectors are increased. The weight12 α2 of the second tree T2(x) is chosen such that the overall loss from all input vectors of the training sample is minimal for the combination [α1T1(x) + α2T2(x)] / [α1 + α2]. We continue in the same way and add further trees. For tree i the weight αi is optimized such that the existing trees are complemented in an optimal way. How this is done depends of course on the loss function.

12We have two kinds of weight, weights of input vectors (wi) and weights of trees (αm).

11.4 Classiﬁcation

325

A well tested recipe for the choice of weights is AdaBoost [75]. The training algorithm proceeds as follows:

•The i-th input xi gets the weight wi = 1 and the value yi = 1, (= −1), if it belongs to class I, (II).

•Tm(xi) = 1 (= −1), if the input ends in a leave belonging to class I (II). Sm(xi) = (1 − yiTm(xi))/2 = 1 (= 0), if the assignment is wrong (right).

•The fraction of the weighted wrong assignments εm is used to change the weights for the next iteration:

εm = wiSm(xi)/ wi ,

αm = ln 1 − εm , εm

wi → wieαmSm .

Weights of correctly assigned training inputs thus remain unchanged. For example, for εm = 0.1, wrongly assigned inputs will be boosted by a factor 0.9/0.1 = 9. Note that αm > 0 if ε < 0.5; this is required because otherwise the replacement Tm(xi) → −Tm(xi) would produce a better decision tree.

• The response for a new input which is to be classiﬁed is

	M
	X
TM (xi) = sign	m=1 αmTm(xi)! .

For εm = 0.1 the weight of the tree is αm = ln 9 ≈ 2.20. For certain applications it may be useful to reduce the weight factors αm somewhat, for instance αm = 0.5 ln ((1 − εm)/εm) [76].

11.4.5 Bagging and Random Forest

Bagging

The concept of bagging was ﬁrst introduced by Breiman [83]. He has shown that the performance of unstable classiﬁers can be improved considerably by training many classiﬁers with bootstrap replicates and then using a majority vote of those: From a training sample containing N input vectors, N vectors are drawn at random with replacement. Some vectors will be contained several times. This bootstrap13 sample is used to train a classiﬁer. Many, 100 or 1000 classiﬁers are produced in this way. New inputs are run through all trees and each tree “votes” for a certain classiﬁcation. The classiﬁcation receiving the majority of votes is chosen. In a study of real data [83] a reduction of error rates by bagging between 20% and 47% was found. There the bagging concept had been applied to simple decision trees, however, the bagging concept is quite general and can be adopted also to other classiﬁers.

13We will discuss bootstrap methods in the following chapter.

326 11 Statistical Learning

Random Forest

Another new development [84] which includes the bootstrap idea, is the extension of the decision tree concept to the random forest classiﬁer.

Many trees are generated from bootstrap samples of the training sample, but now part of the input vector components are suppressed. A tree is constructed in the following way: First m out of the M components or attributes of the input vectors are selected at random. The tree is grown in a m-dimensional subspace of the full input vector space. It is not obvious how m is to be chosen, but the author proposes m M and says that the results show little dependence on m. With large m the individual trees are powerful but strongly correlated. The value of m is the same for all trees.

From the N truncated bootstrap vectors, Nb are separated, put into a bag and reserved for testing. A fraction f = Nb/N ≈ 1/3 is proposed. The remaining ones are used to generate the tree. For each split that attribute out of the m available attributes is chosen which gives the smallest number of wrong classiﬁcations. Each leave contains only elements of a single class. There is no pruning.

Following the bagging concept, the classiﬁcation of new input vectors is obtained by the majority vote of all trees.

The out-of-the-bag (oob) data are used to estimate the error rate. To this end, each oob-vector of the k-th sample is run through the k-th tree and classiﬁed. The fraction of wrong classiﬁcations from all oob vectors is the error rate. (For T trees there are in total T × Nb oob vectors.) The oob data can also be used to optimize the constant m.

The random forest classiﬁer has received quite some interest. The concept is simple and seems to be similarly powerful as that of other classiﬁers. It is especially well suited for large data sets in high dimensions.

11.4.6 Comparison of the Methods

We have discussed various methods for classiﬁcation. Each of them has its advantages and its drawbacks. It depends on the special problem, which one is the most suitable.

The discriminant analysis o ers itself for oneor two dimensional continuous distributions (preferably Gaussians or other unimodal distributions). It is useful for event selection in simple situations.

Kernel methods are relatively easy to apply. They work well if the division line between classes is su ciently smooth and transitions between di erent classes are continuous. Categorical variables cannot be treated. The variant with support vectors reduces computing time and the memory space for the storage of the training sample. In standard cases with not too extensive statistics one should avoid this additional complication. Kernel methods can perform event selection in more complicated environments than is possible with the primitive discriminant analysis. For the better performance the possibility of interpreting the results is diminished, however.

Artiﬁcial neural networks are, due to the enormous number of free parameters, able to solve any problem in an optimal way. They su er from the disadvantage that the user usually has to intervene to guide the minimizing process to a correct minimum. The user has to check and improve the result by changing the network

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 3334 / 4234 35 36 37 38 39 40 41 42 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
27.09.2019275.97 Кб30voprosy_po_spets_kursu_k_zachetu.doc
#
23.09.2019653.82 Кб138Voprosy_vnimanie_i_pamyat_1.doc
#
01.03.2025394.96 Кб5Voprosy_ya_10-18.docx
#
02.08.201928.44 Кб21Vopros_10.docx
#
01.07.2025169.98 Кб2vse_voprosy_po_osnovam_turizma (1).docx
#
12.03.20166.43 Mб26vstatmp_engl.pdf
#
13.02.20151.12 Mб20Vsya_teoria_k_FAYa.pdf
#
01.07.20252.6 Mб0Vtoraya_glava_docgatov.docx
#
14.11.2019212.99 Кб14Vvedenie_v_ specialnoct_kl.doc
#
11.11.2019739.33 Кб20vvedenie_v_socialno_ekonomicheskuyu_geografiyu.doc
#
08.11.201963.49 Кб3vvodnyy_urok_10_kl.doc