Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
11
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

11.4 Classification

327

structure, the learning constant and the start values of the weights. New program packages are able to partially take over these tasks. ANN are able to separate classes in very involved situations and extract very rare events from large samples.

Decision trees are a very attractive alternative to ANN. One should use boosted decision trees, random forest or apply bagging though, since those discriminate much better than simple trees. The advantage of simple trees is that they are very transparent and that they can be displayed graphically. Like ANN, decision trees can, with some modifications, also be applied to categorical variables.

At present, there is lack of theoretical framework and experimental information on some of the new developments. We would like to know to what extent the di erent classifiers are equivalent and which classifier should be selected in a given situation. There will certainly be answers to these questions in the near future.

12

Auxiliary Methods

12.1 Probability Density Estimation

12.1.1 Introduction

In the subsection function approximation we have considered measurements y at fixed locations x where y due to statistical fluctuations deviates from an unknown function. Now we start from a sample {x1, . . . , xN } which follow an unknown statistical distribution which we want to approximate. We have to estimate the density

ˆ at the location from the frequency of observations in the vicinity of . The f(x) x xi x

corresponding technique, probability density estimation (PDE), is strongly correlated with function approximation. Both problems are often treated together under the title smoothing methods. In this section we discuss only non-parametric approaches; a parametric method, where parameters are adjusted to approximate Gaussian like distributions has been described in Sect. 11.2.2. We will essentially present results and omit the derivations. For details the reader has to consult the specialized literature.

PDE serves mainly to visualize an empirical frequency distribution. Visualization of data is an important tool of scientific research. It can lead to new discoveries and often constitutes the basis of experimental decisions. PDE also helps to classify data and sometimes the density which has been estimated from some ancillary measurement is used in subsequent Monte Carlo simulations of experiments. However, to solve certain problems like the estimation of moments and other characteristic properties of a distribution, it is preferable to deal directly with the sample instead of performing a PDE. This path is followed by the bootstrap method which we will discuss in a subsequent section. When we have some knowledge about the shape of a distribution, then PDE can improve the precision of the bootstrap estimates. For instance there may exist good reasons to assume that the distribution has only one maximum and/or it may be known that the random variable is restricted to a certain region with known boundary.

The PDE ˆ of the true density is obtained by a smoothing procedure f(x) f(x)

applied to the discrete experimental distribution of observations. This means, that some kind of averaging is done which introduces a bias which is especially large if the distribution f(x) varies strongly in the vicinity of x.

The simplest and most common way to measure the quality of the PDE is to evaluate the integrated square error (ISE) L2

330 12 Auxiliary Methods

L2

=

 

fˆ(x) − f(x)

2

Z−∞

h

dx

 

 

 

i

 

and its expectation value E(L2), the mean integrated square error (MISE )1. The

mean quadratic di erence E(

ˆ x

 

 

 

2

 

 

 

 

 

x

 

 

 

 

usual decomposition:

hf(

 

)

− f( )i ) has

two components, according to the

h

 

i

 

 

 

 

 

 

E fˆ(x) − f(x)

 

2

= var fˆ(x)

+ bias2

 

fˆ(x) .

The first term, the variance, caused by statistical fluctuations, decreases with increasing smoothing and the second term, the bias squared, decreases with decreasing smoothing. The challenge is to find the optimal balance between these two contributions.

We will give a short introduction to PDE mainly restricted to one-dimensional distributions. The generalization of the simpler methods to multi-dimensional distributions is straight forward but for the more sophisticated ones this is more involved. A rather complete and comprehensive overview can be found in the books by J.S. Simono [85], A.W. Bowman and A. Azzalini [86], D. W. Scott [87] and W. Härdle et al. [89]. A summary is presented in an article by D. W. Scott and S. R. Sain [88].

12.1.2 Fixed Interval Methods

Histogram Approximation

The simplest and most popular method of density estimation is histogramming with fixed bin width. For the number νk of N events falling into bin Bk and bin width h the estimated density is

ˆ

νk

f(x) =

Nh

for x Bk .

It is easy to construct, transparent, does not contain hidden parameters which often are included in other more sophisticated methods and indicates quite well which distributions are compatible with the data. However it has, as we have repeatedly stated, the disadvantage of the rather arbitrary binning and its discontinuity. Fine binning provides a good resolution of structures and low bias but has to be paid for by large fluctuations. Histograms with wide bins have the advantage of small statistical errors but are biased. A sensible choice for the bin width h is derived from the requirement that the mean squared integrated error should be as small as possible. The mean integrated square error, MISE,for a histogram is

MISE =

1

+

1

h2

Z

f(x)2dx + O

h4

.

(12.1)

Nh

12

N

The integral R f(x)2dx = R(f) is called roughness. For a normal density with variance σ2 it is R = (4πσ3)−1. Neglecting the small terms (h → 0) we can derive [88] the optimal bin width h and the corresponding asymptotic mean integrated

square error AMISE:

1

ˆ

, . . . , xN } of random variables and thus

 

The estimate f(x) is a function of the set {x1

also a random variable.

12.1 Probability Density Estimation

331

signal sample

 

15

 

 

 

events

10

 

 

 

number of

 

 

 

5

 

 

 

 

 

 

 

 

0-4

-2

0

2

 

 

 

x

 

 

 

background

 

 

reference sample

15

 

 

10

 

 

 

 

 

5

 

 

 

 

4

0-4

-2

0

2

4

 

 

 

x

 

 

Fig. 12.1. Experimental signal with some background (left) and background reference sample with two times longer exposure time (right). The fitted signal and background functions are indicated.

h ≈

 

 

6

 

1/3

 

 

 

 

 

≈ 3.5σN−1/3,

(12.2)

 

 

 

 

N

 

f(x)2dx

 

9

R f(x)2dx

 

1/3

 

 

AMISE ≈

R

16N2

 

 

≈ 0.43N−2/3/σ.

 

The second part of relation (12.2) holds for a Gaussian p.d.f. with variance σ2 and is a reasonable approximation for a distribution with typical σ. Even though the derivative fand the bandwidth2 σ are not precisely known, they can be estimated from the data. As expected, the optimal bin width is proportional to the band width, whereas its N−1/3 dependence on the sample size N is less obvious.

In d dimensions similar relations hold. Of course the N-dependence has to be modified. For d-dimensional cubical bins the optimal bin width scales with N−1/(d+2) and the mean square error scales with N−2/(d+2).

Example 142. PDE of a background distribution and signal fit

We analyze a signal sample containing a Gaussian signal N(x|µ, σ) with unknown location and scale parameters µ, σ containing some unknown background. In addition, we dispose of a reference sample containing only background. From the data taking times and fluxes we know that the background in the signal sample should nominally be half (r = 0.5) of that in the reference sample. In Fig. 12.1 we show the two experimental distributions. From the shape of the experimental background distribution we estimate the slope

2Contrary to what is understood usually under bandwidth, in PDE this term is used to describe the typical width of structures. For a Gaussian it equals the standard deviation.

332 12 Auxiliary Methods

y= 0.05 of its p.d.f. h(x), and, using relation 12.2, find a bin width of 2 units. The heights β1, β2, β3, β4 of the 4 equally wide bins of the histogram distribution are left as free parameters in the fit. Because of the normalization, we have β4 = 1 − β1 − β2 − β3. Further parameters are the expected rate of background events in the reference sample ρ and the fraction φ of true signal events in the signal sample. These 7 parameters are to be determined in a likelihood fit.

The log-likelihood function ln L = ln L1 + ln L2 + ln L3 + ln L4 comprises 4 terms, with:

1. L1, the likelihood of the ns events in the signal sample (superposition of signal and background distribution):

Xns

ln L1(µ, σ, φ, β1, β2, β3) = ln [φN(xi|µ, σ) + (1 − φ)h(xi1, β2, β3)] .

i=1

2. L2, the likelihood of the nr events of the reference sample (background distribution):

 

nr

 

 

Xi

 

ln L21, β2, β3) =

ln h(xi1

, β2, β3) .

 

=1

 

3. L3, the likelihood to observe nr reference events where ρ are expected (Poisson distribution):

ln L3 = −ρ + nr ln ρ − ln nr! .

4. L4, the likelihood to get ns(1 −φ) background events3 in the signal sample where rρ are expected (Poisson distribution):

ln L4 = −rρ + ns(1 − φ) ln(rρ) − ln {[ns(1 − φ)]!} .

The results of the fit are indicated in the Fig. 12.1 which is a histogram of the observed events.The MLE of the interesting parameters are µ = −0.18±0.32,

σ = 0.85 ± 0.22, φ = 0.60+0.05, the correlation coe cients are of the order

−0.09

of 0.3. We abstain from showing the full error matrix. The samples have been generated with the nominal parameter values µ0 = 0, σ0 = 1, φ = 0.6. To check the influence of the background parametrization, we repeat the fit with only two bins. The results change very little to µ = −0.12 ± 0.34,

σ = 0.85 ± 0.22, φ = 0.57+0.07. When we represent the background p.d.f. by

−0.11

a polygon (see next chapter) instead of a histogram, the result again remains

stable. We then get µ = −0.20 ± 0.30, σ = 0.82 ± 0.21, φ = 0.60+0.05. The

−0.09

method which we have applied in the present example is more precise than that of Sect. 6.5.10 but depends to a certain degree on the presumed shape of the background distribution.

Linear and Higher Order Parabolic Approximation

In the previous chapter we had adjusted spline functions to measurements with errors. Similarly, we can use them to approximate the probability density. We will consider

3The fit will favor a natural number. It is recommended to replace here n! by (n + 1)

12.1 Probability Density Estimation

333

here only the linear approximation by a polygon but it is obvious that the method can be extended to higher order parabolic functions. The discontinuity corresponding to the steps between bins is avoided when we transform the histogram into a polygon. We just have to connect the points corresponding to the histogram functions at the center of the bins. It can be shown that this reduces the MISE considerably, especially for large samples. The optimum bin width in the one-dimensional case now depends on the average second derivative f′′ of the p.d.f. and is much wider than for a histogram and the error is smaller [88] than in the corresponding histogram case:

h

 

 

 

 

1

1/5

 

1.6

 

 

 

 

 

 

,

 

 

 

 

f′′ (x)2dx

 

 

 

N Rf′′ (x)2dx

 

 

1/5

MISE

≈ 0.5

 

R

N4

 

.

In d dimensions the optimal bin width for polygon bins scales with N−1/(d+4) and the mean square error scales with N−4/(d+4).

12.1.3 Fixed Number and Fixed Volume Methods

To estimate the density at a point x an obvious procedure is to divide the number k of observations in the neighborhood of x by the volume V which they occupy,

ˆ . Either we can fix and compute the corresponding volume or we f(x) = k/V k V (x)

can choose V and count the number of observations contained in that volume. The quadratic uncertainty is σ2 = k + bias2, hence the former emphasizes fixed statistical uncertainty and the latter rather aims at small variations of the bias.

The k-nearest neighbor method avoids large fluctuations in regions where the density is low. We obtain a constant statistical error if we estimate the density from the spherical volume V taken by the k- nearest neighbors of point x:

ˆ

k

(12.3)

f(x) =

Vk(x)

.

As many other PDE methods, the k-nearest neighbor method is problematic in regions with large curvature of f and at boundaries of x.

Instead of fixing the number of observations k in relation (12.3) we can fix the volume V and determine k. Strong variations of the bias in the k-nearest neighbor method are somewhat reduced but both methods su er from the same deficiencies, the boundary bias and a loss of precision due to the sharp cut-o due to either fixing k or V . Furthermore it is not guaranteed that the estimated density is normalized to one. Hence a renormalization has to be performed.

The main advantage of fixed number and fixed volume methods is their simplicity.

12.1.4 Kernel Methods

We now generalize the fixed volume method and replace (12.3) by

ˆ 1 X

f(x) = NV K(x xi)

334 12 Auxiliary Methods

where the kernel K is equal to 1 if xi is inside the sphere of volume V centered at x and 0 otherwise. Obviously, smooth kernel functions are more attractive than the uniform kernel of the fixed volume method. An obvious candidate for the kernel function K(u) in the one-dimensional case is the Gaussian exp(−u2/2h2). A very popular candidate is also the parabolically shaped Epanechnikov kernel (1 −c2u2), for |cu| ≤ 1, and else zero. Here c is a scaling constant to be adjusted to the bandwidth of f. Under very general conditions the Epanechnikov kernel minimizes the asymptotic mean integrated square error AMISE obtained in the limit where the effective binwidth tends to zero, but other kernels perform nearly as well. The AMISE of the Gaussian kernel is only 5% larger and that of the uniform kernel by 8% [85].

The optimal bandwidth of the kernel function obviously depends on the true density. For example for a Gaussian true density f(x) with variance σ2 the optimal bandwidth h of a Gaussian kernel is hG ≈ 1.06σN−1/5 [85] and the corresponding constant c of the Epanechnikov kernel is c ≈ 2.2/(2hG). In practice, we will have to replace the Gaussian σ in the relation for h0 by some estimate depending on the structure of the observed data. AMISE of the kernel PDE is converging at the rate N−4/5 while this rate was only N−2/3 for the histogram.

12.1.5 Problems and Discussion

The simple PDE methods sketched above su er from several problems, some of which are unavoidable:

1. The boundary bias: When the variable is bounded, say , then ˆ is x x < a f(x)

biased downwards unless f(a) = 0 in case the averaging process includes the region x > a where we have no data. When the averaging is restricted to the region x < a, the bias is positive (negative) for a distribution decreasing (increasing) towards the boundary. In both cases the size of the bias can be estimated and corrected for, using so-called boundary kernels.

2. Many smoothing methods do not guarantee normalization of the estimated

probability density. While this e ect can be corrected for easily by renormalizing ˆ, f

it indicates some problem of the method.

3.Fixed bandwidth methods over-smooth in regions where the density is high and tend to produce fake bumps in regions where the density is low. Variable bandwidth kernels are able to avoid this e ect partially. Their bandwidth is chosen inversely

proportional to the square root of the density, h(xi) = h0f(xi)−1/2. Since the true density is not known, f must be replaced by a first estimate obtained for instance with a fixed bandwidth kernel.

4.Kernel smoothing corresponds to a convolution of the discrete data distribution with a smearing function and thus unavoidably tends to flatten peaks and to fill-up valleys. This is especially pronounced where the distribution shows strong structure, that is where the second derivative f′′ is large. Convolution and thus also PDE implies a loss of some information contained in the original data. This defect may be acceptable if we gain su ciently due to knowledge about f that we put into the smoothing program. In the simplest case this is only the fact that the distribution is continuous and di erentiable but in some situations also the asymptotic behavior of f may be given, or we may know that it is unimodal. Then we will try to implement this information into the smoothing method.

12.1 Probability Density Estimation

335

 

 

 

 

 

 

 

 

 

 

Fig. 12.2. Estimated probability density. Left hand: Nearest neighbor, center: Gaussian kernel, right hand: Polygon.

Some of the remedies for the di culties mentioned above use estimates of f and its derivatives. Thus iterative procedures seem to be the solution. However, the iteration process usually does not converge and thus has to be supervised and stopped before artifacts appear.

In Fig. 12.2 three simple smoothing methods are compared. A sample of 1000 events has been generated from the function shown as a dashed curve in the figure. A k-nearest neighbor approximation of the p.d.f. of a sample is shown in the left hand graph of Fig. 12.2. The value of k chosen was 100 which is too small to produce enough smoothing but too large to follow the distribution at the left hand border. The result of the PDE with a Gaussian kernel with fixed width is presented in the central graph and a polygon approximation is shown in the right-hand graph. All three graphs show the typical defects of simple smoothing methods, broadening of the peak and fake structures in the region where the statistics is low.

Alternatively to the standard smoothing methods, complementary approaches often produce better results than the former. The typical smoothing problems can partially be avoided when the p.d.f. is parameterized and adjusted to the data sample in a likelihood fit. A simple parametrization is the superposition of normal distributions, X

f(x) = αiN(x; µi, Σi) ,

with the free parameters, weights αi, mean values µi and covariance matrixes Σi.

If information about the shape of the distribution is available, more specific parameterizations which describe the asymptotic behavior can be applied. Distributions which resemble a Gaussian should be approximated by the Gram-Charlier series (see last paragraph of Sect. 11.2.2). If the data sample is su ciently large and the distribution is unimodal with known asymptotic behavior the construction of the p.d.f. from the moments as described in [74] is quite e cient.

Physicists use PDE mainly for the visualization of the data. Here, in one dimension, histogramming is the standard method. When the estimated distribution is used to simulate an experiment, frequency polygons are to be preferred. Whenever a useful parametrization is at hand, then PDE should be replaced by an adjustment of the corresponding parameters in a likelihood fit. Only in rare situations it pays to

336 12 Auxiliary Methods

construct complicated kernels. For a quick qualitative illustration of a distribution o -the-shelf programs may do. For most quantitative evaluations of moments and parameters of the unknown distribution we recommend to use the bootstrap method which is discussed in the following section.

12.2 Resampling Techniques

12.2.1 Introduction

In the previous section we have discussed a method to construct a distribution approximately starting from a sample drawn from it. Knowing the distribution allows us to calculate certain parameters like moments or quantiles. In most cases, however, it is preferable to determine the wanted quantities directly from the sample. A trivial example for this approach is the estimation of the mean value and the variance from a series of measurements as we have discussed in Chap. 4 treating

error calculation where we had used the sample mean x = P x /N and the empirical

P i

variance s2 = (xi − x)2 /(N − 1). In a similar way we can also determine higher moments, correlations, quantiles and other statistical parameters. However, the analytical derivation of the corresponding expressions is often not as simple as that of the mean value or the variance. Here the bootstrap concept, which to a large extend has been developed by Efron, helps. A comprehensive presentation of the method is given in Ref. [64], which has served as bases for this section.

The name Bootstrap goes back to the famous book of Erich Raspe in which he narrates the adventures of the lying Baron von Münchhausen. Münchhausen had pretended to have saved himself out of a swamp by pulling himself up with his own bootstraps4. In statistics, the expression bootstrap is used because from a small sample the quasi complete distribution is generated. There is not quite as much lying as in Münchhausen’s stories.

The bootstrap concept is based upon a simple idea: The sample itself replaces the unknown distribution. The sample is the distribution from which we draw individual observations.

As already mentioned, the bootstrap method permits us, apart from error estimation, to compute p-values for significance tests and the error rate in classifications. It relies, as will be shown in subsequent examples, on the combination of randomly selected observations.

A subvariant of the bootstrap technique is called jackknife which is mainly used to estimate biases from subsets of the data.

In Chap. 10 where we had evaluated the distribution of the energy test statistic in two-sample problems, we have used another resampling technique. We had reshu ed the elements of two partitions applying random permutations. Whereas in the bootstrap method, elements are drawn with replacement, permutations generate samples where every element occurs only a single time.

The reason for not using all possible permutations is simply that their number is in most cases excessively large and a finite random sample provides su ciently precise results. While bootstrap techniques are used mainly to extract parameters

4In the original version he is pulling himself with his hair.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]