vstatmp_engl
.pdf5.3 Solution of Integrals 
127 
other polygon in the circle and considering only the area between the polygons.
b) Importance Sampling
If there exists a majorant m(x) for the function y(x) to be integrated,
x 

I = Zxab y(x)dx , 
(5.4) 
with the property that the indeﬁnite integral M(x)
Z x
M(x) = m(x′)dx′
xa
can be inverted, we generate N0 xvalues according to the distribution m(x). For each xi a further random number yi in the interval 0 < y < m(xi) is generated. Again, as for the simulation of distributions, points lying above y(xi) are rejected. The number N of the remaining events provides the integral
I = M(x ) 
N 


N0 . 

b 
b 
5.3.3 Weighting Method
a) Simple Weighting
We generate N random numbers xi in the interval xa < x < xb and average over the
function values:
XN
y = y(xi)/N .
i=1
An estimate for the integral (5.4) is given by
b 
b − a y . 

I = (x 
x ) 


This method corresponds to the usual numerical integration, with the peculiarity that the supporting points on the abscissa are not chosen regularly but are distributed at random. This alone cannot be an advantage, and indeed the Monte Carlo integration in one and two dimensions for a given number of supporting points is less e cient than conventional methods. It is, however, superior to other methods for multidimensional integrations. Already in three dimensions it competes favorably in many cases.
To estimate the accuracy, we apply the usual statistical error estimation. We consider the numbers yi = y(xi) as N stochastic measurements of y. The expected mean squared error of y is then given by (4.3):




)2 = 
1 
X (yi − 

)2 . 


(δy 
y 


N(N − 1) 

The relative errors of 


I 





y and b are the same, 
128 5 Monte Carlo Simulation
1.0 


f(x) 
analytically integrable 

0.5 

difference function 



0.00 
2 
4 


x 
Fig. 5.13. Monte Carlo integration of the di erence between the function to be integrated and an integrable function.
b 
2 














δI 
! = 








2 









δy 



, 






I 

















y 

− 


b 



P 



)2 




= 




(yi 
− 
y 
. 
(5.5) 





N(N 


1) 
y 
2 

The numerator is an estimate of the variance of the y distribution. The accuracy is the better, the smaller the ﬂuctuations of the function around its mean value are.
b) Subtraction method
The accuracy can be improved through a reduction of the ﬂuctuations of the integrand.
If we ﬁnd a function ye(x) which is integrable analytically and does not di er too much from the original integrand y(x) we cast the integral into the form
x 
x 
x 
(y(x) − y(x)) dx . 
Zxab y(x)dx = 
Zxab y(x)dx + Zxab 



e 
e 
We now have to evaluate by Monte Carlo only the second term with relatively small ﬂuctuations (Fig. 5.13).
5.3.4 Reduction to Expected Values
In many cases it makes sense to factorize the integrand y(x) = f(x)y1(x) into a factor f(x) corresponding to a p.d.f. normalized inside the integration interval which
5.4 General Remarks 
129 
is easy to generate, and a second factor y1(x). To be e ective, the method requires that f is close to y. Our integral has now the form of an expected value:
x 
x 
Zxab y(x)dx = 
Zxab f(x)y1(x)dx 
= hy1i .
We generate values xi distributed according to f(x) and obtain from these an estimate for the integral I:




y1(xi) 







b 
I = 
Pi N 
− 
, 






2 

P 







δI 
! 
b 

[y1(xi) − 


1]2 
. 

= 
y 









I 


N(N 

1) 

12 




y 

b 











The estimate is again the better, the less the y1values are ﬂuctuating, i.e. the more similar the functions y and f are. The error estimate is analogous to (5.5).
5.3.5 Stratiﬁed Sampling
In stratiﬁed sampling the domain of integration is partitioned into subdomains. Over each of these we integrate separately. The advantage is that the distribution in each subdomain is more uniform and thus the ﬂuctuations of the random variables are smaller and the statistical error is reduced. This method is somewhat antithetical to the basic idea of the simple Monte Carlo method, since it produces a more uniform (equidistant) distribution of the supporting points and requires some e ort to combine the errors from the di erent contributions. Thus we recommend it only if the integrand shows very strong variations.
5.4 General Remarks
Often we need to solve integrals over di erent domains but always with the same integrand. In these cases the Monte Carlo approach is particularly advantageous. We store all single simulated values (usually called “events”) and are able to select events afterwards according to the chosen domain, and obtain the integral with relatively small computing expense by summation. Similarly a change of event weights is possible without repeating the generation of the events.
Let us illustrate this feature with a mechanical example: If, for instance, we want to obtain the tensor of inertia for a complex mass distribution like a car, we distribute points stochastically within the body and store their coordinates together with the respective mass densities. With these data it is easy to calculate by summations the mass, the center of mass and the moments of inertia with respect to arbitrary axes. If desired, parts of the body can be eliminated simply by rejecting the corresponding points in the sums and di erent materials can be considered by changing the density.
In thermodynamic systems we are often interested in several mean values, like the mean free path length, mean kinetic or potential energy, velocities etc.. Once a
130 5 Monte Carlo Simulation
statistical ensemble has been generated, all these quantities are easily obtained, while with the usual integration methods, one has to repeat each time the full integration.
Even more obvious are these advantages in acceptance calculations. Big experiments in particle physics and other areas have to be simulated as completely and realistically as allowed by the available computing power. The acceptance of a given system of particle detectors for a certain class of events is found in two steps: ﬁrst, a sample of interesting events is generated and the particles produced are traced through the detecting apparatus. The hits in various detectors together with other relevant information (momenta, particle identities) are stored in data banks. In a second step the desired acceptance for a class of events is found by simulating the selection procedure and counting the fraction of events which are retained. Arbitrary changes in the selection procedure are readily implemented without the need to simulate large event samples more than once.
Finally, we want to stress again how easy it is to estimate the errors of Monte Carlo integration. It is almost identical1 to the error estimation for the experimental data. We usually will generate a number of Monte Carlo reactions which is large enough to neglect their statistical error compared to the experimental error. In other words, the number of Monte Carlo events should be large compared to the number of experimental events. Usually a factor of ten is su cient, a higher factor is reasonable if enough computing power is available.
1The Monte Carlo errors are usually described by the binomial distribution, those of the experimental data by the Poisson distribution.
6
Parameter Inference I
6.1 Introduction
We now leave the probability calculus and its simple applications and turn to the ﬁeld of statistics. More precisely we are concerned with inferential statistics.
While the probability calculus, starting from distributions, predicts properties of random samples, in statistics, given a data sample, we look for a theoretical description of the population from which it has been derived by some random process. In the simplest case, the sample consists of independent observations, randomly drawn from a parent population. If not speciﬁed di erently, we assume that the population is a collection of elements which all follow the same discrete or continuous distribution. Frequently, the sample consists of data collected in a measurement sequence.
Usually we either want to check whether our sample is compatible with a speciﬁc theory, or we decide between several theories, or we infer unknown parameters of a given theory.
To introduce the problem, we discuss three simple examples:
1.At a table we ﬁnd eight playing cards: two kings, three ladies, one ten, one eight and one seven. Do the cards belong to a set of Canasta cards or to a set of Skat cards?
2.A college is attended by 523 boys and 490 girls. Are these numbers compatible with the assumption that in average the tendency to attend a college is equal for boys and girls?
3.The lifetimes of ﬁve instable particles of a certain species have been measured. How large is the mean life of that particle and how large is the corresponding uncertainty?
In our ﬁrst example we would favor the Skat game because none of the cards two to six is present which, however, are part of Canasta card sets. Assuming that the cards have been taken at random from a complete card set, we can summarize the available information in the following way: The probability to observe no card with value below seven in eight cards of a Canasta game is LC = (5/13)8 = 4.8 × 10−4 whereas it is LS = 1 for a Skat game. We call these quantities likelihoods1 . The likelihood indicates how well a given hypothesis is supported by a given observation
1The term likelihood was ﬁrst used by the British biologist and statistician Sir Ronald Aylmer Fisher (18901962).
132 6 Parameter Inference I
but the likelihood alone is not su cient for a decision in favor of one or the other hypothesis. Additional considerations may play an important role. When the cards are located in a Swiss youth hostel we would consider the hypothesis Skat more sceptically than when the cards are found in a pub at Hamburg. We therefore would weight our hypotheses with prior probabilities (short: priors) which quantify this additional piece of information. Prior probabilities are often hard to estimate, often they are completely unknown. As a consequence, results depending on priors are model dependent.
We usually will avoid to introduce prior probabilities and stay with likelihoods but sometimes this is not possible. Then the results have to be interpreted conditional to the validity of the applied prior probabilities.
Similar situations as described in our trivial example also occur in empirical sciences. Whenever an observation is more or less compatible with two alternative theories, we cannot simply derive probabilities for the validities of theories based solely on the experimental data. Other criteria like the attractiveness of the theory, the compatibility with previous measurements will enter into our judgement. These additional attributes again can be taken into account by introducing prior probabilities which of course will depend to some extend on subjective prejudices. The cases where well founded quantitative priors are available are rare.
Some years ago, in an experiment a signiﬁcant deviation from the gravitational law had been observed. From the two alternatives: H1: Newton’s law is correct and
H2: A 1/r2 term has to be added to the 1/r potential, the latter was much more compatible with the experimental data. In spite of this experimental evidence, hardly any serious physicist doubted the validity of the classical gravitational law. The reason is that our experience shows that the laws of nature are basically simple. In the mean time, as a consequence of the availability of more precise data, the 1/r2 hypothesis has been rejected.
Nevertheless it is correct and necessary to publish the observed data without weighting them with a prior, i.e. to restrict the presentation to the purely statistical result and to leave the subjective part to the reader of the publication.
The scientiﬁc method requires the possibility to compare and combine results from independent measurements. This is impossible when di erent authors apply varying priors. We will see that it is almost always possible to avoid the introduction of this kind of subjective assumptions into the statistical analysis.
In our second example the situation is even more complex because we are confronted with only one hypothesis and no well speciﬁed alternative. The validity of the alternative, e.g. a deviation from the equality of the distribution of the sexes is hardly measurable since an arbitrarily small deviation from the equality is present in any case. There is no other possibility as to quantify the deviation of the data with the prediction in some proper way. We will accept the hypothesis if the deviation is not abnormally high. We will treat this problem in the chapter goodnessofﬁt tests.
In our third example the number of hypotheses is inﬁnite. To each value of the unknown parameter, i.e. to each di erent mean life, corresponds a di erent prediction. The di culties are very similar to those in case one. If we want to quote probabilities, we are forced to introduce a priori probabilities – here for the parameter under investigation. Again, in most cases no reliable prior information will be available but we can avoid the subjective part by documenting the results in a sensible way. We
6.2 Inference with Given Prior 
133 
will quote the parameter best supported by the data and deﬁne an error interval based on the likelihood of the parameter values.
The following table summarizes the cases which we have discussed.
case1 given: N alternative hypotheses Hi
wanted: relative probabilities for the validity of Hi case 2 given: one hypothesis H0
wanted: a quantitative statement about the validity of H0
case 3: given: one valid hypothesis H(λ) where λ is a single parameter or a set of unknown continuous parameters
wanted: “ best” value of λ and its uncertainty
In practice we often will compare observations with a theory which contains free parameters. In this case we have to infer parameters and to test the compatibility of the hypothesis with the data, i.e. case 2 and case 3 apply.
We now address the di erent problems one after the other.
Remark 1 : In the following chapters we consider parameters as random variables as is common in Bayesian statistics. Sometimes, we assign probabilities to the possible values or intervals of parameters, probabilities which reﬂect our knowledge of these parameters.
Remark 2 : We are primarily interested in the estimation of constants of nature. A di erent problem is the estimation of the individual parameters of an ensemble of events, like those of particle tracks, where the individual tracks have di erent parameter values. An analogue situation is common in social sciences and commercial applications but is of little importance in particle physics and astronomy. It will be treated shortly in the appendix.
6.2 Inference with Given Prior
We now try to derive from a given sample probabilities for hypotheses or parameters. If prior information is available this is possible by means of Bayes’ theorem.
6.2.1 Discrete Hypotheses
In Chap. 1 we had shown that conditional probabilities fulﬁl the following relation (Bayes’ theorem):
P {A ∩ B} = P {AB}P {B} = P {BA}P {A} . 
(6.1) 
The probability P {A ∩ B}, that both the properties A and B apply is equal to the probability P {B}, to ﬁnd property B multiplied by the conditional probability P {AB} to ﬁnd A, when B is realized. This is the ﬁrst part of the relation above. The second part is analogous.
We apply this relation to a discrete random variable k and hypotheses Hi. The index denoting the hypothesis is interpreted as a random variable2.
2In this case this is a categorical variable which denotes a certain class.
134 6 Parameter Inference I


H 
H 
H 


3 

4 

5 





(0.27) 
(0.27) 
(0.27) 










k 
(0.44) 

H2 





3 




















(0.14) 







H1 






k2 (0.32) 

















(0.05) 






k 
(0.24) 















1 

















Fig. 6.1. Quantitative Venn diagram. The areas indicate the probabilities for certain combinations of hypotheses Hi and discrete events of type kj .
We assume that the probability P {kHi} to observe k is given for a ﬁnite number
of alternatively exclusive hypotheses Hi. Then we have 


P {kHi}P {Hi} = P {Hik}P {k} , 


P 
{ 
H 
i 
k 
} 
= 
P {kHi}P {Hi} 
. 
(6.2) 





P 
{ 
k 
} 














Here P {Hi} is the assumed probability for the validity of hypothesis i before the observation happens, it is the a priori probability, in short the prior.
In Fig. 6.1 we illustrate relation (6.2) in form of a so called Venn diagram where in the present example 3 out of the 5 hypotheses have the same prior. Each hypothesis bin is divided into 3 regions with areas proportional to the probabilities to observe k = k1, k = k2 and k = k3, respectively. For example when the observation is k = k2 (shadowed in gray) then the gray areas provide the relative probabilities of the validity of the corresponding hypotheses. In our example hypothesis H3 is the most probable, H1 the most unlikely.
The computation of P {k} which is the marginal distribution of k, i.e. the probability of a certain observation, summed over all hypotheses, yields:
X
P {k} = P {kHi}P {Hi} .
i
As required, P {Hik} is normalized in such a way that the probability that any of the hypotheses is fulﬁlled is equal to one. We get
P 

H k 

= 
P {kHi}P {Hi} 
. 
(6.3) 

{ 
} 
Pj P {kHj}P {Hj} 


i 



In words: The probability for the validity of hypothesis Hi after the measurement k is equal to the prior P {Hi} of Hi multiplied with the probability to observe k if Hi applies and divided by a normalization factor. When we are only interested in the
6.2 Inference with Given Prior 
135 
relative probabilities of two di erent hypotheses Hi and Hj for an observation k, the denominator P {k} cancels:
P {Hik} = P {kHi}P {Hi} .
P {Hj k} P {kHj}P {Hj}
Example 70. Bayes’ theorem: pion or kaon decay?
A muon has been detected. Does it originate from a pion or from a kaon decay? The decay probabilities inside the detector are known and are P {µπ} = 0.02 and P {µK} = 0.10, respectively. The ratio of pions and kaons in the beam is P {π} : P {K} = 3 : 1. With these numbers we obtain:

P {Kµ} 
= 
0.10 × 1 
= 
5 

, 



P {πµ} 
0.02 × 3 




3 




P {Kµ} 
= 
0.10 × 1 

= 0.625 . 

P {Kµ} + P {πµ} 
0.02 × 3 + 0.10 × 1 



The kaon hypothesis is more likely than the pion hypothesis. Its probability is 0.625.
6.2.2 Continuous Parameters
Now we extend our considerations to the case where the hypothesis index is replaced by a continuous parameter θ, i.e. we have an inﬁnite number of hypotheses. Instead of probabilities we obtain probability densities. Bayes’ theorem now reads
f(x, θ) = fx(xθ)πθ (θ) = fθ(θx)πx(x) 
(6.4) 
which is just the relation 3.36 of Sect. 3.5, where fx, fθ are conditional distribution densities and πx(x), πθ(θ) are the marginal distributions of f(x, θ). The joined probability density f(x, θ) of the two random variables x, θ is equal to the conditional probability density fx(xθ) of x, where θ is ﬁxed, multiplied by the probability density πθ(θ), the marginal distribution of θ. For an observation x we obtain analogously to our previous relations
fθ(θx) = fx(xθ)πθ (θ) , πx(x)
and
fθ(θ 
x) = 
fx(xθ)πθ (θ) 
. 
(6.5) 

R−∞∞ fx(xθ)πθ (θ)dθ 

 



In words: For a measurement with the result x, we compute the probability density for the parameter θ from the value of the probability density fx(xθ) for x, multiplied by the probability density (prior) πθ(θ) of θ before the measurement, divided by a normalization integral. Again, the quantity fx(xθ) determines how strongly various parameter values θ are supported by the given observation x and is called – in this context – likelihood of θ.
From the probability density fθ(θx) of the interesting parameter we can derive
ˆ 
and an error interval. An obvious choice is the expectation value 
a best estimate θ 
and the standard deviation. Thus the estimate is a function of the observations3,
ˆ ˆ .
θ = θ(x)
3A function of the observations is called a statistic, to be distinguished from the discipline statistics.
136 
6 
Parameter Inference I 








0.8 









observed value 





0.6 






f(θ 
) 







0.4 







0.2 







0.0 0 
1 
2 
3 
4 





θ 


Fig. 6.2. Probability density for the true decay time. The mean decay time is 1, the observed value is 1.5.
Example 71. Time of a decay with exponential prior
A detector with ﬁnite resolution registers at time t the decay of a K meson. The time resolution corresponds to a Gaussian with variance σ2. We are interested in the time θ at which the decay occurred. The mean lifetime τ of kaons is known. The probability density for the parameter θ before the measurement, the prior, is π(θ) = e−θ/τ /τ, θ ≥ 0. The probability density for t with θ ﬁxed is the Gaussian. Applying (6.5) we obtain the probability density f(θ) = f(θt) of the parameter θ,
e−(t−θ)2/(2σ2)e−θ/τ
f(θ) = R ∞ e−(t−θ)2/(2σ2)e−θ/τ dθ ,
0
which is displayed in Fig. 6.2. As a consequence of the exponential prior it is visibly shifted to the left with respect to the observation.
If the value of the probability density fx(xθ) in (6.5) varies much more rapidly with θ than the prior – this is the case when the observation restricts the parameter drastically – then to a good approximation the prior can be regarded as constant in the interesting region. We then have
fθ(θx) ≈ R ∞ . −∞ fx(xθ)dθ
In this approximation the probability density fθ of the parameter corresponds to the normalized likelihood function.
In practice, fθ often follows to a good approximation a normal distribution. The
ˆ 
where fθ 
is maximal then is the estimate of θ and the values where fθ has 
value θ 
decreased by the factor e1/2 deﬁne a standard deviation error interval and thus ﬁx
the uncertainty of the estimate ˆ.
θ
Тут вы можете оставить комментарий к выбранному абзацу или сообщить об ошибке.
Оставленные комментарии видны всем.