1manly_b_f_j_statistics_for_environmental_science_and_managem
.pdf
Appendix 1
Some Basic Statistical Methods
A1.1 Introduction
It is assumed that readers of this book already have some knowledge of elementary statistical methods. This appendix is therefore not intended to be a full introduction to these methods. Instead, it is intended to be a quick refresher course for those who may have forgotten some of this material. Nevertheless, this appendix covers the minimum background needed for reading the rest of the book, so that those who have not formally studied statistics before may find it a sufficient starting point.
A1.2 Distributions for Sample Data
Random variation is the raw material of statistics. When observations are taken on an environmental variable, they usually display variation to a greater or lesser extent. For example, Table A1.1 shows the values for 1,2,3,4-tetra- chlorobenzene (TcCB) in parts per thousand million for 47 samples from an uncontaminated site used as a reference for comparison with a possibly contaminated site (Gilbert and Simpson 1992, pp. 6–22). These values vary from 0.22 to 1.33, presumably due to natural variation in different parts of the site, plus some analytical error involved in measuring a sample. As a subject, the main concern of statistics is to quantify this type of variation.
Data distributions come in two basic varieties. When the values that can be observed are anything within some range, then the distribution is said to be continuous. Hence the data shown in Table A1.1 are continuous because, in principle, any value could have been observed over a range that extends down to 0.22 or less and up to 1.33 or more. On the other hand, if only certain particular values can be observed, then the distribution is said to be discrete. An example is the number of lesions observed on the lungs of rats at the end of an experiment where they were exposed to a certain toxic substance for a certain period of time. In that case, an individual rat can have a number of
257
258 Statistics for Environmental Science and Management, Second Edition
Table A1.1
Measurements of TcCB for 47 Samples Taken from Different Locations at an Uncontaminated Site
0.60 |
0.50 |
0.39 |
0.84 |
0.46 |
0.39 |
0.62 |
0.67 |
0.69 |
0.81 |
0.38 |
0.79 |
0.43 |
0.57 |
0.74 |
0.27 |
0.51 |
0.35 |
0.28 |
0.45 |
0.42 |
1.14 |
0.23 |
0.72 |
0.63 |
0.50 |
0.29 |
0.82 |
0.54 |
1.13 |
0.56 |
1.33 |
0.56 |
1.11 |
0.57 |
0.89 |
0.28 |
1.20 |
0.76 |
0.26 |
0.34 |
0.52 |
0.42 |
0.22 |
0.33 |
1.14 |
0.48 |
|
Note: Measurements are in parts per thousand million (per billion).
Source: Gilbert and Simpson (1992, pp. 6–22).
lesions equal to 0 or 1 or 2, and so on. Often the possible values for a discrete variable are counts like this.
There are many standard distributions for continuous data. Here, only the normal distribution (also sometimes called the Gaussian distribution) is considered. This distribution is characterized as being bell-shaped, with most values being near the center of the distribution. There are two parameters to describe the distribution: the mean and the standard deviation, which are often denoted by μ and σ, respectively. There is also a function to describe the distribution in terms of these two parameters, which is referred to as a probability density function (pdf).
An example pdf is shown in Figure A1.1 for the distribution with μ = 5 and σ = 1. A random value from this normal distribution is one for which the probability of obtaining a particular value x is proportional to the height of the function. Thus 5 is the value that is most likely to occur, and values outside the range 2 to 8 will occur rather rarely.
In general, it turns out that, for all normal distributions, about 67% of values will be in the range μ ± σ, about 95% will be in the range μ ± 2σ, and about 99.7% will be in the range μ ± 3σ.
The normal distribution with μ = 0 and σ = 1 is of special importance. This is called the standard normal distribution, and variables with this distribution are often called Z-scores. A table of probabilities for this particular distribution is presented in Table A2.1 in Appendix 2, because this is useful for various statistical analyses.
If a large sample of random values is selected independently from a normal distribution to give values x1, x2, …, xn, then the mean of these values
x = (x1 + x2 + … + xn)/n |
(A1.1) |
will be approximately equal to μ. Thus x is an estimate of μ. This is, in fact, true for all other distributions as well: The sample mean is said to be an “estimator” of the distribution mean for a random sample from any distribution.
The square of the standard deviation of the distribution, σ2, is called the variance of the distribution. For large samples, this should be approximately equal to the sample variance, which is defined to be
s2 = [(x1 − x)2 + (x2 − x)2 + … + (xn − x)2]/(n − 1) |
(A1.2) |
Some Basic Statistical Methods |
259 |
Probability Density Function (pdf)
0.4
0.3
0.2
0.1
0.0
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
|
|
|
|
x |
|
|
|
|
Figure A1.1
The probability density function (pdf) for the normal distribution with a mean of 5 and a standard deviation of 1.
where the division by n − 1 (rather than n) is made to remove a tendency to underestimate the population variance that would otherwise occur.
The square root of the sample variance, s, is called the sample standard deviation, and this should be close to σ for large samples. Hence, s2 estimates σ2, and s estimates σ. More generally, s2 is an estimator of the distribution variance, and s is an estimator of the distribution standard deviation when calculated from a random sample from any distribution.
The sample mean and variance are often referred to using the more concise notation
n
x =∑xi/n
i=1
and
n
s2 =∑(xi −x)2/(n −1)
i=1
where Σ is the summation operator, indicating that elements following this sign are to be added up over the range from i = 1 to n. Consequently, the last two equations are exactly equivalent to equations (A1.1) and (A1.2), respectively.
The name normal implies that the normal distribution is what will usually be found for continuous data. This is, however, not the case. It is a matter of fact that many naturally occurring distributions appear to be approximately normal, but certainly not all of them. Environmental variables often have a distribution that is skewed to the right rather than being bell-shaped. For example, a histogram for the TcCB values in Table A1.1 is shown in Figure A1.2. Here there is a suggestion that if many more data values were obtained from the site, then the distribution would not quite be symmetrical
260 Statistics for Environmental Science and Management, Second Edition
Percentage
30
25
20
15
10
5
0
0.0 |
0.5 |
1.0 |
1.5 |
|
|
TcCB |
|
Figure A1.2
Histogram of the sample distribution for TcCB for the 47 samples shown in Table A1.1, with the height of the bars proportional to the percentage of data values within the ranges that they cover.
about the mean, which is about 0.6. Instead, the right tail would extend further from the mean than the left tail.
There are also many standard distributions for discrete data. Here, only the binomial distribution will be considered, which owes its importance to its connection with data in the form of proportions. This distribution arises when there is a certain constant probability that an observation will have a property of interest. For example, a series of samples might be taken from random locations at a certain study site, and there is interest in whether the level of a toxic chemical is higher than a specified value. If the probability of an exceedance is p for a randomly chosen location, then the probability of observing exactly x exceedances from a sample of n locations is given by the binomial distribution
P(x) = nCxpx(1 − p)n−x |
(A1.3) |
where the possible values of x are 0, 1, 2, …, n. In this equation, nCx = n!/[x!(n − x)!] is the number of combinations of n things taken r at a time, with a! = a(a − 1)(a − 2) … (1) being the factorial function.
The mean and variance of the binomial distribution are μ = np and σ2 = np(1 − p), respectively. If a large sample of values x1, x2, …, xn is taken from the distribution, then the sample mean x and the sample variance s2, calculated using equations (A1.1) and (A1.2), will be approximately equal to μ and σ2, respectively.
An example of a binomial distribution comes from the accidental bycatch of New Zealand sea lions Phocarctos hookeri during trawl fishing for squid around the Auckland Islands to the south of New Zealand. Experience shows that, for any individual trawl, the probability of catching a sea lion in the net is fairly constant at about 0.025, and there are about 3500 trawls in a summer fishing season (Manly and Walshe 1999). It is extremely rare for more than one animal to be captured at a time. Suppose that the trawls during
Some Basic Statistical Methods |
261 |
Table A1.2
Simulated Bycatch of New Zealand Sea Lions in 35 Successive
Sets of 100 Trawls
1 |
1 |
2 |
1 |
3 |
2 |
1 |
3 |
2 |
2 |
2 |
2 |
3 |
0 |
2 |
0 |
3 |
2 |
1 |
0 |
1 |
4 |
2 |
3 |
3 |
2 |
4 |
1 |
4 |
2 |
2 |
2 |
1 |
2 |
3 |
|
Note:
Percentage
Each observation is a random value from a binomial distribution, with n = 100 and p = 0.025.
45
40
35
30
25
20
15
10
5
0 |
1 |
2 |
3 |
4 |
|
|
Bycatch |
|
|
Figure A1.3
Histogram of the simulated bycatch data shown in Table A1.2, with the height of each bar reflecting the percentage of observations for the value concerned.
a particular season are considered in groups of 100, and the number of sea lions caught is recorded for trawls 1–100, 101–200, and so on up to 3401–3500. Then there would be 35 observations, each of which is a random value from a binomial distribution with n = 100 and p = 0.025. Table A1.2 shows the type of data that can be expected to be obtained under these conditions, with a histogram illustrating the results in Figure A1.3.
The results shown in Table A1.2 come from a computer simulation of the sea lion bycatch. Such simulations are often valuable for obtaining an idea of the variation to be expected according to an assumed model for data.
One of the reasons for the importance of the normal distribution is the way that some other distributions become approximately normal, given the right circumstances. This applies for the binomial distribution, provided that the sample size n is large enough, and a good general rule is that a binomial distribution is similar to a normal distribution, provided that the condition np(1 − p) ≥ 5 applies. This result is particularly useful because of the fact that—if a sample count follows a binomial distribution with mean np and variance np(1 − p), then, provided that np(1 − p) ≥ 5—the sample proportion x/n can reasonably be treated as coming from a normal distribution with mean p and variance p(1 − p)/n. This is a key result for the analysis of data consisting of observed proportions.
262 Statistics for Environmental Science and Management, Second Edition
For the bycatch data np(1 − p) = 100(0.025)(0.975) = 2.44, which is much less than 5. Therefore, the condition for a good normal approximation does not apply. Nevertheless, the distribution observed is quite symmetric and bell shaped.
A1.3 Distributions of Sample Statistics
There are some distributions that arise indirectly as a result of calculating values that summarize samples, like the sample mean and the sample variance. These values that summarize samples are called sample statistics. Hence, it is the distributions of these sample statistics that are of interest. The reason why it is necessary to know about these particular distributions is that they are needed for drawing conclusions from data, for example, using tests of significance and confidence limits.
The first distribution to consider for sample statistics is the t-distribu- tion. This arises when a random sample of size n is taken from a normal distribution with mean μ. If the sample mean x and the sample variance s2 are calculated using equations (A1.1) and (A1.2), then the quantity
t = (x − μ)/(s/√n) |
(A1.4) |
follows what is called a t-distribution with n − 1 degrees of freedom (df). For large values of n, this distribution approaches a standard normal distribution with mean 0 and standard deviation 1.
It is important to realize what, exactly, it means to say that the quantity (x − μ)/(s/√n) follows a t-distribution. The meaning is as follows: If the process of taking a random sample of size n from the normal distribution were to be repeated a large number of times, then the resulting distribution obtained for the values of (x − μ)/(s/√n) would be the t-distribution. One sample of size n therefore yields a single random value from this distribution.
For small values of degrees of freedom (df), the t-distribution is more spread out than the normal distribution with mean 0 and standard deviation 1. This is illustrated in Figure A1.4, which shows the shape of the distribution for several values of df. Table A2.2 in Appendix 2 gives what are called “critical values” for t-distributions, which will be useful throughout this book. The values in this table mark the levels corresponding to probabilities of 0.05, 0.025, 0.01, or 0.005. Note that because the distribution is symmetric about zero, Prob(t < −c) = Prob(t > c), for any critical value c.
The second distribution to be considered is the chi-squared distribution. If a random sample of size n is taken from a normal distribution with variance σ2, and if the sample variance s2 is calculated using equation (A1.2), then the quantity
χ2 = (n − 1)s2/σ2 |
(A1.5) |
Some Basic Statistical Methods |
263 |
Relative Probability
0.4
0.3 
0.2 
0.1 
0.0
–5 |
–4 |
–3 |
|
–2 |
–1 |
0 |
1 |
|
2 |
3 |
4 |
5 |
|
|
|
|
|
Value of t-Statistic |
|
|
|
|
|||
|
|
|
|
3 df |
5 df |
10 df |
|
|
Infinite df |
|
|
|
|
|
|
|
|
|
|
|
|
||||
Figure A1.4
The form of the t-distribution with 3, 5, 10, and infinite df. With infinite df, the t-distribution is the standard normal distribution, with a mean of 0 and a standard deviation of 1.
Relative Probability
0.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.4 |
|
|
2 df |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
0.3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.2 |
|
|
5 df |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
0.1 |
|
|
|
|
10 df |
|
|
|
|
|
|
|
|
|
40 df |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
0.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
10 |
20 |
30 |
40 |
50 |
60 |
70 |
|||||||||||||||
Value of Chi-Squared Statistic
Figure A1.5
Form of the chi-squared distribution with 2, 5, 10, and 40 df.
is a random value from the chi-squared distribution with n − 1 df. The shape of this distribution depends very much on the df, becoming more like a normal distribution as the df increases. Some example shapes are shown in Figure A1.5, and Table A2.3 in Appendix 2 gives critical values for the distributions that are needed for various purposes.
The third and last distribution to be considered is the F-distribution. Suppose that (a) a random sample of size n1 is taken from a normal distribution with variance σ2, and the sample variance s12 is calculated using equation (A1.2), and then (b) a random sample of size n2 is independently taken from a second normal distribution with the same variance, and the sample variance s22 is calculated, again using equation (A1.2). In that case, the ratio of the sample variances,
F = s |
2/s 2 |
(A1.6) |
1 |
2 |
|
264 Statistics for Environmental Science and Management, Second Edition
Relative Probability
0.8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
df1 = df2 = 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
0.7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5 |
|
|
|
|
|
|
|
|
df1 = 6, df2 = 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
0.4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
df1 = df2 = 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
0.3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
0.2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|||||||||||||||||||||
Value of F-Statistic
Figure A1.6
Form of the F-distribution for various values of the two df: df1 and df2.
will be a random value from the F-distribution with df n1 − 1 and n2 − 1. Like the chi-squared distribution, the F-distribution may take a variety of different shapes, depending on the df. Some examples are shown in Figure A1.6, and Table A2.4 in Appendix 2 gives critical values for the distribution.
A1.4 Tests of Significance
One of the most used tools in statistics is the test of significance, which examines the question of whether a set of data could reasonably have arisen based on a certain assumption, which is called the null hypothesis. One framework for conducting such a test has the following steps:
1.Decide on the null hypothesis to be tested (often a statement that a parameter of a distribution takes a specific value).
2.Decide whether the alternative to the null hypothesis shows any difference in a particular direction.
3.Choose a suitable test statistic that measures the extent to which the data agree with the null hypothesis.
4.Determine the distribution of the test statistic if the null hypothesis is true.
5.Calculate the test statistic, S, for the observed data.
6.Calculate the probability p (sometimes called the p-value) of obtaining a value as extreme as, or more extreme than, S if the null hypothesis is true, using the distribution identified at step 4, and defining “extreme” taking into account the alternative to the null hypothesis identified at step 2.
7.Conclude that there is evidence that the null hypothesis is not true if p is small enough; otherwise, conclude that there is no real evidence
against the null hypothesis.
Some Basic Statistical Methods |
265 |
At step 7, it is conventional to use the probability levels 0.05, 0.01, and 0.001. Thus if p ≤ 0.05, then the test is considered to provide some evidence against the null hypothesis; if p ≤ 0.01, then the test is considered to provide strong evidence against the null hypothesis; and if p ≤ 0.001, then the test is considered to provide very strong evidence against the null hypothesis. Another way to express this is to say that if p ≤ 0.05, then the result is significant at the 5% level; if p ≤ 0.01, then the result is significant at the 1% level; and if p ≤ 0.01, then the result is significant at the 0.1% level. There is an element of arbitrariness in the choice of the probability levels of 0.05, 0.01, and 0.001. They were originally used in the era before computers because of the need to specify a limited number of levels in constructing printed reference tables that could be used to test significance.
Some people prefer to specify in advance the level, α, of p that will be considered to be significant at step 7. For example, it might be decided that a result will be significant only if p ≤ 0.01. The test is then said to be at the 1% level of significance, or sometimes the test is said to have a size of 0.01 or 1%. In that case, the above framework changes slightly. There is an additional step before step 1:
0. Choose the significance level α for the test.
The last step then changes to:
7.If p ≤ α, then declare the test result to be significant, giving evidence against the null hypothesis.
Step 2 in the procedure is concerned with deciding whether the test is twosided or one-sided. This depends on whether there is interest in differences from the null hypothesis in (a) either direction (i.e., the true mean could be either higher or lower than the mean specified by the null hypothesis) or
(b) only in one direction (i.e., the true mean exceeds the mean specified by the null hypothesis). This then makes a difference to the calculation of the p-value at step 6. In general, what is done is to calculate the probability of a result as extreme or more extreme than that observed in terms of the direction (or directions) of interest. Often test statistics are such that a value of zero indicates no difference from the null hypothesis. In that case, the p-value for a two-sided test will be the probability of being as far from zero as the observed value, while the p-value for a one-sided test will be the probability of being as far from zero in the direction that indicates an effect of interest. The example given at the end of this section should clarify what this means in practice.
An important distinction is between parametric and nonparametric tests. In practice, this often comes down to a question of whether the population being sampled has a normal distribution or not. The difference between parametric and nonparametric tests is that parametric tests require more
