Handbook_of_statistical_analysis_using_SAS
.pdf
|
Tests for Normality |
|
||||
Test |
|
|
--Statistic--- |
-----P-value------ |
||
Shapiro-Wilk |
W |
|
0.887867 |
Pr < W |
<0.0001 |
|
Kolmogorov-Smirnov |
D |
|
|
0.196662 |
Pr > D |
<0.0100 |
Cramer-von Mises |
W-Sq |
0.394005 |
Pr > W-Sq |
<0.0050 |
||
Anderson-Darling |
A-Sq |
2.399601 |
Pr > A-Sq |
<0.0050 |
||
|
Quantiles (Definition 5) |
|
||||
|
Quantile |
Estimate |
|
|||
|
100% Max |
|
138 |
|
||
|
99% |
|
|
|
138 |
|
|
95% |
|
|
|
122 |
|
|
90% |
|
|
|
101 |
|
|
75% |
Q3 |
|
75 |
|
|
|
50% |
Median |
39 |
|
||
|
25% |
Q1 |
|
14 |
|
|
|
10% |
|
|
|
8 |
|
|
5% |
|
|
|
6 |
|
|
1% |
|
|
|
5 |
|
|
0% Min |
|
5 |
|
||
|
Extreme Observations |
|
||||
----Lowest---- ----Highest--- |
|
|||||
|
Value |
|
Obs |
Value |
Obs |
|
|
5 |
|
39 |
107 |
38 |
|
|
5 |
|
3 |
122 |
19 |
|
|
6 |
|
41 |
122 |
59 |
|
|
6 |
|
37 |
133 |
35 |
|
|
8 |
|
46 |
138 |
26 |
|
Fitted Distribution for Hardness
Parameters for Normal Distribution
Parameter |
Symbol |
Estimate |
Mean |
Mu |
47.18033 |
Std Dev |
Sigma |
38.09397 |
©2002 CRC Press LLC
Goodness-of-Fit Tests for Normal Distribution |
|
|||
Test |
---Statistic---- |
-----P-value----- |
||
Kolmogorov-Smirnov |
D |
0.19666241 |
Pr > D |
<0.010 |
Cramer-von Mises |
W-Sq |
0.39400529 |
Pr > W-Sq |
<0.005 |
Anderson-Darling |
A-Sq |
2.39960138 |
Pr > A-Sq |
<0.005 |
Quantiles for Normal Distribution
|
|
--------Quantile------- |
|||
Percent |
Observed |
Estimated |
|||
1 |
.0 |
5 |
.00000 |
-41.43949 |
|
5 |
.0 |
6 |
.00000 |
-15.47867 |
|
10 |
.0 |
8 |
.00000 |
-1.63905 |
|
25 |
.0 |
14 |
.00000 |
21 |
.48634 |
50 |
.0 |
39 |
.00000 |
47 |
.18033 |
75 |
.0 |
75 |
.00000 |
72 |
.87432 |
90 |
.0 |
101 |
.00000 |
95 |
.99971 |
95 |
.0 |
122 |
.00000 |
109 |
.83933 |
99 |
.0 |
138 |
.00000 |
135 |
.80015 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Display 2.3
The quantiles provide information about the tails of the distribution as well as including the five number summaries for each variable. These consist of the minimum, lower quartile, median, upper quartile, and maximum values of the variables. The box plots that can be constructed from these summaries are often very useful in comparing distributions and identifying outliers. Examples are given in subsequent chapters.
The listing of extreme values can be useful for identifying outliers, especially when used with an id statement. The following section, entitled “Fitted Distribution for Hardness,” gives details of the distribution fitted to the histogram. Because a normal distribution is fitted in this instance, it largely duplicates the output generated by the normal option on the proc statement.
The numerical information in Display 2.2 and the plots in Displays 2.4 and 2.5 all indicate that mortality is symmetrically, approximately normally, distributed. The formal tests of normality all result in non-significant values of the test statistic. The results in Display 2.3 and the plots in Displays 2.6 and 2.7, however, strongly suggest that calcium concentration (hardness) has a skew distribution with each of the tests for normality having associated P-values that are very small.
©2002 CRC Press LLC
Display 2.4
Display 2.5
©2002 CRC Press LLC
Display 2.6
Display 2.7
©2002 CRC Press LLC
The first step in examining the relationship between mortality and water hardness is to look at the scatterplot of the two variables. This can be found using proc gplot with the following instructions:
proc gplot;
plot mortal*hardness; run;
The resulting graph is shown in Display 2.8. The plot shows a clear negative association between the two variables, with high levels of calcium concentration tending to occur with low mortality values and vice versa. The correlation between the two variables is easily found using proc corr, with the following instructions:
proc corr data=water pearson spearman; var mortal hardness;
run;
Display 2.8
The pearson and spearman options in the proc corr statement request that both types of correlation coefficient be calculated. The default, if neither option is used, is the Pearson coefficient.
©2002 CRC Press LLC
The results from these instructions are shown in Display 2.9. The correlation is estimated to be –0.655 using the Pearson coefficient and –0.632 using Spearman’s coefficient. In both cases, the test that the population correlation is zero has an associated P-value of 0.0001. There is clearly strong evidence for a non-zero correlation between the two variables.
The CORR Procedure
2 Variables: Mortal Hardness
|
|
|
Simple Statistics |
|
|
||
Variable |
N |
Mean |
Std Dev |
Median |
Minimum |
Maximum |
|
Mortal |
61 |
1524 |
187 |
.66875 |
1555 |
1096 |
1987 |
Hardness |
61 |
47.18033 |
38 |
.09397 |
39.00000 |
5.00000 |
138.00000 |
Pearson Correlation Coefficients, N = 61
Prob > |r| under H0: Rho=0
|
|
Mortal |
Hardness |
|
Mortal |
1 |
.00000 |
-0 |
.65485 |
|
|
|
|
<.0001 |
Hardness |
-0 |
.65485 |
1 |
.00000 |
|
|
<.0001 |
|
|
Spearman Correlation Coefficients, N = 61
Prob > |r| under H0: Rho=0
|
|
Mortal |
Hardness |
|
Mortal |
1 |
.00000 |
-0 |
.63166 |
|
|
|
|
<.0001 |
Hardness |
-0 |
.63166 |
1 |
.00000 |
|
|
<.0001 |
|
|
Display 2.9
One of the questions of interest about these data is whether or not there is a geographical factor in the relationship between mortality and water hardness, in particular whether this relationship differs between the
©2002 CRC Press LLC
towns in the North and those in the South. To examine this question, a useful first step is to replot the scatter diagram in Display 2.8 with northern and southern towns identified with different symbols. The necessary instructions are
symbol1 value=dot; symbol2 value=circle; proc gplot;
plot mortal*hardness = location; run;
The plot statement of the general form plot y * x = z will result in a scatter plot of y by x with a different symbol for each value of z. In this case, location has only two values and the first two plotting symbols used by SAS are 'x'and '+'. The symbol statements change the plotting symbols to give more impact to the scattergram.
The resulting plot is shown in Display 2.10. There appears to be no obvious difference in the form of the relationship between mortality and hardness for the two groups of towns.
Display 2.10
©2002 CRC Press LLC
Separate correlations for northern and southern towns can be produced using proc corr with a by statement as follows:
proc sort;
by location;
proc corr data=water pearson spearman; var mortal hardness;
by location; run;
The by statement has the effect of producing separate analyses for each subgroup of the data defined by the specified variable, location in this case. However, the data set must first be sorted by that variable.
The results from this series of instructions are shown in Display 2.11. The main items of interest in this display are the correlation coefficients and the results of the tests that the population correlations are zero. The Pearson correlation for towns in the North is –0.369, and for those in the South it is –0.602. Both values are significant beyond the 5% level. The Pearson and Spearman coefficients take very similar values for this example.
--------------------------------------- |
|
|
location=north ------------------------------------- |
|
|
||
|
|
The CORR Procedure |
|
|
|||
|
|
2 Variables: |
Mortal |
Hardness |
|
||
|
|
|
Simple Statistics |
|
|
||
Variable |
N |
Mean |
Std Dev |
Median |
Minimum |
Maximum |
|
Mortal |
35 |
1634 |
136 |
.93691 |
1637 |
1378 |
1987 |
Hardness |
35 |
30.40000 |
26 |
.13449 17.00000 |
6.00000 |
94.00000 |
Pearson Correlation Coefficients, N = 35
Prob > |r| under H0: Rho=0
|
|
Mortal |
Hardness |
|
Mortal |
1 |
.00000 |
-0 |
.36860 |
|
|
|
|
0.0293 |
Hardness |
-0 |
.36860 |
1 |
.00000 |
|
|
0.0293 |
|
|
©2002 CRC Press LLC
Spearman Correlation Coefficients, N = 35
Prob > |r| under H0: Rho=0
|
|
Mortal |
Hardness |
|
Mortal |
1 |
.00000 |
-0 |
.40421 |
|
|
|
|
0.0160 |
Hardness |
-0 |
.40421 |
1 |
.00000 |
|
|
0.0160 |
|
|
-------------------------------------- |
location=south -------------------------------------- |
||||||
|
|
The CORR Procedure |
|
|
|||
|
|
2 Variables: Mortal |
Hardness |
|
|||
|
|
Simple Statistics |
|
|
|||
Variable |
N |
Mean |
Std Dev |
Median |
Minimum |
Maximum |
|
Mortal |
26 |
1377 140.26918 |
|
1364 |
1096 |
1627 |
|
Hardness |
26 69.76923 40.36068 75.50000 |
5.00000 138.00000 |
|||||
|
|
Pearson Correlation Coefficients, N = 26 |
|
||||
|
|
Prob > |r| under H0: Rho=0 |
|
||||
|
|
|
Mortal |
Hardness |
|
||
|
|
Mortal |
1.00000 |
-0.60215 |
|
||
|
|
|
|
|
|
0.0011 |
|
|
|
Hardness |
-0.60215 |
1 |
.00000 |
|
|
|
|
|
0.0011 |
|
|
|
|
|
|
Spearman Correlation Coefficients, N = 26 |
|
||||
|
|
Prob > |r| under H0: Rho=0 |
|
||||
|
|
|
Mortal |
Hardness |
|
||
|
|
Mortal |
1.00000 |
-0.59572 |
|
||
|
|
|
|
|
|
0.0013 |
|
|
|
Hardness |
-0.59572 |
1 |
.00000 |
|
|
|
|
|
0.0013 |
|
|
|
Display 2.11
©2002 CRC Press LLC
Examination of scatterplots often centres on assessing density patterns such as clusters, gaps, or outliers. However, humans are not particularly good at visually examining point density and some type of density estimate added to the scatterplot is frequently very helpful. Here, plotting a bivariate density estimate for mortality and hardness is useful for gaining more insight into the structure of the data. (Details on how to calculate bivariate densities are given in Silverman [1986].) The following code produces and plots the bivariate density estimate of the two variables:
proc kde data=water out=bivest; var mortal hardness;
proc g3d data=bivest;
plot hardness*mortal=density; run;
The KDE procedure (proc kde) produces estimates of a univariate or bivariate probability density function using kernel density estimation (see Silverman [1986]). If a single variable is specified in the var statement, a univariate density is estimated and a bivariate density if two are specified. The out=bivest option directs the density estimates to a SAS data set. These can then be plotted with the three-dimensional plotting procedure proc g3d. The resulting plot is shown in Display 2.12. The two clear modes in the diagram correspond, at least approximately, to northern and southern towns.
Display 2.12
©2002 CRC Press LLC