Handbook_of_statistical_analysis_using_SAS
.pdfbetween them with the message “Press Forward to see next graph” in the status line. The Page Down and Page Up keys are used for Forward and Backward, respectively.
1.10Some Tips for Preventing and Correcting Errors
When writing programs:
1.One statement per line, where possible.
2.End each step with a run statement.
3.Indent each statement within a step (i.e., each statement between the data or proc statement and the run statement) by a couple of spaces. This is automated in the enhanced editor.
4.Give the full path name for raw data files on the infilestatement.
5.Begin any programs that produce graphics with goptions reset=all; and then set the required options.
Before submitting a program:
1. Check that each statement ends with a semicolon.
2.Check that all opening and closing quotes match. Use the enhanced editor colour coding to double-check.
3.Check any statement that does not begin with a keyword (blue, or navy blue) or a variable name (black).
4.Large blocks of purple may indicate a missing quotation mark.
5.Large areas of green may indicate a missing */ from a comment.
“Collapse” the program to check its overall structure. Hold down the Ctrl and Alt keys and press the numeric keypad minus key. Only the data, proc statements, and global statements should be visible. To expand the program, press the numeric keypad plus key while holding down Ctrl and Alt.
After running a program:
1.Examine the SAS log for warning and error messages.
2.Check for the message: "SAS went to a new line when INPUT statement reached past the end of a line" when using list input.
3.Verify that the number of observations and variables read in is correct.
©2002 CRC Press LLC
4.Print out small data sets to ensure that they have been read correctly.
If there is an error message for a statement that appears to be correct, check whether the semicolon was omitted from the previous statement.
The message that a variable is “uninitialized” or “not found” usually means it has been misspelled.
To correct a missing quote, submit: '; run;or "; run; and then correct the program and resubmit it.
©2002 CRC Press LLC
Chapter 2
Data Description and
Simple Inference:
Mortality and Water
Hardness in the U.K.
2.1 Description of Data
The data to be considered in this chapter were collected in an investigation of environmental causes of diseases, and involve the annual mortality rates per 100,000 for males, averaged over the years from 1958 to 1964, and the calcium concentration (in parts per million) in the drinking water supply for 61 large towns in England and Wales. (The higher the calcium concentration, the harder the water.) The data appear in Table 7 of SDS and have been rearranged for use here as shown in Display 2.1. (Towns at least as far north as Derby are identified in the table by an asterisk.)
The main questions of interest about these data are as follows:
How are mortality and water hardness related?
Is there a geographical factor in the relationship?
©2002 CRC Press LLC
2.2 Methods of Analysis
Initial examination of the data involves graphical techniques such as histograms and normal probability plots to assess the distributional properties of the two variables, to make general patterns in the data more visible, and to detect possible outliers. Scatterplots are used to explore the relationship between mortality and calcium concentration.
Following this initial graphical exploration, some type of correlation coefficient can be computed for mortality and calcium concentration. Pearson’s correlation coefficient is generally used but others, for example, Spearman’s rank correlation, may be more appropriate if the data are not considered to have a bivariate normal distribution. The relationship between the two variables is examined separately for northern and southern towns.
Finally, it is of interest to compare the mean mortality and mean calcium concentration in the north and south of the country by using either a t-test or its nonparametric alternative, the Wilcoxon rank-sum test.
2.3 Analysis Using SAS
Assuming the data is stored in an ASCII file, water.dat, as listed in Display 2.1 (i.e., including the '*'to identify the location of the town and the name of the town), then they can be read in using the following instructions:
Town |
Mortality |
Hardness |
|
|
|
Bath |
1247 |
105 |
* Birkenhead |
1668 |
17 |
Birmingham |
1466 |
5 |
* Blackburn |
1800 |
14 |
* Blackpool |
1609 |
18 |
* Bolton |
1558 |
10 |
* Bootle |
1807 |
15 |
Bournemouth |
1299 |
78 |
* Bradford |
1637 |
10 |
Brighton |
1359 |
84 |
Bristol |
1392 |
73 |
* Burnley |
1755 |
12 |
Cardiff |
1519 |
21 |
Coventry |
1307 |
78 |
Croydon |
1254 |
96 |
* Darlington |
1491 |
20 |
* Derby |
1555 |
39 |
* Doncaster |
1428 |
39 |
|
|
|
©2002 CRC Press LLC
|
|
East Ham |
1318 |
122 |
|
|
|
Exeter |
1260 |
21 |
|
|
* Gateshead |
1723 |
44 |
|
|
|
* Grimsby |
1379 |
94 |
|
|
|
* |
Halifax |
1742 |
8 |
|
|
* Huddersfield |
1574 |
9 |
|
|
|
* |
Hull |
1569 |
91 |
|
|
|
Ipswich |
1096 |
138 |
|
|
* Leeds |
1591 |
16 |
|
|
|
|
Leicester |
1402 |
37 |
|
|
* |
Liverpool |
1772 |
15 |
|
|
* Manchester |
1828 |
8 |
|
|
|
* Middlesbrough |
1704 |
26 |
|
|
|
* Newcastle |
1702 |
44 |
|
|
|
|
Newport |
1581 |
14 |
|
|
|
Northampton |
1309 |
59 |
|
|
|
Norwich |
1259 |
133 |
|
|
* Nottingham |
1427 |
27 |
|
|
|
* Oldham |
1724 |
6 |
|
|
|
|
Oxford |
1175 |
107 |
|
|
|
Plymouth |
1486 |
5 |
|
|
|
Portsmouth |
1456 |
90 |
|
|
* |
Preston |
1696 |
6 |
|
|
|
Reading |
1236 |
101 |
|
|
* Rochdale |
1711 |
13 |
|
|
|
* Rotherham |
1444 |
14 |
|
|
|
* St Helens |
1591 |
49 |
|
|
|
* |
Salford |
1987 |
8 |
|
|
* Sheffield |
1495 |
14 |
|
|
|
|
Southampton |
1369 |
68 |
|
|
|
Southend |
1257 |
50 |
|
|
* Southport |
1587 |
75 |
|
|
|
* South Shields |
1713 |
71 |
|
|
|
* |
Stockport |
1557 |
13 |
|
|
* Stoke |
1640 |
57 |
|
|
|
* Sunderland |
1709 |
71 |
|
|
|
|
Swansea |
1625 |
13 |
|
|
* |
Wallasey |
1625 |
20 |
|
|
|
Walsall |
1527 |
60 |
|
|
|
West Bromwich |
1627 |
53 |
|
|
|
West Ham |
1486 |
122 |
|
|
|
Wolverhampton |
1485 |
81 |
|
|
* York |
1378 |
71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Display 2.1
©2002 CRC Press LLC
data water;
infile 'water.dat';
input flag $ 1 Town $ 2-18 Mortal 19-22 Hardness 25-27; if flag = '*' then location = 'north';
else location = 'south';
run;
The input statement uses SAS's column input where the exact columns containing the data for each variable are specified. Column input is simpler than list input in this case for three reasons:
There is no space between the asterisk and the town name.
Some town names are longer than eight characters — the default length for character variables.
Some town names contain spaces, which would make list input complicated.
The univariate procedure can be used to examine the distributions of numeric variables. The following simple instructions lead to the results shown in Displays 2.2 and 2.3:
proc univariate data=water normal; var mortal hardness;
histogram mortal hardness /normal; probplot mortal hardness;
run;
The normal option on the proc statement results in a test for the normality of the variables (see below). The var statement specifies which variable(s) are to be included. If the var statement is omitted, the default is all the numeric variables in the data set. The histogram statement produces histograms for both variables and the /normal option requests a normal distribution curve. Curves for various other distributions, including nonparametric kernel density estimates (see Silverman [1986]) can be produced by varying this option. Probability plots are requested with the probplot statement. Normal probability plots are the default. The resulting histograms and plots are shown in Displays 2.4 to 2.7.
Displays 2.2 and 2.3 provide significant information about the distributions of the two variables, mortality and hardness. Much of this is selfexplanatory, for example, Mean, Std Deviation, Variance, and N. The meaning of some of the other statistics printed in these displays are as follows:
Uncorrected SS: Uncorrected sum of squares; simply the sum of squares of the observations
©2002 CRC Press LLC
Corrected SS: |
Corrected sum of squares; simply the sum |
|
of squares of deviations of the observa- |
|
tions from the sample mean |
Coeff Variation: |
Coefficient of variation; the standard devi- |
|
ation divided by the mean and multiplied |
|
by 100 |
Std Error Mean: |
Standard deviation divided by the square |
|
root of the number of observations |
Range: |
Difference between largest and smallest |
|
observation in the sample |
Interquartile Range: |
Difference between the 25% and 75% |
|
quantiles (see values of quantiles given |
|
later in display to confirm) |
Student’s t: |
Student’s t-test value for testing that the |
|
population mean is zero |
Pr>|t|: |
Probability of a greater absolute value for |
|
the t-statistic |
Sign Test: |
Nonparametric test statistic for testing |
|
whether the population median is zero |
Pr>|M|: |
Approximation to the probability of a |
|
greater absolute value for the Sign test |
|
under the hypothesis that the population |
|
median is zero |
Signed Rank: |
Nonparametric test statistic for testing |
|
whether the population mean is zero |
Pr>=|S|: |
Approximation to the probability of a |
|
greater absolute value for the Sign Rank |
|
statistic under the hypothesis that the pop- |
|
ulation mean is zero |
Shapiro-Wilk W: |
Shapiro-Wilk statistic for assessing the nor- |
|
mality of the data and the corresponding |
|
P-value (Shapiro and Wilk [1965]) |
Kolmogorov-Smirnov D: Kolmogorov-Smirnov statistic for assessing the normality of the data and the corresponding P-value (Fisher and Van Belle [1993])
Cramer-von Mises W-sq: Cramer-von Mises statistic for assessing the normality of the data and the associated P-value (Everitt [1998])
Anderson-Darling A-sq: Anderson-Darling statistic for assessing the normality of the data and the associated P-value (Everitt [1998])
©2002 CRC Press LLC
The UNIVARIATE Procedure
Variable: Mortal
Moments
N |
61 |
Sum Weights |
|
61 |
Mean |
1524.14754 |
Sum Observations |
|
92973 |
Std Deviation |
187.668754 |
Variance |
35219.5612 |
|
Skewness |
-0.0844436 |
Kurtosis |
-0 |
.4879484 |
Uncorrected SS |
143817743 |
Corrected SS |
2113173.67 |
|
Coeff Variation |
12.3130307 |
Std Error Mean |
24 |
.0285217 |
|
Basic Statistical Measures |
|
|
Location |
Variability |
|
|
Mean |
1524.148 |
Std Deviation |
187.66875 |
Median |
1555.000 |
Variance |
35220 |
Mode |
1486.000 |
Range |
891.00000 |
|
|
Interquartile Range |
289.00000 |
NOTE: The mode displayed is the smallest of 3 modes with a count of 2.
Tests for Location: Mu0=0
Test |
-Statistic- |
-----P-value------ |
|
||||
Student's t |
t |
63.43077 |
Pr > |t| |
<.0001 |
|||
Sign |
M |
|
30.5 |
Pr >= |M| |
<.0001 |
|
|
Signed Rank |
S |
|
945.5 |
Pr >= |S| |
<.0001 |
|
|
|
Tests for Normality |
|
|
|
|||
Test |
|
--Statistic--- |
-----P-value------ |
||||
Shapiro-Wilk |
|
W |
0.985543 |
Pr < W |
0.6884 |
||
Kolmogorov-Smirnov |
|
D |
0.073488 |
Pr > D |
>0.1500 |
||
Cramer-von Mises |
|
W-Sq |
0.048688 |
Pr > W-Sq |
>0.2500 |
||
Anderson-Darling |
|
A-Sq |
0.337398 |
Pr > A-Sq |
>0.2500 |
©2002 CRC Press LLC
Quantiles (Definition 5)
Quantile |
Estimate |
||
100% Max |
|
1987 |
|
99% |
|
|
1987 |
95% |
|
|
1800 |
90% |
|
|
1742 |
75% Q3 |
|
1668 |
|
50% Median |
|
1555 |
|
25% Q1 |
|
1379 |
|
10% |
|
|
1259 |
5% |
|
|
1247 |
1% |
|
|
1096 |
0% Min |
|
1096 |
|
Extreme Observations |
|||
----Lowest---- |
----Highest--- |
||
Value |
Obs |
Value |
Obs |
1096 |
26 |
1772 |
29 |
1175 |
38 |
1800 |
4 |
1236 |
42 |
1807 |
7 |
1247 |
1 |
1828 |
30 |
1254 |
15 |
1987 |
46 |
Fitted Distribution for Mortal
Parameters for Normal Distribution
Parameter |
Symbol |
Estimate |
Mean |
Mu |
1524.148 |
Std Dev |
Sigma |
187.6688 |
Goodness-of-Fit Tests for Normal Distribution |
|
|||
Test |
---Statistic---- |
-----P-value----- |
||
Kolmogorov-Smirnov |
D |
0.07348799 |
Pr > D |
>0.150 |
Cramer-von Mises |
W-Sq |
0.04868837 |
Pr > W-Sq |
>0.250 |
Anderson-Darling |
A-Sq |
0.33739780 |
Pr > A-Sq |
>0.250 |
©2002 CRC Press LLC
Quantiles for Normal Distribution |
|||
|
|
------Quantile------ |
|
Percent |
Observed |
Estimated |
|
1 |
.0 |
1096.00 |
1087.56 |
5 |
.0 |
1247.00 |
1215.46 |
10 |
.0 |
1259.00 |
1283.64 |
25 |
.0 |
1379.00 |
1397.57 |
50 |
.0 |
1555.00 |
1524.15 |
75 |
.0 |
1668.00 |
1650.73 |
90 |
.0 |
1742.00 |
1764.65 |
95 |
.0 |
1800.00 |
1832.84 |
99 |
.0 |
1987.00 |
1960.73 |
Display 2.2
The UNIVARIATE Procedure |
|
|
|
|
|
Variable: Hardness |
|
|
|
|
Moments |
|
|
|
N |
61 |
Sum Weights |
|
61 |
Mean |
47.1803279 |
Sum Observations |
|
2878 |
Std Deviation |
38.0939664 |
Variance |
1451.15027 |
|
Skewness |
0.69223461 |
Kurtosis |
-0 |
.6657553 |
Uncorrected SS |
222854 |
Corrected SS |
87069.0164 |
|
Coeff Variation |
80.7412074 |
Std Error Mean |
4 |
.8774326 |
|
Basic Statistical Measures |
|
|
|
Location |
Variability |
|
|
|
Mean |
47.18033 |
Std Deviation |
38 |
.09397 |
Median |
39.00000 |
Variance |
|
1451 |
Mode |
14.00000 |
Range |
133 |
.00000 |
|
|
Interquartile Range |
61 |
.00000 |
|
Tests for Location: Mu0=0 |
|
|||
Test |
|
-Statistic- |
|
-----P-value------ |
|
Student's t |
t |
9.673189 |
Pr > |t| |
<.0001 |
|
Sign |
M |
30 |
.5 |
Pr >= |M| |
<.0001 |
Signed Rank |
S |
945 |
.5 |
Pr >= |S| |
<.0001 |
©2002 CRC Press LLC