Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Handbook_of_statistical_analysis_using_SAS

.pdf
Скачиваний:
17
Добавлен:
01.05.2015
Размер:
4.92 Mб
Скачать

between them with the message “Press Forward to see next graph” in the status line. The Page Down and Page Up keys are used for Forward and Backward, respectively.

1.10Some Tips for Preventing and Correcting Errors

When writing programs:

1.One statement per line, where possible.

2.End each step with a run statement.

3.Indent each statement within a step (i.e., each statement between the data or proc statement and the run statement) by a couple of spaces. This is automated in the enhanced editor.

4.Give the full path name for raw data files on the infilestatement.

5.Begin any programs that produce graphics with goptions reset=all; and then set the required options.

Before submitting a program:

1. Check that each statement ends with a semicolon.

2.Check that all opening and closing quotes match. Use the enhanced editor colour coding to double-check.

3.Check any statement that does not begin with a keyword (blue, or navy blue) or a variable name (black).

4.Large blocks of purple may indicate a missing quotation mark.

5.Large areas of green may indicate a missing */ from a comment.

“Collapse” the program to check its overall structure. Hold down the Ctrl and Alt keys and press the numeric keypad minus key. Only the data, proc statements, and global statements should be visible. To expand the program, press the numeric keypad plus key while holding down Ctrl and Alt.

After running a program:

1.Examine the SAS log for warning and error messages.

2.Check for the message: "SAS went to a new line when INPUT statement reached past the end of a line" when using list input.

3.Verify that the number of observations and variables read in is correct.

©2002 CRC Press LLC

4.Print out small data sets to ensure that they have been read correctly.

If there is an error message for a statement that appears to be correct, check whether the semicolon was omitted from the previous statement.

The message that a variable is “uninitialized” or “not found” usually means it has been misspelled.

To correct a missing quote, submit: '; run;or "; run; and then correct the program and resubmit it.

©2002 CRC Press LLC

Chapter 2

Data Description and

Simple Inference:

Mortality and Water

Hardness in the U.K.

2.1 Description of Data

The data to be considered in this chapter were collected in an investigation of environmental causes of diseases, and involve the annual mortality rates per 100,000 for males, averaged over the years from 1958 to 1964, and the calcium concentration (in parts per million) in the drinking water supply for 61 large towns in England and Wales. (The higher the calcium concentration, the harder the water.) The data appear in Table 7 of SDS and have been rearranged for use here as shown in Display 2.1. (Towns at least as far north as Derby are identified in the table by an asterisk.)

The main questions of interest about these data are as follows:

How are mortality and water hardness related?

Is there a geographical factor in the relationship?

©2002 CRC Press LLC

2.2 Methods of Analysis

Initial examination of the data involves graphical techniques such as histograms and normal probability plots to assess the distributional properties of the two variables, to make general patterns in the data more visible, and to detect possible outliers. Scatterplots are used to explore the relationship between mortality and calcium concentration.

Following this initial graphical exploration, some type of correlation coefficient can be computed for mortality and calcium concentration. Pearson’s correlation coefficient is generally used but others, for example, Spearman’s rank correlation, may be more appropriate if the data are not considered to have a bivariate normal distribution. The relationship between the two variables is examined separately for northern and southern towns.

Finally, it is of interest to compare the mean mortality and mean calcium concentration in the north and south of the country by using either a t-test or its nonparametric alternative, the Wilcoxon rank-sum test.

2.3 Analysis Using SAS

Assuming the data is stored in an ASCII file, water.dat, as listed in Display 2.1 (i.e., including the '*'to identify the location of the town and the name of the town), then they can be read in using the following instructions:

Town

Mortality

Hardness

 

 

 

Bath

1247

105

* Birkenhead

1668

17

Birmingham

1466

5

* Blackburn

1800

14

* Blackpool

1609

18

* Bolton

1558

10

* Bootle

1807

15

Bournemouth

1299

78

* Bradford

1637

10

Brighton

1359

84

Bristol

1392

73

* Burnley

1755

12

Cardiff

1519

21

Coventry

1307

78

Croydon

1254

96

* Darlington

1491

20

* Derby

1555

39

* Doncaster

1428

39

 

 

 

©2002 CRC Press LLC

 

 

East Ham

1318

122

 

 

 

Exeter

1260

21

 

 

* Gateshead

1723

44

 

 

* Grimsby

1379

94

 

 

*

Halifax

1742

8

 

 

* Huddersfield

1574

9

 

 

*

Hull

1569

91

 

 

 

Ipswich

1096

138

 

 

* Leeds

1591

16

 

 

 

Leicester

1402

37

 

 

*

Liverpool

1772

15

 

 

* Manchester

1828

8

 

 

* Middlesbrough

1704

26

 

 

* Newcastle

1702

44

 

 

 

Newport

1581

14

 

 

 

Northampton

1309

59

 

 

 

Norwich

1259

133

 

 

* Nottingham

1427

27

 

 

* Oldham

1724

6

 

 

 

Oxford

1175

107

 

 

 

Plymouth

1486

5

 

 

 

Portsmouth

1456

90

 

 

*

Preston

1696

6

 

 

 

Reading

1236

101

 

 

* Rochdale

1711

13

 

 

* Rotherham

1444

14

 

 

* St Helens

1591

49

 

 

*

Salford

1987

8

 

 

* Sheffield

1495

14

 

 

 

Southampton

1369

68

 

 

 

Southend

1257

50

 

 

* Southport

1587

75

 

 

* South Shields

1713

71

 

 

*

Stockport

1557

13

 

 

* Stoke

1640

57

 

 

* Sunderland

1709

71

 

 

 

Swansea

1625

13

 

 

*

Wallasey

1625

20

 

 

 

Walsall

1527

60

 

 

 

West Bromwich

1627

53

 

 

 

West Ham

1486

122

 

 

 

Wolverhampton

1485

81

 

 

* York

1378

71

 

 

 

 

 

 

 

 

 

 

 

 

 

Display 2.1

©2002 CRC Press LLC

data water;

infile 'water.dat';

input flag $ 1 Town $ 2-18 Mortal 19-22 Hardness 25-27; if flag = '*' then location = 'north';

else location = 'south';

run;

The input statement uses SAS's column input where the exact columns containing the data for each variable are specified. Column input is simpler than list input in this case for three reasons:

There is no space between the asterisk and the town name.

Some town names are longer than eight characters — the default length for character variables.

Some town names contain spaces, which would make list input complicated.

The univariate procedure can be used to examine the distributions of numeric variables. The following simple instructions lead to the results shown in Displays 2.2 and 2.3:

proc univariate data=water normal; var mortal hardness;

histogram mortal hardness /normal; probplot mortal hardness;

run;

The normal option on the proc statement results in a test for the normality of the variables (see below). The var statement specifies which variable(s) are to be included. If the var statement is omitted, the default is all the numeric variables in the data set. The histogram statement produces histograms for both variables and the /normal option requests a normal distribution curve. Curves for various other distributions, including nonparametric kernel density estimates (see Silverman [1986]) can be produced by varying this option. Probability plots are requested with the probplot statement. Normal probability plots are the default. The resulting histograms and plots are shown in Displays 2.4 to 2.7.

Displays 2.2 and 2.3 provide significant information about the distributions of the two variables, mortality and hardness. Much of this is selfexplanatory, for example, Mean, Std Deviation, Variance, and N. The meaning of some of the other statistics printed in these displays are as follows:

Uncorrected SS: Uncorrected sum of squares; simply the sum of squares of the observations

©2002 CRC Press LLC

Corrected SS:

Corrected sum of squares; simply the sum

 

of squares of deviations of the observa-

 

tions from the sample mean

Coeff Variation:

Coefficient of variation; the standard devi-

 

ation divided by the mean and multiplied

 

by 100

Std Error Mean:

Standard deviation divided by the square

 

root of the number of observations

Range:

Difference between largest and smallest

 

observation in the sample

Interquartile Range:

Difference between the 25% and 75%

 

quantiles (see values of quantiles given

 

later in display to confirm)

Student’s t:

Student’s t-test value for testing that the

 

population mean is zero

Pr>|t|:

Probability of a greater absolute value for

 

the t-statistic

Sign Test:

Nonparametric test statistic for testing

 

whether the population median is zero

Pr>|M|:

Approximation to the probability of a

 

greater absolute value for the Sign test

 

under the hypothesis that the population

 

median is zero

Signed Rank:

Nonparametric test statistic for testing

 

whether the population mean is zero

Pr>=|S|:

Approximation to the probability of a

 

greater absolute value for the Sign Rank

 

statistic under the hypothesis that the pop-

 

ulation mean is zero

Shapiro-Wilk W:

Shapiro-Wilk statistic for assessing the nor-

 

mality of the data and the corresponding

 

P-value (Shapiro and Wilk [1965])

Kolmogorov-Smirnov D: Kolmogorov-Smirnov statistic for assessing the normality of the data and the corresponding P-value (Fisher and Van Belle [1993])

Cramer-von Mises W-sq: Cramer-von Mises statistic for assessing the normality of the data and the associated P-value (Everitt [1998])

Anderson-Darling A-sq: Anderson-Darling statistic for assessing the normality of the data and the associated P-value (Everitt [1998])

©2002 CRC Press LLC

The UNIVARIATE Procedure

Variable: Mortal

Moments

N

61

Sum Weights

 

61

Mean

1524.14754

Sum Observations

 

92973

Std Deviation

187.668754

Variance

35219.5612

Skewness

-0.0844436

Kurtosis

-0

.4879484

Uncorrected SS

143817743

Corrected SS

2113173.67

Coeff Variation

12.3130307

Std Error Mean

24

.0285217

 

Basic Statistical Measures

 

Location

Variability

 

Mean

1524.148

Std Deviation

187.66875

Median

1555.000

Variance

35220

Mode

1486.000

Range

891.00000

 

 

Interquartile Range

289.00000

NOTE: The mode displayed is the smallest of 3 modes with a count of 2.

Tests for Location: Mu0=0

Test

-Statistic-

-----P-value------

 

Student's t

t

63.43077

Pr > |t|

<.0001

Sign

M

 

30.5

Pr >= |M|

<.0001

 

Signed Rank

S

 

945.5

Pr >= |S|

<.0001

 

 

Tests for Normality

 

 

 

Test

 

--Statistic---

-----P-value------

Shapiro-Wilk

 

W

0.985543

Pr < W

0.6884

Kolmogorov-Smirnov

 

D

0.073488

Pr > D

>0.1500

Cramer-von Mises

 

W-Sq

0.048688

Pr > W-Sq

>0.2500

Anderson-Darling

 

A-Sq

0.337398

Pr > A-Sq

>0.2500

©2002 CRC Press LLC

Quantiles (Definition 5)

Quantile

Estimate

100% Max

 

1987

99%

 

 

1987

95%

 

 

1800

90%

 

 

1742

75% Q3

 

1668

50% Median

 

1555

25% Q1

 

1379

10%

 

 

1259

5%

 

 

1247

1%

 

 

1096

0% Min

 

1096

Extreme Observations

----Lowest----

----Highest---

Value

Obs

Value

Obs

1096

26

1772

29

1175

38

1800

4

1236

42

1807

7

1247

1

1828

30

1254

15

1987

46

Fitted Distribution for Mortal

Parameters for Normal Distribution

Parameter

Symbol

Estimate

Mean

Mu

1524.148

Std Dev

Sigma

187.6688

Goodness-of-Fit Tests for Normal Distribution

 

Test

---Statistic----

-----P-value-----

Kolmogorov-Smirnov

D

0.07348799

Pr > D

>0.150

Cramer-von Mises

W-Sq

0.04868837

Pr > W-Sq

>0.250

Anderson-Darling

A-Sq

0.33739780

Pr > A-Sq

>0.250

©2002 CRC Press LLC

Quantiles for Normal Distribution

 

 

------Quantile------

Percent

Observed

Estimated

1

.0

1096.00

1087.56

5

.0

1247.00

1215.46

10

.0

1259.00

1283.64

25

.0

1379.00

1397.57

50

.0

1555.00

1524.15

75

.0

1668.00

1650.73

90

.0

1742.00

1764.65

95

.0

1800.00

1832.84

99

.0

1987.00

1960.73

Display 2.2

The UNIVARIATE Procedure

 

 

 

 

Variable: Hardness

 

 

 

Moments

 

 

N

61

Sum Weights

 

61

Mean

47.1803279

Sum Observations

 

2878

Std Deviation

38.0939664

Variance

1451.15027

Skewness

0.69223461

Kurtosis

-0

.6657553

Uncorrected SS

222854

Corrected SS

87069.0164

Coeff Variation

80.7412074

Std Error Mean

4

.8774326

 

Basic Statistical Measures

 

 

Location

Variability

 

 

Mean

47.18033

Std Deviation

38

.09397

Median

39.00000

Variance

 

1451

Mode

14.00000

Range

133

.00000

 

 

Interquartile Range

61

.00000

 

Tests for Location: Mu0=0

 

Test

 

-Statistic-

 

-----P-value------

Student's t

t

9.673189

Pr > |t|

<.0001

Sign

M

30

.5

Pr >= |M|

<.0001

Signed Rank

S

945

.5

Pr >= |S|

<.0001

©2002 CRC Press LLC

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]