Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

lowest to highest values. If the residuals show a pattern — i.e., contiguous zones where all of the values are > 0 or < 0 — the data have a curved shape, and the straight line is an inappropriate model that produces a misfit.

This method of checking residuals is commonly used in multivariable analysis. In simple bivariate analysis, however, a much simpler and easier option is to inspect the graph of the original points for Yi vs. X i. This simple graph is always easy to plot, and a computer will do it, if asked. If the points show a clear non-rectilinear pattern — as in the middle and right side of Figure 18.13 — the straight line is a poor or potentially unsatisfactory model. If the points show a diverse scatter with no evident pattern, as in Figure 18.16, and if all you want to know is the average trend in their relationship, a straight line can be satisfactory.

The bivariate graph of points should always be drawn and inspected, and it should preferably be displayed in the published report. For drawing conclusions about a bi-dimensional relationship, a single graphic picture can be worth more than 1000 “words” of statistical calculations.

19.6.2.3 Anscombe’s “Quartet” — F. J. Anscombe (Professor Emeritus of Statistics at Yale University) constructed15 an ingenious illustration of the problems inherent in doing regressions and interpreting coefficients without carefully examining the graph of the data. He prepared four different sets of bivariate data that all had the same summary values. For each set of data, n = 11, X = 9.0, Y =

7.5, Sxx

= 110.0, Syy = 41.25, and Sxy =

55.01. Each set of data also had the same Sr = 13.75, SM =

27.50, r

2

= .667, and the same regression line:

ˆ

= 3 + 0.5Xi . From this striking similarity in all

 

Yi

pertinent features of the univariate and bivariate statistical summaries, the graphs for the four sets of data could be expected to look quite similar. Nevertheless, the four data sets had the disparate patterns shown in Figure 19.10. Whenever you think that a statistical summary has told you everything, and that you need not bother looking at the graphical portrait, remember the cacophonous music of “Anscombe’s quartet”!

19.6.3Potential Influence of Outliers

In univariate analysis, outlier points could distort the effectiveness with which a mean and standard deviation represented the associated data. In the conventional (“parametric”) form of bivariate analysis under discussion here, the adverse effects can be particularly pernicious. The outlier can greatly alter the covariance as well as the variances used in the calculations.

For example, consider the eleven points of the D graph in “Anscombe’s Quartet” (Figure 19.10). The first ten points all lie on a vertical straight line that shows no relationship between Xand Y. The eleventh point, however, is an outlier. It may represent an error in measurement, a person who does not belong in the group, or someone who is there quite properly. Regardless of the substantive propriety of the outlier, however, it will have the profound statistical impact shown in the summary coefficients and the linear graph.

© 2002 by Chapman & Hall/CRC

Relation of the baroreflex sensitivity during ACE inhibition (vertical axis) to the baseline sensitivity (horizontal axis). [Figure and legend taken from Chapter Reference 16.]

10

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

10

15

20

5

 

10

15

20

0

 

 

 

 

0

 

10

 

 

 

 

C

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

10

15

20

 

 

 

 

5

 

10

15

20

FIGURE 19.10

The four performers in “Anscombe’s Quartet.” (For details, see text.) [Figure and legend taken from Chapter Reference 15.]

19.6.3.1

Published Illustration of

 

 

15

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem — If you believe

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

that Anscombe’s imagination was too wild,

 

ms

 

 

 

 

 

 

and that this type of configuration occurs too

 

 

 

 

 

 

 

 

 

mmHg

 

 

 

 

 

 

rarely to be regarded as a serious problem,

 

 

 

 

 

 

 

 

consider Figure 19.11, which achieved publi-

 

Inhibition

10

y = 2.80x - 0.46

cation in a prominent medical journal.16 The

 

 

r

= 0.84

 

 

 

 

investigators did not succeed in exactly repli-

 

 

 

 

 

p < 0.005

cating Anscombe’s D pattern, but they cer-

 

 

 

 

 

 

 

 

 

tainly came close.

-

 

 

 

 

 

 

 

ACE

 

 

 

 

 

 

The best way to avoid such problems is to

 

5

 

 

 

 

 

look at the data. If you see a pattern such as D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

in Figure 19.10 or the graph in Figure 19.11, do

 

 

 

 

 

 

 

 

not try to summarize the results with conven-

 

 

 

 

 

 

 

 

tional regression and correlation coefficients.

 

 

 

 

 

 

 

 

19.6.3.2

Another Published

 

 

0

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

Illustration — A particularly

 

 

 

Baseline

 

 

ms

dramatic example of the outlier problem

 

 

 

 

 

 

 

 

 

 

mmHg

appeared earlier in Figure 19.2. The data in

 

 

 

 

 

 

 

 

the graph (and in the study itself) had been

FIGURE 19.11

 

 

 

 

 

used to conclude that “large science departments ... are more productive than small ones” in British universities. With that premise, government agencies were thinking about closing

small science departments. Two physicists at Sussex University, however, pointed out6 that the main effect in Figure 19.2 is produced by the two outlier points on the far right of the graph. These two points happen to be Oxford and Cambridge Universities, which have other atypical attributes. If the Oxford and Cambridge points are removed from the data, the effect of department size on productivity

© 2002 by Chapman & Hall/CRC

vanishes. Outraged about decisions based on this type of “spurious correlation,” one of the physicists said, “Who’s going to have a policy where you start closing university departments on the basis of a graph that’s really fuzzy … [and] easily destroyed by a very simple commonsense thing [i.e., removing the two outlier points].”

In ordinary bivariate regression, the effects of outliers can be reduced by transferring from a dimensional to an ordinal analysis, as discussed in Chapter 27, using the ranks of the data rather than absolute values. (In multivariable analysis, the effect of outliers can be examined with various “influence functions.” Some of them use the jackknife method of removing one member at a time, recalculating the regression coefficients, and seeing how extensively they vary.)

19.6.4Causal Implications

Scientifically oriented readers probably need no reminder of the old statistical cliché that correlation does not imply causation. No matter how stochastically and quantitatively significant a correlation coefficient may be, its magnitude does not imply that the association is a causal relationship.

Most thoughtful scientists today are aware of this distinction, but it is still often disregarded during some of the etiologically infatuated reasoning that may occur during analyses of alleged causes of chronic disease. The attendant biases, which are beyond the scope of the discussion here, can distort the true value of correlation coefficients and lead to erroneous conclusions about etiology.

An important point to bear in mind, however, is that diverse types of strong associations can be found during various types of “dredging” from the computerized databases that are now widely available. If you cannot dredge the abundant data to find evidence associating anything that you want to incriminate and any effect that it allegedly produces, either you or your computer programmer is not very talented. For example, the annual rise in incidence rates for AIDS in the U.S. can be strongly correlated with increases in the annual sales of video cassette recorders. Yet no one has proposed — as of now — that the two events are causally related. A few years ago, however, a distinct association was found17 (in a case-control study) between AIDS and antecedent usage of amyl nitrite “poppers.” A chemical etiology was strongly proposed for AIDS before the causal virus was demonstrated, and before the role of amyl nitrite “poppers” was found to be correlated with the sexual activity that transmitted the virus.18

A well-known example of the fallacy that correlation implies causation is shown in Figure 19.12, where the population of Oldenburg during the years 1930–36 was plotted against the number of storks observed flying in the city each year.19 What better evidence could one want to prove that storks bring babies?

19.6.5Additional “Sins”

Beyond all the problems just cited, many other opportunities are available to abuse the correlation/regression process. Of the many candidate “sins,” only five more will be listed here.

19.6.5.1 Retroactive Demarcations — In certain studies the regression line shows a “dose–response” or “exposure–response” relationship when different levels of response are plotted against increasing magnitudes of a pharmaceutical agent or a “risk factor.” If the regression line is not impressively “significant,” however, the investigator may demarcate the dose–exposure variable into zones and then search for “significance” in pairwise (or other) comparisons of results in the demarcated zones.

If the demarcation is announced beforehand or if it depends on quantiles that are determined inherently by the data, the procedure is scientifically “legitimate.” On the other hand, if the investigator makes arbitrary choices after inspecting the pattern of the data, the retroactive demarcations have dubious scientific credibility. They can sometimes be found, however, in studies where a binary variable for “exposure” is neither defined before the analysis nor stated in the text of the published report. Instead, after doing regression analysis for results of different levels of exposure, the investigators may present an odds ratio for response to a yes/no binary demarcation of “exposure.” Distressing examples of these retroactive demarcations have been noted20 in reports of the alleged relationship between endometrial cancer and postmenopausal use of estrogen.

© 2002 by Chapman & Hall/CRC

population (in thousands)

80

70

60

100

200

300

number of storks

FIGURE 19.12

A plot of the population of Oldenburg at the end of each year against the number of storks observed in thatyear, 1930–1936. [Figure and legend taken from Chapter Reference 19.]

19.6.5.2Comparing Subsequent vs. Baseline Values — If variable X represents a person’s baseline value at time 1 and if another value of X is obtained for each person at a later time 2, the

investigator may plot a graph showing the subsequent values as Y and the baseline values as X. If d is each person’s change in values, the graph is really a plot of (X + d) vs. X, and a strong correlation with

the baseline values of X is inevitable.

Back in Chapter 7, this type of correlation was the reason for examining change in a single incremental group of (after before) values, rather than comparing two groups of values. In addition to the previously

cited impropriety of the graphical line, this problem occurred in the data reported in Figure 19.11. The investigators checked at least nine other variables (plasma renin, noradrenaline, vasopressin, etc.) that

might affect baroreflex sensitivity in the displayed group, but concluded that “the only significant correlation (p < 0.005) was found between baroreflex sensitivity before and during ACE inhibition.”

A separate biologic correlation is often present, of course, between the amount of change in a variable and the initial level. An example of such a relationship is shown with the decreasing exponential curves that depict survival rates in a cohort or levels of radioactive decay over time. The relationship between change and initial levels is called the “Law of the initial value” and its appropriate analytic management is often an improperly managed challenge in biomedical data.21

19.6.5.3Extrapolation beyond Observed Range — Any regression line — whether rectilinear or curved—is constructed to fit the observed values of points, and the estimates are valid only within that range. For example, in most systems of laboratory measurement, voltages can be rectilinearly related to chemical concentrations only within a limited range of values. If the concentration becomes too high or too low, the laboratory will transfer to a different system of measurement (and a different linear relationship with voltage).

The problem of beyond-range extrapolation is particularly likely to occur when the X-variable is time, and when a future status is predicted from extrapolations beyond the most recent date in the regression line. The annals of errors in the biomedical or social sciences contain many wrong predictions about

© 2002 by Chapman & Hall/CRC

populations, economic events, global temperatures, etc. — all derived from beyond-range extrapolation of a regression equation.

Y

Group

D

Group

C

Group

B

Group

A

x

FIGURE 19.13

Misleading aggregate regression line for individual results in four groups.

(positive) direction.

19.6.5.4Inappropriate Combinations — Data from different studies are regularly aggregated for analysis as a single group during meta-analyses or other forms of analytic “pooling.” One requirement for pooling is that the individ- u a l r e s u l t s b e r e a s o n a b l y homogeneous. Testing for “homogeneity” is often neglected, however, because its definition may be inconsistent for statistical components or nonexistent for biologic distinctions. Figure 19.13 displays a dramatic example of the error that can occur when the directions of individual group regressions are ignored during a combined analysis of the groups. Each of the individual groups shows a distinct negative association, but the aggregate shows a strong trend in the opposite

19.6.5.5 Inappropriate Fits — Almost any collection of data can be fitted with a polynomial or “time series” regression line expressed by Yi = a + bX + cX2 + dX3 +. Even if the data have no resemblance to a linear structure, the computer can find coefficients for a line that offers “best fit” for the collection of scattered points. Such an accomplishment appears in Figure 19.14, which shows total results for repeated measurements of a cohort of 74 patients receiving antiretroviral therapy after acute or recent HIV1 seroconversion. The authors concluded that the line in Figure 19.14 “shows the rapid decrease in plasma levels of HIV1 RNA over time” and that “after 117 days after infection, an inflection point was reached at which HIV-1 levels stopped decreasing and gradually increased.” You can decide for yourself whether this interpretation is justified for the data fitted by the line.

19.7 Other Trends

Despite substantial length, the discussion in this chapter has covered only the bivariate estimates and trends for two variables that are each dimensional. As noted in Chapter 18, however, many more statistical indexes are needed for trends in relationships that are not bi-dimensional. After this long bolus of reading, you will be pleased to know that most of the additional indexes seldom appear in medical literature, and that their discussion, deferred until Chapter 27, will emphasize only a few that you may be likely to encounter. The topics omitted now and included in Chapter 27 are the following: bi-ordinal coefficients, such as Spearman’s rho and Kendall’s tau; the expression of linear trend in an ordinal array of proportions; binary and nominal coefficients, such as φ ; and other indexes of trend in r × 2 or r × c two-way tables (where r 3 and c 3).

© 2002 by Chapman & Hall/CRC

copies/mL

6

Level,

5

HIV-1 RNA

4

Plasma

3

10

 

log

 

0

200

400

600

800

1000

1200

Days from Acquisition

FIGURE 19.14

Scatterplot of plasma HIV-1 RNA levels (× 103/mL) and median plasma HIV-1 RNA levels in the cohort from the time of seroconversion. [Figure and legend taken from Chapter Reference 22.]

References

1. Feinstein, 1996; 2. Guilford, 1956, pg. 145; 3. Cohen, 1977, pgs. 78-81; 4. Fleiss, 1981; 5. Burnand, 1990; 6. Cherfas, 1990; 7. Knapen, 1989; 8. Tasaki, 1992; 9. Wagner, 1978; 10. Blankenhorn, 1978; 11. Stead, 1978; 12. Allred, 1989; 13. Snedecor, 1956, pg.126; 14. Concato, 1993; 15. Anscombe, 1973; 16. Osterziel, 1990; 17. Mormor, 1982; 18. Vandenbroucke, 1989; 19. Box, 1978; 20. Horwitz, 1986; 21. Bierman, 1976; 22. Schacker, 1998; 23. Altman, 1988; 24. Sechi, 1998; 25. Saad, 1991; 26. Wintemute, 1988; 27. Reaven, 1988; 28. Beckmann, 1988; 29. Kahaleh, 1989; 30. Godfrey, 1985; 31. Dines, 1974.

Appendix

Additional Formulas and Proofs for Assertions

A.19.1

 

 

 

 

ˆ

2

is a minimum when b = Sxy /Sxx

 

Sr = Σ ( Yi Yi)

 

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

from both sides, we get

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

From Equation [19.1], Yi – Y = b(Xi – X ) . Transposing Y and subtracting Yi

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ 2

, the squaring and summing produces S r =

 

 

 

 

 

 

 

 

 

 

 

 

 

Yi – Yi = Y – Yi – b(Xi – X ).

Because S r = Σ (Yi – Yi)

Σ (Yi

 

)2

– 2bΣ (Xi

 

)(Yi

 

) + b2 Σ (Xi

 

)2 . Substituting the appropriate symbols, this becomes

Y

X

Y

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Sr = Syy – 2bSxy + b2 Sxx

[A.19.1]

where Syy, Sxy, and Sxx are constant for any individual group of data. To find the minimum value, differentiate Sr with respect to b, set the result to 0, and solve. Thus, δ Sr / δ b = –2Sxy + 2bSxx = 0, and so b = Sxy/Sxx.

© 2002 by Chapman & Hall/CRC

A.19.2

a =

 

b

 

 

 

 

 

 

 

 

 

 

 

 

 

Y

X

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

= a

+ bXi , subtract Yi from both sides, then square both sides, and take the sums to

Starting with Yi

 

ˆ

 

 

2

 

 

 

2

2

 

 

 

2

 

 

 

 

 

produce

 

 

= Sr = na

nX +

+ 2abnX – 2anY – 2bΣ Xi Yi . When differentiated

Σ (Yi – Yi )

 

 

+ b

nYi

with respect to a, all terms will vanish that do not include a. Differentiating and setting the result to 0 produces δ Sr /δ a = 2an + 2bn X 2n Y = 0; and so a = Y b X.

A.19.3 Standardized Regression Equation is (Yi Y)/ sy = r(Xi X)/ sx

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Divide both sides of equation [19.1] by sxsy to produce (Yi – Y) ⁄sx sy = b (Xi – X ) ⁄sx sy . Transpose and

rearrange terms to produce (Yi – Y) ⁄sy = (bsx /sy )(Xi – X ) ⁄sx . Because sy = Sy y ⁄(n – 1)

and sx =

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

Sx x ⁄(n – 1),

sx/sy = Sx x Sy y . Because b = Sxy/Sxx, the factor bsx/sy becomes (Sx y Sx x)(

Sxx /Syy ),

which is Sx y ⁄(

Syy Sxx) = r. [With a similar algebraic process, the same result emerges if we start with

ˆ

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

Xi – X = b ′(Yi – Y ), where b′ = Sxy/Syy, The result is (Xi – X )/sx = r(Yi – Y)/sy . ]

 

A.19.4 Degrees of Freedom for ˆ

Yi

For parametric inference, the observed values of {X i , Yi } are regarded as a sample from the “true”

regression equation, which is cited (using Greek letters for the parameters) as Y = α +

β X. In the

ˆ

Thus, two

calculated equation for Yi = a + bXi, a and b are estimates for the two parameters, α and β.

ˆ

 

degrees of freedom are lost from the n degrees available for choosing Yi . An alternative explanation is

that a and b are calculated after the parametric means are estimated as µX = X and µY = Y. Two degrees of freedom are lost for those two parameters.

A.19.5 Parametric Variances of Regression Line

We need to estimate two variances. One of them is σ

y.x, which represents the average parametric variance

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

of the observed Yi values for the Yi calculated from X i. The other is the sampling variance or “standard

error” of the slope, β . Anything else that is needed can be derived from these two estimates.

A.19.5.1 Variance of σ y·x for Yi

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Yi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

From Formula [19.8] the “mean square error” is estimated as

 

 

 

 

 

 

 

 

 

 

 

 

ˆ 2

=

2

 

= Sr /(n – 2 )

 

 

 

 

 

 

 

 

 

[A.19.2]

 

σ y .x

sy .x

 

 

 

 

 

 

 

 

 

A.19.5.2 Variance of the Slope β

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To determine variance of the slope, the formula b = Sxy/Sxx is rewritten as b

= Σ (Xi

 

)(Yi

 

) ⁄Sxx ,

X

Y

which becomes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

b = Σ (Xi

 

)Yi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

 

YΣ (Xi – X)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Sxx

 

 

 

 

 

 

 

 

 

 

 

The second term in the numerator vanishes because

 

is constant and Σ

(Xi

 

) = 0 by definition.

Y

X

The rest of the expression is now expanded to form

 

 

 

 

 

 

 

 

 

 

 

b = (X1

 

)Y1 +

(X2

 

)Y2 + +

(Xn

 

)Yn

X

X

X

Sx x

 

 

 

Sxx

 

Sx x

 

 

 

 

 

 

 

© 2002 by Chapman & Hall/CRC

If each of the (Xi – X) /Sxx components is expressed as a constant, ki, we can write

b = k1 Y1 + k2 Y2 + + kn Yn

Using “Var ( )” as the symbol for variance, we first determine the effects of a constant in variance. Thus, if Var (X) = Σ (Xi – X)2 , Var (kX) = Σ (kXi – kX)2 = k2 Σ (Xi – X )2 = k2 Var(X ).

In Appendix A.7.1, we found that the variance of a sum is the sum of the variances. Thus, Var (X +

Y + W + ...) = Var(X) + Var (Y) + Var (W) + .

Consequently, in the foregoing expression of b,

Var (b) = k1

2 Var (Y1) + k2 2

Var (Y2 ) + + kn2 Var (Yn )

The average variance of each Y value around the regression line was previously shown to be s2

. On

 

 

i

 

 

 

 

 

 

y.x

 

average, therefore,

 

 

 

 

 

 

 

 

 

 

Var (b )

= s2y .x (k12

+ k22 + + kn2 )

 

Because ki = (Xi

 

)/ Sx x and ki2 = (Xi

 

 

)2 /Sxx2 , the sum is

 

 

X

X

 

 

 

 

Σ k2i = Σ (Xi

 

)2 /Sxx2

= Sxx / Sxx2

= 1/ Sx x

 

 

 

X

 

Therefore the variance of the slope is

 

 

 

 

 

 

 

 

 

 

sb2

= Var (b ) = sy2.x / Sx x

[A.19.3]

A.19.6 Critical Ratio for t or Z Test on b or r

The critical ratio for a t or Z test on the slope will be

 

 

b – β

[A.19.4]

 

t or Z = -----------

 

sb

 

which becomes b/sb when β

is set to 0 under the null hypothesis. From Formula [A.19.2], sy2 x =

S r/(n – 2) and from Formula [19.16], S r = (1 r2)Syy. Therefore, sy.x2

= (1r2)Syy /(n – 2). Substituting

in Formula [A.19.3], we get

sb2 = (1 – r2) Syy /[Sxx(n 2)]. When the square root of this result and b =

Sxy /Sxx are substituted, and when terms are suitably rearranged, the critical ratio becomes

 

r n – 2

[A.19.5]

 

t (or Z) = -----------------

 

1 – r2

 

The critical ratio for testing b is thus expressed in values of r. Exactly the same calculation is used for a t or Z test on r when the parametric correlation coefficient, ρ , is set to 0 under the null hypothesis.

A.19.7 Confidence Intervals for r and b

In the critical ratio of Formula [A.19.5], (1 – r2 )/(n – 2) corresponds to the standard error for the observed value of r. Therefore, a confidence interval for the estimated parameter ρ can be placed around r as

ρ = r ± tν α, (1 – r2 )/ (n – 2 )

[A.19.6]

If you want a more formal demonstration of the “standard error” approach here, recall from Formula [A.19.4], that a confidence interval for β can be estimated (using t) as

βˆ = b ± tν α, sb

This becomes

© 2002 by Chapman & Hall/CRC

ˆ

=

Sxy

± tν α,

2

)Syy ]/ [Sx x(n – 2)]

β

------

[(1 – r

 

 

Sxx

 

 

 

and then

 

 

 

 

 

 

 

 

 

ˆ

Sxy

± tν α,

1 – r2

Syy

 

β

= ------

------------

------

 

 

 

Sxx

 

n – 2

Sxx

 

Because Sy y Sx x = sy sx and b = Sxy Sxx =

rsy sx ,

the confidence interval becomes

 

βˆ = (sy sx )

 

 

 

 

 

[A.19.7]

r ± tυ α,

(1

– r2 ) ⁄(n – 2 )

 

 

 

 

 

 

 

 

 

 

 

The term in the square root sign is the “standard error” for r.

A.19.8 Confidence Interval for Individual Points

If the goal is to estimate an actual point, Yi, rather than using confidence interval is wider, to encompass the variations in Yi point becomes

ˆ

Yi as the mean value of Yi at Xi , the

– Yi . The standard error of an estimated

s2y .x [1 + (1 n ) + {(Xi

 

)2 /Sx x }]

[A.19.8]

X

Because the value of 1 in Formula [A.19.8] is so much larger than (1/n) or (Xi – X)2 Sxx , the result is not greatly affected by changes in Xi – X , and the subsequent wide confidence bands will seem to be relatively straight rather than concave lines.

A.19.9 Confidence Interval for Intercept

Because the intercept is calculated when Xi = 0, the corresponding confidence interval can be determined from Formula [19.23], with X i set to zero.

A.19.10 Comparison of Two Regression Lines

Because most ordinary regression lines in medical research do not give close fits, two lines are seldom compared stochastically. (Two survival curves may be compared, as discussed later in Chapter 22, but the stochastic strategy does not use regression principles.)

If a stochastic contrast is desired, however, two regression lines can be compared either for the different slopes or, if the lines seem parallel, for the interlinear vertical distances. The latter evaluation is essentially a comparison of the two intercepts.

If you need to do any of these comparisons, they are clearly discussed, with a worked example, in an excellent paper by Altman and Gardner. 23

© 2002 by Chapman & Hall/CRC

A.19.11 Transformations of r

If X and Y have a joint bivariate Gaussian distribution (which would look like a Gaussian hill), r can be transformed to

Zr =

1

ln

1 + r

--

----------

 

2

 

1 – r

The distribution of Z r is approximately Gaussian and its standard error is 1/n – 3 . For the 1 − α confidence interval we calculate Zr ± (tν α, /n – 3) , and then transform the results with suitable exponentiation. A good worked example of the procedure is also shown by Altman and Gardner. 23

The data of Table 18.5 obviously do not have a suitable distribution, but can be used for a crude illustration here. In those data, r = .186 and n = 8; and so Z r = (1/ 2) [ln(1.186/.814)] = .188. The value of n – 3 is 2.236, and t6,.05 = 2.447, so that .188 ± 2.447 = – 2.259 to 2.635, is the range for Z r . Because

2Zr

1 + r

 

= ln ----------

 

 

1 – r

 

we exponentiate to get [(1 + r)/(1 – r)] = e2Zr ,

which leads to

r = (e2 Zr – 1)/(e2 Zr + 1 ). For Zr = −

2.259, this result becomes r = − .989/1.0109 = .978; and for Z r

= 2.635, the corresponding result is

r = 193.42/195.42 = .990. The 95% confidence interval for the observed r = .188 would thus extend throughout the wide range from .978 to .990 — a result consistent with the diffuse spread in the small data set. Because neither boundary exceeds 1 , this range seems more numerically “acceptable” than the range of .795 to 1.167 previously calculated in Section 19.4.2.1 with Formula [19.22].

Exercises

To give you time for all the reading, the exercises that follow are mainly challenges in “physical diagnosis,” containing almost no requirements for calculation. (Besides, most regression calculations today are relegated to a computer program.) Nevertheless, because your education would be incomplete if you failed to do at least one set of regression computations yourself, a delightful opportunity is presented in Exercise 19.1.

19.1. A clinical investigator who believes that serum omphalase is closely related to immunoglobulin zeta is disappointed by the nonlinear pattern seen in a graph of the following data:

Immunoglobulin Zeta Level

Serum Omphalase Level

 

 

1

35

3

10

5

40

7

30

9

15

10

45

11

80

13

75

15

50

17

60

19

50

 

 

The investigator consults a statistician who also happens to be board-certified in diagnostic graphology. The statistician looks at the graph, and promptly announces, “The relationship here is strong and probably stochastically significant.”

© 2002 by Chapman & Hall/CRC