Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

18.2.3Patterns Formed by Constituent Variables

The main statistical complexity of bivariate associations is produced by the diverse patterns formed with different scales for the constituent variables. Because each of the two variables can be cited in four possible scales — binary, dimensional, ordinal, or nominal — indexes of descriptive association will be needed for 16 (= 4 × 4) possible patterns. Additional possibilities will occur when the relationships are concordant, or nondependent and dependent trends. The rest of this section, which provides an outlined inventory of the diverse formats, is intended only to let you know about the many different indexes. Fortunately, only a few of them regularly appear in medical research.

The diagram in Figure 18.6 shows 16 possible patterns for co-relationships in scales of an independent and dependent variable. The bi-dimensional pattern in the heavily stippled central zone is the classical and most frequent arrangement; it will be discussed throughout Sections 18.3 and 18.4, and again in Chapter 19. In the three lightly stippled central zones, both variables can be ranked. In seven of the outer zones, around the top and left side of the diagram, at least one of the variables is binary; and in the remaining five zones, at least one of the variables is nominal. These 16 possible patterns will not all occur, however, for concordances or for nondependent trends.

SCALE OF DEPENDENT (OUTCOME) VARIABLE

 

Binary

Dimensional

Ordinal

Nominal

 

Binary

 

 

 

SCALE OF

Dimensional

 

 

 

 

 

 

 

INDEPENDENT

 

 

 

 

VARIABLE

Ordinal

 

 

 

 

Nominal

 

 

 

At least one variable is binary

Both variables are dimensional

Both variables can be ranked

At least one variable is nominal

FIGURE 18.6

Possible patterns of co-relationship for two variables, each expressed in four possible scales.

18.2.3.1Concordance — Because concordance can be measured only when both variables are commensurate — i.e., expressed in exactly the same scales — only four patterns are possible. They arise when the two examined variables are expressed in the same binary-binary, dimensional-dimensional, ordinal-ordinal, or nominal-nominal scales for the same entities. The possible arrangements are shown in Table 18.3. The indexes of concordance for these four patterns of data will be discussed in Chapter 20.

18.2.3.2Nondependent Trend — The trend in nondependent correlations can be expressed for four patterns in which the scales for the two variables are binary-binary, ..., or nominal-nominal. For example, the dimensional scales for hematocrit and hemoglobin, although different in magnitude, would form a dimensional-dimensional pair. Six additional patterns can occur, however, when the two associated variables have different types of scales. The additional pairs of scales can be binary-dimensional, binaryordinal, binary-nominal, dimensional-ordinal, dimensional-nominal and ordinal-nominal as shown in

©2002 by Chapman & Hall/CRC

TABLE 18.3

Four Patterns of Variables for Assessing Concordance

 

 

 

 

Scale of Variable A

 

 

 

 

Binary

Dimensional

Ordinal

Nominal

 

 

 

 

 

 

 

Scale

 

 

 

 

 

 

of

Binary

 

 

Variable B

 

Dimensional

 

 

 

 

 

 

 

 

for

 

Ordinal

 

 

 

 

 

 

 

Same

 

Nominal

 

 

 

Entity

Table 18.4. The names of some of the indexes, as cited in the footnote to Table 18.4, will be discussed either in Chapter 19, or later in Chapter 27.

TABLE 18.4

Patterns of Variables and Statistical Indexes for a Nondependent Correlation*

 

 

 

Scale of Variable A

 

 

 

Binary

Dimensional

Ordinal

Nominal

 

 

 

 

 

 

Scale of Variable B for

Binary

A

 

 

 

Different Entity

Dimensional

B

C

 

 

 

Ordinal

D

E

F

 

 

Nominal

G

H

I

J

*Examples of correlation indexes, discussed in either Chapter 19 or later in Chapter 27, for these bivariate scales are:

A:φ ; B: Biserial coefficient; C: Pearson’s r; D: Pearson’s r or same as F; E: Jasper’s multiserial coefficient; F: Spearman’s rho, Kendall’s tau, gamma, or Somers D; G: φ ; H: Eta; I: Lambda or Freeman’s theta; J: φ .

Because the two variables are associated without a direction, the correlations for each pair will be symmetrical, whether the variables are listed directionally in a dimensional-ordinal or ordinal-dimen- sional orientation. For example, the correlation between the dimensional height and the ordinal social class or between the nominal religion and the binary sex would be the same regardless of how the variables are arranged.

18.2.3.3 Dependent Trend — For dependent associations, however, all 16 patterns can occur asymmetrically, as shown in Figure 18.6.

18.2.3.3.1 Binary Constituents. The upper row and left-hand column of Figure 18.6 contain seven patterns in which the independent and/or dependent variable is binary. If independent, the binary variable forms the two groups whose contrasts were previously discussed. If the dependent variable for the two groups is also binary, the contrast (or association) can be shown in a 2 2 table, with the results expressed as an index of association or comparison for two proportions. In the other illustrations throughout Chapters 10 to 17, the dependent variables were dimensional values of blood sugar (summarized as means) or ordinal values of improvement. An independent binary variable could be associated with a nominal dependent variable if we compared choice of occupation (doctor, lawyer, homemaker, etc.) in women and men.

A binary dependent variable was associated with a dimensional (or ordinal) independent variable for the “survival curve” constructed in Figure 18.5. Each point on the curve showed the proportion of people who were alive (or dead) at various times after the onset of observation. The binary proportion of survivors could also be associated with an independent variable that is ordinal (e.g., severity of clinical stage) or nominal (e.g., different categories of diagnosis). In the latter two situations, the results will usually be shown with tables (such as Table 18.2) rather than graphs.

© 2002 by Chapman & Hall/CRC

18.2.3.3.2Nominal Constituents. The lower row and right-hand column of Figure 18.6 contain five patterns in which at least one constituent variable is nominal. These associations will require special arrangements, some of which were mentioned earlier.

18.2.3.3.3Both Constituents Rankable. For the associations shown in the central four stippled zones of Figure 18.6, both variables can be ranked, being either dimensional or ordinal. This type of jointly ranked relationship is what most people have in mind when thinking about “associations.” In particular, the dimensional-dimensional (or “bi-dimensional”) pairing is the pattern that gave rise to the well-known ideas of regression and correlation.

18.3 Basic Mathematical Strategies for Associations

In any form of statistical analysis, the most pertinent basic question is, “What do I really want to know?” The answer to this question indicates the basic goal of the analysis. The next question might be, “What statistical strategy is used for this purpose?” The answer indicates the basic operational principle used for achieving the goal. The third question is, “What particular index, procedure, or test is used to carry out the strategy?” The answer indicates the particular statistical method applied to execute the basic operational principle.

For example, in Part I of the text, one of the goals was to select a single value that would represent a group of dimensional data. The main operational principle for achieving this goal was to choose an item that was “central” in the group. The mode, median, (arithmetic) mean, and geometric mean were methods available to carry out the principle. The choice of the method depended on what we wanted to use as a “central” index.

In Parts I and II of the text, we discovered that the investigator’s goals and the operational principles of statistical analysis do not always coincide. For example, if the goal is to evaluate numerical stability for a contrast of two groups, the prime statistical strategy is often not aimed directly at stability. Instead, various arrangements of mathematical probability are used to find a P value or a 1 − α confidence interval for the summary indexes of the observed contrast. The parametric sampling and the permutation or resampling methods are individual procedures for getting the desired P values or confidence intervals.

Analogous types of disparity between goals and operational principles can arise in the statistical methods developed for indexes of association. These indexes have achieved the general status of established tradition, widespread acceptance, and ubiquitous usage. Nevertheless, if common sense has not been obliterated during all the mathematical explanations and computations, you may sometimes note that what you get is not necessarily what you want.

18.3.1Basic Mathematical Principles

To summarize a single group of data, the operational principles use a central index and an index of dispersion. To summarize a contrast of two groups, the main principles rely on increments and ratios. For associations, the mathematical principles employ estimations and covariations. With estimations, one variable is used to estimate (or predict) the value of the other. With covariations, an index of magnitude or strength is determined for the relationship between the two variables.

18.3.1.1 Estimations — For the univariate dimensional data, {Yi}, shown on the left of Figure 18.7, suppose we had to guess the value of any individual Yi that might be chosen from this set. If G is the chosen guess, the individual error will be Yi G.

From what was learned in univariate statistics (Section 4.6.2), we know that the average absolute error, |Yi G|, will be smallest if G is chosen to be the median of the data set, and that the average squared error, (Yi G )2, will be smallest if G is the mean. Accordingly, the best guess would be either the mean or the median of the Yi values.

© 2002 by Chapman & Hall/CRC

Now suppose that values of an associated variable, Xi, are available for each Yi, as shown on the right side of Figure 18.7. The pattern of points suggests that the estimates of Yi could be substantially improved if we made use of the Xi values. For this purpose, we can fit the points with an algebraic model* that

Value of Y

Value of Y

FIGURE 18.7

 

Display of values, on left, for {Yi} alone, and,

 

Value of X

on right, for {Yi} with corresponding {Xi}.

 

expresses Y as a “function” of X. The model used most commonly (for reasons discussed later) is the straight line:

 

 

 

 

 

ˆ

= a + bXi .

 

 

[18.1]

 

 

 

 

 

Yi

 

 

In this expression, the “^” symbol over

ˆ

indicates that it is estimated from the corresponding

Yi

 

 

 

 

 

 

 

 

 

ˆ

observed value of Xi. The value of a is the intercept of the equation, representing the value of Yi

 

 

 

 

 

 

 

 

 

ˆ

when Xi = 0. The value of b is the slope of the line indicating the number of units of change in Yi

for each unitary change of Xi. (The calculation of a and b is discussed in Chapter 19.)

18.3.1.2 Errors in Estimation —

 

ˆ

 

 

 

For each Yi estimated with Xi, the absolute error of the

estimate will be

 

ˆ

 

 

 

ˆ

– Yi )

2

. We can express the accomplishment

 

 

 

 

Yi – Yi

 

and the squared error will be (Yi

 

of the algebraic model by comparing the total errors made when we used only the univariate “guesses”

ˆ

from values of Y alone, versus errors in the bivariate estimates of Y , using values of X. The expression for proportionate reduction in error would be calculated as

Errors with Y alone – Errors using X

-------------------------------------------------------------------------------------------- [18.2] Errors with Y alone

If the expression used the sums of squared errors, the “guesstimates” made from the mean of Y alone

would be Sy y

= Σ (Yi

Y

)2 . With Sr

as

the symbol, the corresponding sum of squared errors for

 

ˆ

would be Sr

= Σ (Yi

ˆ

 

2

. The formula for proportionate reduction in errors would be

estimates with Yi

– Yi )

 

Sy y – Sr

[18.3]

-----------------Syy

 

The idea of reducing error or improving accuracy is a fundamental principle in constructing indexes of association, and the proportionate reduction in errors is commonly used as an index of the association.

Errors in estimation can also be reduced for associations that are not dimensional. For example, suppose Y is a binary variable showing that the 5-year survival rate is 33% (= 66/200) for a particular disease. If we had to make individual predictions for each patient from the univariate information alone, the best guess would be to predict that everyone will be dead at 5 years. The prediction will be wrong in 66 patients and right in 134.

Now suppose that the patients were classified, as in Table 18.2, into four ordinal stages of disease, expressed as Variable X. Using results of the additional variable, we can form another set of predictions. For the 80% survival rate in Stage I, the prediction of alive would be correct in 16 patients and wrong

* Many types of “models” — including clusters of categories, algorithmic flow charts, and other mathematical or quasi-math- ematical structures — can be used in statistical analysis. Models arranged in the form of an equation will be called algebraic, a term that seems preferable to equational.

© 2002 by Chapman & Hall/CRC

18.3.1.3.1 Illustration of Calculations.

in 4. For Stages II, III, and IV, where the survival rates are below the meridian of 50%, we would predict death for everyone. The prediction would be correct, respectively, for 27, 49, and 54 patients and wrong in 23, 21, and 6 patients of those three stages. The total number of predictive errors in the 200 patients would become 4 + 23 + 21 + 6 = 54; and the proportionate reduction in errors would be calculated, from the 66 with Y alone and the 54 using X, as

66 – 54

----------------- = 18% 66

18.3.1.3 Covariations — The other main principle for indexing an association is to cite the magnitude (or “strength”) of covariation in the relationship between the two variables. As X increases, how vigorously does the gradient rise or fall in Y? For example, in Table 18.2, the survival rate drops from 80% to 10% during three “steps” from Stage I to Stage IV, so that the “average” decrease is 70%/3 = 23% for each change of stage.

The gradient of change is easy to discern when results are expressed in the ordinal-binary arrangement of Table 18.2, but a more general approach is needed to show covariance when both variables are dimensional. The index should denote the simultaneous change of the two variables in a set of paired points, {X i , Yi}, for each person, i.

Considering each variable alone, the “changes” within X can be cited as the amount by which each item, Xi , deviates from a reference point, which is most commonly chosen to be the arithmetical mean,

X.The deviation, Xi X , will then indicate the amount by which any individual value, X i , has changed from the mean. Similarly, for the other variable Y, the deviation of Yi Y would indicate the corresponding change in Yi .

This principle was used to calculate variance and standard deviation as indexes of univariate disper-

sion. For n items in variable X, the individual deviations from the mean are squared and added to form

the group variance, Sx x = Σ (Xi – X )2 , which is then divided by n 1 (or n) to form the variance. A similar process would produce Sy y = Σ (Yi – Y )2 as the group variance in variable Y. The square roots of the two variances would be the corresponding standard deviations.

The two individual deviations for the pair of Xi and Yi values at point i will have a bivariate role if they are multiplied. The product, (Xi X )(Yi Y ), is the codeviation that indicates the simultaneous change as the two variables “move” from their respective means to reach point i. The codeviation product will be positive if both deviations are positive, so that Xi becomes greater than X , and Yi greater than

Y.The product will also be positive if both deviations are negative, with Xi and Yi each becoming less than their corresponding means. If the deviations for Xi X and Yi Y have opposite signs, the codeviation product will be negative, indicating that the two variables have moved in opposite directions.

The greater the absolute magnitude of each codeviation, the greater the amount of movement in a jointly positive or negative direction.

For “movement” within an individual variable, the univariate deviations were squared and added to

form Sx x = Σ (Xi – X )(Xi – X) and Sy y = Σ (Yi – Y )(Yi – Y) . For the two variables “moving” together, the individual codeviations are added to form a sum of products that can be called the group covariance, symbolized as

Sxy = Σ (Xi

X

)(Yi

Y

)

[18.4]

The average value, obtained when S xy is divided by n (or by n 1 for inferential purposes) is called the covariance. As a quantitative index of co-relationship between two variables, covariance will have a high positive score if the variables generally move together in the same direction, a high negative score if they generally move oppositely, and a score near 0 for no distinct pattern.

To demonstrate the way that covariance indicates co-rela- tionship, consider the illustrative set of data for Xi and Yi of eight persons in Table 18.5. In addition to the basic values of {Xi} and {Yi}, the table shows values for X , Y , Xi X , Yi Y , (Xi X )2,

© 2002 by Chapman & Hall/CRC

18.3.1.3.2 Pattern and Quantification of Covariance.

(Yi – Y )2, and (Xi X )(Yi Y ). Figure 18.8 is a “scattergraph” for the eight points listed in Table 18.5. In Figure 18.9, the scattergraph has been redrawn, with the axes of origin placed at the mean values of X and Y, forming four quadrants that locate each point according to the deviation units for each variable.

TABLE 18.5

Deviations, Squared Deviations, and Codeviations for an Illustrative Set of Data

 

 

 

 

Variables

 

 

 

 

 

 

)2

 

 

 

 

 

)2

 

 

 

 

 

Person

 

 

 

X

 

 

Y

Xi

 

 

(Xi

 

Yi

 

 

(Yi

 

(Xi

 

)(Yi

 

)

 

 

 

X

X

Y

Y

X

Y

A

1

 

 

39

3.88

1 5.05

 

27.13

736.04

+105.25

 

 

B

2

 

 

90

2.88

8.29

 

+23.87

569.78

68.75

C

3

 

 

50

1.88

3.53

 

16.13

260.18

+30.32

 

 

D

4

 

 

82

.88

.77

 

+15.87

251.86

13.97

E

6

 

 

43

+1.12

 

1.25

 

23.13

535.00

25.91

F

7

 

 

95

+2.12

 

4.49

 

+28.87

833.48

+61.20

 

 

G

8

 

 

51

+3.12

 

9.73

 

15.13

228.92

47.21

H

8

 

 

79

+3.12

 

9.73

 

+12.87

165.64

+40.15

 

 

Sum

39

 

 

529

0

 

 

52.88

 

0

 

 

3580.88

+81.13

 

 

 

 

 

 

= 4.88

 

 

= 66.13

 

 

 

 

 

 

 

 

 

 

 

 

Mean

 

X

Y

0

 

 

6.61

 

0

 

 

447.61

10.14

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note:

Divisor for all mean values = 8.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

B

F

 

90

 

 

D

 

80

 

 

 

 

H

70

 

 

 

 

 

 

60

 

C

G

 

50

 

 

E

40 A

30

20

10

1

2

3

4

5

6

7

8

9

10

FIGURE 18.8

“Scattergraph” of data in Table 18.5.

QUADRANT II

QUADRANT I

(X

 

_

(X

 

_

i

- X) = -

i

- X) = +

 

_

 

_

(Yi

- Y) = +

(Yi

- Y) = +

F

B

D

H

_

Y = 66.13

 

 

C

 

G

A

 

 

E

 

 

 

 

 

 

 

QUADRANT III

QUADRANT IV

 

 

_

 

 

_

(X

i

- X) = -

(X

i

- X) = +

 

_

 

_

(Yi

- Y) = -

(Yi

- Y) = -

 

 

 

_

 

 

 

 

 

X = 4.88

 

 

FIGURE 18.9

Mean values of X and Y used as axes for scattergraph of data in Table 18.5 and Figure 18.8.

The X and Y deviations at each point are both positive in Quadrant I of Figure 18.9 and both negative in Quadrant III. In both of these quadrants the co-deviation product will be positive. Conversely, in Quadrants II and IV, the X and Y deviations go in opposite directions, one being positive and the other negative; and the product of co-deviations in these quadrants will be negative. This distinction is also shown by the values of (X i – X )(Yi – Y ) in Table 18.5. For persons A, C, F, and H, whose points lie in Quadrants I or III, each co-deviation is positive. For persons B, D, E, and G, whose points lie in Quadrants II or IV, the products are negative.

© 2002 by Chapman & Hall/CRC

18.3.1.4.1 Product of Standardized Deviates.

To show the general impact of co-deviations, arbitrary collections of illustrative points (not the ones shown in Figure 18.8 and 18.9) have been placed in the appropriate quadrants of Figures 18.10 and 18.11. In Figure 18.10, where all the codeviate points lie in the first and third quadrants, the swarm of points shows a distinct positive relationship between X andY. In Figure 18.11, where all of the points lie in the second and fourth quadrants, the pattern shows a distinct negative or inverse relationship. (An important feature of terminology for co-relationships is that negative means something going in a direction distinctively opposite to positive. In many medical uses, the word negative refers to “none” or “normal”; but in the absence of a distinct co-relation, the correct word is none, not negative. To avoid possible confusion, however, a negative co-relationship is often called inverse.)

Y

QUADRANT I

X

QUADRANT III

FIGURE 18.10

Positive correlation effect evident from “swarm” of codeviate points in Quadrants I and III of a scattergraph.

Y

QUADRANT II

X

QUADRANT IV

FIGURE 18.11

Negative correlation effect evident from “swarm” of codeviate points in Quadrants II and IV of scattergraph.

As the average value of Σ (Xi X )(Yi Y ), the covariance could quantitatively indicate the positive or negative strength of the relationship. For example, the group of eight points in Figure 18.8 do not show a strong relationship in either direction. Their average codeviance (as shown in Table 18.5) is 10.14. On the other hand, if we consider only the contributions of points A, C, F, and H (in Quadrants I and III), their average co-deviance is +236.93/4 = +59.23; and for just the points B, D, E, and H, the average co-deviance is –155.83/4 = −38.96.

18.3.1.4 Correlation Coefficient — Although helping quantify a co-relationship, the average magnitude of Sxy depends completely on the arbitrary units in which X and Y are expressed. If each value of Y in Table 18.5 were ten times larger (e.g., 390, 900, 500,…rather than 39, 90, 50,…) the values of Y would be ten times larger for the mean and deviations, and the values of S xy and Sxy /n would also be ten times larger. Nevertheless, the basic relationship between X and Y would remain the same.

To eliminate this problem, each variable can be expressed in the Z-scores that form standardized deviates, thereby making Sxy free of dimensional units. Thus, if sX is the standard deviation of the X values, the entity (Xi X )/sx is dimension-free, cited in standard-deviation units above or below the mean of X. The counterpart entity, (Yi Y )/sy, has a similar structure for values of Y. Each product of the standard deviation scores would be

© 2002 by Chapman & Hall/CRC

Xi – X Yi – Y

 

---------------

 

---------------

 

 

 

 

 

 

 

 

sx

 

 

sy

and the sum of these standardized codeviations

would be Σ [(Xi

 

)(Yi

 

)]/sx sy , which is

X

Y

Sxy /[(sx)(sy)]. The mean value of this sum (using n as the divisor) would be (Sxy/n)/[(sx)(sy)]. Because sx = Sx x/n and sy = Sy y/n , some further algebra will show that the new index would be

Sxy

[18.5]

r = ------------------

Sxx Syy

(The same result would emerge if each divisor were n 1 rather than n.)

This index is customarily symbolized as r and called the correlation coefficient. It is also sometimes called Pearson’s r, to commemorate Karl Pearson, who helped popularize its use for expressing correlation. In older statistical language, values of X i X and Yi Y were called the “first moment around the mean.” For this reason, the correlation coefficient is sometimes called the product-moment coefficient.

Some additional algebra, shown in Chapter 19, will demonstrate that r has a maximum positive value of +1 and a maximum negative value of 1. Thus, when the “standardized” dimension-free r has values close to +1 or 1, the two variables have a strong relationship. When r is close to 0, they have little or no relationship.

18.3.1.4.2 Example of Calculations. The sums of appropriate columns in Table 18.5 show that Sxx = 52.88, Syy = 3580.88, and Sxy = 81.13. Substituting these values in Formula [18.5], the correlation coefficient for the data is

r = -------------------------------------------

81

.13

= ----------------81.13 = 0.186

 

52.88 ×

3580.88

435.15

which indicates the weak positive relationship evident in Figure 18.8. If you do these calculations yourself, some computing formulas can greatly ease the job. Just as Σ (Xi – X)2 = Σ X2i – nX2 was best calculated as Sxx = Σ X2i [(Σ Xi )2 /n] , the best “hand-calculator” formula for the group covariance is

Sx y = Σ XiYi [(Σ Xi )(Σ Yi )/n ]

[18.6]

For example, as shown in the Sum row of Table 18.5, Σ Xi = 39 and Σ Yi = 529. The additional items

needed for the quick calculations are Σ X2i = 243 , Σ Y2i = 38561 , and Σ Xi Yi = 2660. These items would lead to Sxx = 243 [(39)2/8] = 52.88 and Syy = 38561 [(529)2/8] = 3580.88. The group covariance, Sxy = Σ (Xi X )(Yi Y ), was originally calculated in Table 18.5 by directly adding each codeviation to get +81.13. The quick-calculation Formula [18.6] would produce 2660 [(39)(529)/8] = 81.13.

18.3.2Choice of Principles

According to the goals, orientation, and patterns of data, indexes of association will be formed with principles of estimation or principles of covariation, and sometimes with both. For example, to demon - strate closeness of agreement in appraising concordance or to make specific predictions, the emphasis is on principles of estimation. To appraise trend in the two variables, the focus is on covariation.

The two indexes of expression — for estimation and covariation — will indicate different attributes of the association. A weak relationship can produce highly accurate estimations, and a strong relationship may have many errors. For example, in Figure 18.7, Y seems to have a weak relationship to X: the values of Y go up only slightly as X increases. Yet the estimates of Y made from X might have perfect accuracy because a straight line would fit the data so well. Conversely, in Table 18.2, the two variables have a strong (although inverse) relationship, but the error rate in predictions is proportionately reduced only 18%.

© 2002 by Chapman & Hall/CRC

18.4 Concept and Strategy of Regression

Regression has become the well established name for a process that fits a mathematical “model” to data of a dependent variable. The same term is also regularly used in calling the result a regression line.

18.4.1Historical Background

Fitting a mathematical model was not the idea, however, when the term regression was originally proposed in 1885 by Francis Galton, who is often regarded (at least in English-speaking countries) as the founder of biometry. While studying familial manifestations of genetics, Galton compared the height of parents and the corresponding height of their children. Fitting a straight line to the plot of points presented here in Figure 18.12, Galton1 noted a phenomenon that he initially called reversion but later

HEIGHT in inches

72

71

70

69

68

67

66

65

FIGURE 18.12

RATE OF REGRESSION IN HEREDITARY STATURE.

The Deviates of the Children are to those of their Mid-Parents as 2 to 3.

B

 

-PARENTS

 

 

When Mid-Parents are taller than mediocrity,

D

their Children tend to be shorter than they.

 

 

MID

 

 

CHILDREN

 

 

 

 

 

M

C

 

-PARENTS

their Children tend to be taller than they.

 

 

CHILDREN

When Mid-Parents are shorter than mediocrity,

 

 

 

 

 

MID

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

DEVIATE in inches

+4

+3

+2

+1

0

-1

-2

-3

-4

Format of graph displayed by Francis Galton to demonstrate “regression toward mediocrity.” [Figure derived from Chapter Reference 1.]

termed regression toward mediocrity. The tallest parents tended to have children who were shorter than themselves; and the shortest parents tended to have correspondingly taller children. The extreme values of tall or short height for the parents were associated, in the children, with heights that were closer to the mean for each variable.

© 2002 by Chapman & Hall/CRC

18.4.2Regression to the Mean

The phenomenon Galton noted as reversion to mediocrity is today called regression to the mean and is still regularly encountered, but not in Galton’s format. In subsequent repeated measurements of people who initially have the highest (or lowest) values of blood pressure or serum cholesterol in a group, the originally extreme values often tend to regress toward the mean of the group. The data analyst then has to decide whether the change was due to treatment or whether values that were higher because of random variation had later regressed to the mean.2,3 [An analogous event, occurring in professional sports, is sometimes called the “outstanding rookie’s second-year slump.” The person who may have randomly been the best player among the first-year rookies may regress to the mean in the next year, and is thought to have “slumped.”]

Galton used the term regression for the straight line that he drew through the bivariate dimensional points for heights of parents and children in Figure 18.12, but the mathematical strategy that fit the line to the data had nothing to do with either the biologic idea of regression to the mean, or the statistical idea that with random variations in measurement, the extreme values on one occasion subsequently become closer to the mean. Nevertheless, the term regression became rapidly accepted and thoroughly entrenched for a totally different idea: the mathematical procedure of fitting a line to a set of bivariate dimensional data.

18.4.3Straight-Line Models and Slopes

Although many kinds of curved lines can be fitted to a bi-dimensional set of points, the standard approach uses a rectilinear, i.e., straight-line, model, Y = a + bX, which is mathematically arranged to do its best job in fitting the data. The fit may be good or poor, but the slope of the straight line indicates a trend that is constant throughout all zones of the data. Consequently, no matter how well the line fits, the single value of its slope will be misleading if the data have different gradients in different zones. Thus, if we want to know the general trend of the data, the single slope may be a quite satisfactory average, but if we want to know about the trends in different zones, the single slope may produce serious distortions of what is happening.

The problem is illustrated for bi-dimensional data by the three sets of points shown in Figure 18.13. Figure 18.14 shows the results of applying a best-fitting straight-line model to each set of points. As expected from the pattern of points, the left-hand set of data is excellently fit by the straight line. The middle set of data is relatively well fit by the straight line, which will be given credit for a good achievement, according to the customary indexes of fit (discussed in Chapter 19). Nevertheless, the slope of the line will fail to show the three different gradients present in the three zones of data: small gradients at the low and high ends of the X variable and a large gradient in the middle.

Y

Y

Y

 

 

X

 

X

 

X

 

 

 

FIGURE 18.13

Patterns for three sets of bi-dimensional points.

For the right-hand set of data in Figures 18.13 and 18.14, the straight line will have a slope close to 0, indicating no relationship between X and Y. Nevertheless, X and Y have the strong relationship that is evident from visual inspection of the pattern. The straight line will completely distort this pattern, by

© 2002 by Chapman & Hall/CRC