Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

“independent” variable, X, was examined unidirectionally for its effect on a “dependent” target, Y. Thus, age might affect a child’s height. Such an association was often called a regression of Y on X, but other targeted relationships were examined when we compared mean weight in two groups, when two variables formed a curve in which survival depended on time, or when stratifications (and adjustments) showed the effect of the variables (or groups) used for the stratification.

In the non-targeted bidirectional associations, the two variables were related to one another without implying that one depended on the other. The nondependent association was often called a correlation between Y and X. Thus, we might examine the correlation between hemoglobin and hematocrit or between hematocrit and serum cholesterol, without suggesting that one variable affected the other. We could also examine concordance for the agreement between two ratings, without choosing one of them as a “gold standard” for accuracy.

28.1.2Number of Variables and Symbols

The two variables in a bivariate association could easily be symbolized as X and Y, and additional letters such as U, V, and W could be used to denote the multiple extra variables. Many alphabetical letters, however, have already been reserved for other purposes (e.g., N for sample size; P for proportions, rates, or probabilities; R for correlations or rates; Z for standardized scores or variables).

To avoid running out of available letters, the multiple available variables are identified with numerical subscripts in symbols such as Y1, Y2, Y3, … , which will represent the dependent or target variables. The general symbol for any of the independent variables is X j; and the collection of independent variables (X1, X2, X3, X4, …) in a particular analysis is denoted as {Xj}. These new symbols can be confusing because they differ from what was used previously when X1, X2, X3, … showed the values of persons 1, 2, 3, … for a single variable, X. In the new arrangements, the subscripts represent variables, not persons. To show both persons and variables, we would need double subscripts such as X i,j (often written as Xij) to represent

the value for person i in variable j. Thus, X26 would be the value of variable 6 for person 2; and X39,18 would be the value of variable 18 for person 39. In most circumstances, however, the double subscripts

can be avoided without ambiguity, and Xj will represent individual independent variables.

The analytic situation becomes multivariable

 

 

 

 

 

 

 

 

 

 

 

when three or more variables are examined

X1

 

 

X1

 

 

 

X1

simultaneously. In directional orientation, the

X2

 

 

X2

 

 

Y1

X2

multivariable associations can have the relation -

X3

 

Y1

X3

 

 

Y2

X3

ships illustrated in Figure 28.1. In a many-to-one

 

 

 

X

 

 

 

X

 

 

 

Y3

X

 

or uni-targeted relationship, independent vari-

4

 

 

4

 

 

4

ables such as X1 = age, X2 = sex, X3 = smoking,

X5

 

 

X5

 

 

 

X5

X4 = serum cholesterol, and X5 = hematocrit are

 

Many-to-One

 

 

Many-to-Many

Many-Internal

related to a single target, Y = blood pressure.

 

 

 

 

 

 

 

 

 

 

 

Multiple linear regression is an example of this

FIGURE 28.1

 

 

 

 

 

 

 

analytic procedure. In a many-to-many or multi-

Three main types of relationships among multiple variables.

targeted relationship, a set of several indepen-

On the left, the five variables X1, X2, X3, X4, and X5 are

related to a single external variable, Y1. In the middle, they

dent variables might be related simultaneously

are related to several external variables, Y1, Y2, Y3. On the

to several target variables, such as Y1

= duration

right, the five variables are related internally to one another.

of survival, Y2 = functional status, Y3

= costs of

 

 

 

 

 

 

 

 

 

 

 

care. Canonical analysis is a procedure that might be used for this many-to-many association. In a many-internal nontargeted relationship, the variables are neither dependent nor independent. They are analyzed together for their own interrelationships with one another, not with a target variable. Factor analysis and principal component analysis are procedures applied in this manner.

28.1.3Scales of Variables

As noted in Chapter 2, any variable can be expressed in a dimensional or categorical scale; the categorical scales can be binary or polytomous; and the polytomous categories can be ordinally ranked or nominally unranked.

© 2002 by Chapman & Hall/CRC

Age, height, and serum cholesterol are all dimensional variables. Scales such as alive/dead, success/failure, male/female are all binary. Severity of illness can have an ordinally ranked polytomous scale such as none, mild, moderate, severe; and religion has an unranked nominal polytomous scale expressed (alphabetically) as Buddhist, Catholic, Confucian, Hindu, … .

The categorical scales are a major source of complexity, leading to different formats for the analyses. For example, if the target variable is a dimensional blood pressure, the multivariable analytic procedure will usually be multiple linear regression. If the target variable is a binary alive/dead expression, however, the analysis will usually be done with multiple logistic regression. In the many-internal analytic procedures that are done without a target, variables expressed in categories usually receive a procedure called cluster analysis, but dimensional variables usually receive factor analysis or principal component analysis. In many-to-one analyses, categorical independent variables may be examined with recursive partitioning or with conjunctive consolidation rather than with conventional regression procedures. (All of these new names are used merely to illustrate the distinctions and to give you a preview of coming attractions if you eventually go beyond the current text to learn about multivariable analysis.)

28.1.4Patterns of Combination

Multiple variables can be combined in two fundamentally different patterns: algebraic models and decision-rule clusters.

28.1.4.1 Algebraic Models — With algebraic models, the variables are given coefficients, called weights, and are algebraically combined in the configuration of a geometric pattern, such as a straight or curved line. The linear regression equation Y = a + bX is a rectilinear algebraic model that relates the dependent variable Y to the independent variable X, which is weighted with the coefficient b. The value of a is the intercept when X = 0. The expression is constructed with the assumption that the relationship between Y and X can be “modeled” with a straight line. For curved models, the format of X (or Y) is appropriately changed. For example, Y = a + bX2, Y = a + b log X, log Y = a + bX, Y = ea + bX, and Y

=a + bX + cX2 are algebraic models that form curved lines for the two variables, X and Y.

For multiple variables, the straight-line algebraic models form “hyperplanes” or “multi-dimensional

surfaces.” Thus, Y = b0 + b1X1 + b2X2 is a plane in three dimensional space, showing the dependence of Y on the independent variables, X1 and X2. They are weighted with the respective coefficients, b1 and b2; and b0 is the intercept. The Apgar1 score is a multidimensional non-targeted expression that takes the form

V = b1X1 + b2X2 + b3X3 + b4X4 + b5X5

V is the new variable (i.e., the Apgar score) formed by the combination; each of the five constituent X j variables is expressed in a rating scale of 0, 1, or 2; and each of the five b j coefficients has a value of 1.

The formation of algebraic models is now the conventional method of statistical analysis for multiple variables or multiple categories, particularly because the availability of modern computers has eased the formidable difficulties of calculating the appropriate bj coefficients from the data. The availability of computers, however, has also allowed development of the different approach discussed next.

28.1.4.2 Decision-Rule Clusters — Forming decision-rule clusters is a common (although informal) multivariable strategy in medical activities. The name cluster is given to a collection of categories that form a group or subgroup. The collection can be simple or compound. For example, the variable age can be partitioned into two simple categories, old and young. These categories form groups that can be categorically subdivided into simple subgroups such as old men and tall women. Boolean symbols such as old male can be used to denote categories such as old men. A compound categorical cluster might contain old men and/or tall women, which could be symbolized as (old male) (tall

female).

An analytic method that uses groups or subgroups is particularly compatible with the fundamental thought processes of biologists and clinicians, who usually think in categories, and seldom make decisions based on combinations of weighted variables. The term clusters is simply a formal designation for the

© 2002 by Chapman & Hall/CRC

collections of categories that form the groupings. The TNM staging system for cancer is an example of decision-rule multi-categorical clusters. The system begins with the ordinal ratings given to three variables: T, which represents features of the primary tumor; N, which represents regional lymph nodes; and M, for distant metastases. In ordinary description, the TNM index gives ratings for each of these variables in a tandem profile such as T 2 N 1 M 0 or T 4 N 0 M 2. According to an arbitrary decision rule, these profiles are then assigned to clustered categories that are called stages. Stage I usually indicates a localized cancer. Stage II usually indicates a cancer that has spread to adjacent lymph nodes, but no farther. Stages III, IV, and V refer to increasingly distant degrees of spread.

Clustered categories are formed according to diverse decision rules that are determined biologically or statistically (or both), without reliance on the “model” of an algebraic configuration. The decision rule can also be expressed as an algorithm in which the sequential addition of categories leads to specified conclusions. Such algorithms now often appear as “guidelines” for clinical practice.

28.2 Basic Concepts of Non-Targeted Analyses

The multivariable procedures in this chapter contain non-targeted, “many-internal” analytic methods that have such names as factor analysis, principal component analysis, and cluster analysis. In non-targeted analyses, the selected variables are examined for their interrelationship with one another, not with a specific target variable. The goal is usually to reduce the variables into a smaller set of “factors” or other entities that contain the most cogent information.

For example, when Virginia Apgar1 constructed the score by which she is commemorated, she wanted to form a multivariate index that would describe the “severity” of the condition of a newborn baby. The index was not intended to predict an external target, such as survival or need for special therapy. The goal was merely to specify explicitly the components of a judgment about “condition” that was formerly expressed implicitly in such general terms as excellent, good, fair, and poor.

28.2.1“Intellectual Dissection”

Apgar’s attempt to specify and reconstruct the components of an implicit global “intuition” or “judgment” has become a relatively common medical event. The initial judgment might have been singly or doubly “global.” In a singly global judgment, the phenomenon under discussion is well identified, but no criteria are offered to demarcate the categories of an ordinal rating scale such as excellent, good, fair, or poor. In a doubly global judgment, neither the phenomenon nor the rating scale is clearly identified. For example, if we ask someone “How are you?,” the answer is doubly global when the respondent says, “Good.” The same response would be singly global, however, for the question, “How is the pain in your ankle?”

An indication of the modern non-global approach is that clinicians today seldom talk about such things as a “guarded prognosis.” Instead, a duration of survival may be predicted from the specifications of an algorithm or equation containing multiple variables. In Virginia Apgar’s approach, the entity she wanted to dissect and reconstruct, as condition of newborn baby, was a clinical pathophysiologic status, rather than such attributes as cleanliness of skin, size of limbs, or excitement of parents. After reviewing all the pathophysiologic variables that might be observed neonatally, Apgar chose five as being most important: heart rate, respiratory rate, color, muscle tone, and reflex responses. She rated each variable in a threecategory ordinal scale of 0, 1, or 2. She then added the five ratings to form a score that ranged from 0 to 10.

Apgar constructed the composite score entirely according to her clinical judgment, without any formal mathematical procedures. The statistical methods that are about to be discussed offer a formal approach to the analogous process.

28.2.2Face Validity

When Apgar identified and gave rating scales to the selected five variables, she had a clear vision of what should emerge in the composite score. Her years of clinical experience had given her a substantive

© 2002 by Chapman & Hall/CRC

knowledge of the particular variables that would be most important for indicating pathophysiologic status; she could readily “dissect” her “intuition” to identify those variables; and she could examine the subsequent composite scores to determine that they were “clinically sensible” in doing the desired task. This type of “sensibility” is often called face validity. It cannot be discerned with any type of statistical calculation and requires a combination of “common sense” plus knowledge of the substantive issue under consideration. For example, the Apgar Score would have excellent face validity as an index of pathophysiologic condition, but not as an index of socioeconomic status.

Face validity is a particularly thorny problem in non-targeted analyses of multivariable data. If a target has been identified, it serves as a direct criterion for evaluating face validity in the analytic accomplish - ment. Thus, a particular prognostic index can readily be shown (with appropriate data) to be effective or ineffective in its job of predicting survival. In the absence of a target to serve as a criterion, however, the non-targeted analytic methods are constructed exclusively with mathematical principles. The constructions may then be splendidly ingenious acts of mathematics, but ultimately unsuccessful because they lack face validity.

The factors (or other entities) created by composite variables can be constructed in an “inside-out” or “outside-in” manner. The Apgar Score was an inside-out construction, accomplished when Apgar examined her “inside” clinical judgment to identify and recombine the cogent (i.e., “sensible”) variables into a composite result. In contrast, the mathematical methods produce an outside-in approach. The available “outside” information is first processed mathematically, without any criteria for “sensibility”; and face validity of the composite result is evaluated afterward.

This distinction becomes particularly important when non-targeted analyses are used to construct and to offer a purely statistical “validation” for such composite variables as competence, intelligence, health status, satisfaction with care, and quality of life.

28.3 Algebraic Methods

The two most commonly used non-targeted algebraic methods are called factor analysis and principal component analysis. In recent years, these methods have sometimes been called “latent variable” analysis because the results are aimed at forming a separate latent idea or “construct” that is not itself directly observed. For example, Apgar’s construct was the severity of the clinical condition in newborns. Other clinical constructs such as congestive heart failure or hepatic decompensation are not observed directly, but are identified or conceptually derived from such observed entities as dyspnea, edema, or jaundice. The construct called intelligence is the latent variable that was the incentive2 for the early development of factor analysis.

In the customary factor-analytic procedure, data for a set of variables in a group of persons are examined for the intercorrelations among the variables. Those that most strongly correlate with one another have a “ shared variance” that is then attributed to their presumptive correlation with the “external” latent variable. For example, students whose scores are correlated as correspondingly high (or low) values in tests of reading, writing, and arithmetic may be regarded as having a high (or low) level of a common factor, which is the latent variable called intelligence.

Figure 28.2, showing data for only two variables, X1 and X2, will give a particularly simple illustration of the ideas. The small solid square is the centroid, which is the common mean of the two variables, located at the coordinate for ( X1, X2 ). In the original axes for the two variables, the variances around the centroid would be calculated as the group variance for X1, the group variance for X2, and the group covariance for the two variables. To simplify the symbols, the group variances can be written as S 11 and S22 for variables X1 and X2, respectively, and the covariance can be designated as S 12. Suppose we now prepare a new axis as two lines, V1 = a1X1 + a2X2 and V2 = b1X1 + b2X2, that pass through the centroid as shown in the figure, and suppose we cite the coordinate of each point according to its distance along the new axes. The group variance of the points will be larger along the V1 axis, where the points are widely spread, than along the V2 axis, where the points have smaller distances. (In an ideal situation, all of the points would fit perfectly on theV 1 line, and no deviations would remain to produce variance along the V2 axis.) If we wanted to replace the original variables, X1 and X2, by a new factor that

© 2002 by Chapman & Hall/CRC

“captured” most of the common variance in their interrelationship, the line constructed as V1 = a1X1 + a2X2 would be an excellent choice, and the residual variance could be ignored in the V2 axis.

FIGURE 28.2

Data for two variables, X1 and X2, with V1 and V2 as new axes passing through the centroid. Each data point is much closer to the V1 axis than to the V2 axis. For further details, see text.

28.3.1Analytic Goals and Process

X2

V1

 

V2

X1

The mathematical goal is to combine some or all of the existing variables into a smaller set of new variables, called factors, that best “explain” the relationships among the original variables. The new factors should ideally not be correlated with one another, but should be composed of variables having substantial intercorrelations. For example, suppose we have five variables, X1, X2, …, X5. When finished, the analytic procedure should produce four (or fewer) factors constructed as the following weighted combinations:

V1 = a1X1 + a2X2 + a3X3 + a4X4 + a5X5

V2 = b1X1 + b2X2 + b3X3 + b4X4 + b5X5

V3 = c1X1 + c2X2 + c3X3 + c4X4 + c5X5

V4 = d1X1 + d2X2 + d3X3 + d4X4 + d5X5

The second factor, V2, is constructed to “explain” the variance that remains after the construction of V1. Factor V3 explains what is left after V1 and V2 are formed, and so on. In general, the goal is to get the desired “explanation” with no more than two factors, and the original variables need not appear in each factor. (For the Apgar Score, a large number of possible candidate variables was judgmentally reduced to five, which were then combined into a single factor in which all coefficients had a value of 1. The single Apgarian factor could be cited as V = X1 + X2 + X3 + X4 + X5, with each Xj having a rating scale of 0, 1, or 2.)

The analytic process begins by examining the

TABLE 28.1

 

 

 

matrix grid that is outlined with the variables cited

“Skeletal” Outline of Variables in a Correlation

in the rows and columns in Table 28.1. The entries

or Variance–Covariance Matrix

 

in the cells of Table 28.1 can be either correlations

 

 

 

 

 

X1

X2

X3

X4

or group variances and covariances. When each vari-

 

 

 

 

able is related to itself, the group variances are S11

X1

 

 

 

for X1X1, S22 for X2X2, etc. For intervariable rela-

X2

 

 

 

tionships, the group covariances are S 12 for X1X2 (or

X3

 

 

 

X2X1), S34 for X3X4 (or X4X3) and so on. The entries

X4

 

 

 

:

 

 

 

in the cells can also be correlation coefficients such

.

 

as r12 (or r21) and r34 (or r43). In the main left-upper- to-right-lower diagonal, these coefficients will all be

1, when each variable is correlated with itself. Inspection of a matrix of correlation coefficients will immediately offer an idea of which variables are closely or hardly interrelated with one another.

© 2002 by Chapman & Hall/CRC

28.3.2Operational Strategies and Nomenclature

The procedures have complex operational strategies and a correspondingly arcane jargon of names. [The names are mentioned here so that they will not be total strangers if you meet them in a published report. A complete explanation of the concepts, however, is beyond the scope of this text. You can read more about them (and hope to find clear presentations somewhere) if you begin giving serious attention to using these methods or their results.] Equations derived from appropriate arrangements of the correlation or variance–covariance matrixes are solved to construct factors having the form of V = a1X1 + a2X2 + a3X3 + a4X4 + … . The solutions to a special set of equations are called eigenvalues (or characteristic roots); and the “best” factors, i.e., those that “explain” the most total variance, have the highest eigenvalues. (The hope is that the eigenvalues will be large for one or two factors, and small for others.) The coefficients (a1, a2, a3, …) indicate the variables that are most prominent in the arrangement of each factor.

After being selected, the factors can be checked for their correlation coefficients with the individual variables. For example, the Apgar scores of a group of babies could be examined for their correlation with the variables of heart rate, reflex responses, etc. These coefficients for a particular factor are usually called factor loadings. They represent the prominence of each variable in each factor (and vice versa). The sum of the squared factor-loading coefficients is called its communality. The larger the communality, the larger is the amount of shared variance that is presumably “explained” by that factor.

Because the attached coefficients indicate the impact of the component variables, factors are sometimes named according to the constituents that have high impact. For example, V1 might be called a body-size factor if it has large coefficients for the variables height and weight; and V2 might be called a complexion factor if it has large coefficients for variables indicating the color of skin, hair, and eyes. The body-size factor in the foregoing example would probably have high factor loads for height and weight, but small loads for skin, hair, and eyes.

28.3.3Rotations

Perhaps the most complex (and controversial) part of the operating procedure occurs after the most prominent (or “principal”) components have been identified. In the basic mathematical reasoning, all of the factors are regarded as orthogonal, i.e., their axes are perpendicular to one another. (This type of orthogonality cannot be visualized for four or more axes, but it is easy to assume mathematically.)

In the next step, the axes are often rotated to produce an optimum arrangement. The criteria of optimality differ with different strategies of rotation, which can keep the axes orthogonal or change them into oblique patterns. Among the many names for these rotational procedures are varimax and oblique Procrustean.

28.3.4Factor vs. Component Analysis

Factor analysis and principal component analysis are mathematical “siblings” that do essentially the same thing in slightly different ways. In published literature, the investigators often begin with principal component analysis, but then call the results factor analysis after the rotations cited in the preceding section. The operations of the two methods also differ in using “standardized” Z scores, e.g., (X X )/s, or the original units of expression for the variables. The methods may also work with the matrix of correlation coefficients or covariances, and may find the appropriate components (or factors) with least squares computations or with a special method called maximum-likelihood solutions. The main conceptual distinction has been described by Maurice Kendall:3 “In component analysis we begin with the observations and look for components … [going] from the data toward a hypothetical model. In factor analysis, we work the other way round … .”

28.3.5Published Examples

The published examples cited here were taken from literature that happened to be noted by the author, without a systematic search. The investigators almost never cite reasons for choosing between factor and principal component analysis, and probably use whatever program is conveniently available or preferred by the statistical consultant.

© 2002 by Chapman & Hall/CRC

28.3.5.1Factor Analysis — With factor analysis, an “index of need for health resources” was developed4 for geographic regions of the nation of India. After the original principal components approach, applied to 7 variables, “revealed two factors” that were “not directly meaningful,” the investigators applied varimax rotation to produce two other factors that had “immediate use.” The first factor, called proximate determinants, had high “loadings” for the four variables that indicated rates of homicide, crude deaths, infant mortality, and crude births. The second factor, called sociomedical background, had high loadings for the other three variables, which indicated population rates or proportions of doctors, illiterates, and hospital beds. The percentage of “explained variation” was 67% for the first factor and 16% for the second.

When data for 360 patients with rheumatoid arthritis received an “iterated principal component factor analysis,” the investigators5 decided that the 9 constituents of the Arthritis Impact Measurement Scale could be reduced to a 4-factor model that accounted for 77% of the variance. After a 5th factor was added to allow better assessment of “the areas of clinical interest,” the investigators concluded that the “model” containing “5 distinct components” was “preferable to 9 more highly correlated scales for use as explanatory variables.”

Factor analysis was applied6 when 428 physicians responded to a questionnaire containing 56 items intended to measure attitudes that “influence resource utilization.” After various forms of matrix rotation and statistical checking, the investigators emerged with factors for “four prominent domains [that] closely corresponded with our hypothesized domains a priori. [They] were interpreted as cost-consciousness, discomfort with uncertainty, fear of malpractice, and annoyance with utilization review.”

In other medical applications, factor analysis was used as follows: to reduce 28 candidate variables to four “meaningful factors” that could be used as “outcome measures” in chronic obstructive pulmonary disease7; to convert ten variables in patients with asthma to three factors that “provided a useful summary of asthma severity”8; and to produce 3 factors for 46 sociogeographic variables associated with multiple sclerosis in regions of the United States.9

28.3.5.2Principal Component Analysis — After examining data from a battery of 24 audiologic tests, Henderson et al.10 used principal component analysis to derive a composite score as the primary outcome variable in a clinical trial comparing the efficacy of three cochlear implant devices for bilateral profound hearing loss. The investigators found that “the first principal component had its largest coefficients associated with the most difficult audiologic tests” and that “changes in the composite score over time were also closely related to subjective impressions of implant performance” by the patients and clinicians.

In an ophthalmologic application,11 principal component analysis was used to analyze 53 variables,

which were the “threshold” value at visual field locations checked with the Humphrey automated perimeter. For locations in both eyes of 304 “normal” persons, the data set contained 53 × 2 × 304 =

32,224 threshold measurements. The investigators found (probably to no one’s surprise) that the average threshold value for each person became the first principal component in accounting for variations.

Principal component analyses have often been applied to differentiate symptom patterns in patients with schizophrenia, tested with the PANSS (Positive and Negative Syndrome Scale) inventory. A five-factor model is reported12 to be “an interesting tool when new, selective, psychopharmacological drugs are to be evaluated.” Principal component analyses have also been used to combine laboratory tests and culture conditions in melanomas grafted onto mice,13 to reduce 6 tests of function to three principal components in pulmonary disease,14 and to divide 61 patients receiving coronary bypass grafts into three “subtypes” formed from eight morphologic and electrocardiographic variables.15

28.3.6Biplot Illustrations

If a good reduction of the variables is provided by the first two principal components, they can be used as the main axes of a bi-plot graph, which shows the relationships among different members of a group. Each member’s location is plotted according to coordinates for the first and second principal component.

© 2002 by Chapman & Hall/CRC

28.4 Cluster Analysis

Cluster analysis, which yields sets of composite categories rather than sums of weighted variables, can have an intuitive appeal to biologists concerned with challenges in taxonomic classification. Appropriate new forms of taxonomy were the basis for such major scientific advances as the Linnaean classification of animals and plants, Darwin’s theory of evolution, Mendeleyev’s periodic table, and the astronomic partition of “dwarf” and “giant” stars. In clinical medicine, the fundamental taxonomy used for thoughts about human ailments is a set of nosographic categories, called the International Classification of Disease, which is revised every 10 years.

During the latter half of the 20th century, as new forms of technologic and other data become available, diverse investigators wanted to eliminate the “subjective” judgments of traditional taxonomy, and to substitute new systems created by numerical methods that were presumably objective and “stable.” According to an excellent text by Everitt,16 the strategies of cluster analysis were applied in all of these methods but were given different names: numerical taxonomy in biology, Q analysis in psychology, and unsupervised pattern recognition in the field of artificial intelligence.

The goal of cluster analysis is to divide the persons or other entities into an appropriate set of categorical clusters; and the process begins, as in factor or component analysis, with data for multiple variables in members of an observed group.

28.4.1Basic Structures

The clustering process can be arranged to form either partitions or trees. In a partition, each object is assigned to a single clustered location. In a tree, the hierarchical organization allows objects to belong to a “pedigreed” family of clusters.

Figure 28.3 illustrates a hierarchical family arrangement of different species of horses.16 The hierarchy in this instance was established judgmentally. The process begins with the “parent” group and forms a “family” in successive splits of the “pedigree.” When cluster analysis is done mathematically with a hierarchical method, the direction is reversed, starting with individual members that are successively combined to form the family groups. The branching pattern of construction is often called a dendogram. Figure 28.4 shows a dendogram developed17 to classify the distribution of pain in 127 patients with

 

 

 

Hypohippus

 

 

 

asbami

Mesohippus

 

 

Archaeohippus

 

 

blackbergi

barboun

Parahippus

Merychippus

Nannipus

 

 

pristinus

seversus

 

minor

 

 

 

 

 

 

Neohipparion

 

 

 

accidentale

 

 

 

Calippus

 

 

 

placidus

 

 

Merychippus

 

 

 

secundus

Pliohippus

 

 

 

maxicanus

FIGURE 28.3

Hierarchical family arrangement showing an evolutionary tree for species of horses. (Figure taken from Chapter Reference 16.)

© 2002 by Chapman & Hall/CRC

A
B
C
FIGURE 28.4
Dendrogram showing distribution of temporomandibular pain in 127 patients, and subsequent hierarchical formation of three main groups, marked A, B, and C. (Figure taken from Chapter Reference 17.)

temporomandibular (TM) joint dysfunction who had indicated the locations of their pain graphically in squares of a stylized graph of the face. In the main hierarchical clusters of Figure 28.4, the pain was mainly over the TM joint for group A, involved the ramus (vertical portion) of the mandible in group B, and extended over the zygomatic arch in group C.

In nonhierarchical partitions, the members are grouped into clusters that do not overlap and that do not have “familial” relationships. Figure 28.5 illustrates a partition18 of four clusters for 16 members.

28.4.2 Operational Strategies

The main operational strategies in cluster analysis involve the choice of mech -

anisms for “measuring” the “similarity” or “affinity” of different objects. The different tactics used for this purpose create a complex array of strategies and

nomenclatures.

One common approach is to measure the distance between values of the variables. For example, suppose person A has values of 35 and 109, respectively,

in variables 1 and 2, and person B has the corresponding values of 27 and 183. With the Pythagorean theorem, the squared Euclidean distance between the two persons is (35 27)2 + (109 183)2 = 74.4. A similar tactic can be used for Euclidean distance in more than two variables. Because the values of the variables may be correlated with one another, multivariable distance is usually measured instead with an entity called Mahalanobis D2, which takes account of the covariances.

The diverse methods proposed for evaluating and joining distances have produced an extensive nomenclature, which is mentioned here just to give you an idea of the names. The measures of distance include city block metric, genetic distance, and information radius. The strategies16 for joining distances can be called single linkage, nearest neighbor, complete linkage, furthest neighbor, Ward’s information loss, Lance and Williams’ recurrence formula, monothetic or polythetic divisions, and minimizing traces for matrixes showing dispersion of appropriate sums of squared values between and among groups.

28.4.3Published Examples

Cluster analysis has been medically applied in various attempts to form “typologies” consisting of clustered categories that form subgroups for a particular ailment. Clinicians have formed many such subgroups judgmentally with staging systems for cancer, categories of insulin-dependent and nondependent for diabetes mellitus, the various “non-” forms of disease nosology,19 and many other purely clinical subclassifications. With cluster analysis methods, these judgments are replaced by a formal mathematical approach. Cluster analysis has been used to form subgroups for the clinical spectrum of back pain,20 hepatitis,21 chronic affective disorders,22 insulin-dependent diabetes mellitus,23 eating habits of obese patients,24 and the previously illustrated partition of the temporomandibular joint syndrome.17 Other relatively recent medical applications include efforts to categorize the geographic distribution of female cancer patterns in Belgium,25 epidemiologic transitions in cause-specific mortality trends in The Netherlands,26 and the resource utilization of Veterans Administration medical centers.27

In memorable nonmedical applications, cluster analysis was used to help identify an unknown mummy according to its craniofacial morphologic similarity with other Egyptian queens,28 and to achieve a classification of 109 pure malt Scotch whiskies according to features of “nose, colour, body, palate and finish.” 29

© 2002 by Chapman & Hall/CRC

In one medical application that reported use of cluster analysis, the investigators did not employ the customary many-internal, non-dependent-variable arrangement. After analyzing an array of variables that might be pr e di ct o r s of p r e - t er m d el ive r y, t h e investigators30 divided each variable into binary categories, and then used the presence or absence of pre-term delivery to group the categories into three clusters of “risk”: younger women, older women who were smokers, and nonsmoking older women. Because the grouping and clustering was done in direct relation to the outcome variable, this study is an example of the targeted multicategorical stratification procedure.

28.4.4Epidemiologic Clusters

A

B

C

D

FIGURE 28.5

Illustration of a partition that forms four clusters, marked A, B, C, and D. (Figure taken from Chapter Reference 18.)

Beyond these statistical procedures, epidemiologists often use the word cluster in a quite different, nonmathematical way, referring to apparent mini-epidemics of disease. The mini-epidemic becomes noted when the scanners of vital-statistics data find a “spatial cluster,” in which a particular disease seems to have an unexplainably increased occurrence either in calendar time or at a particular geographic location. The usual next step is to search for etiology of the spatial cluster by looking for correlations with suspected hazards such as contaminated water, nuclear power plants, electromagnetic radiation, or “toxic wastes.” Various statistical methods have been and continue to be proposed for analyzing the geographic-temporal patterns of data31 to find (or confirm) these clusters and their alleged causes.

Unfortunately, the investigators seldom consider checking data for the possibility that the “outbreak” is due to increased detection of the disease via increased screening, surveillance, and new diagnostic technologies in the selected spatial region. Other obvious possibilities are that the increase is a stochastic variation among the huge plethora of opportunities for “blips” when multitudes of different diseases are recorded for a multitude of geographic regions over a multitude of different points in time. After the needless fears induced by the publicity for diverse “clusters” that turned out to be etiologic “false alarms,” the topic is now being approached with greater caution than formerly. Nevertheless, a media-sensitive and suitably zealous investigator can almost always get publicity for “discov - eries” that can warn an already frightened public about yet another “menace” of daily life.32

28.5 Correspondence Analysis

In correspondence analysis, the variables are demarcated into categories (as in cluster analysis), but the operational strategy uses algebraic models, as in principal component analysis. The procedure has been advocated as a powerful method for analyzing multi-categorical contingency tables.33

Although seldom applied in medical activities, the correspondence analysis technique has been used34 “to reduce the dimensions of the raw … (mainly) categorical data” in preparation for a cluster analysis that provided a “four group typology” of nonspecific low back pain, “using both organic and psychiatric symptoms and signs.” In another application,35 where the correspondence technique was combined with a Bayesian strategy, the arrays of symptoms, called indicants, in sets of patients with chest pain and with abdominal pain, were reduced and divided into clusters that were then “validated” for their accuracy in separating “risk groups” for various diagnoses.

© 2002 by Chapman & Hall/CRC