Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
4.18 Mб
Скачать

THE CORRELATION COEFFICIENT

175

Figure 15.4 (a) A linear association (b) A non-linear association

environmental

work (%)

Proportion exposed to

tobacco smoke at

60

50

40

r =0.80 p<0.0001

30

20

10

0

0

10

20

30

40

50

60

 

Proportion of current smokers (%)

 

Figure 15.5 Scatterplot from a study into the effect of passive smoking on respiratory symptoms. Reprinted courtesy of Elsevier (The Lancet 2001, 358, 2103–9, Fig. 1, p. 2105)

The correlation coefficient

The principal limitation of the scatterplot in assessing association is that it does not provide us with a numeric measure of the strength of the association; for this we have to turn to the correlation coefficient. Two correlation coefficients are widely used: Pearson’s and Spearman’s.

Pearson’s correlation coefficient

Pearson’s product-moment correlation coefficient, denoted ρ (Greek rho), in the population, and r in the sample, measures the strength of the linear association between two variables. Loosely speaking, the correlation coefficient is a measure of the average distance of all of the points from an imaginary straight line drawn through the scatter of points (analogous to the standard deviation measuring the average distance of each value from the mean).

176

CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES

Percent Body Fat

UNITED STATES

60

50

40

30

20

10

0

0

10

20

30

40

50

60

bmi

Figure 15.6 Scatterplot of per cent body fat against body mass index from a cross-section study into the relationship between bmi and body fat, in black population samples from Nigeria, Jamaica and the USA. Reproduced from Amer. J. Epid., 145, 620–8, courtesy of Oxford University Press

For Pearson’s correlation coefficient to be appropriately used, both variables must be metric continuous and, if a confidence interval is to be determined, also approximately Normally distributed. The value of Pearson’s correlation coefficient can vary as follows: from 1, indicating a perfect negative association (all the points lie exactly on a straight line); through 0, indicating no association; to +1, indicating perfect positive association (all points exactly on a line). In practice, with real sample data, you will never see values of –1, 0 or +1. Calculation of r by hand is very tedious and prone to error, so we will avoid it here. But it can be done in a flash with a computer statistics program, such as SPSS or Minitab.

Is the correlation coefficient statistically significant in the population?

To assess the statistical significance of a population correlation coefficient and hence decide whether there is a statistically significant association between the two variables, you can either perform a hypothesis test (is the p-value less than 0.05?), or calculate a confidence interval (does it include zero?). For the hypothesis test, the hypotheses are:

H0: ρ = 0

H1: ρ =0

For example, for the data shown in the scatterplot in Figure 15.1, the sample r = 0.49, with a p-value < 0.001. This indicates a statistically significant positive association in the population between incidence rate of Crohn’s disease and ulcerative colitis.

THE CORRELATION COEFFICIENT

177

A useful rule of thumb if you have a value for r but no confidence interval or p-value, is that to be statistically significant, r must be greater than 2/n, where n is the sample size. For example, if n = 100, then r has to be greater than 2/10 = 0.200 to be statistically significant.

An example from practice

Table 15.1 is taken from the same cross-section study as Exercise 15.3, and shows the sample Pearson’s correlation coefficient for the association between bmi and per cent body fat, with blood pressure, and waist and hip measurements, along with an indication of the statistical significance or otherwise of the p-value.

Unfortunately, the authors have not given the actual p-values, but only indicated whether they were less than 0.05 or less than 0.01. This is not good practice; the actual p-values should always be provided. As you can see, the population correlation coefficient between both bmi and per cent body fat, with waist and hip circumference, is positive and statistically significant in every case. However, bmi is more closely associated (higher r values) than body fat, except in Jamaican men. Apart from the association with systolic blood pressure in US males, there is no statistically significant association with either of the blood pressure measurements.

Exercise 15.4 Table 15.2 is from a case-control study of medical record validation (Olson et al. 1997), and shows the value of Pearson’s r, and the 98 per cent confidence intervals, for the correlation between gestational age, as estimated by the mother, and as determined from medical records, for a number of demographic sub-groups (ignore the last column). The cases were the mothers of child leukaemia patients, the matched controls were randomly selected by random telephone calling. Identify: (a) any correlation coefficients not statistically significant; (b) the strongest correlation; (c) the weakest correlation.

Spearman’s rank correlation coefficient

If either (or both) of the variables is ordinal, then Spearman’s rank correlation coefficient (usually denoted ρs in the population and rs in the sample) is appropriate. This is a non-parametric measure. As with Pearson’s correlation coefficient, Spearman’s correlation coefficient varies from –1, through 0, to +1, and its statistical significance can again be assessed with a p-value or a confidence interval. The null hypothesis is that the population correlation coefficient ρs = 0. Spearman’s rs is not quite as bad to calculate by hand as Pearson’s r but bad enough, and once again you would want to do it with the help of a computer program.

An example from practice

Table 15.3 is from the same cross-section study first referred to in Figure 4.3, into the use of the Ontario mammography services. The authors wanted to know whether the variation in

Table 15.1 Correlation coefficients from a cross-section study into the relationship between body mass index (bmi) and body fat, in black population samples from Nigeria, Jamaica, and the USA. The aim of the study was to investigate whether body fat rather than bmi could be used as a measure of obesity.,§ Reproduced from Amer. J. Epid., 145, 620–8, courtesy of Oxford University Press

 

 

 

 

Women

 

 

 

 

 

 

 

Men

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Nigeria

 

Jamaica

 

United States

 

Nigeria

 

Jamaica

 

United States

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Variable

BMI

% fat

 

BMI

% fat

 

BMI

% fat

 

BMI

% fat

 

BMI

% fat

 

BMI

% fat

 

 

 

 

 

 

 

 

 

 

 

 

 

Waist circumference

0.90**

0.77**

0.87**

0.77**

0.91**

0.85**

0.89**

0.79**

0.69**

0.76**

0.93**

0.83**

Hip circumference

0.93**

0.81**

0.91**

0.82**

0.93**

0.87**

0.89**

0.76**

0.64**

0.72**

0.93**

0.82**

Systolic blood pressure

0.24

0.24

0.16

0.15

0.21

0.21

0.09

0.09

0.24

0.24

0.24*

0.23*

 

Diastolic blood pressure

0.16

0.14

0.20

0.16

0.07

0.10

0.31

0.24

0.16

0.11

0.22

0.20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

*p < 0.05; ** p < 0.01.

Weight (kg)/height (m)2. ‡Data were adjusted for age.

§No significant difference was found between correlation coefficients for body mass index and percentage of body fat.

THE CORRELATION COEFFICIENT

179

the ranked utilisation rates (number of visits per 1000 women) was similar across the age groups. They did this by measuring the strength of the association between the ranked rates for each pair of different age groups. When the association was strong and significant, they concluded that the variation in the usage rate was similar.

The results show that the rs for the association between the ranked usage rates for 30–39 year-olds, and the 40–49 year-olds, across the 33 districts was 0.6496 (first row of table), with a p-value of 0.0005. So this association is positive and statistically significant in these two age group populations. Indeed, the correlation coefficients between all pairs of age groups are statistically significant, with all p-values < 0.05. The authors thus concluded that variation in

Table 15.2 Pearson’s r and 98 per cent confidence intervals for the association between gestational age, as estimated by the mother and from medical records, for a number of demographic sub-groups. Reproduced from Amer. J. Epid., 145, 58–67, courtesy of Oxford University Press

 

Correlation of

 

Kappa

 

gestational age

98% CI*

statistic

 

 

 

 

All gestational ages

0.839

0.817–0.859

0.62

Case/control status

 

 

 

Cases

0.849

0.813–0.878

0.63

Controls

0.835

0.805–0.861

0.61

Education

 

 

 

<High school

0.694

0.553–0.797

0.51

High school

0.833

0.790–0.868

0.63

>High school

0.835

0.804–0.861

0.62

Household income

 

 

 

<$22,000

0.791

0.734–0.837

0.59

$22,000–$ 34,999

0.882

0.849–0.908

0.62

$35,000

0.843

0.800–0.877

0.65

Unknown

0.745

0.641–0.823

0.60

Time (years) from delivery to interview

 

 

 

<2

0.896

0.862–0.921

0.64

2–3.9

0.821

0.784–0.852

0.63

4–5.9

0.828

0.775–0.869

0.61

6–8

0.852

0.734–0.920

0.42

Maternal age (years)

 

 

 

<25

0.822

0.773–0.861

0.64

25–29

0.889

0.862–0.912

0.63

30–34

0.760

0.694–0.813

0.57

35

0.888

0.824–0.930

0.64

Birth order

 

 

 

First born

0.880

0.853–0.903

0.67

Second born

0.815

0.778–0.846

0.57

Third born

0.632

0.416–0.781

0.52

Maternal race

 

 

 

White

0.846

0.822–0.866

0.64

Other

0.782

0.680–0.855

0.42

 

 

 

 

*CI, confidence Interval.

Three categories, <38, 38–41, 42 weeks.

180

CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES

Table 15.3 Spearman correlation coefficients from a cross-section study of the use of the Ontario mammography services in relation to age. Each correlation coefficient measures the strength of the association in the variation between the ranked usage rate across the 33 heath districts for each pair of age groups. Reproduced from J. Epid. Comm. Health, 51, 378–82, courtesy of BMJ Publishing Group

Age group (y)

30–39y

40–49y

50–69y

70+y

 

30–39

1.0000 0.6496 (p < 0.0001)

0.5949 (p = 0.0005)

0.5488 (p = 0.0014)

40–49

 

1.0000

0.9021 (p < 0.0001)

0.8985 (p < 0.0001)

50–69

 

 

1.0000

0.9513 (p < 0.0001)

70+

 

 

 

1.0000

 

usage rate was similar for the four age groups across the 33 health districts. However, whether association is the correct way to measure similarity in two sets of values is a question I will return to in the next chapter.

Two other correlation coefficients can only be mentioned briefly. Kendal’s rank-order correlation coefficient, denoted τ (tau), is appropriate in the same circumstances as Spearman’s rs , i.e. with ranked data (which may be ordinal or continuous). Tau is available in SPSS, but not in Minitab. The point-biserial correlation coefficient is appropriate if one variable is metric continuous and the other is truly dichotomous (which means that the variable can take only two values; alive or dead, male or female, etc.). Unfortunately, this latter measure of association is not available in either SPSS or Minitab.

If you plan to use a correlation coefficient you should ensure that the assumptions referred to above are satisfied, in particular that the association is linear - which can be checked by a scatterplot. Moreover, with Pearson’s correlation coefficient you should interpret any results with suspicion if there are outliers present in either data set, since these can distort the results.

Finally it is worth noting again that just because two variables are significantly associated, does not mean that there is a cause–effect relationship between them.