- •Contents
- •Preface to the 2nd Edition
- •Preface to the 1st Edition
- •Introduction
- •Learning Objectives
- •Variables and Data
- •The good, the Bad, and the Ugly – Types of Variable
- •Categorical Variables
- •Metric Variables
- •How can I Tell what Type of Variable I am Dealing with?
- •2 Describing Data with Tables
- •Learning Objectives
- •What is Descriptive Statistics?
- •The Frequency Table
- •3 Describing Data with Charts
- •Learning Objectives
- •Picture it!
- •Charting Nominal and Ordinal Data
- •Charting Discrete Metric Data
- •Charting Continuous Metric Data
- •Charting Cumulative Data
- •4 Describing Data from its Shape
- •Learning Objectives
- •The Shape of Things to Come
- •5 Describing Data with Numeric Summary Values
- •Learning Objectives
- •Numbers R us
- •Summary Measures of Location
- •Summary Measures of Spread
- •Standard Deviation and the Normal Distribution
- •Learning Objectives
- •Hey ho! Hey ho! It’s Off to Work we Go
- •Collecting the Data – Types of Sample
- •Types of Study
- •Confounding
- •Matching
- •Comparing Cohort and Case-Control Designs
- •Getting Stuck in – Experimental Studies
- •7 From Samples to Populations – Making Inferences
- •Learning Objectives
- •Statistical Inference
- •8 Probability, Risk and Odds
- •Learning Objectives
- •Calculating Probability
- •Probability and the Normal Distribution
- •Risk
- •Odds
- •Why you can’t Calculate Risk in a Case-Control Study
- •The Link between Probability and Odds
- •The Risk Ratio
- •The Odds Ratio
- •Number Needed to Treat (NNT)
- •Learning Objectives
- •Estimating a Confidence Interval for the Median of a Single Population
- •10 Estimating the Difference between Two Population Parameters
- •Learning Objectives
- •What’s the Difference?
- •Estimating the Difference between the Means of Two Independent Populations – Using a Method Based on the Two-Sample t Test
- •Estimating the Difference between Two Matched Population Means – Using a Method Based on the Matched-Pairs t Test
- •Estimating the Difference between Two Independent Population Proportions
- •Estimating the Difference between Two Independent Population Medians – The Mann–Whitney Rank-Sums Method
- •Estimating the Difference between Two Matched Population Medians – Wilcoxon Signed-Ranks Method
- •11 Estimating the Ratio of Two Population Parameters
- •Learning Objectives
- •12 Testing Hypotheses about the Difference between Two Population Parameters
- •Learning Objectives
- •The Research Question and the Hypothesis Test
- •A Brief Summary of a Few of the Commonest Tests
- •Some Examples of Hypothesis Tests from Practice
- •Confidence Intervals Versus Hypothesis Testing
- •Nobody’s Perfect – Types of Error
- •The Power of a Test
- •Maximising Power – Calculating Sample Size
- •Rules of Thumb
- •13 Testing Hypotheses About the Ratio of Two Population Parameters
- •Learning Objectives
- •Testing the Risk Ratio
- •Testing the Odds Ratio
- •Learning Objectives
- •15 Measuring the Association between Two Variables
- •Learning Objectives
- •Association
- •The Correlation Coefficient
- •16 Measuring Agreement
- •Learning Objectives
- •To Agree or not Agree: That is the Question
- •Cohen’s Kappa
- •Measuring Agreement with Ordinal Data – Weighted Kappa
- •Measuring the Agreement between Two Metric Continuous Variables
- •17 Straight Line Models: Linear Regression
- •Learning Objectives
- •Health Warning!
- •Relationship and Association
- •The Linear Regression Model
- •Model Building and Variable Selection
- •18 Curvy Models: Logistic Regression
- •Learning Objectives
- •A Second Health Warning!
- •Binary Dependent Variables
- •The Logistic Regression Model
- •19 Measuring Survival
- •Learning Objectives
- •Introduction
- •Calculating Survival Probabilities and the Proportion Surviving: the Kaplan-Meier Table
- •The Kaplan-Meier Chart
- •Determining Median Survival Time
- •Comparing Survival with Two Groups
- •20 Systematic Review and Meta-Analysis
- •Learning Objectives
- •Introduction
- •Systematic Review
- •Publication and other Biases
- •The Funnel Plot
- •Combining the Studies
- •Solutions to Exercises
- •References
- •Index
THE CORRELATION COEFFICIENT |
175 |
Figure 15.4 (a) A linear association (b) A non-linear association
environmental |
work (%) |
Proportion exposed to |
tobacco smoke at |
60
50
40 |
r =0.80 p<0.0001 |
30
20
10
0
0 |
10 |
20 |
30 |
40 |
50 |
60 |
|
Proportion of current smokers (%) |
|
||||
Figure 15.5 Scatterplot from a study into the effect of passive smoking on respiratory symptoms. Reprinted courtesy of Elsevier (The Lancet 2001, 358, 2103–9, Fig. 1, p. 2105)
The correlation coefficient
The principal limitation of the scatterplot in assessing association is that it does not provide us with a numeric measure of the strength of the association; for this we have to turn to the correlation coefficient. Two correlation coefficients are widely used: Pearson’s and Spearman’s.
Pearson’s correlation coefficient
Pearson’s product-moment correlation coefficient, denoted ρ (Greek rho), in the population, and r in the sample, measures the strength of the linear association between two variables. Loosely speaking, the correlation coefficient is a measure of the average distance of all of the points from an imaginary straight line drawn through the scatter of points (analogous to the standard deviation measuring the average distance of each value from the mean).
176 |
CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES |
Percent Body Fat
UNITED STATES
60
50
40
30
20
10
0
0 |
10 |
20 |
30 |
40 |
50 |
60 |
bmi
Figure 15.6 Scatterplot of per cent body fat against body mass index from a cross-section study into the relationship between bmi and body fat, in black population samples from Nigeria, Jamaica and the USA. Reproduced from Amer. J. Epid., 145, 620–8, courtesy of Oxford University Press
For Pearson’s correlation coefficient to be appropriately used, both variables must be metric continuous and, if a confidence interval is to be determined, also approximately Normally distributed. The value of Pearson’s correlation coefficient can vary as follows: from −1, indicating a perfect negative association (all the points lie exactly on a straight line); through 0, indicating no association; to +1, indicating perfect positive association (all points exactly on a line). In practice, with real sample data, you will never see values of –1, 0 or +1. Calculation of r by hand is very tedious and prone to error, so we will avoid it here. But it can be done in a flash with a computer statistics program, such as SPSS or Minitab.
Is the correlation coefficient statistically significant in the population?
To assess the statistical significance of a population correlation coefficient and hence decide whether there is a statistically significant association between the two variables, you can either perform a hypothesis test (is the p-value less than 0.05?), or calculate a confidence interval (does it include zero?). For the hypothesis test, the hypotheses are:
H0: ρ = 0
H1: ρ =0
For example, for the data shown in the scatterplot in Figure 15.1, the sample r = 0.49, with a p-value < 0.001. This indicates a statistically significant positive association in the population between incidence rate of Crohn’s disease and ulcerative colitis.
THE CORRELATION COEFFICIENT |
177 |
A useful rule of thumb if you have a value for r but no confidence interval or p-value, is that to be statistically significant, r must be greater than 2/√n, where n is the sample size. For example, if n = 100, then r has to be greater than 2/10 = 0.200 to be statistically significant.
An example from practice
Table 15.1 is taken from the same cross-section study as Exercise 15.3, and shows the sample Pearson’s correlation coefficient for the association between bmi and per cent body fat, with blood pressure, and waist and hip measurements, along with an indication of the statistical significance or otherwise of the p-value.
Unfortunately, the authors have not given the actual p-values, but only indicated whether they were less than 0.05 or less than 0.01. This is not good practice; the actual p-values should always be provided. As you can see, the population correlation coefficient between both bmi and per cent body fat, with waist and hip circumference, is positive and statistically significant in every case. However, bmi is more closely associated (higher r values) than body fat, except in Jamaican men. Apart from the association with systolic blood pressure in US males, there is no statistically significant association with either of the blood pressure measurements.
Exercise 15.4 Table 15.2 is from a case-control study of medical record validation (Olson et al. 1997), and shows the value of Pearson’s r, and the 98 per cent confidence intervals, for the correlation between gestational age, as estimated by the mother, and as determined from medical records, for a number of demographic sub-groups (ignore the last column). The cases were the mothers of child leukaemia patients, the matched controls were randomly selected by random telephone calling. Identify: (a) any correlation coefficients not statistically significant; (b) the strongest correlation; (c) the weakest correlation.
Spearman’s rank correlation coefficient
If either (or both) of the variables is ordinal, then Spearman’s rank correlation coefficient (usually denoted ρs in the population and rs in the sample) is appropriate. This is a non-parametric measure. As with Pearson’s correlation coefficient, Spearman’s correlation coefficient varies from –1, through 0, to +1, and its statistical significance can again be assessed with a p-value or a confidence interval. The null hypothesis is that the population correlation coefficient ρs = 0. Spearman’s rs is not quite as bad to calculate by hand as Pearson’s r but bad enough, and once again you would want to do it with the help of a computer program.
An example from practice
Table 15.3 is from the same cross-section study first referred to in Figure 4.3, into the use of the Ontario mammography services. The authors wanted to know whether the variation in
Table 15.1 Correlation coefficients from a cross-section study into the relationship between body mass index (bmi) and body fat, in black population samples from Nigeria, Jamaica, and the USA. The aim of the study was to investigate whether body fat rather than bmi could be used as a measure of obesity.‡,§ Reproduced from Amer. J. Epid., 145, 620–8, courtesy of Oxford University Press
|
|
|
|
Women |
|
|
|
|
|
|
|
Men |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
Nigeria |
|
Jamaica |
|
United States |
|
Nigeria |
|
Jamaica |
|
United States |
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Variable |
BMI |
% fat |
|
BMI |
% fat |
|
BMI |
% fat |
|
BMI |
% fat |
|
BMI |
% fat |
|
BMI |
% fat |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
Waist circumference |
0.90** |
0.77** |
0.87** |
0.77** |
0.91** |
0.85** |
0.89** |
0.79** |
0.69** |
0.76** |
0.93** |
0.83** |
||||||
Hip circumference |
0.93** |
0.81** |
0.91** |
0.82** |
0.93** |
0.87** |
0.89** |
0.76** |
0.64** |
0.72** |
0.93** |
0.82** |
||||||
Systolic blood pressure |
0.24 |
0.24 |
0.16 |
0.15 |
0.21 |
0.21 |
0.09 |
0.09 |
0.24 |
0.24 |
0.24* |
0.23* |
|
|||||
Diastolic blood pressure |
0.16 |
0.14 |
0.20 |
0.16 |
0.07 |
0.10 |
0.31 |
0.24 |
0.16 |
0.11 |
0.22 |
0.20 |
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*p < 0.05; ** p < 0.01.
† Weight (kg)/height (m)2. ‡Data were adjusted for age.
§No significant difference was found between correlation coefficients for body mass index and percentage of body fat.
THE CORRELATION COEFFICIENT |
179 |
the ranked utilisation rates (number of visits per 1000 women) was similar across the age groups. They did this by measuring the strength of the association between the ranked rates for each pair of different age groups. When the association was strong and significant, they concluded that the variation in the usage rate was similar.
The results show that the rs for the association between the ranked usage rates for 30–39 year-olds, and the 40–49 year-olds, across the 33 districts was 0.6496 (first row of table), with a p-value of 0.0005. So this association is positive and statistically significant in these two age group populations. Indeed, the correlation coefficients between all pairs of age groups are statistically significant, with all p-values < 0.05. The authors thus concluded that variation in
Table 15.2 Pearson’s r and 98 per cent confidence intervals for the association between gestational age, as estimated by the mother and from medical records, for a number of demographic sub-groups. Reproduced from Amer. J. Epid., 145, 58–67, courtesy of Oxford University Press
|
Correlation of |
|
Kappa |
|
gestational age |
98% CI* |
statistic† |
|
|
|
|
All gestational ages |
0.839 |
0.817–0.859 |
0.62 |
Case/control status |
|
|
|
Cases |
0.849 |
0.813–0.878 |
0.63 |
Controls |
0.835 |
0.805–0.861 |
0.61 |
Education |
|
|
|
<High school |
0.694 |
0.553–0.797 |
0.51 |
High school |
0.833 |
0.790–0.868 |
0.63 |
>High school |
0.835 |
0.804–0.861 |
0.62 |
Household income |
|
|
|
<$22,000 |
0.791 |
0.734–0.837 |
0.59 |
$22,000–$ 34,999 |
0.882 |
0.849–0.908 |
0.62 |
≥$35,000 |
0.843 |
0.800–0.877 |
0.65 |
Unknown |
0.745 |
0.641–0.823 |
0.60 |
Time (years) from delivery to interview |
|
|
|
<2 |
0.896 |
0.862–0.921 |
0.64 |
2–3.9 |
0.821 |
0.784–0.852 |
0.63 |
4–5.9 |
0.828 |
0.775–0.869 |
0.61 |
6–8 |
0.852 |
0.734–0.920 |
0.42 |
Maternal age (years) |
|
|
|
<25 |
0.822 |
0.773–0.861 |
0.64 |
25–29 |
0.889 |
0.862–0.912 |
0.63 |
30–34 |
0.760 |
0.694–0.813 |
0.57 |
≥35 |
0.888 |
0.824–0.930 |
0.64 |
Birth order |
|
|
|
First born |
0.880 |
0.853–0.903 |
0.67 |
Second born |
0.815 |
0.778–0.846 |
0.57 |
≥Third born |
0.632 |
0.416–0.781 |
0.52 |
Maternal race |
|
|
|
White |
0.846 |
0.822–0.866 |
0.64 |
Other |
0.782 |
0.680–0.855 |
0.42 |
|
|
|
|
*CI, confidence Interval.
† Three categories, <38, 38–41, ≥42 weeks.
180 |
CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES |
Table 15.3 Spearman correlation coefficients from a cross-section study of the use of the Ontario mammography services in relation to age. Each correlation coefficient measures the strength of the association in the variation between the ranked usage rate across the 33 heath districts for each pair of age groups. Reproduced from J. Epid. Comm. Health, 51, 378–82, courtesy of BMJ Publishing Group
Age group (y) |
30–39y |
40–49y |
50–69y |
70+y |
|
30–39 |
1.0000 0.6496 (p < 0.0001) |
0.5949 (p = 0.0005) |
0.5488 (p = 0.0014) |
||
40–49 |
|
1.0000 |
0.9021 (p < 0.0001) |
0.8985 (p < 0.0001) |
|
50–69 |
|
|
1.0000 |
0.9513 (p < 0.0001) |
|
70+ |
|
|
|
1.0000 |
|
usage rate was similar for the four age groups across the 33 health districts. However, whether association is the correct way to measure similarity in two sets of values is a question I will return to in the next chapter.
Two other correlation coefficients can only be mentioned briefly. Kendal’s rank-order correlation coefficient, denoted τ (tau), is appropriate in the same circumstances as Spearman’s rs , i.e. with ranked data (which may be ordinal or continuous). Tau is available in SPSS, but not in Minitab. The point-biserial correlation coefficient is appropriate if one variable is metric continuous and the other is truly dichotomous (which means that the variable can take only two values; alive or dead, male or female, etc.). Unfortunately, this latter measure of association is not available in either SPSS or Minitab.
If you plan to use a correlation coefficient you should ensure that the assumptions referred to above are satisfied, in particular that the association is linear - which can be checked by a scatterplot. Moreover, with Pearson’s correlation coefficient you should interpret any results with suspicion if there are outliers present in either data set, since these can distort the results.
Finally it is worth noting again that just because two variables are significantly associated, does not mean that there is a cause–effect relationship between them.
