- •Contents
- •Preface to the 2nd Edition
- •Preface to the 1st Edition
- •Introduction
- •Learning Objectives
- •Variables and Data
- •The good, the Bad, and the Ugly – Types of Variable
- •Categorical Variables
- •Metric Variables
- •How can I Tell what Type of Variable I am Dealing with?
- •2 Describing Data with Tables
- •Learning Objectives
- •What is Descriptive Statistics?
- •The Frequency Table
- •3 Describing Data with Charts
- •Learning Objectives
- •Picture it!
- •Charting Nominal and Ordinal Data
- •Charting Discrete Metric Data
- •Charting Continuous Metric Data
- •Charting Cumulative Data
- •4 Describing Data from its Shape
- •Learning Objectives
- •The Shape of Things to Come
- •5 Describing Data with Numeric Summary Values
- •Learning Objectives
- •Numbers R us
- •Summary Measures of Location
- •Summary Measures of Spread
- •Standard Deviation and the Normal Distribution
- •Learning Objectives
- •Hey ho! Hey ho! It’s Off to Work we Go
- •Collecting the Data – Types of Sample
- •Types of Study
- •Confounding
- •Matching
- •Comparing Cohort and Case-Control Designs
- •Getting Stuck in – Experimental Studies
- •7 From Samples to Populations – Making Inferences
- •Learning Objectives
- •Statistical Inference
- •8 Probability, Risk and Odds
- •Learning Objectives
- •Calculating Probability
- •Probability and the Normal Distribution
- •Risk
- •Odds
- •Why you can’t Calculate Risk in a Case-Control Study
- •The Link between Probability and Odds
- •The Risk Ratio
- •The Odds Ratio
- •Number Needed to Treat (NNT)
- •Learning Objectives
- •Estimating a Confidence Interval for the Median of a Single Population
- •10 Estimating the Difference between Two Population Parameters
- •Learning Objectives
- •What’s the Difference?
- •Estimating the Difference between the Means of Two Independent Populations – Using a Method Based on the Two-Sample t Test
- •Estimating the Difference between Two Matched Population Means – Using a Method Based on the Matched-Pairs t Test
- •Estimating the Difference between Two Independent Population Proportions
- •Estimating the Difference between Two Independent Population Medians – The Mann–Whitney Rank-Sums Method
- •Estimating the Difference between Two Matched Population Medians – Wilcoxon Signed-Ranks Method
- •11 Estimating the Ratio of Two Population Parameters
- •Learning Objectives
- •12 Testing Hypotheses about the Difference between Two Population Parameters
- •Learning Objectives
- •The Research Question and the Hypothesis Test
- •A Brief Summary of a Few of the Commonest Tests
- •Some Examples of Hypothesis Tests from Practice
- •Confidence Intervals Versus Hypothesis Testing
- •Nobody’s Perfect – Types of Error
- •The Power of a Test
- •Maximising Power – Calculating Sample Size
- •Rules of Thumb
- •13 Testing Hypotheses About the Ratio of Two Population Parameters
- •Learning Objectives
- •Testing the Risk Ratio
- •Testing the Odds Ratio
- •Learning Objectives
- •15 Measuring the Association between Two Variables
- •Learning Objectives
- •Association
- •The Correlation Coefficient
- •16 Measuring Agreement
- •Learning Objectives
- •To Agree or not Agree: That is the Question
- •Cohen’s Kappa
- •Measuring Agreement with Ordinal Data – Weighted Kappa
- •Measuring the Agreement between Two Metric Continuous Variables
- •17 Straight Line Models: Linear Regression
- •Learning Objectives
- •Health Warning!
- •Relationship and Association
- •The Linear Regression Model
- •Model Building and Variable Selection
- •18 Curvy Models: Logistic Regression
- •Learning Objectives
- •A Second Health Warning!
- •Binary Dependent Variables
- •The Logistic Regression Model
- •19 Measuring Survival
- •Learning Objectives
- •Introduction
- •Calculating Survival Probabilities and the Proportion Surviving: the Kaplan-Meier Table
- •The Kaplan-Meier Chart
- •Determining Median Survival Time
- •Comparing Survival with Two Groups
- •20 Systematic Review and Meta-Analysis
- •Learning Objectives
- •Introduction
- •Systematic Review
- •Publication and other Biases
- •The Funnel Plot
- •Combining the Studies
- •Solutions to Exercises
- •References
- •Index
15
Measuring the association between two variables
Learning objectives
When you have finished this chapter you should be able to:
Explain the meaning of association.
Draw and interpret a scatterplot, and from it assess the linearity, direction and strength of an association.
Distinguish between negative and positive association.
Explain what a correlation coefficient is.
Describe Pearson’s correlation coefficient r , its distributional requirements, and interpret a given value of r .
Describe Spearman’s correlation coefficient rs and interpret a given value of rs .
Describe the circumstances under which Pearson’s r or Spearman’s rs is appropriate.
Association
When we say that two ordinal or metric variables are associated, we mean that they behave in a way that makes them appear ‘connected’ - changes in either variable seem to coincide with
Medical Statistics from Scratch, Second Edition David Bowers
C 2008 John Wiley & Sons, Ltd
172 |
CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES |
changes in the other variable. It’s important to note (at this point anyway), that we are not suggesting that change in either variable is causing the change in the other variable, simply that they exhibit this commonality. As you will see, association, if it exists, may be positive (low values of one variable coincide with low values of the other variable, and high values with high values) or negative (low values with high values and vice versa).
In this chapter, I want to discuss two alternative methods of detecting an association. The first method relies on a plot of the sample data, called a scatterplot, in which values of one variable are plotted on the vertical axis and values of the other on the horizontal axis. The second approach is numeric, making both comparison and inference possible.
The scatterplot
A scatterplot will enable you to see if there is an association between the variables, and if there is, its strength and direction. But the scatterplot will only provide a qualitative assessment, and thus has obvious limitations. First, it’s not always easy to say which of two sample scatterplots indicates the stronger association and second, it doesn’t allow us to make inferences about possible associations in the population.
An example from practice
As part of a study of the possible association between Crohn’s disease (CD) and ulcerative colitis (UC), researchers in Canada (Blanchard et al. 2001) produced the scatterplot shown in Figure 15.1. It doesn’t matter which variable is plotted on which axis for the scatterplot itself, but in the study of causal relationships between variables (which I will discuss in Chapter 17), the choice of axis becomes more important.
Looking at the scatterplot it’s not difficult to see that something is going on here. The scatter is not just a random cloud of points, but appears to display a pattern – low CD levels seem to be associated with low UC levels, and higher CD levels with high UC levels. You could justly claim that the two variables appear to be positively associated.
As a second example, Figure 15.2 shows a scatterplot taken from a study into the possible relationship between percentage mortality from aortic aneurysm, and the number of aortic aneurysm episodes dealt with per year, in each of 22 hospitals (McKee and Hunter 1995). This scatterplot displays a negative association between the two variables, low values for number of episodes seem to be associated with high values for percentage mortality, and vice versa.
As a final example from practice, Figure 15.3 shows a scatterplot taken from the crosssection study into the possible contribution of channel blockers (prescribed for depression), to the suicide rate in 284 Swedish municipalities (Lindberg et al. 1998), first referred to in Figure 3.10. The scatterplot here is very much more fuzzy than the two previous plots, and it would be hard to claim, merely from eyeballing it, that there is any notable association between the two variables (although admittedly there is some evidence of a rather weak positive association).
When you set out to investigate a possible association between two variables, a scatterplot is almost always worthwhile, and will often produce an insight into the way the two variables co-behave. In particular, it may reveal whether an association between them is linear. The
ASSOCIATION |
173 |
UC Incidence Rate per 100,000
35
30
25
20
15
10 |
|
|
|
|
|
|
|
5 |
|
|
|
|
r=0.49, p<0.001 |
||
0 |
|
|
|
|
|
|
|
0 |
5 |
10 |
15 |
20 |
25 |
30 |
35 |
CD Incidence Rate per 100,000
Figure 15.1 Scatterplot of the age-standardised incidence rates of Crohn’s disease (CD) and ulcerative colitis (UC) by Manitoba postal area, Canada, 1987–1996. The scatterplot suggests a positive association between the two variables. Reproduced from Americal Jnl of Epidemiology 2001, 154: 328–33, Fig. 3 p. 331, by permission of OUP
% Mortality
100
90
80
70
60
50
40
30
20
10
0
0 |
10 |
20 |
30 |
40 |
50 |
60 |
70 |
|
|
|
Episodes/year |
|
|
|
|
Figure 15.2 A scatterplot of percentage mortality from aortic aneurysm, and number of aortic aneurysm episodes dealt with per year, in 22 hospitals. The plot suggests a negative association between the two variables. Reproduced from Quality in Health Care, 4, 5–12, courtesy of BMJ Publishing Group
property of linearity is important in some branches of statistics and we’ll meet it again ourselves in Chapter 17. Put simply, a linear association is one in which the points in the scatterplot seem to cluster around a straight line. The two scatterplots in Figure 15.4 illustrate the difference between a linear and a non-linear association. The scatter in Figure 15.4a seems to be linear; but in Figure 15.4b it shows some curviness.
174
No of suicides per 10 000 inhabitants/year
CH 15 MEASURING THE ASSOCIATION BETWEEN TWO VARIABLES
4
3
2
1
0 |
10 |
20 |
30 |
40 |
50 |
0 |
Use of calcium channel blockers (defined daily does/1000 inhabitants/year)
Figure 15.3 A scatterplot taken from a cross-section study into the possible contribution of channel blockers (prescribed for depression) to the suicide rate, in 284 Swedish municipalities. The plot suggests a weak, if any, relationship between the variables. Reproduced courtesy of BMJ Publishing Group
Exercise 15.1 Draw a scatterplot of Apgar score against birthweight for the 30 maternityunit born infants using the data in Table 2.5, and comment on what it shows about any association between the two variables.
Exercise 15.2 The scatterplot in Figure 15.5 is from a study into the effect of passive smoking on respiratory symptoms (Janson et al. 2001). In addition, the ‘best’ straight line has been drawn through the points.1 Comment on what the scatterplot suggests about the nature and strength of any association between the two variables.
Exercise 15.3 The scatterplot of percentage body fat against body mass index (bmi) in Figure 15.6 is from a cross-section study into the relationship between body mass index and body fat, in black populations in Nigeria, Jamaica and the USA (Luke et al. 1997). The aim of the study was to investigate whether per cent body fat rather than bmi could be used as a measure of obesity. What does the scatterplot tell you about the nature and strength of any association between these two variables?
1I’ll have more to say about what constitutes the best straight line in Chapter 17, but loosely speaking, it’s the line which passes as close as possible to all the points.
