- •Contents
- •Preface to the 2nd Edition
- •Preface to the 1st Edition
- •Introduction
- •Learning Objectives
- •Variables and Data
- •The good, the Bad, and the Ugly – Types of Variable
- •Categorical Variables
- •Metric Variables
- •How can I Tell what Type of Variable I am Dealing with?
- •2 Describing Data with Tables
- •Learning Objectives
- •What is Descriptive Statistics?
- •The Frequency Table
- •3 Describing Data with Charts
- •Learning Objectives
- •Picture it!
- •Charting Nominal and Ordinal Data
- •Charting Discrete Metric Data
- •Charting Continuous Metric Data
- •Charting Cumulative Data
- •4 Describing Data from its Shape
- •Learning Objectives
- •The Shape of Things to Come
- •5 Describing Data with Numeric Summary Values
- •Learning Objectives
- •Numbers R us
- •Summary Measures of Location
- •Summary Measures of Spread
- •Standard Deviation and the Normal Distribution
- •Learning Objectives
- •Hey ho! Hey ho! It’s Off to Work we Go
- •Collecting the Data – Types of Sample
- •Types of Study
- •Confounding
- •Matching
- •Comparing Cohort and Case-Control Designs
- •Getting Stuck in – Experimental Studies
- •7 From Samples to Populations – Making Inferences
- •Learning Objectives
- •Statistical Inference
- •8 Probability, Risk and Odds
- •Learning Objectives
- •Calculating Probability
- •Probability and the Normal Distribution
- •Risk
- •Odds
- •Why you can’t Calculate Risk in a Case-Control Study
- •The Link between Probability and Odds
- •The Risk Ratio
- •The Odds Ratio
- •Number Needed to Treat (NNT)
- •Learning Objectives
- •Estimating a Confidence Interval for the Median of a Single Population
- •10 Estimating the Difference between Two Population Parameters
- •Learning Objectives
- •What’s the Difference?
- •Estimating the Difference between the Means of Two Independent Populations – Using a Method Based on the Two-Sample t Test
- •Estimating the Difference between Two Matched Population Means – Using a Method Based on the Matched-Pairs t Test
- •Estimating the Difference between Two Independent Population Proportions
- •Estimating the Difference between Two Independent Population Medians – The Mann–Whitney Rank-Sums Method
- •Estimating the Difference between Two Matched Population Medians – Wilcoxon Signed-Ranks Method
- •11 Estimating the Ratio of Two Population Parameters
- •Learning Objectives
- •12 Testing Hypotheses about the Difference between Two Population Parameters
- •Learning Objectives
- •The Research Question and the Hypothesis Test
- •A Brief Summary of a Few of the Commonest Tests
- •Some Examples of Hypothesis Tests from Practice
- •Confidence Intervals Versus Hypothesis Testing
- •Nobody’s Perfect – Types of Error
- •The Power of a Test
- •Maximising Power – Calculating Sample Size
- •Rules of Thumb
- •13 Testing Hypotheses About the Ratio of Two Population Parameters
- •Learning Objectives
- •Testing the Risk Ratio
- •Testing the Odds Ratio
- •Learning Objectives
- •15 Measuring the Association between Two Variables
- •Learning Objectives
- •Association
- •The Correlation Coefficient
- •16 Measuring Agreement
- •Learning Objectives
- •To Agree or not Agree: That is the Question
- •Cohen’s Kappa
- •Measuring Agreement with Ordinal Data – Weighted Kappa
- •Measuring the Agreement between Two Metric Continuous Variables
- •17 Straight Line Models: Linear Regression
- •Learning Objectives
- •Health Warning!
- •Relationship and Association
- •The Linear Regression Model
- •Model Building and Variable Selection
- •18 Curvy Models: Logistic Regression
- •Learning Objectives
- •A Second Health Warning!
- •Binary Dependent Variables
- •The Logistic Regression Model
- •19 Measuring Survival
- •Learning Objectives
- •Introduction
- •Calculating Survival Probabilities and the Proportion Surviving: the Kaplan-Meier Table
- •The Kaplan-Meier Chart
- •Determining Median Survival Time
- •Comparing Survival with Two Groups
- •20 Systematic Review and Meta-Analysis
- •Learning Objectives
- •Introduction
- •Systematic Review
- •Publication and other Biases
- •The Funnel Plot
- •Combining the Studies
- •Solutions to Exercises
- •References
- •Index
146 CH 12 TESTING HYPOTHESES ABOUT THE DIFFERENCE BETWEEN TWO POPULATION PARAMETERS
Where, μM = population mean birthweight of maternity-unit-born infants, and μH = the population mean birthweight of home-born infants.4
With SPSS
Look back at Figure 10.1, which shows the output from SPSS, which, in addition to the 95 per cent confidence interval, gives the result of the two-sample t test of the equality of the two population mean birthweights. The test results are given in columns five, six and seven. The column headed ‘Sig. (2-tailed)’ gives the p-value of 0.407. Since this is not less than 0.05, you cannot reject the null hypothesis. You thus conclude that there is no difference in the two population mean birthweights.
With Minitab
The Minitab output in Figure 10.2 gives the same p-value value as SPSS (0.407), confirming that the two population means are not significantly different.
Some examples of hypothesis tests from practice
Two independent means – the two-sample t test
Table 12.2 shows the baseline characteristics of two independent groups in a randomised controlled trial to compare conventional blood pressure measurement (CBP) and ambulatory blood pressure measurement (ABP) in the treatment of hypertension (Staessen et al. 1997). p-values for the differences in the basic characteristics of the two groups are shown in the last column.
The authors used a variety of tests to assess the difference between several parameters for these independent groups (although these are referred to in the text, this information should have been available somewhere in the table itself). To assess the difference in population mean age, and mean body mass index, they used a two-sample t test. For age, the p-value is 0.03, so you can reject the null hypothesis of equal mean ages and conclude that the difference is statistically significant. The p-value for the difference in mean body mass index is 0.39, so you can conclude that the mean body mass index in the two populations is the same.
Exercise 12.2 Comment on what the results in Table 12.2 indicate about the difference between the two populations in terms of their mean serum creatinine and serum total cholesterol levels.
Exercise 12.3 Refer back to Table 1.6, showing the basic characteristics of women in the breast cancer and stressful life events case-control study. Comment on what the p-values tell you about the equality or otherwise, between cases and controls, of the means of the seven metric variables (shown with an * – see table footnote).
4 Note that differences in independent percentages can also be tested with the two-sample t test.
SOME EXAMPLES OF HYPOTHESIS TESTS FROM PRACTICE |
147 |
Table 12.2 Baseline characteristics of two independent groups, from a randomised controlled trial to compare conventional blood pressure measurement (CBP) and ambulatory blood pressure measurement (ABP) in the treatment of hypertension. Reproduced from JAMA, 278, 1065–72, courtesy of the American Medical Association
|
CBP Group |
ABP Group |
|
||
Characteristics |
(n = 206) |
(n = 213) |
P |
||
Age, mean (SD), y |
51.3 |
(11.9) |
53.8 |
(10.8) |
.03 |
Body mass index, mean (SD), kg/m2 |
28.5 |
(4.8) |
28.2 |
(4.4) |
.39 |
Women, No. (%) |
102 |
(49.5) |
124 |
(58.2) |
.07 |
Receiving oral contraceptives, No. (%) |
14 (13.7) |
10 |
(8.1) |
.17 |
|
Receiving hormonal substitution, No. (%) |
19 (18.6) |
19 |
(15.3) |
.51 |
|
Previous antihypertensive treatment, No. (%)† |
134 |
(65.0) |
139 |
(65.3) |
.95 |
Diuretics, No. (%) |
47 |
(35.1) |
59 |
(42.4) |
.26 |
β-Blockers, No. (%) |
65 |
(48.5) |
80 |
(57.6) |
.17 |
Calcium channel blockers, No. (%) |
45 |
(33.6) |
38 |
(27.3) |
.32 |
Angiotensin-converting enzyme inhibitors, No. (%) |
50 |
(37.3) |
48 |
(34.5) |
.72 |
Multiple-drug treatment, No. (%) |
62 |
(46.3) |
65 |
(46.8) |
.97 |
Smokers, No. (%) |
42 |
(20.5) |
35 |
(16.4) |
.29 |
Alcohol use, No. (%) |
115 |
(55.8) |
102 |
(47.9) |
.10 |
Serum creatinine, mean (SD), μmol/L‡ |
85.75 |
(15.91) |
88.4 |
(16.80) |
.25 |
Serum total cholesterol, mean (SD), mmol/L‡ |
6.00 |
(1.03) |
6.10 |
(1.19) |
.32 |
Percentages and values of P computed considering only women receiving antihypertensive drug treatment before their enrollment.
†Defined as antihypertensive drug treatment within 6 months before the screening visit.
‡Divide creatinine by 88.4 and cholesterol by 0.02586 to convert milligrams per deciliter.
Two matched means – the matched-pairs t test
Table 10.3 provides an example from practice, and shows the p-values for the differences in population mean bone mineral densities between two individually matched groups of depressed and normal women (which we have already discussed in confidence interval terms). As you can see, only at the radius are the population mean bone mineral densities the same, indicated by a p-value of 0.25. All the other p-values are less than 0.05. Notice that this confirms the confidence interval results.5
Two independent medians – the Mann-Whitney test
With two independent groups, and when the data is ordinal or skewed metric, the median is the preferred measure of location. In these circumstances, the Mann-Whitney test can be used to test the null hypothesis that the two population medians are the same.
Recall that in Chapter 10, I introduced the Mann-Whitney procedure to calculate confidence intervals for the difference between two independent population median treatment times. These
5 Note that differences in matched percentages can also be tested with the matched-pairs t test.
148 CH 12 TESTING HYPOTHESES ABOUT THE DIFFERENCE BETWEEN TWO POPULATION PARAMETERS
were from a study of the use of ketorolac versus morphine to treat limb injury pain. Table 10.4 contains both 95 per cent confidence intervals and p-values from this study. Only one confidence interval does not include zero, that for the time between receiving analgesia and leaving A&E (4.0 to 39.0). This outcome has a p-value of 0.02, less than 0.05, which confirms the fact that the difference in treatment time between the two population median times is statistically significant.
However there is a problem with the time for preparation of the analgesia. Table 10.4 shows this has a 95 per cent confidence interval of (0 to 5.0), which includes zero, implying no significant difference in treatment times. But the p-value is given as 0.0002, which suggests a highly significant difference in population medians. In the accompanying text the authors indicate that this difference is significant and quote the low p-value, so I can only assume a typographical error in the confidence interval.
Interpreting computer output for the Mann-Whitney test
In view of the widespread use of the Mann-Whitney test you might find it helpful to see the output for this procedure from both SPSS and Minitab.
With SPSS
With the Apgar scores in Table 10.1, you can use the Mann-Whitney test to check if the population median Apgar scores for infants born in a maternity unit and those born at home are the same and differ in the sample only by chance. The null hypothesis is that these medians are equal. The output from SPSS is shown in Figure 12.1. The p-value of 0.061 is labelled ‘Asymp. Sig. (2-tailed)’. Since this is not less than 0.05 you cannot reject the null hypothesis of no difference in population median Apgar scores between the two groups.
Test Statistics |
|
|
|
|
APGARALL |
|
|
Mann-Whitney U |
325.500 |
|
|
Wilcoxon W |
790.500 |
The p |
|
Z |
–1.876 |
value. |
|
Asymp. Sig. (2- |
.061 |
||
|
tailed)
Figure 12.1 Output from SPSS for the Mann-Whitney test of the difference between population medians of the two independent Apgar scores (raw data in Table 10.1)
With Minitab
If you refer back to Figure 10.3, you will see the results of Minitab’s Mann-Whitney test three rows from the bottom.6 The p-value is given in the second row up as 0.0616 and since this is
6 ‘ETA’ is Minitab’s word for the population median.
CONFIDENCE INTERVALS VERSUS HYPOTHESIS TESTING |
149 |
not less than 0.05 you cannot reject the null hypothesis. This is confirmed in the bottom row of the table, and enables you to conclude that the population median Apgar scores are the same in both groups of infants.
Two matched medians – the Wilcoxon test
In the same circumstances as for the Mann-Whitney test described above, but with matched populations, the Wilcoxon test is appropriate. Look back at Table 10.5, which was from a matched case-control study into the dietary intake of schizophrenic patients living in the community in Scotland. Here the authors have used the Wilcoxon matched-pairs test to test for differences in the population median daily intakes of a number of substances between ‘All Patients’ and ‘All Controls’. The p-values are in the column headed ‘P’. As you can see, the only p value not less than 0.05 is that for protein (p-value = 0.07), so this is the only substance whose median daily intake does not differ between the two populations. Once again this confirms the confidence interval results.
Confidence intervals versus hypothesis testing
I said at the beginning of this chapter that where possible, confidence intervals are preferred to hypothesis tests because the confidence intervals are more informative. How so? Have another look at Table 10.4, from the study comparing ketorolac and morphine for limb injury pain. The authors give both 95 per cent confidence intervals and p-values for differences in a number of different treatment times, between two groups of limb injury patients. Let’s take the last of these. For the ‘interval between receiving analgesia and leaving A&E’, the p-value of 0.02 enables us to reject the null hypothesis, and you would conclude that the difference between the two population median treatment times is statistically significant.
The 95 per cent confidence interval of (4.0 to 39.0) minutes, tells us, not only that the difference between the population medians is statistically significant – because the confidence interval does not contain zero – but in addition, that the value of this difference in population medians is likely to be somewhere between 4.0 minutes and 39 minutes. So the confidence interval does everything that the hypothesis test does – it tells us if the medians are equal or not, but it also gives us extra information – on the likely range of values for this difference. Moreover, unlike a p-value, the confidence interval is in clinically meaningful units, which helps with the interpretation. So whenever possible, it is good practice to use confidence intervals in preference to p-values.
Nobody’s perfect – types of error
Suppose you are investigating a new drug for the treatment of hypertension. Your null hypothesis is that the drug has no effect. Let’s suppose that the drug does actually reduce mean systolic blood pressure, but, on average, by only 5 mmHg. However, the hypothesis test you use can only detect a change of 10 mmHg or more. As a consequence, you will not find strong enough
150 CH 12 TESTING HYPOTHESES ABOUT THE DIFFERENCE BETWEEN TWO POPULATION PARAMETERS
evidence to reject the null hypothesis, and you’ll conclude, mistakenly, that the new drug is not effective. But the effect is there, it’s just that your test does not have enough power to detect it.
There are three questions here. First, what exactly is the power of a test and how can we measure it? Second, how can we increase the power of the test we are using? Third, is there a more powerful test that we can use instead? Before I address these questions, a few words on types of error.
Whenever you decide either to reject or not reject a null hypothesis, you could be making a mistake. After all, you are basing your decision on sample evidence. Even if you have done everything right, your sample could still, by chance, not be very representative of the population. Moreover, your test might not be powerful enough to detect an effect if there is one. There are two possible errors:
Type I error: Rejecting a null hypothesis when it is true. Also known as a false positive. In other words, concluding there is an effect when there isn’t. The probability of committing a type I error is denoted α (alpha), and is the same alpha as the significance level of a test.
Type II error: Not rejecting a null hypothesis when it is false. Also known as a false negative. That is, concluding there is no effect when there is. The probability of committing a type II error is denoted β (beta).
Ideally, you would like a test procedure which minimised the probability of a type I error, because in many clinical situations such an error is potentially serious – judging some procedure to be effective when it is not. When you set the significance level of a test to α = 0.05, it’s because you want the probability of a type I error to be no more than 0.05. Nonetheless, if there is a real effect you would certainly like to detect it, so you also want to minimise the probability of β, a type II error, or put another way, you want to make (1 − β) as large as possible.
Exercise 12.4 Explain, with examples, what is meant in hypothesis testing by: (a) a false positive; (b) a false negative.
