- •Contents
- •Preface to the 2nd Edition
- •Preface to the 1st Edition
- •Introduction
- •Learning Objectives
- •Variables and Data
- •The good, the Bad, and the Ugly – Types of Variable
- •Categorical Variables
- •Metric Variables
- •How can I Tell what Type of Variable I am Dealing with?
- •2 Describing Data with Tables
- •Learning Objectives
- •What is Descriptive Statistics?
- •The Frequency Table
- •3 Describing Data with Charts
- •Learning Objectives
- •Picture it!
- •Charting Nominal and Ordinal Data
- •Charting Discrete Metric Data
- •Charting Continuous Metric Data
- •Charting Cumulative Data
- •4 Describing Data from its Shape
- •Learning Objectives
- •The Shape of Things to Come
- •5 Describing Data with Numeric Summary Values
- •Learning Objectives
- •Numbers R us
- •Summary Measures of Location
- •Summary Measures of Spread
- •Standard Deviation and the Normal Distribution
- •Learning Objectives
- •Hey ho! Hey ho! It’s Off to Work we Go
- •Collecting the Data – Types of Sample
- •Types of Study
- •Confounding
- •Matching
- •Comparing Cohort and Case-Control Designs
- •Getting Stuck in – Experimental Studies
- •7 From Samples to Populations – Making Inferences
- •Learning Objectives
- •Statistical Inference
- •8 Probability, Risk and Odds
- •Learning Objectives
- •Calculating Probability
- •Probability and the Normal Distribution
- •Risk
- •Odds
- •Why you can’t Calculate Risk in a Case-Control Study
- •The Link between Probability and Odds
- •The Risk Ratio
- •The Odds Ratio
- •Number Needed to Treat (NNT)
- •Learning Objectives
- •Estimating a Confidence Interval for the Median of a Single Population
- •10 Estimating the Difference between Two Population Parameters
- •Learning Objectives
- •What’s the Difference?
- •Estimating the Difference between the Means of Two Independent Populations – Using a Method Based on the Two-Sample t Test
- •Estimating the Difference between Two Matched Population Means – Using a Method Based on the Matched-Pairs t Test
- •Estimating the Difference between Two Independent Population Proportions
- •Estimating the Difference between Two Independent Population Medians – The Mann–Whitney Rank-Sums Method
- •Estimating the Difference between Two Matched Population Medians – Wilcoxon Signed-Ranks Method
- •11 Estimating the Ratio of Two Population Parameters
- •Learning Objectives
- •12 Testing Hypotheses about the Difference between Two Population Parameters
- •Learning Objectives
- •The Research Question and the Hypothesis Test
- •A Brief Summary of a Few of the Commonest Tests
- •Some Examples of Hypothesis Tests from Practice
- •Confidence Intervals Versus Hypothesis Testing
- •Nobody’s Perfect – Types of Error
- •The Power of a Test
- •Maximising Power – Calculating Sample Size
- •Rules of Thumb
- •13 Testing Hypotheses About the Ratio of Two Population Parameters
- •Learning Objectives
- •Testing the Risk Ratio
- •Testing the Odds Ratio
- •Learning Objectives
- •15 Measuring the Association between Two Variables
- •Learning Objectives
- •Association
- •The Correlation Coefficient
- •16 Measuring Agreement
- •Learning Objectives
- •To Agree or not Agree: That is the Question
- •Cohen’s Kappa
- •Measuring Agreement with Ordinal Data – Weighted Kappa
- •Measuring the Agreement between Two Metric Continuous Variables
- •17 Straight Line Models: Linear Regression
- •Learning Objectives
- •Health Warning!
- •Relationship and Association
- •The Linear Regression Model
- •Model Building and Variable Selection
- •18 Curvy Models: Logistic Regression
- •Learning Objectives
- •A Second Health Warning!
- •Binary Dependent Variables
- •The Logistic Regression Model
- •19 Measuring Survival
- •Learning Objectives
- •Introduction
- •Calculating Survival Probabilities and the Proportion Surviving: the Kaplan-Meier Table
- •The Kaplan-Meier Chart
- •Determining Median Survival Time
- •Comparing Survival with Two Groups
- •20 Systematic Review and Meta-Analysis
- •Learning Objectives
- •Introduction
- •Systematic Review
- •Publication and other Biases
- •The Funnel Plot
- •Combining the Studies
- •Solutions to Exercises
- •References
- •Index
THE POWER OF A TEST |
151 |
The power of a test
We can now come back to the three questions above. To answer the first question – the power of a test is defined to be (1 − β); it is a measure of its capacity to reject the null hypothesis when it is false. In other words, to detect an effect if one is present. In practice, β is typically set at 0.2 or 0.1. This provides power values of 0.80 (or 80 per cent), and 0.90 (or 90 per cent) respectively. So if there is an effect, then the probability of the test detecting it is 0.80 or 0.90.
The power of a test is a measure of its capacity to reject the null hypothesis when it is false. In other words, its capacity to detect an effect if one is present.
Although you would like to minimise both α and β, unfortunately they are, for a given sample size, linked. You can’t make β smaller without making α larger, and vice versa. Thus when you decide a value for α, you are also inevitably fixing the value of β. To answer the second question
– the only way to reduce both simultaneously (and increase the power of a test) is to increase the sample size.
To answer the third question, is there a more powerful test? Briefly, parametric tests are more powerful than non-parametric tests (see p. 127 on the meaning of these terms). For example, a Mann-Whitney test has 95 percent of the power of the two-sample t test.7 The Wilcoxon matched-pairs test similarly has 95 per cent of the power of the matched-pairs t test. As for the chi-squared test, there is usually no obvious alternative when used for categorical data, so comparisons of power are less relevant, but it is known to be a powerful test. Generally you should of course use the most powerful test that the type of data, and its distributional shape, will allow.
An example from practice
The following is an extract from the RCT of epidural analgesic in the prevention of stump and phantom pain after amputation, referred to in Table 5.3. The authors of the study outline their thinking on power thus:
The natural history of phantom pain after amputation shows rates of about 70%, and in most patients the pain is not severe. Since epidural treatment is an invasive procedure, we decided that a clinically relevant treatment should reduce the incidence of phantom pain to less than 30% at week 1 and then at 3, 6, and 12 months after amputation. Before the start of the study, we estimated that a sample size of 27 patients per group would be required to detect a between-group difference of 40% in the rate of phantom pain (type I error rate 0.05; type II error rate 0.2; power = 0.8).
7In view of the restrictions associated with the two-sample t test, the Mann-Whitney test seems an excellent alternative!
152 CH 12 TESTING HYPOTHESES ABOUT THE DIFFERENCE BETWEEN TWO POPULATION PARAMETERS
Exercise12.5 a) Explain, with the help of a few clinical examples, why you would normally want to minimise α, when testing a hypothesis. (b) α is conventionally set to 0.05, or 0.01. Why, if you want to minimise it, don’t you set it at 0.001 or 0.000001, or even 0?
Maximising power – calculating sample size
Generally, the bigger the sample, the more powerful the test.8 The minimum size of a sample for a given power is determined both by the chosen level of alpha, as well as the power required. The sample size calculation can be summarised thus:
Decide on the minimum size of the effect that would be clinically useful (or otherwise of interest).
Decide the significance level α, usually 0.05.
Decide the power required, usually 80 per cent.
Do the sample size calculation, using some appropriate software, or the rule of thumb described below.
Minitab has an easy to use sample size calculator for the most commonly used tests. Machin, et al. (1987) is a comprehensive collection of sample size calculations for a large number of different test situations.
Rules of thumb9
Comparing the means of two independent populations (metric data)
The required sample size n is given by the following expression:
n = 2 × s.d.2 × k E 2
Where s.d. is the population standard deviation (assumed equal in both populations). This can be estimated using the sample standard deviations, if they are available from a pilot study, say. Otherwise the s.d. will have to be guessed using whatever information is available. E is the minimum change in the mean that would be clinically useful or otherwise interesting. k is a magic number which depends on the power and significance levels required, and is obtained from Table 12.3.
8These sample size calculations also apply if you are calculating confidence intervals. Samples that are too small produce wide confidence intervals, sometimes too wide to enable a real effect to be identified.
9I am indebted to Andy Vail for this material.
|
|
RULES OF THUMB |
|
|
153 |
|
Table 12.3 Table of magic numbers for sample size calculations |
|
|||||
|
|
|
|
|
|
|
|
|
|
|
Power, (1 − β) |
|
|
|
|
70 % |
80 % |
90 % |
95 % |
|
|
|
|
|
|
|
|
Significance level, α |
0.05 |
6.2 |
7.8 |
10.5 |
13.0 |
|
|
0.01 |
9.6 |
11.7 |
14.9 |
17.8 |
|
|
|
|
|
|
|
|
For example, suppose you propose to use a case-control study to examine the efficacy of a program of regular exercise, as an alternative to your current drug of choice, in treating moderately hypertensive patients. The minimal difference in mean systolic blood pressures between the cases (given the exercise program), and the controls (given the existing drug), that you think clinically worthwhile is 10 mmHg. You will have to make an intelligent guess as to the standard deviation of systolic blood pressure (assumed the same in both groups – see above). Information on this, and many other measures, is likely to be available from reference sources, from the research literature, from colleagues, etc. Let’s assume systolic blood pressure s.d. = 12 mmHg. If power required is 80 per cent, with a significance level of 0.05, then from Table 12.3, k = 7.8, and the sample size required per group is:
n = 2 × 122 × 7.8 = 22.5 102
So you will need at least 23 subjects in each of the two groups (always round up to next highest integer) to detect a difference between the means of 10 mmHg. Note that these sample sizes will also be large enough for two matched populations since these require smaller sample sizes for the same power.
Comparing the proportions in two independent populations (binary data)
The required sample size, n, is given by:
n = [Pa × (1 − Pa )] + [Pb × (1 − Pb )] × k (Pa − Pb )2
Where Pa is the proportion with treatment a, Pb is proportion with treatment b, so (Pa − Pb ) is the effect size; and k is the magic number from Table 12.3.
For example, suppose the percentage of elderly patients in a large district hospital with pressure sores is currently around 40 per cent, or 0.40. You want to test a new pressure-sore- reducing mattress, and you would like the percentage with pressure sores to decrease to at least 20 per cent, or 0.20. So Pa = 0.40, and (1 − Pa ) = 0.60; Pb = 0.20, and (1 − Pb ) = 0.80; therefore (Pa − Pb ) = (0.40 − 0.20) = 0.20. If power required is 80 per cent and significance
154 CH 12 TESTING HYPOTHESES ABOUT THE DIFFERENCE BETWEEN TWO POPULATION PARAMETERS
level α = 0.05, then required sample size per group is:
n = (0.40 × 0.60) + (0.20 × 0.80) × 7.8 = 78.0 0.202
Thus you would need at least 78 subjects in each group, which would also be big enough for matched proportions.
Exercise 12.6 In the above examples for: (a) hypertension and (b) the pressure sore example; what sample sizes would be required if power and significance levels were respectively:
(i) 90 per cent and 0.05; (ii) 90 per cent and 0.01; (iii) 80 per cent and 0.01?
Exercise 12.7 Suppose you are proposing to use a randomised controlled trial to study the effectiveness of St John’s Wort, as an alternative to an existing drug for the treatment of mild to moderate depression. The percentage of patients reporting an improvement in mood three months after existing drug treatment is 70 per cent. You would be satisfied if the percentage reporting mood improvement after three months of St John’s Wort was 80 per cent. How big a sample would you require to detect this improvement if you wanted your test to have, (a) 80 per cent power and an α of 0.05; (b) 90 per cent power and an α of 0.01?
