Ординатура / Офтальмология / Английские материалы / Study Design and Statistical Analysis a practical guide for clinicians_Katz _2006
.pdf
159 Bias
association between having a case manager and receiving supportive services among HIV-infected persons. Advocates have used these studies as justification for funding case management programs, pointing out that having a case manager results in patients receiving needed services. However, these studies were vulnerable to the criticism of reverse causality, specifically the possibility that receiving services led to getting a case manager (because many service organizations automatically assign case managers to patients who request services).
To resolve this issue colleagues and I used a longitudinal probability sample of HIV-infected persons (HIV Cost and Services Utilization Study, HCSUS).152 We identified two groups: (1) subjects with unmet needs and case managers at baseline and (2) subjects with unmet needs and no case managers at baseline. We found that contact with a case manager at baseline was associated with a higher likelihood that unmet needs were fulfilled by the time of the follow-up visit. By requiring that the case manager be in place prior to the unmet need being fulfilled, we excluded the possibility that receiving services resulted in getting a case manager and thereby strengthened the argument that there was a causal relationship between having a case manager and receiving needed services.
Even with longitudinal studies, reverse causality may be operating if the disease you are studying has a subclinical form. This is why it is important to intensively screen for subclinical disease at the start of a study. For example, in Section 2.3.A I discussed the evidence supporting a relationship between participating in challenging cognitive activities and not developing dementia. But what if effect–cause is operating? Could it be that persons with undiagnosed dementia are less likely to engage in challenging cognitive activities? When such people are observed years later the dementia has progressed and the lack of engagement in challenging cognitive activities is assumed to be one of the reasons. To guard against this possibility, the investigators tested all subjects at baseline for dementia using a standardized instrument that closely correlates with the stages of Alzheimer’s disease.
9.1.G Exclude bias
Of potential threats to causality, bias can be the most difficult to assess because there are so many sources of potential bias. Remember from Section 1.1 that bias is systematic error in the design or execution of a study.153 Selection bias may
152Katz, M.H., Cunningham, W.E., Fleishman, J.A., et al. Effect of case management on unmet needs and utilization of medical care and medications among HIV-infected persons. Ann. Int. Med. 2001; 135: 557–65.
153For more on bias, see Szklo, M., Nieto, F.J. Epidemiology: Beyond the Basics. Gaithersburg, Maryland: Aspen Publication, pp. 125–76; Hulley, S.B., Cummings, S.R., Browner, W.S., Grady, D., Hearst,
N., Newman, T.B. Designing Clinical Research (2nd edition). Philadelphia: Lippincott Williams & Wilkins, 2001, pp. 126–8.
160 Statistics and causality
occur in sampling of subjects or assignment to study groups (e.g., sicker persons being steered to a particular treatment group); bias may occur due to subjects with a disease being more likely to remember exposures (recall bias) or due to subjects answering questions the way they think the investigators want them to (i.e., social desirability bias); bias may occur due to interviewers probing more deeply with subjects they think likely to have had an exposure; observer bias occurs when the investigator draws a conclusion about a participant based on collateral information about the patient (e.g., investigator assumes that an AIDS patient is taking zidovudine because the patient has an elevated MCV level).
The best way to minimize bias is through careful study design. However, even if you perform a randomized placebo-controlled trial there are still potential sources of bias (e.g., subjects submitting their pills to a private laboratory to unblind their assignment). As a researcher, all you can do is minimize the sources of bias, test the impact of bias in your study (e.g., if study dropout is high among older persons, test your results in younger persons; if the association holds then you know it cannot be due solely to bias due to dropout among older persons); and honestly report the biases of your study.
9.1.H Strengthening causal associations: putting it all together and getting it wrong!
The association between estrogen use and Alzheimer’s disease provides a perfect example of how to strengthen causal associations and get it wrong!
Five observational studies showed that estrogen use was associated with decreased development of Alzheimer’s disease (prior research).154 Estrogen is known to have positive effects on the brain including reducing beta-amyloid accumulation, enhancing neurotransmitter release and action, and protecting against oxidative damage (biologic plausibility).155 The prospective longitudinal study performed by Tang and colleagues carefully evaluated subjects on enrollment to exclude incipient Alzheimer’s disease (exclude reverse causality). All five of the studies used multivariable analysis to control for possible confounders such as age, education, ethnicity, age at menarche, age at menopause, and apolipoprotein E genome (exclude confounding). To test for bias due to
154Tang, M.-X., Jacobs, D., Stern, Y., et al. Effect of oestrogen during menopause on risk and age at onset of Alzheimer’s disease. Lancet 1996; 348: 429–32; Baldereschi, M., De Carlo, A., Lepore, V., et al. Estrogen-replacement therapy and Alzheimer’s disease in the Italian longitudinal study on aging. Neurology 1998; 50: 996–1002; Zandi, P.P., Carlson, M.C., Plassman, B.L., et al. Hormone replacement therapy and incidence of Alzheimer disease in older women. J. Am. Med. Assoc. 2002; 288: 2123–9; Paganini-Hill, A., Henderson, V.W. Estrogen deficiency and risk of Alzheimer’s disease in women. Am. J. Epidemiol. 1994; 140: 256–61; Kawas, C., Resnick, S., Morrison, A., et al. A prospective study of estrogen replacement therapy and the risk of developing Alzheimer’s disease: The Baltimore Longitudinal Study of Aging. Neurology 1997; 48: 1517–21.
155Yaffe, K. Hormone therapy and the Brain: Déjà vu all over again? J. Am. Med. Assoc. 2003; 289: 2717–18.
161 Statistically significant and clinically unimportant results
excluding women with Parkinson’s disease or stroke, Tang and colleagues compared hormone use among excluded women to that of women included in the study and found no differences (exclude bias). The protective effect was strong (OR 0.33) in the study by Baldereschi and colleagues (strength of effect). Three studies (Tang and colleagues, Paganini-Hill and Henderson, and Zandi and colleagues) found an association between longer duration of estrogen use and decreased incidence of Alzheimer’s disease (dose–response relationship).
However, when a randomized clinical trial was completed, it showed that estrogen plus progestin therapy actually increased the risk of dementia.156 How could the observational studies been so wrong? The reason for the discrepancy between the observational data and the randomized controlled trial is unknown. The most likely explanation is confounding due to an unmeasured factor such as healthful life-style behavior.
9.2 Can the results be statistically significant and clinically unimportant?
You are more likely to correctly characterize a population if you assess a large number of its members than if you assess a small number of members.
Absolutely! The reason is that statistical significance is heavily affected by sample size. If you have any doubt remember the coin toss example (Section 1.1). Having 60% of the tosses land on heads is sufficient evidence to conclude the coin is equally weighted if you have 100 tosses but not if you only have 10 tosses.
Why is sample size such an important determinant of statistical significance? The reason is that you are more likely to correctly characterize a population if you assess a large number of its members than if you assess a small number of members.
However, correctly characterizing a population does not mean that the results are important. For example, Flum and colleagues examined the records of 1,570,361 Medicare patients who underwent cholecystectomy during a 7-year period.157 The investigators compared those patients who underwent an intraoperative cholangiography (IOC) to those who did not. (Performance of IOC is thought to increase the risk of common bile duct injury.) There were many statistically significant differences between patients who underwent IOC and those who did not (Table 9.3).
In fact, of the 12 comparisons shown in Table 9.3, nine are statistically significant at the P 0.001 level and two are statistically significant at the P 0.05. But are these differences important? No, most seem trivial. For example, 96.8%
156Shumaker, S.A., Legault, C., Rapp, S.R., et al. Estrogen plus progestin and the incidence of dementia and mild cognitive impairment in postmenopausal women. J. Am. Med. Assoc. 2003; 289: 2651–62.
157Flum, D.R., Dellinger, E.P., Cheadle, A., Chan, L., Koepsell, T. Intraoperative cholangiography and risk of common bile duct injury during cholecystectomy. J. Am. Med. Assoc. 2003; 289: 1639–44.
162 |
|
Statistics and causality |
|
|
|
|
|
|
|
Table 9.3. Characteristics of patients with and without intraoperative |
|
||||
|
|
cholangiography (IOC) |
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
With IOC |
Without IOC |
|
||
|
|
Variables |
(N 613,706) |
(N 956,655) |
P-value |
||
|
|
|
|
|
|
|
|
|
|
Patient-level variables |
|
|
|
|
|
|
|
Age, mean (SD), (years) |
71.7 |
(10.3) |
71.2 |
(10.7) |
0.001 |
|
|
Sex, (% of female) |
62.6 |
|
63.2 |
|
0.001 |
|
|
Race, (% of white/non-Hispanic) |
88.9 |
|
88.8 |
|
0.05 |
|
|
Complex biliary tract disease, (%) |
10.9 |
|
11.0 |
|
0.05 |
|
|
Comorbidity index, mean (SD) |
0.04 (0.22) |
0.08 (0.24) |
0.001 |
||
|
|
Surgeon-level variables |
|
|
|
|
|
|
|
Age, mean (SD), (years) |
48.1 |
(9.3) |
48.6 |
(9.6) |
0.001 |
|
|
Sex, (% of male) |
96.8 |
|
96.7 |
|
0.001 |
|
|
Percent performed in the surgeon’s |
24.6 |
|
25.0 |
|
0.001 |
|
|
first 20 cholecystectomies |
|
|
|
|
|
|
|
Case order, mean # (SD) |
70.5 (61.3) |
66.6 |
(57.7) |
0.001 |
|
|
|
General surgeon/surgical specialist |
95.6 |
|
95.6 |
|
1.0 |
|
|
Surgeon board certified, (%) |
82.6 |
|
79.6 |
|
0.001 |
|
|
Years since surgeon graduated from |
21.8 |
(9.6) |
22.3 |
(9.6) |
0.001 |
|
|
medical school, mean (SD), (years) |
|
|
|
|
|
|
|
|
|
|
|
|
|
Data from Flum, D.R., et al. Intraoperative cholangiography and risk of common bile duct injury during cholecystectomy. J. Am. Med. Assoc. 2003; 289: 1639–44.
of patients who underwent IOC had a male surgeon versus 96.7% of patients who did not have an IOC. Although the difference is a trivial 0.1%, the difference is statistically significant at the P 0.001 level. What is driving the statistical significance is the large sample size. Almost any difference no matter how trivial will be statistically significant if you have 1.5 million subjects!
Besides large sample sizes, very sensitive measures can lead to statistically significant, but clinically unimportant results. For example, a study of Alzheimer’s disease found that patients given the medicine tacrine had statistically significant improvements on a scale very sensitive to cognitive changes (the cognitive scale of the Alzheimer’s Disease Assessment) compared to patients who were given placebo. However, tacrine was not associated with improvements using more global measures of function such as the MiniMental State Examination.158 Due to its very limited benefit, tacrine is not widely prescribed for patients with Alzheimer’s disease.
158Qizilbash, N., Birks, J., Lopez Arrieta, J., Lewington S., Szeto, S. Tacrine for Alzheimer’s disease (Cochrane Review). In: The Cochrane Library (Issue 3). 2003, Oxford: Update Software.
163
Tip
Make sure your effect size is clinically important before undertaking your study.
Statistically insignificant and clinically important results
The best way to avoid a situation of having a statistically significant, but clinically unimportant result is to set an effect size a priori that is clinically important. Although this sounds obvious, much more attention is paid in both study design and study interpretation to the issue of statistical significance than to clinical significance.159
9.3 Can the results be statistically insignificant and clinically important?
Tip
When clinically important differences do not reach statistical significance report the finding, but indicate that the difference did not reach statistical significance.
Also: absolutely! There is nothing sacred about the conventionally used P-value of 0.05. There is no reason be dramatically more confident of a result that is significant at a P-value of 0.05 than a P-value of 0.06.
One way to avoid judging results based on a single threshold is to focus on the confidence intervals rather than the significance levels. The confidence intervals give you a sense of the range of results compatible with your data (Section 4.3). However, some people make the same mistake with confidence intervals as with P-values. That is, they dismiss any effect where the 95% CI don’t exclude 1.0.
On the other hand, there does need to be some widely accepted threshold for deciding when chance is an unlikely explanation for a result. Otherwise, investigators would be tempted to move that threshold around, after the fact, to call their results statistically significant.
When you have a clinically important difference that does not reach statistical significance but is close to the conventional cut-off (e.g., P 0.07 or the 95% CI includes one but excludes 0.98) report the finding, but indicate to the reader that it did not reach statistical significance.
For example, Kadish and colleagues tested the ability of an implantable cardioverter-defibrillator (ICD) to prevent deaths among patients with severe heart disease.160 They randomized 458 patients with non-ischemic dilated cardiomyopathy, left ventricular dysfunction, and evidence of arrhythmias to receive standard medical therapy alone versus standard medical therapy plus a single-chamber ICD. Using proportional hazards regression, they found that the ICD group was less likely to die (relative hazard 0.65). However, the 95% CI included 1 (0.40–1.06) and the P-value was 0.08.
Does this mean that ICDs do not save lives? No. What it does mean is that the study was underpowered for this outcome. When the investigators calculated their sample size they assumed that more than 50% of the deaths in the standard-therapy group would occur due to an arrhythmia. However, in the
159Man-Son-Hing, M., Laupacis, A., O’Rourke, K., et al. Determination of the clinical importance of study results. J. Gen. Int. Med. 2002; 17: 469–76.
160Kadish, A., Dyer, A., Daubert, J.P., et al. Prophylactic defibrillator implantation in patients with non-ischemic dilated cardiomyopathy. New Engl. J. Med. 2004; 350: 2151–8.
164 Statistics and causality
study, only a third of the deaths in the standard-therapy group were due to an arrhythmia. When the investigators used a more specific marker (Section 7.12) of the efficacy of ICD (sudden death due to an arrhythmia) they found a statistically significant decrease in deaths due to arrhythmias among the ICD recipients (relative hazard 0.20; 95% CI 0.06–0.71; P 0.006).
On the other hand, some investigators mistakenly assert that their nonsignificant findings should be accepted as truth because if the sample size had been bigger, the P-value would have been statistically significant and the confidence intervals would have excluded 1.0. Although it is true that for a given effect size, a larger sample size will result in a smaller P-value (tossed coin example, Section 1.1) and narrow the confidence intervals, statistical significance testing takes into account the degree of uncertainty in the effect size at a given sample size. A larger sample size will result in less uncertainty but may also result in a different point estimate.
10
Special topics
10.1 What is the difference between the relative risk and the absolute risk?
Absolute risk is more helpful in clinical situations than relative risk.
Relative risks (risk ratios and rate ratios (RR)) identify the risk factors for particular outcomes. However, they cannot tell you how likely an outcome is to occur, only how much more likely the outcome is to occur in one group than the other. Therefore, knowing the relative risk is not very helpful in clinical situations. In contrast, an absolute risk tells you how likely an outcome is to occur.
The difference between the relative risk and absolute risk is particularly great with rare diseases because a person at high relative risk of developing a disease (compared to an unexposed person) may still be very unlikely to develop that disease. For example, the relative risk of developing esophageal cancer is 40–125 higher among persons with Barrett esophagus. For persons newly diagnosed with Barrett esophagus this must sound like a certainty that they will develop cancer. In fact, the absolute risk of developing cancer if you have Barrett esophagus has been estimated at 0.5% per year (one in two hundred).161 Despite the high relative risk, the absolute risk is low because esophageal cancer is a rare disease.
10.2 What other effect measures are available in addition to relative risk and absolute risk?
In addition to relative risk and absolute risk, several related effect measures are available. Each one characterizes the association between a risk factor and an outcome differently. The different measures, along with their meaning, and their uses, are shown in Table 10.1.
161Shaheen, N., Ransohoff, D.F. Gastroesophageal reflux, Barrett esophagus, and esophageal cancer. J. Am. Med. Assoc. 2002; 287: 1972–81.
165
166 |
|
Special topics |
|
Table 10.1. Comparison of different measures of effect |
|
||
|
|
|
|
Effect measure |
Meaning |
Use |
|
|
|
|
|
Absolute risk difference |
Incidence of disease that can be |
Understand differences in risk due to |
|
(attributable risk) |
attributed to a particular exposure |
differences in exposures |
|
Attributable fraction |
Proportion of disease due to a |
Understand importance of a particular |
|
|
|
particular exposure |
factor on disease occurrence |
Population attributable |
Incidence of disease due to a |
Helpful in targeting public health |
|
fraction |
particular exposure in a community |
interventions |
|
Number needed to treat |
Number of persons needed to be |
Helpful in deciding whether it is worth |
|
|
|
treated to prevent one outcome |
adopting a clinical intervention |
|
|
|
|
10.2.A Absolute risk difference |
|
|
|
|
|
|
||
|
|
The absolute risk difference is the difference in the incidence between two |
||||||
|
|
groups: |
|
|
|
|
|
|
|
|
absolute risk |
|
|
|
|
|
|
|
|
|
incidence among |
incidence among |
||||
|
|
difference |
|
|
|
|
|
|
|
|
|
|
exposed |
|
|
|
|
|
|
|
|
|
|
unexposed |
|
|
|
|
Assuming that there is a causal relationship between the exposure and the |
||||||
Definition |
|
|||||||
|
outcome, the absolute risk difference tells you how much of the incidence of the |
|||||||
Attributable risk tells |
|
|||||||
|
disease is due to (can be attributed to) the exposure. For this reason it is also |
|||||||
you how much of the |
|
|||||||
incidence of a disease |
|
referred to as the attributable risk or the attributable risk in exposed persons. |
||||||
can be attributed to a |
|
In Section 5.9.A I reviewed a study comparing the risk of community-acquired |
||||||
particular exposure. |
|
|||||||
|
pneumonia among patients exposed to acid suppressing drugs compared to per- |
|||||||
|
|
|||||||
|
|
|||||||
|
|
sons not exposed. The investigators found that the incidence of pneumonia in |
||||||
|
|
patients exposed to acid suppressing drugs was 2.45 per 100 person years |
||||||
|
|
(185/7562 100) and the incidence of pneumonia in unexposed patients was |
||||||
|
|
0.55 per 100 person years (5366/970,331 100). Therefore, the attributable |
||||||
|
|
risk (attributable to acid suppression medication) is 1.9 cases (2.45 0.55) per |
||||||
|
|
100 person years. |
|
|
|
|
|
|
10.2.B Attributable fraction (attributable risk percentage)
The attributable fraction (also known as the attributable risk percentage) tells us the proportion of a disease that is due to a particular exposure, assuming that
167 |
Attributable fraction |
|
|
the exposure causes the disease.162 It is calculated as:
attributable |
|
incidence among exposed incidence among unexposed |
fraction |
|
incidence among exposed |
Incidence in the formula can be incidence rate or incidence proportion. Continuing with the example of acid suppressing drugs and pneumonia, the
attributable fraction would be:
2.45 0.55 0.78
2.45
In other words, 78% of the pneumonias that developed among the patients in the study can be attributed to acid suppressing drugs. This may seem very high to you because you are thinking that the attributable fractions for all the causes of pneumonia should add up to 100%. This is incorrect. The attributable fractions can exceed 100% because multiple causes can interact and result in disease (e.g., acid suppressing drugs in the setting of exposure to pneumococcus can cause pneumonia).163
This attributable fraction can also be stated in terms of RR, specifically:
attributable fraction RR 1.0 RR
To prove that the two ways of stating the attributable fraction are equivalent calculate the attributable fraction in terms of the RR. In Section 5.9.A we had calculated that the RR associated with exposure to acid suppressing drugs was
4.5.Therefore, he unadjusted attributable fraction would be:
4.51.0 0.78
4.5
One advantage to the formula calculating attributable risk from the risk ratio is that the formula can be generalized so that you can approximate the attributable fraction from the odds ratio when it can be considered an approximation of the risk ratio (Section 5.2).
162Some authors define the attributable risk in the way I have defined the attributable fraction. It is best not to get distracted by the confusing nomenclature, and instead focus on the meaning of the comparison you are making.
163In fact, the sum of the attributable fractions is bounded by infinity. For more on this somewhat counter-intuitive idea see Rothman, K.J., Greenland, S. Modern Epidemiology (2nd edition). Philadelphia: Lippincott, Williams & Wilkins, 1998, pp. 12–14.
168 |
Special topics |
|
|
attributable fraction* OR 1.0 OR
*Assuming outcome is uncommon ( 10–15%)
This is very useful when you have performed logistic regression and have an odds ratio rather than a relative risk for a given exposure.
10.2.C Population attributable fraction
Population attributable fraction tells us the proportion of a disease that is due to a particular exposure in a population, assuming that the exposure causes the disease. This metric incorporates the prevalence of the risk factor such that interventions that decrease common risk factors reduce disease more than interventions that eliminate uncommon risk factors. Stated in a different way: if you had two interventions that halved the incidence of a particular disease, the intervention that decreased the more common risk factor would have a more powerful effect in the community than the intervention that eliminated the less common risk factor. The formula for population attributable fraction164 is:
population |
|
incidence in population incidence in unexposed |
attributable fraction |
incidence in population |
As with attributable fraction, incidence can be based on incidence rates or incidence proportions. The above formula can be rewritten mathematically165 to more easily see the impact of the prevalence of the risk factor on the population attributable fraction:
|
|
(prevalence of risk |
population |
|
factor in the population) (RR 1) |
attributable fraction |
[(prevalence of risk |
|
|
|
factor in the population) (RR 1) 1] |
The differences between risk ratios, attributable fraction, and population attributable fraction are illustrated by a population-based study of risk factors for uncontrolled hypertension (Table 10.2).166 You can see that based on the relative risks, having no medical care is a stronger predictor of uncontrolled hypertension than being male. However, because only 10% of the sample had
164For more on attributable risk and population attributable risk see Kelsey, J.L., Whittemore, A.S., Evans, A.S., Douglas Thompson, W. Methods in Observational Epidemiology (2nd edition). Oxford: Oxford University Press, 1996, pp. 37–40.
165To see how: Szklo, M., Nieto, F.J. Epidemiology: Beyond the Basics. Gaithersburg, Maryland: Aspen Publication, pp. 101–5.
166Hyman, D.J., Pavlik, V.N. Characteristics of patients with uncontrolled hypertension in the United
States. New Engl. J. Med. 2001; 345: 479–86.
