- •Contents
- •Foreword
- •Contributors
- •1 History of the Electroretinogram
- •2 History of Electro-Oculography
- •12 Origins of the Electroretinogram
- •15 Origin of the Visual Evoked Potentials
- •IV EQUIPMENT
- •17 Electrodes for Visual Testing
- •V DATA ACQUISITION
- •20.2 EOG Standard
- •21 Multifocal Techniques
- •22 The Pattern Electroretinogram
- •28 Suppressive Rod-Cone Interaction
- •VI DATA ANALYSIS
- •31 Analytical Techniques
- •32 Reverse Correlation Methods
- •34 Kernel Analysis
- •VIII OTHER PROTOCOLS FOR RECORDING OF ERG AND SLOWER POTENTIALS, TECHNICAL ISSUES, AND AUXILIARY TESTING TECHNIQUES
- •40 Early Receptor Potential
- •42 Direct Current Electroretinogram
- •44 Flicker Electroretinography
- •48 Causes and Cures of Artifacts
- •IX PRINCIPLES OF CLINICAL TESTING AND EVALUATION OF VISUAL DYSFUNCTION FROM DEVELOPMENTAL, TOXIC, AND ACQUIRED CAUSES
- •52 Developmental Amblyopia
- •53 Visual Evoked Potentials in Cortical Blindness
- •55 Mitochondrial Diseases
- •59 Ischemic Optic Neuropathy
- •61 Dominant Drusen
- •62 Stargardt Disease
- •64 Leber Congenital Amaurosis
- •65 Pattern Dystrophies
- •67 Sorsby’s Fundus Dystrophy
- •68 Choroideremia
- •69 Retinitis Pigmentosa
- •73 Juvenile X-Linked Retinoschisis
- •75 Quinine Retinopathy
- •XV ANIMAL TESTING
- •Index
VI DATA ANALYSIS
This page intentionally left blank
30 Experimental Design and
Data Analyses in Vision
Function Testing
W that reflects visual function abnormal? When is a change in some aspect of visual function meaningful, and when is it just a random chance variation? When is a therapeutic intervention helpful in slowing the progression of disease? When might an intervention be harmful? These are just some of the areas in which statistical analyses can be extremely helpful and informative. In this chapter, basic issues of experimental design and statistical analysis are discussed. Detailed discussion of a variety of statistical issues is beyond the scope of this chapter. Whenever possible, the reader will be referred to primary sources for expanded discussion.
Summarizing the characteristics of a sample using descriptive statistics
The most basic way of summarizing functional data is by using descriptive statistics. Three pieces of information are usually provided in summarizing the characteristic of a randomly drawn sample of data. First, the number of separate and independent observations made regarding any one variable or the number of individual subjects is denoted by N. In eye research, for example, observations made by two individuals by the right eye are independent (or uncorrelated) for statistical purposes, and N = 2, but observations made by the left and right eyes of the same individual are not independent. This is a fact that is frequently overlooked in many experimental designs. Most parametric statistical tests (see below) require that the assumption of independence be met. If the assumption of independence is ignored, an increase in the Type I error rate is likely to occur (see below).1
The second piece of information used in describing sample data is the average, or center, of the distribution of scores. The three measures of a distribution’s center are the mean, median, and mode. The mean is the most widely used measure of the center and is defined as the sum of all scores divided by the number of scores. The median is the score at the fiftieth percentile, that is, the score that exactly divides
the upper half of the scores from the lower half of the scores. Finally, the mode is defined as the most popular scores in the distribution. The mode is the crudest of averages, since it is not necessarily unique; a distribution of scores might have two or three popular scores, or modes.
The estimates of the mean, median, and mode are influenced by the nature of the underlying distribution. If a sample of data is normally distributed, the mean, median, and mode will be very close. However, there are other types of sample distributions, called skewed distributions, in which the mean, median, and mode can be very different.
The third descriptive statistic is a measure of how the scores vary around the center of the distribution. Commonly used measures of variability are the range, standard deviation, and variance. (There are several other measures of variability, such as the average deviation and the quartile deviation, that will not be described here.) The range is the difference between the highest score and the lowest score in the distribution. The standard deviation is the most common measure of the variability of a sample. In statistics, it is usually denoted S, although in nonstatistical applications sd or st dev is commonly used. It is defined as the square root of the sum of the squared deviations from the mean divided by N. The variance, which is computed as the square of the standard deviation, also indicates how representative of the individual scores the mean is.
The specific applications and methods of computation of the measures of the average and variability of test scores can be found in most elementary statistics textbooks.
A common application of descriptive statistics is in establishing normative data for some specific testing protocol. Normative data are defined by the nonpathological visual system and provide a guideline for deciding whether a particular electrophysiological or psychophysical parameter lies within normal limits or is outside the normal range and perhaps indicative of underlying pathology. Normal limits are usually defined arbitrarily as, for example, ±2.0 standard deviations (S ) from the mean. For example, if the mean of a sample of
: |
431 |
electroretinogram (ERG) b-waves is 300 mV and the standard deviations of the distribution of b-waves is 50 mV, then the scores from 200 to 400 mV represent the mean ±2.0 S. If the distribution of scores is normally distributed, ±2.0 S will capture approximately 98% of the distribution of normal scores. Of course, the caveat here is that roughly 2.0% of the normal population of scores would be considered abnormal by chance alone.
Estimating population parameters
It is of course not practical to measure an entire population, say, those with normal vision, to estimate the mean and variability of a specific functional parameter, such as the amplitude of the b-wave of the ERG. Instead, if an investigator wants to learn about the mean and standard deviation of the population, called m and s, respectively, a representative sample of individuals is drawn from the population of interest, and then this sample is studied. (Greek letters designate population parameters.)
There are two ways to estimate m, the population mean. One is called a point estimate. For this method, a random sample of individuals is drawn from the population of interest, and the mean of this sample is taken as the best estimate of m. A second way of estimating m is called an interval estimate. Instead of computing a single estimate of m, a range of likely values is computed so that there is a high probability that this range contains the population m. The range is called a confidence interval, and its boundaries are called confidence limits. The confidence interval can be set at different levels. For example, the 95% confidence limits means that the mean drawn from a sample falls with the specified range. If the mean of the sample falls outside this range, and with a 95% confidence interval there is a 5% chance that this will occur, then the sample must have been drawn from a different population from that defining the confidence interval. To set broader limits, an investigator might establish the 99% confidence limits, that is, that 99% of the sample means would fall within the specified range and 1% would fall outside the range. It should be apparent that setting confidence limits allows testing of specific hypotheses about a sample data, as, for example, when one asks whether one group of patients generates b-wave amplitudes that are different from those of a normal population of b-waves.
Investigators frequently report the standard error of the mean in studies testing whether one sample is significantly different from another sample. If one were to draw 100 random samples from a population, then the standard deviation of the 100 sample means is called the standard error of the mean, denoted sx. The term is related to the standard deviation of the scores in the entire population of interest (s) and is defined as s
N . The sx has a huge advantage in testing the difference between sample means: that even in cases of
extreme skewness and kurtosis, sx has a Gaussian distribution. However, in the clinic in which the issue is to determine whether a particular individual’s score is abnormal, the sx is not particularly useful.
When does a parameter estimate indicate pathology?
Of importance to the clinician is whether a particular set of responses lie within or outside normal limits. For example, if an ERG is performed, a clinician would be interested in knowing whether a rodor cone-mediated response lies within normal limits or whether it falls outside normal limits. Scores that fall outside normal limits are suggestive of underlying pathology.
To answer these questions, statistical methods are employed. Most established clinical laboratories will have normative data derived from a normal population for the parameters of interest. A mean and the boundaries of normal function, or the confidence limits, will be established. Some laboratories set 95% or 99% confidence limits, and if a particular parameter estimate falls outside this range, it is considered abnormal. Other laboratories arbitrarily set these limits at 2.0 or even 3.0 standard deviations from the mean.
While normative data can be found in the literature for some electrophysiological tests (see, e.g., Birch and Anderson2 and Birch, Anderson, and Fish3 for the ERG), it is commonly recommended that laboratories collect their own normative data. The rationale for this point of view is that performance on a particular test depends not only on the characteristic of the observer, but also on how the data were collected. An important point to remember is that the normative data should be representative of the population of interest. Normative data collected in Australia, for example, might not be representative of individuals in North America, so the concept of global norms seems foolish. Variations in stimuli, equipment, and data analysis can affect not only the average performance on a test, but also the variability associated with the test scores. What might be considered within normal limits in one laboratory might be outside in another. However, if a laboratory can ensure that the populations, stimulus conditions, and data collection are consistent with the normative data reported in the literature, it is probably safe to use those data as a guideline.
Basic elements of experimental design and hypothesis testing
There is considerable interest in evaluating the effectiveness of therapeutic interventions for the prevention or even reversal of a progressive eye disease. The intervention might involve administering a drug or some sort of genetic therapy. In the simplest experimental design to evaluate the effec-
432
tiveness of a treatment, the drug or gene therapy would be administered in a treatment condition and a placebo in a control condition, and the effectiveness of the treatment would be evaluated with some outcome measure such as a change in the amplitude of an ERG signal. In this example, the placebo is assumed to be a sham condition in which no treatment is delivered. However, the control condition need not be the absence of a treatment. In a drug trial, for example, an available alternative drug that has already been accepted as the current standard of care might be used as the control condition. The main point is that one needs an appropriate control group against which the treatment effect is evaluated.
One type of possible experimental design to evaluate a treatment is called the independent groups design. In this instance, a group of subjects are selected at random from the population of interest and then split, with random assignment to either the treatment or control condition. In every respect, the treatment and control groups are treated alike. Ideally, the experiment should be run double-blind: Neither the experimenter nor the subject knows who is receiving the drug.
In the above example, the subjects are randomly assigned to one or the other condition, but it is conceivable that by chance alone, the groups might differ in some irrelevant way. For example, by chance, one group might be older or might be disproportionately higher in males than females. This factor alone and not the particular treatment could be responsible for any observed differences.
To avoid the influence of irrelevant factors, experimenters will make their groups as comparable as possible. Suppose age is a relevant factor; the experimenter will then equate the two groups for age. If gender were a relevant factor, then the experimenter would balance the groups to avoid gender differences, either using only males or only females or using equal numbers of each gender in each group. One possible experimental design is to select pairs of subjects from the population of interest who have been equated on as many relevant variables as possible and then assigning one of each pair to either the treatment or the control condition. This design is called a groups with matched-subjects design.
Ideally, the matched subjects should be identical in every important way. If an experiment could be administered to identical twins, then the match would be very good, but this is often impractical. A more feasible approach would be to test the same individual under both conditions. For example, a patient might be tested before and after administration of a drug or gene therapy. This is the ultimate in matching and is generally considered to be the most powerful experimental design if it can be run. This particular design is called a repeated measurements design.
The experimental designs just described are the most basic of designs. More frequently than not, however, exper-
imental designs are significantly more complicated. For example, an investigator might be interested in examining the effects of different doses of a drug treatment on more than one disease entity (a two-factor design). The experimental design might have X number of dose levels and Y number of disease entities. If X = 4 and Y = 5, then this experimental design would require 20 groups of subjects (4 doses ¥ 5 diseases) if an independent groups design were used. In another experimental design, an experimenter might be interested in the before and after effects of administration of a drug on each subject within each disease entity. This is called a mixed design because each subject acts as its own control (repeated measures on the same subject) and they are nested within disease group (independent groups). The particular experimental design that is developed depends in large part on the questions that are being asked. The main goal of any experimental design is to have the most powerful possible design to answer relevant questions and control for irrelevant variables. Of course, the more complicated the design, the more difficult are the computational analysis and interpretation of the statistical findings. It is good practice to design an experiment and simulate potential outcomes to ensure that the design unambiguously answers the questions of interest.
Type I and II errors
When tests of significance evaluating the differences between conditions or subjects are made, two types of errors can occur. These errors are referred to as Type I and Type II errors. For demonstration, assume that there are two groups of subjects, say, patients with X-linked retinitis pigmentosa (xlrp) and patients with autosomal-dominant retinitis pigmentosa (adrp) and the performance measure to be compared is the amplitude of the cone b-wave. If an investigator rejects the hypothesis of no difference in the cone b-wave amplitude between xlrp and adrp (referred to as the null hypothesis, or H0) when in fact H0 is true, an error is made, and this type of error is called a Type I error. Alternatively, an investigator might fail to reject H0, when in reality, the hypothesis of no difference is false—in reality, there is a real difference in b-wave amplitudes between xlrp and adrp. This is also an error and is called a Type II error. The main goal of any statistical hypothesis testing is to make correct decisions, that is, to reject H0 when it is false and to not reject H0 when it is true.
The probability of making a Type I error is called by a number of different names, including alpha (a), significance level, or p-value. For example, when testing the difference in the cone b-wave amplitude between xlrp and adrp, an investigator might find that the significance, or p-value, of the test is 0.05. This means that the difference that is observed in the performance of the two groups has a 5% likelihood of
: |
433 |
occurring by chance alone. In rejecting the hypothesis of no difference between the groups (H0), therefore, the investigator will have a 5% chance of making a Type I error—reject- ing H0 when it is in fact true. Frequently, an investigator will set a at a higher level, say, 0.01. This means that an investigator will have only a 1% chance of making a Type I error. However, in setting a more stringent a or p-value for the test, the investigator is making the test less sensitive to real differences that might exist between the groups, and this can lead to a Type II error. The probability of making a Type II error is referred to as beta (b ). The probability of making a Type II error and the probability of making a correct rejection of H0 are determined by a number of variables discussed in the section below on the power of a statistical test.
Power
The power of a statistical test is the ability of the test to correctly reject the hypothesis of no difference between samples (the null hypothesis) when in fact there are real differences. The power of a test is determined by a number of variables, including the level of significance that is set for the test, the size of the sample, the magnitude of the population variability, the magnitude of the difference between population means, and whether a oneor two-tailed test is used.
In some experimental designs in vision research, calculation or consideration of statistical power is not an important consideration. An example of such a situation would be when there are large differences between groups on some performance measure such as the amplitude of the b-wave of the ERG, as in the case of normal subjects and patients with X-linked retinitis pigmentosa. Another example would be when the populations of interest are so unique and rare that sufficient numbers of subjects cannot be obtained to meet the requirements of power.
However, in some experimental designs, a researcher might decide that it is important to reject the hypothesis of no difference, H0, if the difference between two groups of patients is greater than some predetermined value. For example, a researcher might decide that a difference of 100 mV on an ERG response is a meaningful difference betweeen a treatment and a control group; smaller differences might be rejected as being of little practical or clinical importance. Thus, the researcher would test the hypothesis of no difference, H0, that m, the difference in population means of the two groups, is £100 mV against the alternative hypothesis that m ≥ 100 mV. The question then is what power does the test of H0 have when the alternative hypothesis is true.
As was stated above, the power of the test would depend on several factors. First, power increases as a increases; a is an arbitrary value that defines the likelihood of a particular event or occurrence and is traditionally set at 0.05. Power increases as a increases because a larger range of values is
included in the rejection region. Second, power will be affected by whether H0 and the alternative hypothesis, H1, are oneor two-tailed tests. In two-tailed tests, there are two critical regions that scores must exceed in order for H0 to be rejected, whereas in a on-tailed test, there is only one critical region, making the one-tailed test a more powerful test to reject H0 in some specific experimental situations.
Finally, the population variance (a measure of variability) and sample size are two additional factors affecting power. In general, a smaller population variance and a larger sample size will yield lower estimates of the standard error of the mean (S.E.). If H0 is true, a smaller S.E. increases the probability of rejecting the H0 in favor of the alternative hypothesis; power is increased.
Traditionally, sample size will depend on what specific power is desired as well as the smallest difference between two groups that must be exceeded with that level of power in order to reject H0. Procedures for calculating sample size can be found in a number of standard statistical design references, including Cohen6 and Kirk.9 The reader is referred to Brewer,5 Cohen,7 and Meyer11 for a helpful and in-depth discussion of the practical considerations of power calculations.
Assessing physiological change in the clinic
Frequently, clinicians are asked to evaluate patients who are undergoing a treatment for a systemic disease where the treatment may adversely affect vision. Treatment of systemic lupus erthymatosis with Plaquenil (hydroxychloriquine) is such an example. The clinician’s task is to select a test or a battery of tests that reliably detect the onset of drug toxicity. Finding the appropriate test(s) that can discriminate normal patients from those with a potentially toxic response is typically accomplished by deriving receiver operating characteristic (ROC) curves. A ROC curve is a graphical representation of the trade-off between false-positive and false-negative rates for selected values obtained from a test. Conventionally, the plot shows the false-positive rate on the X-axis and the false-negative rate on the Y-axis. Equivalently, the ROC curve is the representation of the trade-off between sensitivity and specificity. Accuracy, which refers to the test’s ability to discriminate normal from abnormal, is measured by the area under the ROC curve. A zero-effect ROC curve, which would be consistent with a test that is a poor discriminator, would have an area of 0.0, and a perfect discriminator would have an area of 1.0, when sensitivity and specificity are expressed as fractions of unity. In a plot of the true positive rate against the false-positive rate, a zeroeffect test would be indicated by a diagonal line with a slope of 1.0 and an origin at 0.0. From the ROC curves, one can select any pass/fail measure and then compare the performance of the test with a gold standard that is known to discriminate between groups of patients.
434
ROC curves can also be constructed from clinical prediction rules. For example, one might find that changes in the electro-oculogram, an abnormal color vision test, depressed macular function on the multifocal ERG, and changes in the retinal pigment epithelium are predictive of Plaquenil toxicity. ROC curves can be constructed by computing the sensitivity and specificity of increasing numbers of these clinical findings (0–5). The ROC curves would indicate whether this set of clinical findings adequately discriminates normal from toxic responses and what minimum number of clinical findings would have the most predictive power. A more detailed discussion of ROC analysis can be found in Metz10 and Zweig and Campbell.15
Here the issue is that of change. How much change in some physiological parameter would need to occur for the researcher to conclude that a treatment has positive or negative consequences? Clearly, a one-time evaluation of visual function is inadequate. Serial testing is required, which usually involves obtaining pretreatment baseline measures and then repetitive testing at appropriate time intervals during treatment. Electrophysiological and psychophysical test results are notoriously variable and are influenced by many factors. Variability can originate from factors inherent in the subject or patient, such as diurnal rhythms and attention, or from factors external to the patient, such as stimulus variations and the quality of the tester.
Waiting for the test scores to fall outside normal limits is not acceptable if termination of treatment and drug washout will reverse the negative consequences for visual function. Such dramatic change in an individual might suggest that irreparable damage has resulted. Instead, the question becomes: How much change is acceptable? How much change is attributable to the treatment and how much is random variation in the parameter estimate?
One strategy is to obtain test-retest reliability measurements for the specific test for normal observers. Published databases of this sort are becoming available (see, e.g., Birch, Hood, Locke, Hoffman, and Tzekov4). Then one can say that 15%, 20%, or 50% variation in a particular physiological or psychophysical parameter is random variation. If the change is greater than the prescribed amount, then the treatment can be assumed to have some effect. An alternative strategy would be to obtain test-retest reliability measurement on the particular patient before treatment and then use this estimate of variability to evaluate treatment for this individual. A within-subject design is generally accepted as being a more powerful experimental design but would require more extensive testing that might be beyond the scope of most clinical laboratories.
An additional question would be whether a single negative finding would be sufficient to conclude a treatment effect. In patients with glaucoma or a progressive retinal disease, it is common to observe visual function over many
test sessions before concluding that progressive change has occurred and determining the amount of that change. In drug therapies, this kind of luxury is not afforded to the clinician who must decide whether a drug treatment should be discontinued. Correlation with other clinical findings is very important in this context. Finally, the trend in visual function must be reversible when drug washout has occurred to conclude definitively that the drug caused the progressive change in visual function.
Parametric and nonparametric statistical testing
There are two broad classes of statistical tests: parametric and nonparametric tests. Each of these classes of tests relies on certain assumptions. Parametric tests, for example, rely on the assumption (among other assumptions) that test scores obtained from a population are normally distributed. That is, in a distribution of test scores, the assumption is made that there are just as many high and above-average scores as there are low and below-average scores. In comparing two groups of scores using a parametric test, it is assumed that the two groups of scores are normally distributed around a mean and have more or less equal variability. A parametric test, such as an analysis of variance (ANOVA) or a t-test, is inappropriate when this assumption is not met.
Sometimes, however, data sets and experimental designs do not allow this assumption of normality to be made. For example, distributions of test scores may have unequal variability; one set might be normally distributed, but a second set of test scores might have substantially greater numbers of higher than lower scores. In such cases, it is sometimes possible and legitimate to transform the raw test scores into another numeric form, as in the conversion to logarithms, to normalize and equate the distributions and then to continue with an appropriate parametric test.
In other situations, the nature of the test scores themselves prevents the use of a parametric test. For example, in a drug toxicity study, an investigator might simply ask whether there is a positive or negative change in a visual field rather than actually quantifying, for example, the percent change in area of the visual field. Or the outcome measure might require a two-alternative forced choice, such as “Yes, I see it” or “No, I don’t see it.” Or the test scores might be in the form of a rating scale, say, from 1 to 5, as in the scale for rating the density of cataracts. These types of experimental test results would be better suited to nonparametric than to parametric statistical analyses.
In general, nonparametric tests are less powerful than parametric tests. If the assumptions for the use of a parametric test can be met, then a parametric test will be a more sensitive test to detect small differences. Nonparametric tests tend to overlook or miss small differences and produce a Type II error: accepting the hypothesis of no difference
: |
435 |
between groups when in reality it should not have been accepted. (A broad discussion of the use of nonparametric tests can be found in Seigel.13)
Regression analyses
In some research questions, an investigator might be interested in understanding how a particular physiological or psychophysical parameter changes with time. For example, an issue that has been investigated extensively in the literature is whether the mode of inheritance affects disease progression associated with retinitis pigmentosa. Alternatively, an investigator might be interested in knowing whether the presence or absence of a particular antigen affects disease progression as measured by the changes in visual field area or a physiological parameter, such as the amplitude of the b-wave.
This class of question is best addressed with a regression analysis. The simplest form of a regression analysis is called a linear regression analysis. In this type of regression, it is assumed that a straight line defines the relationship between two variables, such as ERG amplitude and age. By using a least-squares minimization procedure, the best fit straight line is found that describes the relationship between the two variables, and the relationship is defined by an equation of the form
Y = b0 + b1X
where b0 and b1 are constants and X and Y are the variables that are to be related, such as ERG amplitude and age, respectively, in the above example. The constant b1 is called the slope of the line and indicates the rate of change of Y with X. This constant can be used to indicate how much in the way of amplitude loss occurs on average with each year of change. The constant b0 is the Y-intercept, that is, the value of Y when X is equal to zero.
In the discussion thus far, the relationship between two variables is assumed to be linear; that is, a straight line is assumed to describe the relationship between two variables. It is conceivable, however, that for some physiological and psychophysical variables, a straight line is not the best way of describing the relationship. For example, a curved line might describe the relationship better. In this instance, the regression equation is best fit by a polynomial equation of the form
Y = b0 + b1X + b2X 2 + . . . + bpX p + . . . + Ba-1X a-1
where X and Y are described as above and b0-(a-1) are the slope coefficients. As can be see, in a nonlinear relationship between two variables, it is impossible to state a single value to define the rate of change of one variable with the other. One possible remedy to this is to transform scores into some form that will better define a linear relationship.
Defining and modeling the relationship between two variables can be complicated, filled with assumptions and fraught with difficulties. The reader is referred to Myers and Well12 and Darlington8 for discussions of computational and practical issues in linear and nonlinear regression analysis.
Univariate and multivariate statistical analyses
Univariate statistical analyses refers to the situation in which there is only one outcome measure, called a dependent variable. There can be one or more independent variables, which is either a variable that is manipulated by an experimenter or some naturally occurring category. For example, in a drug intervention study, the empirical question might involve testing the effects on the b-wave of the ERG (the dependent variable) for different doses of a particular drug (the independent variable). Or the study might be expanded to study the interaction effects of gender (a second independent variable) and drug dose on ERG amplitudes. Analysis of variance (ANOVA) and t-tests are examples of univariate statistical analyses.
In some experimental designs, however, there may be more than one outcome measure or dependent variable. For example, if an experimental design called for performing an entire ERG protocol that conforms to the International Society for Clinical Electrophysiology standards, there could be as many as five or six dependent variables. Or one might have an ERG measure and a psychophysical measure, such as the area of a visual field, as outcome measures. In these situations, the empirical question might be which ERG measure is a better indicator of drug toxicity or whether an ERG measure does better than a psychophysical measure.
In vision research, data are typically analyzed with multiple univariate analyses. For example, one might use a t-test to test the significance of gender differences on the amplitude of an ERG response. A second t-test might be used to test the significance of gender differences on the area of the visual field. An experimenter might continue with multiple t-tests to examine differences in outcome measures for each independent variable separately.
In general, this analytical approach is acceptable provided that the number of comparisons is small. When the number of comparisons is relatively large, the results of multiple t- tests may be misleading and statistically suspect. Each t-test is accompanied by a margin of error that is dictated by the p-value or a. If a = 0.05, then there is a 5% chance of a difference being observed in sample means based on chance alone. If several independent univariate tests are performed on the same sample of subjects, the error rate increases with the number of univariate tests. In addition, when numerous outcome measures are obtained from the same sample, these outcome measures are also interrelated in complex ways, and the probability of error increases even further.
436
These increases in error rates are frequently overlooked in clinical vision studies or, alternatively, are dealt with by asserting a more stringent criterion level for significance. For example, one might set a = 0.01 or 0.001, which would reduce the rejection level of the hypothesis of no difference to 1% and 0.1%, respectively. However, such approaches would reduce the power of the test to detect real differences that might exist between the samples (see the section on power), thereby increasing the probability of a Type II error. With the use of multivariate statistics, complex interactions between multiple independent (outcome) measures and independent (experimental) variables can be assessed while keeping the overall error rate at some stated level, say, 5%, independent of the number of comparisons that are made. Experimental designs in clinical and experimental vision science are becoming increasingly complex, particularly in studies in which one is trying to find the most sensitive measure of change. These experimental designs call for multivariate analyses, and there are numerous statistical analysis packages, such as SPSS, SYSTAT, and SAS, that can make the job of computation less painful. However, while the computational part of multivariate statistics can be made relatively easy, the interpretation of the statistics in terms of understanding relationships between many variables can be very difficult. The reader is referred to Tabachnick and Fidell14 for a comprehensive discussion of multivariate statistical designs and interpretation issues.
Summary
Exploring experimental questions involves complicated experimental designs and statistical analyses. Experimental design addresses the issue of how an experiment is to be constructed so that it best answers the questions of interest. Statistical analyses allow one to quantify and summarize the data obtained from an experiment and allow tests of the significance of experimental findings. The major goal of any experimental design and statistical analysis is to provide the most powerful possible mechanism to answer experimental
questions and to test specific hypotheses. It is recommended that an experiment be simulated with generated data to ensure not only that the data can be analyzed appropriately, but also that the interpretation of the analysis will be unambiguous.
REFERENCES
1.Anderson LR, Ager JW: Analysis of variance in small group research. Personality Social Psychol Bull 1978; 4:341–345.
2.Birch DG, Anderson JL: Standardized full-field electroretinography: Normal values and their variation with age. Arch Ophthalmol 1992; 110:1571–1576.
3.Birch DG, Anderson JL, Fish GE: Yearly rates of rod and cone functional loss in retinitis pigmentosa and cone-rod dystrophy. Ophthalmology 1999; 106:258–268.
4.Birch DG, Hood DC, Locke KG, Hoffman DR, Tzekov RT: Quantitative electroretinogram measures of phototransduction in cone and rod photoreceptors: Normal aging, progression with disease, and test-retest variability. Arch Ophthalmol 2002; 120:1045–1051.
5.Brewer JK: Issues of power: Clarification. Am Educ Res J 1974; 11:189–192.
6.Cohen J: Statistical Power Analysis for the Behavioural Sciences. New York, Academic Press, 1969.
7.Cohen J: Statistical power analysis and research results. Am Educ Res J 1973; 10:225–230.
8.Darlington RB: Regression and Linear Models. New York, McGraw-Hill, 1990.
9.Kirk R: Experimental Design: Procedures for the Behavioural Sciences. Pacific Grove, CA, Brooks/Cole, 1982.
10.Metz CE: Basic principles of ROC analysis. Sem Nucl Med 1978; 8:283–289.
11.Meyer DL: Statistical tests and surveys of power: A critique. Am Educ Res J 1974; 11:179–188.
12.Myers JL, Well AD: Research Design and Statistical Analysis. New York, HarperCollins, 1991.
13.Siegel S: Nonparametric Statistics for the Behavioural Sciences. New York, McGraw-Hill, 1956.
14.Tabachnick BG, Fidell LS: Using Multivariate Statistics. New York, Harper and Row, 1989.
15.Zweig MH, Campbell G: Receiver-operating characteristic (ROC) plots: A fundemental evaluation tool in clinical medicine. Clin Chem 1993; 39(4):561–577.
: |
437 |
This page intentionally left blank
