Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002
.pdf
different distributions of four sets of data, each with the same number of observations. The lower panel shows that despite the obvious differences, the four sets of data have identical means and standard deviations.
5.3.2.4 Reasons for Survival of Standard Deviation — In view of the potential (and sometimes actual) major distortions produced when 95% inner zones are calculated from X ± 2s , you may wonder why the standard deviation has managed to survive so long and be used so often as a descriptive index.
Aside from the reasons already cited, an additional advantage of the standard deviation was computational. In the era before easy ubiquitous electronic calculation, the standard deviation was easier to determine, particularly for large data sets, than the locations of percentile ranks and values. With easy electronic calculation, this advantage no longer pertains today.
The main apparent remaining advantage of the standard deviation arises from inferential rather than descriptive statistics. The standard deviation is an integral part of the Z tests, t tests, and other activities that will be encountered later when we begin to consider P values, confidence intervals, and other inferential statistics. This inferential advantage, however, may also begin to disappear in the next few decades. Easy electronic calculation has facilitated different approaches — using permutations and other rearrangements of data — that allow inferential statistical decisions to be made without recourse to standard deviations.
Accordingly, if your career is now beginning, the standard deviation may have become obsolete by the time your career has ended.
5.3.3Gaussian Z-score Demarcations
If a distribution is Gaussian, a cumulative probability will be associated with the Z-score calculated as Zi = (Xi – X)/s . This probability can promptly be converted to the spans of inner and outer zones of data. In fact, the X ± 2s principle discussed in Section 5.3.2 depends on the Gaussian idea that the interval will span about 95% of the data.
Since very few medical distributions are truly Gaussian, however, Z-score probabilities are seldom used descriptively as indexes of spread. The Z-score probabilities, however, will later become prominent and useful for the strategies of statistical inference in Chapter 6.
5.3.4 Multiples of Median
Multiples of the median, called MoMs, are sometimes proposed as a mechanism for replacing Z-scores. |
|
~ |
~ |
The MoM score is calculated as Xi/X |
when each item of data Xi is divided by the median, X . The |
method is controversial, having both supporters3 and detractors.4 The MoM score produces a distinctive relative magnitude for each item of data, but the span of scores is grossly asymmetrical. The MoMs will range from 0 to 1 for Xi values below the median, but will extend from 1 to infinity for values above the median. Furthermore, unlike a percentile or Z-score, the MoM result is not a standardized location. A MoM score of .7 will indicate that the item’s value is proportionately .7 of the median, but does not show the item’s rank or inner location.
Lest mothers be upset by rejection of something called MoM, however, the technique, although unsatisfactory for denoting a standardized inner location or zone, may be a good way of denoting a “standardized” magnitude for individual items of data.
5.4Indexes of Spread
The main role of an index of spread is to allow the prompt demarcation of either a suitable inner zone for the data or an index of relative spread, which will denote (as discussed in Section 5.5) the “compactness” of the data set.
© 2002 by Chapman & Hall/CRC
5.4.1Standard Deviation and Inner Percentile Range
For most practical purposes, indexes of spread are calculated and used only for dimensional data. Since the range, standard-deviation zones, and Z-score Gaussian demarcations have major disadvantages for eccentric data, the inner-percentile ranges would seem to be the best routine way to describe spread. Nevertheless, for reasons noted earlier, the standard deviation is the most popular expression today, usually shown in citations such as X ± s or X ± 2s . If the goal is to demarcate a 95% inner zone that will always be both realistic and accurate, however, the ipr95 — extending from the 2.5 to 97.5 percentile values — is the best choice.
5.4.2Ordinal Data
Since ordinal data can be ranked, their range can be reduced to a spread that is expressed (if desired) with an inner zone determined from percentiles. The quartile deviation (or semi-interquartile range) is commonly used for this purpose.
Although the mathematical propriety is disputed, ordinal data are sometimes managed as though they were dimensional, with means, standard deviations, and corresponding inner zones being calculated from the arbitrary digits assigned as ordinal codes.
5.4.3Nominal Data
Since nominal data cannot be ranked, an inner zone cannot be demarcated. Several “indexes of diversity” have been proposed, however, to summarize the proportionate distribution of the nominal categories. Hardly ever appearing in medical literature, these indexes are briefly noted at the end of the chapter in Section 5.10.
5.5Indexes of Relative Spread
For ranked dimensional or ordinal data, indexes of relative spread have two important roles: they indicate the compactness of the distribution and they will also help (as noted later in Chapter 6) indicate the stability of the central index.
An index of relative spread is calculated by dividing an index of spread, derived from standard deviations or percentiles, by a central index, such as a mean or median. The result indicates the relative density or compactness of the data.
5.5.1Coefficient of Variation
The best known index of relative spread is the coefficient of variation, often abbreviated as c.v. It is calculated from the ratio of standard deviation and mean as
c.v. = s/X
For example, the c.v. is 80.48/120.1 = .67 for the data in Table 3.1 and .05 (= 5.199/101.572) for Table 4.1. A compact distribution with a tall and narrow shape will have a relatively small standard deviation and therefore a relatively small c.v. For a short, wide shape, the c.v. will be relatively large.
As the external representative of the distribution, the mean cannot do a good job if the distribution is too widely spread, but no specific boundaries have been established for c.v.s that are “too large.” Wallis and Roberts5 have proposed a boundary of ≤.10. Snedecor and Cochran,6 suggesting a zone between .05 and .15, say that higher values “cause the investigator to wonder if an error has been made in calculation, or if some unusual circumstances throw doubt on the validity of the experiment.”
© 2002 by Chapman & Hall/CRC
These proposed standards seem theoretically excellent, but they may be too strict for the kinds of variation found in medical data. If the results have doubtful validity when the c.v. exceeds .10 or .15, a great many medical research projects will be brought under an enormous cloud, and hundreds of published papers will have to be either disregarded or recalled for repairs.
5.5.2Coefficient of Dispersion
Directly analogous to the coefficient of variation, the coefficient of dispersion, sometimes abbreviated
as CD, is determined with absolute deviations from the median, rather than squared deviations from the |
|
|
~ |
mean. The average absolute deviation from the median, AAD, is calculated as Σ |Xi – X| /n. It is then |
|
~ |
~ |
divided by the median, X to form CD = (AAD)/X. Because of the algebraic “awkwardness” of absolute deviations, this index is seldom used.
5.5.3Percentile-Derived Indexes
Since neither the mean nor the standard deviation may be a good representative of a dimensional data set, the coefficient of variation might be replaced by percentile-derived values for indexes of relative dispersion. An index that corresponds directly to c.v. could be obtained by dividing the quartile deviation by the median.
A formal entity called the quartile coefficient of variation is calculated from the lower quartile value (Q1) and the upper quartile value (Q3) as (Q 3 − Ql)/(Q3 + Q1). Since (Q3 Ql)/2 corresponds to a standard deviation, and (Q3 + Q1)/2 corresponds to a mean, this ratio is a counterpart of the result obtained with the c.v. For example, for the data in Table 4.1, the lower quartile is 99 and the upper quartile is 105. The quartile coefficient of variation is (105 − 99)/(105 + 99) = .029, which corresponds to the c.v. of
.05 for those data.
5.5.4Other Analytic Roles
Perhaps the most important statistical role of coefficients of variation (or other indexes of relative spread) occurs in subsequent decisions when we evaluate the stability of a central index, or when central indexes are compared for two groups. These decisions will be discussed in Chapter 7 and later in Part II of the text.
5.6Searching for Outliers
Another reason for demarcating inner zones and indexes of relative spread is to play a fashionable indoor sport called “searching for outliers.” The activity has scientific and statistical components.
5.6.1Scientific Errors
Scientifically, the outlier may be a wrong result. The measurement system was not working properly; the observer may have used it incorrectly; or the information may have been recorded erroneously. For example, an obvious error has occurred if a person’s age in years is recorded as 250. Outliers that represent obvious errors are usually removed from the data and replaced by blank or unknown values. Sometimes, however, the error can easily be corrected. Miller9 gives an example of a data set {12.11, 12.27, 12.19, 21.21, and 12.18} in which “we can be fairly certain that a transcription error has occurred in the fourth result, which should be 12.21.”
In laboratory work where substances are usually measured dimensionally, diverse statistical approaches are used to find suspicious values. One approach, cited by Miller7 and called Dixon’s Q, forms a ratio,
Q = suspect value – nearest value /range
© 2002 by Chapman & Hall/CRC
If the value of Q is higher than the value authorized in a special statistical table, the result is deemed suspicious and probably erroneous.
5.6.2Statistical Misfits
Outliers are commonly sought for statistical reasons, not to detect wrong values, but to remove “misfits.” In many statistical procedures that involve fitting the data with a “model,” such as a mean or an algebraic equation, the choice or fit of the model may be substantially impaired by outliers. The problem becomes more prominent later when we reach the algebraic models used for regressions and correlations, but we have already seen some examples in the way that outliers can alter the location of a mean or the magnitude of a standard deviation.
In the algebraic activities discussed later, the outlier is often removed from the data, which are then fitted into a new model. Most scientists, however, are not happy about removing correct items of data simply to improve the “feelings” of a mathematical model. Furthermore, if something is wrong with a system of measurement, why should we be suspicious only of the extreme values? Should not the interior values be viewed just as skeptically even if they do not stand out statistically?
5.6.3Compromise or Alternative Approaches
The different motives in searching for outliers will often lead to different approaches for management. In one tactic already mentioned (Section 3.8.4), the problem is avoided with “robust” methods that are relatively unaffected by outlier values. Thus, values of the median and inner percentile ranges, which are usually unaffected by outliers, are preferred to means and standard deviations. [Later on, when we reach issues in statistical inference, the “robustness” of “non-parametric” methods will often make them
preferred over “parametric” methods.]
Nevertheless, in examining distributions of univariate data for a single group, we may want some sort of mechanism to make decisions about outliers, regardless of whether the main goal is to find scientifically wrong values or statistical misfits. To escape from outlier effects on means and standard deviations, the best tactics are to examine the data directly or to employ medians and inner-percentile ranges. Both these approaches can be used during the displays and examinations discussed in the next section.
5.7Displays and Appraisals of Patterns
The stem-leaf plots discussed earlier in Section 3.3.1.2. have been replacing the histograms that were used for many years to display the shape and contents of dimensional data. Histograms are gradually becoming obsolete because they show results for arbitrary intervals of data rather than the actual values, and because the construction requires an artistry beyond the simple digits that can easily be shown with a typewriter or computer for a stem-leaf plot.
5.7.1One-Way Graph
Data can always be displayed, particularly if the stem-leaf plot is too large, with a “one-way graph.” The vertical array of points on the left of Figure 5.2 is a one-way graph for the data of Table 3.1. In this type of graph, multiple points at the same vertical level of location are spread horizontally close to one another around the central axis. A horizontal line is often drawn through the vertical array to indicate the location of the mean. The horizontal line is sometimes adorned with vertical flanges so that it appears as | | rather than .The adornment is merely an artist’s esthetic caprice and should be omitted because it can be confusing. The apparently demarcated length of the flange may suggest that something possibly useful is being displayed, such as a standard deviation or standard error.
© 2002 by Chapman & Hall/CRC
5.7.2Box Plot
If you want to see the pattern of dimensional data, the best thing to examine is a stem-leaf plot. To summarize the data, however, the best display is another mechanism: the box plot, which Tukey called a “box-and-whiskers plot.” Based on quantiles rather than “parametric” indexes, the box plot is an excellent way to show the central index, interquartile (50%) zone, symmetry, spread, and outliers of a distribution. The “invention” of the box plot is regularly attributed to John Tukey, who proposed1 it in 1977, but a similar device, called a range bar, was described in 1952 in a text by Mary Spear. 8 Spear’s horizontal “range bar,” reproduced here as Figure 5.3, shows the full range of the data, the upper and lower quartiles that demarcate the interquartile range, and the median.
Units of
Measurement
50
40
30
+
20
10
Range from lowest to highest amount
Median
0
Points |
Box-plot |
|
|
Interquartile Range |
|
|
|
|
|
|
|||||
FIGURE 5.2 |
|
FIGURE 5.3 |
|||||
Display of data points and corresponding box-and- |
Antecedent of box plot: the “range plot” proposed in 1952 |
||||||
whiskers plot for data in Table 3.1. |
by Spear. [Figure taken from Chapter Reference 8.]. |
||||||
5.7.2.1Construction of Box — In the customary box plots today, data are displayed vertically with horizontal lines drawn at the level of the median, and at the upper and lower quartiles (which Tukey calls “hinges”). The three horizontal lines are then connected with vertical lines to form the box. The interquartile spread of the box shows the segment containing 50% of the data, or the “H-spread.” The box plot for the data of Table 3.1 is shown on the right side of Figure 5.2, using the same vertical units as the corresponding one-way graph on the left. For the boundaries of the box, the lower quartile is at 17 between the 14th and 15th rank, and the upper quartile is at 28 between the 41st and 42nd rank. The
box thus has 28 as its top value, 21 as the median, and 17 at the bottom. The mean of 22.7 is shown with a + sign above the median bar.
The width of the horizontal lines in a box plot is an esthetic choice. They should be wide enough to show things clearly but the basic shape should be a vertical rectangle, rather than a square box. When box plots are compared for two or more groups, the horizontal widths are usually similar, but McGill et al.9 have proposed that the widths vary according to the square root of each group’s size.
5.7.2.2Construction of “Whiskers” — Two single lines are drawn above and below the box to form the “whiskers” that summarize the rest of the distribution beyond the quartiles. ŽThe length of the whiskers will vary according to the goal at which they are aimed.
If intended to show the ipr95, the whiskers will extend up and down to the values at the 97.5 and 2.5 percentiles. For the 2.5 and 97.5 percentiles, calculated with the proportional method, the data of Table
3.1 have the respective values of 12 and 41. Because the upper and lower quadragintile boundaries may not be located symmetrically around either the median or the corresponding quartiles, the whiskers may have unequal lengths. In another approach, the whiskers extend to the smallest and largest observations that are within an H-spread (i.e., interquartile distance) below and above the box.
©2002 by Chapman & Hall/CRC
For many analysts, however, the main goal is to let the whiskers include everything but the outliers. The egregious outliers can often be noted by eye, during examination of either the raw data or the stem-leaf plot. The box-plot summary, however, relies on a demarcating boundary that varies with different statisticians and computer programs. Tukey originally proposed that outliers be demarcated with an inner and outer set of boundaries that he called fences. For the inner fences, the ends of the whiskers are placed at 1.5 H-spreads (i.e., one and a half interquartile distances) above the upper hinge (i.e., upper quartile) and below the lower hinge (i.e., lower quartile). The outer fences are placed correspondingly at 3.0 H-spreads above and below the hinges. With this convention, the “mild” or “inner” outliers are between the inner and outer fences; the “extreme” or “outer” outliers are beyond the outer fence.
[Tukey’s boundaries are used for box-plot displays in the SAS data management system,10 where the inner outliers, marked 0, are located between 1.5 and 3 H-spreads; the more extreme outliers, marked *, occur beyond 3 H-spreads. In the SPSS system,11 however, an “inner outlier” is in the zone between 1.0 and 1.5 H-spreads and is marked with an X; the “outer” (or extreme) outliers are beyond 1.5 H-spreads and are marked E.]
In the data of Table 3.1 and Figure 5.2, the spread between the hinges at the upper and lower quartiles is 28 − 17 = 11. Using the 1.5 H-spread rule, the whiskers would each have a length of 1.5 × 11 = 16.5 units, extending from 0.5 (=17 16.5) to 43.5 (=28 + 16.5). Because the data in Table 3.1 have no major outliers, the whiskers can be shortened to show the entire range of data from 11 to 43. This shortening gives unequal lengths to the whiskers in Figure 5.2.
On the other hand, if boundaries are determined with the 1 H-spread rule, each whisker would be 11 units long, extending from a low of 6 (= 17 11) to a high of 39 (= 28 + 11). The lower value could be reduced, because the whisker need only reach the smallest value of data ( 11), but the upper whisker would not encompass the data values of 41 and 42, which would then appear to be outliers.
5.7.2.3 Immediate Interpretations — The horizontal line of the median between the two quartiles divides the box plot into an upper and lower half. If the two halves have about the same size, the distribution is symmetrical around the central index. A substantial asymmetry in the box will promptly indicate that the distribution is not Gaussian.
The “top-heavy” asymmetry of the box in Figure 5.2 immediately shows that the distribution is skewed right (toward high values). Because the mean and median have the same location in a Gaussian distribution, the higher value of the mean here is consistent with the right skew. In most distributions, the mean is inside the box; and a location beyond the box will denote a particularly egregious skew.
5.8“Diagnosis” of Eccentricity
The “diagnosis” of a non-Gaussian or eccentric distribution of data has several important purposes: (1) to warn that the distribution may not be properly represented by the mean and standard deviation; (2) to help identify outliers; (3) to suggest methods of “therapy,” such as transformations, that may either cure the statistical ailment or reduce its severity.
The statistical methods of “diagnosis” are analogous to the tactics used in clinical medicine. Some of the procedures are done with simple “physical examination” of the data. Others use easy, routine ancillary tests that might correspond to a urinalysis or blood count. Yet others involve a more complex technologic “work-up.”
The simple methods of “physical examination” will be recurrently emphasized in this text because they can be valuable “commonsense” procedures, often using nothing more than inspection or a simple in-the-head calculation. These “mental” methods, although seemingly analogous to screening tests, are actually different because a distinctive result is seldom false. Thus, in many of the mental methods to be discussed, a “positive” result is almost always a “true positive,” but a “negative” result should seldom be accepted as truly negative without additional testing.
The methods for diagnosing eccentric distributions can be divided into the “mental” tactics, the simple ancillary test of examining the box plot, and the more elaborate tests available in a formal statistical “work-up.”
© 2002 by Chapman & Hall/CRC
5.8.1“Mental” Methods
Three simple “mental” methods for determining eccentricity are to compare the mean with the standard deviation and with the mid-range index, and to see if X ± 2s exceeds the limits of the range.
5.8.1.1Mean vs. Standard Deviation — The coefficient of variation (see Section 5.5.1) is formally calculated as the standard deviation divided by the mean. An even simpler approach, however, is to compare the two values. In a symmetrical relatively compact data set, the standard deviation is
substantially smaller than the mean. If the value of s is more than 25% of X , something must be wrong. Either the distribution is eccentric, or it is highly dispersed. Thus, in Table 3.1, where s = 7.693 is about 1/3 of X = 22.732, eccentricity can promptly be suspected. In Table 4.1, however, where s = 5.199 and
X = 101.572, the s vs. X comparison does not evoke suspicions. A non-Gaussian diagnosis for Table 4.1 requires more subtle testing.
5.8.1.2Mean vs. Mid-Range Index — The mid-range can easily be mentally calculated as half
the distance between the maximum and minimum values of the data set. The formula is mid-range
index = (Xmax + Xmin)/2. If the mid-range differs substantially from the mean, the data have an eccentric distribution. In Table 3.1, the mid-range is (43 + 11)/2 = 27, which can produce a prompt diagnosis of eccentricity, because the mean is X = 22.7. In Table 4.1, however, the mid-range is (127 + 75)/2 = 101,
which is quite close to the mean of 101.6.
5.8.1.3Range vs. X ± 2s — A third type of mental check is sometimes best done with a calcu -
lator to verify the arithmetic. In Table 3.1, the value of X ± 2s is 22.7 ± 2(7.693). With a calculator, the spread is determined as 7.314 to 38.086. Without a calculator, however, you can approximate the data as 23 ± 2(8) and do the arithmetic mentally to get a spread that goes from 7 to 39. Because the actual range in that table is from 11 to 43, the lower value of the X ± 2s spread — with either method of calculation — goes beyond the true limit. Therefore, the distribution is eccentric. In Table 4.1, however, X ± 2s would be 101.6 ± 2(5.2) which can be calculated as going from 91.2 to 112.0, or 92 to 112. Either set of spreads lies within the true range of 75 to 127.
5.8.2Examination of Box Plot
After a box plot has been drawn, its direct inspection is the simplest and easiest method for identifying eccentricity.
In a symmetrical or quasi-Gaussian distribution, the box plot will show three features of “normality”:
(1) the mean and median are relatively close to one another; (2) the two halves of the box, formed above and below by the location of the upper and lower quartiles, should be relatively symmetrical around the median; and (3) no egregious asymmetrical outliers should be present. Obvious violations of any of these three requirements can easily be discerned as an “abnormality” in the box plot, indicating an eccentric distribution. If all three requirements are fulfilled, the data are almost surely Gaussian or quite symmetrical.
The decisions about what is relatively close, symmetrical, or egregious will depend on the analyst’s “statistical judgment.”
5.8.3Special Tests for Eccentricity
For analysts who prefer more precise “laboratory tests” rather than the simple judgments just discussed, diverse tactics have been devised to offer warnings about non-Gaussian distributions. The methods are automated and the results regularly appear, often without having been solicited, in computer programs that display summaries of univariate data.
© 2002 by Chapman & Hall/CRC
The additional results can regularly be ignored, but are there if you want them. They include indexes of “normality” and various graphical plots. The procedures are briefly mentioned here to outline what they do, should you ever decide (or need) to use them. Further explanation of the strategies is usually offered in the associated manuals for the computer programs, and their interpretations can be found in various statistical textbooks.
5.8.3.1Indexes of “Normality” — Two relatively simple tests produce an index of skewness, reflecting different locations for mean and median, and an index of kurtosis which denotes whether the distributional shape is leptokurtic (too tall and thin), platykurtic (too short and fat), or mesokurtic ( just right). One of these calculations is called Geary’s test of kurtosis.12
Royston has described computational13 and graphical14 methods that offer more sophisticated approaches for evaluating normality.
5.8.3.2Graphical Plots — A different way of checking for normality is to do simple or com-
plicated graphical plots. In a simple plot, which shows the residual values of Xi – X graphed against Xi, the results of a Gaussian curve should show nonspecific variations, with no evident pattern, around a horizontal line drawn at the mean of X .
The more complicated plots compare the observed values of the data with the corresponding quantile
or standard-deviation points calculated for a Gaussian distribution that has the observed X and s as its determining features.
5.8.3.3 Illustration of Univariate Print-Out — Figure 5.4 shows the univariate print-out produced by the SAS system for the 200 values of hematocrit whose stem-leaf plot was previously shown in Figure 3.2. The upper left corner shows the group size of N = 200, the mean of 41.09, and standard deviation of 5.179 (with s 2 = 26.825), with c.v. = (s/ X × 100) = 12.605 and standard error of the mean (marked “Std Mean”) = .366. The uncorrected sum of squares (USS) for Σ X2i is 343015.8 and the corrected sum of squares (CSS) for Sxx is 5338.32. The other data in the upper left corner show indexes of skewness and kurtosis and results for three other tests [T: mean; M(Sign); and Sgn Rank] that will be discussed later. The middle part of the upper half of the printout shows the various quantile values, and the far right upper section shows the five lowest and highest observed values in the data, together with their individual identification codes (marked “Obs”) among the 200 members.
The stem-leaf plot on the lower left is a condensed version (marked for even-numbered stems only) of what was shown earlier in Figure 3.2. Next to it is the box plot, which is ordinarily the most useful part of the visual display. It shows the essentially similar values for mean and median in this set of data. The two halves of the box here falsely appear asymmetrical because of an artifact of the graphic plotting system, which contained no provision for odd values. Hence, the mean and median, with values of 41, were plotted at values of 40. [This graphic artifact should warn you always to check the actual values of the three quartiles before concluding that a box plot is asymmetrical.]
With an interquartile range of 6 (= 44 38), the two whiskers extend a distance of 9 (= 1.5 × 6) units above and below the margins of the box, leaving low outlier values (marked “0”) at values of 28 and 24, with high outliers at 54 and 56.
The normal probability plot in the lower right corner of Figure 5.4 shows + marks to denote where points would be located if the distribution were perfectly Gaussian. Thus, the locations for − 0.5, 1, and 1.5 standard deviations below the mean for these data should be respectively at values of 41.09 (0.5)(5.179) = 38.5; 41.09 5.179 = 35.9, and 41.09 (1.5)(5.179) = 33.3. The asterisks (or “stars”) show the actual locations of the points in the distribution. In this instance, they follow the Gaussian line of + marks fairly closely, but dip below it at the low end and extend above at the high end.
© 2002 by Chapman & Hall/CRC
Hall/CRC & Chapman by 2002 ©
Variable=HCT |
|
|
HEMATOCRIT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
Moments |
|
|
|
|
|
|
|
|
Quantiles (Def=5) |
|
|
|
|
|
|
|
|
Extremes |
|
|
|
|
|
|
|||||
N |
|
|
200 |
Sum Wgts |
200 |
100% Max |
|
|
|
|
56 |
99% |
|
|
53.5 |
|
Lowest |
Obs |
Highest |
|
Obs |
|||||||||||
Mean |
|
|
41.09 |
Sum |
|
|
8218 |
75% Q3 |
|
|
|
|
44 |
95% |
|
|
50 |
24 ( |
193) |
|
52 ( |
146) |
|
|||||||||
Std Dev |
|
|
5.179307 |
Variance |
26.82523 |
50% Med |
|
|
|
|
41 |
90% |
|
|
48 |
28 ( |
65) |
|
53 ( |
78) |
|
|||||||||||
Skewness |
|
|
0.029179 |
Kurtosis |
0.270536 |
25% Q1 |
|
|
|
|
38 |
10% |
|
|
34.1 |
30 ( |
64) |
|
53 ( |
185) |
|
|||||||||||
USS |
|
|
343015.8 |
CSS |
|
|
5338.22 |
0% Min |
|
|
|
|
24 |
5% |
|
|
33 |
31 ( |
66) |
|
54 ( |
11) |
|
|||||||||
CV |
|
|
12.60479 |
Std Mean |
0.366232 |
|
|
|
|
|
|
|
1% |
|
|
29 |
31 ( |
50) |
|
56 ( |
115) |
|
||||||||||
T : Mean=0 |
112.1965 |
Pr > T |
|
|
0.0001 |
Range |
|
|
|
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
Num = 0 |
200 |
Num > 0 |
200 |
Q3 - Q1 |
|
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
M( Sign ) |
|
|
100 |
Pr > = |
M |
0.0001 |
Mode |
|
|
|
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
Sgn Rank |
|
|
10050 |
Pr > = |
S |
0.0001 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
Stem Leaf |
|
|
|
|
|
|
|
# |
|
Boxplot |
|
|
|
|
|
|
|
Normal Probability Plot |
|
|
|
|
|
|
||||||||
56 |
0 |
|
|
|
|
|
|
|
|
1 |
0 |
|
|
|
57+ |
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* |
|||||||||||
54 |
0 |
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
52 |
000 |
|
|
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* + |
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
50 |
00000000 |
|
|
|
|
|
|
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* *+ + + |
||||
48 |
000009000 |
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* ** *+ + + |
|
|||
46 |
000000000001 |
|
|
|
|
|
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
** * * + + |
|
|
|||||
44 |
000000000000000000000000006 |
|
|
|
27 |
+ |
|
|
+ |
|
|
|
|
|
|
|
|
|
|
** *+ |
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||
42 |
0000000000000000009000000000000555 |
|
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
* ** * * |
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||
40 |
00000000000000050000000047 |
|
|
|
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
* ** * * |
|
|
|
|
|
|
||||||
38 |
00000000004000000000000003399 |
|
|
|
29 |
* |
|
|
* |
|
41+ |
|
|
|
|
|
* ** * * |
|
|
|
|
|
|
|
||||||||
|
|
|
+ |
|
|
+ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
36 |
0000330000000000558 |
|
|
|
|
|
19 |
|
|
|
|
|
|
|
|
|
|
* ** * * |
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
34 |
0000002550000000 |
|
|
|
|
|
16 |
|
|
|
|
|
|
|
|
|
|
** * *+ |
|
|
|
|
|
|
|
|
|
|||||
32 |
070000003 |
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
* ** * *+ |
|
|
|
|
|
|
|
|
|
|||
30 |
000 |
|
|
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
|
|
|
* ** * *+ |
|
|
|
|
|
|
|
|
|
|
|
28 |
0 |
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
* *+ + + |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
26 |
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
+ |
* + + |
|
|
|
|
|
|
|
|
|
|
|
|
|||
24 |
0 |
|
|
|
|
|
|
|
|
1 |
0 |
|
|
|
25+* |
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-2 |
-1 |
|
0 |
|
+1 |
+2 |
|
|||||
FIGURE 5.4
Printout of univariate display for group of 200 hematocrit values previously shown in Figure 3.2. The display was created by the PROC UNIVARIATE program of the SAS system. All results in the upper half of the display appear routinely. The stem-leaf, box, and normal probability plots are all requested options.
5.9Transformations and Other “Therapy”
When worried about the misrepresentation that can occur if an eccentric distribution is externally represented by its mean and standard deviation, data analysts can use various tactics to discern and try to “repair” non-Gaussian distributions of dimensional data.
The usual mechanism of repair, as noted earlier, is to eliminate outliers from the data or to perform “Gaussianizing” transformations. The transformations were particularly popular during the pre-computer era of statistical analysis, but the tactic is used much less often today.
1.Many new forms of analysis, as noted in Parts II-IV of the text, do not depend on Gaussian (parametric) procedures. The data can be processed according to their ranks (or other arrangements) that do not rely on means and standard deviations. Having been automated in various “packaged” computer programs, the new methods are easy to use and readily available.
2.Many forms of medical data are not expressed in dimensional variables. They are cited
in binary, ordinal, or nominal categories for which means and standard deviations are not pertinent.
3.Despite apparent contradictions of logic and violations of the basic mathematical assumptions, the parametric methods are quite “robust” not for description of eccentric data, but for inferences that will be discussed later. Without any transformations or alterations, the “unvarnished” methods often yield reasonably accurate inferential conclusions — i.e., the same conclusions about “statistical significance” that might come from transformed data or from the “gold standard” non-parametric methods.
For all these reasons, analysts today do not transform data as often as formerly. The univariate distributions should still be carefully inspected, however, to determine whether the dimensional data are so eccentric that they require non-parametric rather than parametric procedures.
5.10 Indexes of Diversity
Indexes of spread are seldom determined for categorical data, but the dispersion of frequency counts among the categories can be expressed, if desired, with statistical indexes of diversity.
Maximum diversity in any group occurs when its members are distributed equally among the available categories. Thus, with c categories for a group of N members, the group has maximum diversity if the frequency count, fi, in each category is fi = N/c. The various indexes of diversity differ in the way they express the distribution, but all of them compare the observed distribution score with the maximum possible score.
Although these indexes seldom appear in medical or biological literature, they can sometimes be used to describe the proportionate distribution of a set of categorical descriptions in ecologic studies of birds or animals, or even, in the first example cited here, for clinical “staging” of people.
5.10.1Permutation Score for Distributions
One expression of diversity uses scores obtained from factorial permutations. The number of permutations of N items is N! = 1 × 2 × 3 × … × N. If divided into c categories containing f1, f2, … , fc members, the total number of permutations will be N!/(f1! f2! … fc!). [We shall meet this same idea later when permutation procedures are discussed.]
Suppose a group of 21 patients contains 4 in Stage I, 6 in Stage II, and 11 in Stage III. The permutation score would be 21!/(4! 6! 11!) = 5.1091 × 1019/(24 × 720 × 39,916,800) = 7.4070358 × 107. If the group were perfectly divided into three parts, the maximum possible score would have been 21!/(7! 7! 7!) = 3.9907 × 108 . The ratio of the actual and maximum scores would be 7.4070358 × 107/3.9907 × 108 = .186.
© 2002 by Chapman & Hall/CRC
