Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
4.18 Mб
Скачать

SUMMARY MEASURES OF SPREAD

57

7th value is 3388g, a difference of 85g, so the 20th percentile is 3303g plus 0.2 of 85g, which is 3303g + 0.2 × 85g = 3303g + 17g = 3320g.

You might be thinking, this all seems a bit messy, but a computer will perform these calculations effortlessly. As well as percentiles, you might also encounter deciles, which sub-divide the data values into 10, not 100, equal divisions, and quintiles, which sub-divide the values into five equal-sized groups. Collectively, we call percentiles, deciles and quintiles, n-tiles.

Exercise 5.9 Calculate the 25th and 75th percentiles for the ICU per cent mortality values in Table 2.7, and explain your results.

Choosing the most appropriate measure

How do you choose the most appropriate measure of location for some given set of data? The main thing to remember is that the mean cannot be used with ordinal data (because they are not real numbers), and that the median can be used for both ordinal and metric data (particularly when the latter is skewed).

As an illustration of the last point, look again at Figure 3.7 which shows the distribution of the number of measles cases in 37 schools. Not only is this distribution positively skewed, it has a single high-valued outlier. The median number of measles cases is 1.00, but the mean number is 2.91, almost three times as many! The problem is that the long positive tail and the outlier are dragging the mean to the right. In this case, the median value of 1 seems to be more representative of the data than the mean. I have summarised the choices of a measure of location in Table 5.2.

Table 5.2 A guide to choosing an appropriate measure of location

 

Summary measure of location

 

 

 

 

Type of variable

mode

median

mean

 

 

 

 

Nominal

yes

no

no

Ordinal

yes

yes

no

Metric discrete

yes

yes, if distribution

yes

Metric continuous

no

is markedly skewed

yes

 

 

 

 

Summary measures of spread

As well as a summary measure of location, a summary measure of spread or dispersion can also be very useful. There are three main measures in common use, and once again, as you will see, the type of data influences the choice of an appropriate measure.

58

CH 5 DESCRIBING DATA WITH NUMERIC SUMMARY VALUES

The range

The range is the distance from the smallest value to the largest. The range is not affected by skewness, but is sensitive to the addition or removal of an outlier value. As an example, the range of the 30 birthweights in Table 2.5 is (2860.0 to 4490.0) g. The range is best written like this, rather than as the single-valued difference, i.e. as 1630 g, in this example, which is much less informative.

Exercise 5.10 What are the ranges for age among those infants breast-fed, and those bottle-fed in Table 3.2?

The interquartile range (iqr)

One solution to the problem of the sensitivity of the range to extreme value (outliers) is to chop a quarter (25 per cent) of the values off both ends of the distribution (which removes any troublesome outliers), and then measure the range of the remaining values. This distance is called the interquartile range, or iqr. The interquartile range is not affected either by outliers or skewness, but it does not use all of the information in the data set since it ignores the bottom and top quarter of values.

Calculating interquartile range by hand (avoid if possible!)

To calculate the interquartile range by hand, you need first to determine two values:

The value which cuts off the bottom 25 per cent of values; this is known as the first quartile and denoted Q1.

The value which cuts off the top 25 per cent of values, known as the third quartile and denoted

Q3.3

The interquartile range is then written as (Q1 to Q3). With the birthweight data: Q1 = 3396.25 g, and Q3 = 3923.50 g. Therefore: interquartile range = (3396.25 to 3923.50) g. This result tells you that the middle 50 per cent of infants (by weight) weighed between 3396.25 g and 3923.50 g.

An example from practice

Table 5.3 describes the baseline characteristics of 56 patients in an investigation into the use of analgesics in the prevention of stump and phantom pain in lower-limb amputation (Nikolajsen

3 The median is sometimes denoted as Q2.

SUMMARY MEASURES OF SPREAD

59

et al. 1997). The ‘blockade’ group of patients were given bupivacaine and morphine, the control (comparison) group, were given an identically administered saline placebo.

As you can see, two variables, ‘pain in week before amputation’, and ‘daily opioid consumption at admission (mg)’, were summarised with median and interquartile range values. Pain was measured using a visual analogue scale (VAS4), which of course produces ordinal data, so the mean is not appropriate, and the authors have used the median and interquartile range as their summary measures of location and spread.

The median level of pain in the blockade group is 51, with an iqr of (23.8 to 87.8).5 This means that 25 per cent of this group had a pain level of less than 23.8, and 25 per cent a pain level greater than 87.8. The middle 50 per cent had a pain level between 23.8 and 87.8. I’ll return to the opioid consumption variable shortly.

Table 5.3 The baseline characteristics of 56 patients in an investigation into the use of analgesics in the prevention of stump and phantom pain in lower-limb amputation. Reproduced from The Lancet, 1994, 344, 1724–26, courtesy of Elsevier

 

Blockade group

Control group

Characteristics of patients

(n = 27)

(n = 29)

Men/women

15/12

18/11

Mean (SD) age in years

72.8 (13.2)

70.8 (11.4)

Diabetes

10

 

14

 

Concurrent treatment because of cardiovascular disease

18

 

19

 

Previous stroke

3

 

2

 

Previous contralateral amputation

7

 

3

 

Median (IQR) pain in week before amputation

51

(23.8–8–78)

44

(25.3–68)

(VAS, 0–100 mm)

 

 

 

 

Median (IQR) daily opioid consumption at admission

50

(20–68.8)

30

(5–62.5)

(mg)

 

 

 

 

Level of amputation

 

 

 

 

Below knee

15

 

16

 

Through knee-joint

5

 

2

 

Above knee

7

 

11

 

Reamputations during follow-up

3

 

2

 

Died during follow-up

10

 

10

 

 

 

 

 

 

Exercise 5.11 Calculate the iqr for the ICU percentage mortality values in Table 2.7. (You have already calculated the 25th and 75th percentiles in Exercise 5.9).

Exercise 5.12 Interpret the median and interquartile range values for pain in the week before amputation, for the control group in Table 5.3.

4See Chapter 1.

5The table contains a typographical error, recording 87.8 as ‘8–78’.

60

CH 5 DESCRIBING DATA WITH NUMERIC SUMMARY VALUES

Estimating the median and interquartile range from the ogive

As I indicated earlier, you can estimate the median and the interquartile range from the cumulative frequency curve (the ogive). Figure 5.2 shows the ogive for the cumulative birthweight data in Table 3.3.

% cumulative frequency

100

 

 

 

 

 

 

90

 

 

 

 

 

 

80

 

 

 

 

 

 

70

 

 

 

 

 

 

60

 

 

 

 

 

 

50

 

 

 

 

 

 

40

 

 

 

 

 

 

30

 

 

 

 

 

 

20

 

 

 

 

 

 

10

 

 

 

 

 

 

0

 

 

 

 

 

 

2700

3000

3300

3600

3900

4200

4500

Birthweight (g)

Figure 5.2 Using the relative cumulative frequency curve (or ogive) of birthweight to estimate the median and interquartile range values (Note that this should be a smooth curve)

SUMMARY MEASURES OF SPREAD

61

If you draw horizontal lines from the values 25 per cent, 50 per cent and 75 per cent on the y axis, to the ogive, and then down to the x axis, the points of intersection on the x axis approximate values for Q1, Q2 (the median), and Q3, of 3400 g, 3650 g and 3900 g. Thus, if you happen to have an ogive handy, these approximations can be helpful. I plotted per cent cumulative frequency because it makes it slightly easier to do find the percentage values. Notice that you can also use the ogive to answer questions like, ‘What percentage of infants weighed less than, say, 4000 g?’ The answer is that a value of 4000 g on the x axis produces a value of 80 per cent for cumulative frequency on the y axis.

Exercise 5.13 Estimate the median and iqr for total blood cholesterol for the control group from the ogive in Figure 3.12.

The boxplot

Now that we have discussed the median and interquartile range, I can introduce the boxplot as I promised in Chapter 3. The general discussion on measures of spread continues overleaf if you want to continue with this and come back to consider the boxplot later. Boxplots provide a graphical summary of the three quartile values, the minimum and maximum values, and any outliers. They are usually plotted with value on the vertical axis. Like the pie chart, the boxplot can only represent one variable at a time, but a number of boxplots can be set alongside each other.

An example from practice

Figure 5.3 is from the same study as Figure 4.3, into the use of the mammography service in the 33 health districts of Ontario, in which investigators were interested in the variation in the mammography utilisation rate across age groups (Goel et al. 1997). They supplemented their

Rates per 1000 women

300

200

100

0

30–39

40–49

50–69

70+

Age groups

Figure 5.3 Boxplots of the rate of use of mammography services in 33 health districts in Ontario. Reproduced from J. Epid. Comm. Health, 51, 378–82, courtesy of BMJ Publishing Group

62

CH 5 DESCRIBING DATA WITH NUMERIC SUMMARY VALUES

results with the boxplots shown in the figure, for the age groups: (30–39); (40–49); (50–59); and 70+ years. The vertical axis is the mammography utilisation rate (visits per 1000 women), in the 33 health districts. Outliers are denoted by the small open circles.

Let’s look at the third boxplot, that for the women aged 50–69:

The bottom end of the lower ‘whisker’ (the line sticking out of the bottom of the box), corresponds to the minimum value – about 125 visits per 1000 women.

The bottom of the box is the 1st quartile value, Q1. So about 25 per cent of women had a utilisation rate of 175 or less visits per 1000 women.

The line across the inside of the box (it won’t always be half-way up), is the median, Q2. So half of the women had a utilisation rate of less than about 200 consultations per 1000 women, and half a rate of more than 200. The more asymmetric (skewed) the distributional shape, the further away from the middle of the box will be the median line, closer to the top of the box is indicative of negative skew, closer to the bottom of the box – positive skew.

The top of the box is the third quartile Q3. That is, about a quarter of women had a consultation rate of 225 or more per 1000.

The top end of the upper whisker is the ‘maximum’ mammography utilisation rate – about 275 consultations per 1000 women. This is the maximum value that can be considered still to be part of the general mass of the data. Because. . .

. . .there is one outlier. One of the health districts reported a utilisation rate of about 300 per 1000 women.6 This is, of course, the actual maximum value in the data.

Exercise 5.14 Sketch the box plot for the percentage mortality in ICUs shown in Table 2.7. (Note that you have already calculated the median and iqr values in Exercises 5.6 and 5.10). What can you glean from the boxplot about the shape of the distribution of the ICU percentage mortality rate?

Exercise 5.15 The boxplots in Figure 5.4 are from a study of sperm integrity in adult survivors of childhood cancer compared to a control group of non-cancer individuals (Thomson et al. 2002). What do the two boxplots tell you?

Standard deviation

The limitation of the interquartile range as a summary measure of spread is that (like the median) it doesn’t use all of the information in the data, since it omits the top and bottom

6Outliers are defined in various ways by different computer programs. Outliers are here defined as any value more than thee halves of the interquartile range greater than the third quartile, or less than the first quartile.

DNA damage (%)

SUMMARY MEASURES OF SPREAD

63

p=0.06

25

20

15

10

5

0

Controls Non-azoospermic (n=64) long-term survivors of

childhood cancer (n=23)

Figure 5.4 Boxplots from a study of sperm integrity in adult survivors of childhood cancer, compared to a control group of non-cancer individuals. Reprinted from The Lancet 2002, 360, 361–6, Fig. 2, p. 364, courtesy of Elsevier

quarter of values. An alternative approach uses the idea of summarising spread by measuring the mean (average) distance of all the data values from the overall mean of all of the values. The smaller this mean distance is, the narrower the spread of values must be, and vice versa. This idea is the basis for what is known as the standard deviation, or s.d. The following way of calculating the sample standard deviation by hand illustrates this idea:7

Subtract the mean of the sample from each of the n sample values in the sample, to give the difference values.

Square each of these differences.

Add these squared values together (called the sum of squares).

Divide the sum of squares by (n – 1); i.e. divide by 1 less than the sample size.8

Take the square root. This is the standard deviation.

One advantage of the standard deviation is that, unlike the interquartile range, it uses all of the information in the data.

7This is a very tedious procedure. If you have an s.d. key on your calculator use that. Better still, use a computer!

8If we divide by n, as we normally would do to find a mean, we get a result which is slightly too small. Dividing by (n – 1) adjusts for this. Technically, the sample s.d. is said to be a biased estimator of population s.d. See Chapter 7 for the meaning of sample and population.

64

CH 5 DESCRIBING DATA WITH NUMERIC SUMMARY VALUES

Exercise 5.16 In Figure 4.6 the authors tell us that the mean cord platelet count is 308×109/l, and the standard deviation is 69×109/l (notice the two measures have the same units).1 Explain what this value means.

An example from practice

In Table 5.3, the analgesic/amputation pain study, the authors summarise the age of the patients in the study with the mean and standard deviation. As you can see, the spread of ages in the blockade group is wider than in the control group, 13.2 years around a blockade group’s mean of 72.8 years, compared to 11.4 years around a control group’s mean of 70.8 years.

The authors could also have used the mean and standard deviation for daily opioid consumption (mg), since this is a metric variable, but instead used the median and interquartile range; there are a number of possible reasons for this. First, the data may be noticeably skewed and/or contained outliers, perhaps making the mean a little too unrepresentative of the general mass of data. Or the investigators may have specifically wanted a summary measure of central-ness, which the median provides. Third, they may have felt that asking people to recall their opioid consumption last week was likely to lead to fuzzy, imprecise, values, and so have preferred to treat them as if they were ordinal.

Exercise 5.17 Calculate and interpret the standard deviation for the ICU percentage mortality values in Table 2.7. (You have already calculated the mean percentage mortality in Exercise 5.7). I would hesitate to do this without a calculator with a standard deviation function.

To sum up summary measures of spread: with ordinal data use either the range or the interquartile range. The standard deviation is not appropriate because of the non-numeric nature of ordinal data. With metric data use either the standard deviation, which uses all of the information in the data, or the interquartile range. The latter if the distribution is skewed, and/or you have already selected the median as your preferred measure of location. Don’t mix- and-match measures – standard deviation goes with the mean, and iqr with the median. These points are summarised in Table 5.4.

Table 5.4 Choosing an appropriate measure of spread

 

 

Summary measure of spread

 

 

 

 

Type of variable

Range

Interquartile range

Standard deviation

 

 

 

 

Nominal

No

No

No

Ordinal

Yes

Yes

No

Metric

Yes

Yes, if skewed

Yes

 

 

 

 

1109 means 1000 000 000.