- •Contents
- •Preface to the 2nd Edition
- •Preface to the 1st Edition
- •Introduction
- •Learning Objectives
- •Variables and Data
- •The good, the Bad, and the Ugly – Types of Variable
- •Categorical Variables
- •Metric Variables
- •How can I Tell what Type of Variable I am Dealing with?
- •2 Describing Data with Tables
- •Learning Objectives
- •What is Descriptive Statistics?
- •The Frequency Table
- •3 Describing Data with Charts
- •Learning Objectives
- •Picture it!
- •Charting Nominal and Ordinal Data
- •Charting Discrete Metric Data
- •Charting Continuous Metric Data
- •Charting Cumulative Data
- •4 Describing Data from its Shape
- •Learning Objectives
- •The Shape of Things to Come
- •5 Describing Data with Numeric Summary Values
- •Learning Objectives
- •Numbers R us
- •Summary Measures of Location
- •Summary Measures of Spread
- •Standard Deviation and the Normal Distribution
- •Learning Objectives
- •Hey ho! Hey ho! It’s Off to Work we Go
- •Collecting the Data – Types of Sample
- •Types of Study
- •Confounding
- •Matching
- •Comparing Cohort and Case-Control Designs
- •Getting Stuck in – Experimental Studies
- •7 From Samples to Populations – Making Inferences
- •Learning Objectives
- •Statistical Inference
- •8 Probability, Risk and Odds
- •Learning Objectives
- •Calculating Probability
- •Probability and the Normal Distribution
- •Risk
- •Odds
- •Why you can’t Calculate Risk in a Case-Control Study
- •The Link between Probability and Odds
- •The Risk Ratio
- •The Odds Ratio
- •Number Needed to Treat (NNT)
- •Learning Objectives
- •Estimating a Confidence Interval for the Median of a Single Population
- •10 Estimating the Difference between Two Population Parameters
- •Learning Objectives
- •What’s the Difference?
- •Estimating the Difference between the Means of Two Independent Populations – Using a Method Based on the Two-Sample t Test
- •Estimating the Difference between Two Matched Population Means – Using a Method Based on the Matched-Pairs t Test
- •Estimating the Difference between Two Independent Population Proportions
- •Estimating the Difference between Two Independent Population Medians – The Mann–Whitney Rank-Sums Method
- •Estimating the Difference between Two Matched Population Medians – Wilcoxon Signed-Ranks Method
- •11 Estimating the Ratio of Two Population Parameters
- •Learning Objectives
- •12 Testing Hypotheses about the Difference between Two Population Parameters
- •Learning Objectives
- •The Research Question and the Hypothesis Test
- •A Brief Summary of a Few of the Commonest Tests
- •Some Examples of Hypothesis Tests from Practice
- •Confidence Intervals Versus Hypothesis Testing
- •Nobody’s Perfect – Types of Error
- •The Power of a Test
- •Maximising Power – Calculating Sample Size
- •Rules of Thumb
- •13 Testing Hypotheses About the Ratio of Two Population Parameters
- •Learning Objectives
- •Testing the Risk Ratio
- •Testing the Odds Ratio
- •Learning Objectives
- •15 Measuring the Association between Two Variables
- •Learning Objectives
- •Association
- •The Correlation Coefficient
- •16 Measuring Agreement
- •Learning Objectives
- •To Agree or not Agree: That is the Question
- •Cohen’s Kappa
- •Measuring Agreement with Ordinal Data – Weighted Kappa
- •Measuring the Agreement between Two Metric Continuous Variables
- •17 Straight Line Models: Linear Regression
- •Learning Objectives
- •Health Warning!
- •Relationship and Association
- •The Linear Regression Model
- •Model Building and Variable Selection
- •18 Curvy Models: Logistic Regression
- •Learning Objectives
- •A Second Health Warning!
- •Binary Dependent Variables
- •The Logistic Regression Model
- •19 Measuring Survival
- •Learning Objectives
- •Introduction
- •Calculating Survival Probabilities and the Proportion Surviving: the Kaplan-Meier Table
- •The Kaplan-Meier Chart
- •Determining Median Survival Time
- •Comparing Survival with Two Groups
- •20 Systematic Review and Meta-Analysis
- •Learning Objectives
- •Introduction
- •Systematic Review
- •Publication and other Biases
- •The Funnel Plot
- •Combining the Studies
- •Solutions to Exercises
- •References
- •Index
1
First things first – the nature of data
Learning objectives
When you have finished this chapter, you should be able to:
Explain the difference between nominal, ordinal, and metric discrete and metric continuous variables.
Identify the type of a variable.
Explain the non-numeric nature of ordinal data.
Variables and data
A variable is something whose value can vary. For example, age, sex and blood type are variables. Data are the values you get when you measure1 a variable. For example, 32 years (for the variable age), or female (for the variable sex). I have illustrated the idea in Table 1.1.
1I am using ‘measure’ in the broadest sense here. We wouldn’t measure the sex or the ethnicity of someone, for example. We would instead usually observe it or ask the person or get the value from a questionnaire. But we would measure their height or their blood pressure. More on this shortly.
Medical Statistics from Scratch, Second Edition David Bowers
C 2008 John Wiley & Sons, Ltd
4 |
|
|
CH 1 FIRST THINGS FIRST – THE NATURE OF DATA |
|
|
|||
|
|
|
Table 1.1 Variables and data |
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
The variables ... |
... and the data. |
|||||
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
Mrs Brown |
Mr Patel |
|
Ms Manda |
|
|
|
|
|
|
|
|
|
|
|
|
Age |
|
32 |
24 |
|
20 |
|
|
|
Sex |
|
Female |
Male |
|
Female |
|
|
|
Blood type |
|
O |
O |
|
A |
|
The good, the bad, and the ugly – types of variable
There are two major types of variable – categorical variables and metric2 variables. Each of these can be further divided into two sub-types, as shown in Figure 1.1, which also summarises their main characteristics.
Categorical variables |
Metric variables |
||
Nominal |
Ordinal |
Discrete |
Continuous |
Values in |
Values in |
Integer values |
Continuous values |
arbitrary |
ordered |
on proper numeric |
on proper numeric |
categories |
categories |
line or scale |
line or scale |
(no units) |
(no units) |
(counted units) |
(measured units) |
|
|
|
|
Figure 1.1 Types of variable
Categorical variables
Nominal categorical variables
Consider the variable blood type. Let’s assume for simplicity that there are only four different blood types: O, A, B, and A/B. Suppose we have a group of 100 patients. We can first determine the blood type of each and then allocate the result to one of the four blood type categories. We might end up with a table like Table 1.2.
2You will also see metric data referred to as interval/ratio data. The computer package SPSS uses the term ‘scale’ data.
CATEGORICAL VARIABLES |
5 |
Table 1.2 Blood types of 100 patients (fictitious data)
|
Number of patients |
Blood type |
(or frequency) |
|
|
O |
65 |
A |
15 |
B |
12 |
A/B |
8 |
|
|
By the way, a table like Table 1.2 is called a frequency table, or a contingency table. It shows how the number, or frequency, of the different blood types is distributed across the four categories. So 65 patients have a blood type O, 15 blood type A, and so on. We’ll look at frequency tables in more detail in the next chapter.
The variable ‘blood type’ is a nominal categorical variable. Notice two things about this variable, which is typical of all nominal variables:
The data do not have any units of measurement.3
The ordering of the categories is completely arbitrary. In other words, the categories cannot be ordered in any meaningful way.4
In other words we could just as easily write the blood type categories as A/B, B, O, A or B, O, A, A/B, or B, A, A/B, O, or whatever. We can’t say that being in any particular category is better, or shorter, or quicker, or longer, than being in any other category.
Exercise 1.1 Suggest a few other nominal variables.
Ordinal categorical variables
Let’s now consider another variable some of you may be familiar with – the Glasgow Coma Scale, or GCS for short. As the name suggests, this scale measures the degree of brain injury following head trauma. A patient’s Glasgow Coma Scale score is judged by their responsiveness, as observed by a clinician, in three areas: eye opening response, verbal response and motor response. The GCS score can vary from 3 (death or severe injury) to 15 (mild or no injury). In other words, there are 13 possible values or categories of brain injury.
Imagine that we determine the Glasgow Coma Scale scores of the last 90 patients admitted to an Emergency Department with head trauma, and we allocate the score of each patient to one of the 13 categories. The results might look like the frequency table shown in Table 1.3.
3For example, cm, or seconds, or ccs, or kg, etc.
4We are excluding trivial arrangements such as alphabetic.
6 |
CH 1 FIRST THINGS FIRST – THE NATURE OF DATA |
|||
|
|
Table 1.3 A frequency table showing |
||
|
|
the (hypothetical) distribution of 90 |
||
|
|
Glasgow Coma Scale scores |
|
|
|
|
|
|
|
|
|
Glasgow Coma |
Number of |
|
|
|
Scale score |
patients |
|
|
|
|
|
|
|
3 |
8 |
|
|
|
4 |
1 |
|
|
|
5 |
6 |
|
|
|
6 |
5 |
|
|
|
7 |
5 |
|
|
|
8 |
7 |
|
|
|
9 |
6 |
|
|
|
10 |
8 |
|
|
|
11 |
8 |
|
|
|
12 |
10 |
|
|
|
13 |
12 |
|
|
|
14 |
9 |
|
|
|
15 |
5 |
|
|
|
|
|
|
|
The Glasgow Coma Scale is an ordinal categorical variable. Notice two things about this variable, which is typical of all ordinal variables:
The data do not have any units of measurement (so the same as for nominal variables).
The ordering of the categories is not arbitrary as it was with nominal variables. It is now possible to order the categories in a meaningful way.
In other words, we can say that a patient in the category ‘15’ has less brain injury than a patient in category ‘14’. Similarly, a patient in the category ‘14’ has less brain injury than a patient in category ‘13’, and so on.
However, there is one additional and very important feature of these scores, (or any other set of ordinal scores). Namely, the difference between any pair of adjacent scores is not necessarily the same as the difference between any other pair of adjacent scores.
For example, the difference in the degree of brain injury between Glasgow Coma Scale scores of 5 and 6, and scores of 6 and 7, is not necessarily the same. Nor can we say that a patient with a score of say 6 has exactly twice the degree of brain injury as a patient with a score of 12. The direct consequence of this is that ordinal data therefore are not real numbers. They cannot be placed on the number line.5 The reason is, of course, that the Glasgow Coma Scale data, and
5The number line can be visualised as a horizontal line stretching from minus infinity on the left to plus infinity on the right. Any real number, whether negative or positive, decimal or integer (whole number), can be placed somewhere on this line.
