
Robert I. Kabacoff - R in action
.pdf356 CHAPTER 15 Advanced methods for missing data
These functions return an object that’s the same size as its argument, with each element replaced by TRUE if the element is of the type being tested, and FALSE otherwise. For example, let y <- c(1, 2, 3, NA). Then is.na(y) will return the vector c(FALSE, FALSE, FALSE, TRUE).
The function complete.cases() can be used to identify the rows in a matrix or data frame that don’t contain missing data. It returns a logical vector with TRUE for every row that contains complete cases and FALSE for every row that has one or more missing values.
Let’s apply this to the sleep dataset:
#load the dataset data(sleep, package="VIM")
#list the rows that do not have missing values sleep[complete.cases(sleep),]
#list the rows that have one or more missing values sleep[!complete.cases(sleep),]
Examining the output reveals that 42 cases have complete data and 20 cases have one or more missing values.
Because the logical values TRUE and FALSE are equivalent to the numeric values 1 and 0, the sum() and mean() functions can be used to obtain useful information about missing data. Consider the following:
>sum(is.na(sleep$Dream)) [1] 12
>mean(is.na(sleep$Dream)) [1] 0.19
>mean(!complete.cases(sleep)) [1] 0.32
The results indicate that there are 12 missing values for the variable Dream. Nineteen percent of the cases have a missing value on this variable. In addition, 32 percent of the cases in the dataset contain one or more missing values.
There are two things to keep in mind when identifying missing values. First, the complete.cases() function only identifies NA and NaN as missing. Infinite values (Inf and –Inf) are treated as valid values. Second, you must use missing values functions, like those in this section, to identify the missing values in R data objects. Logical comparisons such as myvar == NA are never true.
Now that you know how to identify missing values programmatically, let’s look at tools that help you explore possible patterns in the occurrence of missing data.
15.3 Exploring missing values patterns
Before deciding how to deal with missing data, you’ll find it useful to determine which variables have missing values, in what amounts, and in what combinations. In this section, we’ll review tabular, graphical, and correlational methods for exploring missing values patterns. Ultimately, you want to understand why the data is missing. The answer will have implications for how you proceed with further analyses.
Exploring missing values patterns |
357 |
15.3.1Tabulating missing values
You’ve already seen a rudimentary approach to identifying missing values. You can use the complete.cases() function from section 15.2 to list cases that are complete, or conversely, list cases that have one or more missing values. As the size of a dataset grows, though, it becomes a less attractive approach. In this case, you can turn to other R functions.
The md.pattern() function in the mice package will produce a tabulation of the missing data patterns in a matrix or data frame. Applying this function to the sleep dataset, you get the following:
>library(mice)
>data(sleep, package="VIM")
>md.pattern(sleep)
|
BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD |
|
|||||||||
42 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
3 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
9 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
3 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
0 |
3 |
|
0 |
0 |
0 |
0 |
0 |
4 |
4 |
4 |
12 |
14 38 |
The 1’s and 0’s in the body of the table indicate the missing values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a nonmissing value. The first row describes the pattern of “no missing values” (all elements are 1). The second row describes the pattern “no missing values except for Span.” The first column indicates the number of cases in each missing data pattern, and the last column indicates the number of variables with missing values present in each pattern. Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone. Nine cases are missing both NonD and Dream values. The dataset contains a total of (42 x 0) + (2 x 1) + … + (1 x 3) = 38 missing values. The last row gives the total number of missing values present on each variable.
15.3.2Exploring missing data visually
Although the tabular output from the md.pattern() function is compact, I often find it easier to discern patterns visually. Luckily, the VIM package provides numerous functions for visualizing missing values patterns in datasets. In this section, we’ll review several, including aggr(), matrixplot(), and scattMiss().
The aggr() function plots the number of missing values for each variable alone and for each combination of variables. For example, the code
library("VIM")
aggr(sleep, prop=FALSE, numbers=TRUE)


Exploring missing values patterns |
359 |
Figure 15.3 Matrix plot of actual and missing values by case (row) for the sleep dataset. The matrix is sorted by BodyWgt.
The marginplot() function produces a scatter plot between two variables with information about missing values shown in the plot’s margins. Consider the relationship between amount of dream sleep and the length of a mammal’s gestation. The statement
marginplot(sleep[c("Gest","Dream")], pch=c(20), col=c("darkgray", "red", "blue"))
produces the graph in figure 15.4. The pch and col parameters are optional and provide control over the plotting symbols and colors used.
The body of the graph displays the scatter plot between Gest and Dream (based on complete cases for the two variables). In the left margin, box plots display the distribution of Dream for mammals with (dark gray) and without (red) Gest values. Note that in grayscale, red is the darker shade. Four red dots represent the values of Dream for mammals missing Gest scores. In the bottom margin, the roles of Gest and Dream are reversed. You can see that a negative relationship exists between length of gestation and dream sleep and that dream sleep tends to be higher for mammals that are missing a gestation score. The number of observations with missing values on both variables at the same time is printed in blue at the intersection of both margins (bottom left).
The VIM package has many graphs that can help you understand the role of missing data in a dataset and is well worth exploring. There are functions to produce scatter plots, box plots, histograms, scatter plot matrices, parallel plots, rug plots, and bubble plots that incorporate information about missing values.

Exploring missing values patterns |
361 |
The statement
y <- x[which(sd(x) > 0)]
extracts the variables that have some (but not all) missing values, and
cor(y)
gives you the correlations among these indicator variables:
|
NonD |
Dream |
Sleep |
Span |
Gest |
NonD |
1.000 |
0.907 |
0.486 |
0.015 |
-0.142 |
Dream 0.907 1.000 0.204 0.038 -0.129 |
|||||
Sleep |
0.486 |
0.204 |
1.000 |
-0.069 -0.069 |
|
Span |
0.015 |
0.038 |
-0.069 |
1.000 |
0.198 |
Gest |
-0.142 -0.129 -0.069 |
0.198 |
1.000 |
Here, you can see that Dream and NonD tend to be missing together (r=0.91). To a lesser extent, Sleep and NonD tend to be missing together (r=0.49) and Sleep and Dream tend to be missing together (r=0.20).
Finally, you can look at the relationship between the presence of missing values in a variable and the observed values on other variables:
> cor(sleep, y, |
use="pairwise.complete.obs") |
||||
|
NonD |
Dream |
Sleep |
Span |
Gest |
BodyWgt |
0.227 |
0.223 |
0.0017 |
-0.058 -0.054 |
|
BrainWgt |
0.179 |
0.163 |
0.0079 |
-0.079 -0.073 |
|
NonD |
NA |
NA |
NA |
-0.043 -0.046 |
|
Dream |
-0.189 |
NA |
-0.1890 |
0.117 |
0.228 |
Sleep |
-0.080 |
-0.080 |
NA |
0.096 |
0.040 |
Span |
0.083 |
0.060 |
0.0052 |
NA -0.065 |
|
Gest |
0.202 |
0.051 |
0.1597 |
-0.175 |
NA |
Pred |
0.048 |
-0.068 |
0.2025 |
0.023 |
-0.201 |
Exp |
0.245 |
0.127 |
0.2608 |
-0.193 -0.193 |
|
Danger |
0.065 |
-0.067 |
0.2089 |
-0.067 -0.204 |
|
Warning message: |
|
|
|
|
|
In cor(sleep, y, |
use = "pairwise.complete.obs") : |
||||
the standard deviation is zero |
|
In this correlation matrix, the rows are observed variables, and the columns are indicator variables representing missingness. You can ignore the warning message and NA values in the correlation matrix; they’re artifacts of our approach.
From the first column of the correlation matrix, you can see that nondreaming sleep scores are more likely to be missing for mammals with higher body weight (r=0.227), gestation period (r=0.202), and sleeping exposure (0.245). Other columns are read in a similar fashion. None of the correlations in this table are particularly large or striking, which suggests that the data deviates minimally from MCAR and may be MAR.
Note that you can never rule out the possibility that the data are NMAR because you don’t know what the actual values would have been for data that are missing. For example, you don’t know if there’s a relationship between the amount of dreaming a mammal engages in and the probability of obtaining a missing value on this variable. In the absence of strong external evidence to the contrary, we typically assume that data is either MCAR or MAR.
362 |
CHAPTER 15 Advanced methods for missing data |
15.4 Understanding the sources and impact of missing data
We identify the amount, distribution, and pattern of missing data in order to evaluate
(1) the potential mechanisms producing the missing data and (2) the impact of the missing data on our ability to answer substantive questions. In particular, we want to answer the following questions:
■What percentage of the data is missing?
■Is it concentrated in a few variables, or widely distributed?
■Does it appear to be random?
■Does the covariation of missing data with each other or with observed data suggest a possible mechanism that’s producing the missing values?
Answers to these questions will help determine which statistical methods are most appropriate for analyzing your data. For example, if the missing data are concentrated in a few relatively unimportant variables, you may be able to delete these variables and continue your analyses normally. If there’s a small amount of data (say less than 10 percent) that’s randomly distributed throughout the dataset (MCAR), you may be able to limit your analyses to cases with complete data and still get reliable and valid results. If you can assume that the data are either MCAR or MAR, you may be able to apply multiple imputation methods to arrive at valid conclusions. If the data are NMAR, you can turn to specialized methods, collect new data, or go into an easier and more rewarding profession.
Here are some examples:
■In a recent survey employing paper questionnaires, I found that several items tended to be missing together. It became apparent that these items clustered together because participants didn’t realize that the third page of the questionnaire had a reverse side containing them. In this case, the data could be considered MCAR.
■In another study, an education variable was frequently missing in a global survey of leadership styles. Investigation revealed that European participants were more likely to leave this item blank. It turned out that the categories didn’t make sense for participants in certain countries. In this case, the data was most likely
MAR.
■Finally, I was involved in a study of depression in which older patients were more likely to omit items describing depressed mood when compared with younger patients. Interviews revealed that older patients were loath to admit to such symptoms because doing so violated their values about keeping a “stiff upper lip.” Unfortunately, it was also determined that severely depressed patients were more likely to omit these items due to a sense of hopelessness and difficulties with concentration. In this case, the data had to be considered NMAR.
As you can see, the identification of patterns is only the first step. You need to bring your understanding of the research subject matter and the data collection process to bear in order to determine the source of the missing values.
Rational approaches for dealing with incomplete data |
363 |
Now that we’ve considered the source and impact of missing data, let’s see how standard statistical approaches can be altered to accommodate them. We’ll focus on three approaches that are very popular: a rational approach for recovering data, a traditional approach that involves deleting missing data, and a modern approach that involves the use of simulation. Along the way, we’ll briefly look at methods for specialized situations, and methods that have become obsolete and should be retired. Our goal will remain constant: to answer, as accurately as possible, the substantive questions that led us to collect the data, given the absence of complete information.
15.5 Rational approaches for dealing with incomplete data
In a rational approach, you use mathematical or logical relationships among variables to attempt to fill in or recover the missing values. A few examples will help clarify this approach.
In the sleep dataset, the variable Sleep is the sum of the Dream and NonD variables. If you know a mammal’s scores on any two, you can derive the third. Thus, if there were some observations that were missing only one of the three variables, you could recover the missing information through addition or subtraction.
As a second example, consider research that focuses on work/ life balance differences between generational cohorts (for example, Silents, Early Boomers, Late Boomers, Xers, Millennials), where cohorts are defined by their birth year. Participants are asked both their date of birth and their age. If date of birth is missing, you can recover their birth year (and therefore their generational cohort) by knowing their age and the date they completed the survey.
An example that uses logical relationships to recover missing data comes from a set of leadership studies in which participants were asked if they were a manager (yes/ no) and the number of their direct reports (integer). If they left the manager question blank but indicated that they had one or more direct reports, it would be reasonable to infer that they were a manager.
As a final example, I frequently engage in gender research that compares the leadership styles and effectiveness of men and women. Participants complete surveys that include their name (first and last), gender, and a detailed assessment of their leadership approach and impact. If participants leave the gender question blank, I have to impute the value in order to include them in the research. In one recent study of 66,000 managers, 11,000 (17 percent) had a missing value for gender.
To remedy the situation, I employed the following rational process. First, I crosstabulated first name with gender. Some first names were associated with males, some with females, and some with both. For example, “William” appeared 417 times and was always a male. Conversely, the name “Chris” appeared 237 times but was sometimes a male (86 percent) and sometimes a female (14 percent). If a first name appeared more than 20 times in the dataset and was always associated with males or with females (but never both), I assumed that the name represented a single gender. I used this assumption to create a gender lookup table for gender-specific first names. Using this lookup table for participants with missing gender values, I was able to recover 7,000 cases (63 percent of the missing responses).
Multiple imputation |
365 |
Call:
lm(formula = Dream ~ Span + Gest, data = na.omit(sleep))
Residuals:
Min 1Q Median 3Q Max -2.333 -0.915 -0.221 0.382 4.183
Coefficients: |
|
|
|
|
|
|
Estimate Std. Error t value Pr(>|t|) |
|
|||
(Intercept) |
2.480122 |
0.298476 |
8.31 |
3.7e-10 |
*** |
Span |
-0.000472 |
0.013130 |
-0.04 |
0.971 |
|
Gest |
-0.004394 |
0.002081 |
-2.11 |
0.041 |
* |
--- |
|
|
|
|
|
Signif. codes: 0 ‘***’ |
0.001 ‘**’ |
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 |
Residual standard error: 1 on 39 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.125
F-statistic: 3.92 on 2 and 39 DF, p-value: 0.0282
Here you see that mammals with shorter gestation periods have more dream sleep (controlling for life span) and that life span is unrelated to dream sleep when controlling for gestation period. The analysis is based on 42 cases with complete data.
In the previous example, what would have happened if data=na.omit(sleep) had been replaced with data=sleep? Like many R function, lm() uses a limited definition of listwise deletion. Cases with any missing data on the variables fitted by the function (Dream, Span, and Gest in this case) would have been deleted. The analysis would have been based on 44 cases.
Listwise deletion assumes that the data are MCAR (that is, the complete observations are a random subsample of the full dataset). In the current example, we’ve assumed that the 42 mammals used are a random subsample of the 62 mammals collected. To the degree that the MCAR assumption is violated, the resulting regression parameters will be biased. Deleting all observations with missing data can also reduce statistical power by reducing the available sample size. In the current example, listwise deletion reduced the sample size by 32 percent. Next, we’ll consider an approach that employs the entire dataset (including cases with missing data).
15.7 Multiple imputation
Multiple imputation (MI) provides an approach to missing values that’s based on repeated simulations. MI is frequently the method of choice for complex missing values problems. In MI, a set of complete datasets (typically 3 to 10) is generated from an existing dataset containing missing values. Monte Carlo methods are used to fill in the missing data in each of the simulated datasets. Standard statistical methods are applied to each of the simulated datasets, and the outcomes are combined to provide estimated results and confidence intervals that take into account the uncertainty introduced by the missing values. Good implementations are available in R through the Amelia, mice, and mi packages. In this section we’ll focus on the approach provided by the mice (multivariate imputation by chained equations) package.