Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Advanced methods for missing data

This chapter covers

Identifying missing data

Visualizing missing data patterns

Complete-case analysis

Multiple imputation of missing data

In previous chapters, we focused on analyzing complete datasets (that is, datasets without missing values). Although doing so helps simplify the presentation of statistical and graphical methods, in the real world, missing data are ubiquitous.

In some ways, the impact of missing data is a subject that most of us want to avoid. Statistics books may not mention it or may limit discussion to a few paragraphs. Statistical packages offer automatic handling of missing data using methods that may not be optimal. Even though most data analyses (at least in social sciences) involve missing data, this topic is rarely mentioned in the methods and results sections of journal articles. Given how often missing values occur, and the degree to which their presence can invalidate study results, it’s fair to say that the subject has received insufficient attention outside of specialized books and courses.

414

Steps in dealing with missing data

415

Data can be missing for many reasons. Survey participants may forget to answer one or more questions, refuse to answer sensitive questions, or grow fatigued and fail to complete a long questionnaire. Study participants may miss appointments or drop out of a study prematurely. Recording equipment may fail, internet connections may be lost, or data may be miscoded. Analysts may even plan for some data to be missing. For example, to increase study efficiency or reduce costs, you may choose not to collect all data from all participants. Finally, data may be lost for reasons that you’re never able to ascertain.

Unfortunately, most statistical methods assume that you’re working with complete matrices, vectors, and data frames. In most cases, you have to eliminate missing data before you address the substantive questions that led you to collect the data. You can eliminate missing data by (1) removing cases with missing data or (2) replacing missing data with reasonable substitute values. In either case, the end result is a dataset without missing values.

In this chapter, we’ll look at both traditional and modern approaches for dealing with missing data. We’ll primarily use the VIM and mice packages. The command install.packages(c("VIM", "mice")) will download and install both.

To motivate the discussion, we’ll look at the mammal sleep dataset (sleep) provided in the VIM package (not to be confused with the sleep dataset describing the impact of drugs on sleep provided in the base installation). The data come from a study by Allison and Chichetti (1976) that examined the relationship between sleep and ecological and constitutional variables for 62 mammal species. The authors were interested in why animals’ sleep requirements vary from species to species. The sleep variables served as the dependent variables, whereas the ecological and constitutional variables served as the independent or predictor variables.

Sleep variables included length of dreaming sleep (Dream), nondreaming sleep (NonD), and their sum (Sleep). The constitutional variables included body weight in kilograms (BodyWgt), brain weight in grams (BrainWgt), life span in years (Span), and gestation time in days (Gest). The ecological variables included degree to which species were preyed upon (Pred), degree of their exposure while sleeping (Exp), and overall danger (Danger) faced. The ecological variables were measured on 5-point rating scales that ranged from 1 (low) to 5 (high).

In their original article, Allison and Chichetti limited their analyses to the species that had complete data. We’ll go further, analyzing all 62 cases using a multiple imputation approach.

18.1 Steps in dealing with missing data

If you’re new to the study of missing data, you’ll find a bewildering array of approaches, critiques, and methodologies. The classic text in this area is Little and Rubin (2002). Excellent, accessible reviews can be found in Allison (2001); Schafer

416

CHAPTER 18 Advanced methods for missing data

and Graham (2002); and Schlomer, Bauman, and Card (2010). A comprehensive approach usually includes the following steps:

1Identify the missing data.

2Examine the causes of the missing data.

3Delete the cases containing missing data, or replace (impute) the missing values with reasonable alternative data values.

Unfortunately, identifying missing data is usually the only unambiguous step. Learning why data are missing depends on your understanding of the processes that generated the data. Deciding how to treat missing values will depend on your estimation of which procedures will produce the most reliable and accurate results.

A classification system for missing data

Statisticians typically classify missing data into one of three types. These types are usually described in probabilistic terms, but the underlying ideas are straightforward. We’ll use the measurement of dreaming in the sleep study (where 12 animals have missing values) to illustrate each type in turn:

Missing completely at random—If the presence of missing data on a variable is unrelated to any other observed or unobserved variable, then the data are missing completely at random (MCAR). If there’s no systematic reason why dream sleep is missing for these 12 animals, the data are said to be MCAR. Note that if every variable with missing data is MCAR, you can consider the complete cases to be a simple random sample from the larger dataset.

Missing at random—If the presence of missing data on a variable is related to other observed variables but not to its own unobserved value, the data are missing at random (MAR). For example, if animals with smaller body weights are more likely to have missing values for dream sleep (perhaps because it’s harder to observe smaller animals), and the “missingness” is unrelated to an animal’s time spent dreaming, the data are considered MAR. In this case, the presence or absence of dream sleep data is random, once you control for body weight.

Not missing at random—If the missing data for a variable are neither MCAR nor MAR, the data are not missing at random (NMAR). For example, if animals that spend less time dreaming are also more likely to have a missing dream value (perhaps because it’s harder to measure shorter events), the data are considered NMAR.

Most approaches to missing data assume that the data are either MCAR or MAR. In this case, you can ignore the mechanism producing the missing data and (after replacing or deleting the missing data) model the relationships of interest directly.

Data that are NMAR can be difficult to analyze properly. When data are NMAR, you have to model the mechanisms that produced the missing values, as well as the relationships of interest. (Current approaches to analyzing NMAR data include the use of selection models and pattern mixtures. The analysis of NMAR data can be complex and is beyond the scope of this book.)

 

 

 

 

 

Identifying missing values

417

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Identify Missing Values

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

is.na()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!complete.cases()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

VIM package

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Delete Missing Values

 

 

 

Maximum Likelihood

 

 

 

Impute Missing Values

 

 

 

 

 

 

 

 

 

 

Estimation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mvmle package

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Casewise (Listwise)

 

Available Case

 

 

Single (simple)

 

 

Multiple Imputation

 

 

omit.na()

 

 

(Pairwise)

 

 

Imputation

 

 

mi package

 

 

 

 

Option available for

 

 

Hmisc Package

 

 

mice package

 

 

 

 

 

 

 

 

amelia package

 

 

 

 

 

some functions

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mitools package

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 18.1 Methods for handling incomplete data, along with the R packages that support them

There are many methods for dealing with missing data—and no guarantee that they’ll produce the same results. Figure 18.1 describes an array of methods used for handling incomplete data and the R packages that support them.

A complete review of missing-data methodologies would require a book in itself. In this chapter, we’ll review methods for exploring missing-values patterns and focus on the three most popular methods for dealing with incomplete data (a rational approach, listwise deletion, and multiple imputation). We’ll end the chapter with a brief discussion of other methods, including those that are useful in special circumstances.

18.2 Identifying missing values

To begin, let’s review the material introduced in section 4.5, and expand on it. R represents missing values using the symbol NA (not available) and impossible values using the symbol NaN (not a number). In addition, the symbols Inf and -Inf represent positive infinity and negative infinity, respectively. The functions is.na(), is.nan(), and is.infinite() can be used to identify missing, impossible, and infinite values, respectively. Each returns either TRUE or FALSE. Examples are given in table 18.1.

Table 18.1 Examples of return values for the is.na(), is.nan(), and is.infinite() functions

 

 

 

 

x

 

is.na(x)

 

is.nan(x)

is.infinite(x)

 

 

 

 

 

 

 

 

x <- NA

 

 

TRUE

 

FALSE

 

FALSE

x

<-

0

/

0

TRUE

 

TRUE

 

FALSE

x

<-

1

/

0

FALSE

 

FALSE

 

TRUE

 

 

 

 

 

 

 

 

 

 

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]