Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

422

CHAPTER 18 Advanced methods for missing data

 

6

 

5

 

4

Dream

3

 

2

 

1

 

0

12

0 4

0

100

200

300

400

500

600

Gest

Figure 18.4 Scatter plot between amount of dream sleep and length of gestation, with information about missing data in the margins

reversed. You can see that a negative relationship exists between length of gestation and dream sleep and that dream sleep tends to be higher for mammals that are missing a gestation score. The number of observations missing values on both variables at the same time is printed in blue at the intersection of both margins (bottom left).

The VIM package has many graphs that can help you understand the role of missing data in a dataset and is well worth exploring. There are functions to produce scatter plots, box plots, histograms, scatter plot matrices, parallel plots, rug plots, and bubble plots that incorporate information about missing values.

18.3.3Using correlations to explore missing values

Before moving on, there’s one more approach worth noting. You can replace the data in a dataset with indicator variables, coded 1 for missing and 0 for present. The resulting matrix is sometimes called a shadow matrix. Correlating these indicator variables with each other and with the original (observed) variables can help you to see which variables tend to be missing together, as well as relationships between a variable’s “missingness” and the values of the other variables.

Consider the following code:

x <- as.data.frame(abs(is.na(sleep)))

The elements of data frame x are 1 if the corresponding element of sleep is missing and 0 otherwise. You can see this by viewing the first few rows of each:

>

head(sleep, n=5)

 

 

 

 

 

 

 

 

 

BodyWgt

BrainWgt NonD Dream

Sleep

Span

Gest

Pred

Exp Danger

1

6654.000

5712.0

NA

NA

3.3

38.6

645

3

5

3

 

 

 

Exploring missing-values patterns

 

 

423

2

1.000

6.6

6.3

2.0

8.3

4.5

42

3

1

3

3

3.385

44.5

NA

NA

12.5

14.0

60

1

1

1

4

0.920

5.7

NA

NA

16.5

NA

25

5

2

3

5

2547.000

4603.0

2.1

1.8

3.9

69.0

624

3

5

4

>head(x, n=5)

BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger

1

0

0

1

1

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0

3

0

0

1

1

0

0

0

0

0

0

4

0

0

1

1

0

1

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

The statement

y <- x[which(apply(x,2,sum)>0)]

extracts the variables that have some (but not all) missing values, and

cor(y)

gives you the correlations among these indicator variables:

 

NonD

Dream

Sleep

Span

Gest

NonD

1.000

0.907

0.486

0.015

-0.142

Dream 0.907 1.000 0.204 0.038 -0.129

Sleep

0.486

0.204

1.000

-0.069 -0.069

Span

0.015

0.038

-0.069

1.000

0.198

Gest

-0.142 -0.129 -0.069

0.198

1.000

Here, you can see that Dream and NonD tend to be missing together (r = 0.91). To a lesser extent, Sleep and NonD tend to be missing together (r = 0.49) and Sleep and Dream tend to be missing together (r = 0.20).

Finally, you can look at the relationship between missing values in a variable and the observed values on other variables:

> cor(sleep, y,

use="pairwise.complete.obs")

 

NonD

Dream

Sleep

Span

Gest

BodyWgt

0.227

0.223

0.0017

-0.058 -0.054

BrainWgt

0.179

0.163

0.0079

-0.079 -0.073

NonD

NA

NA

NA -0.043 -0.046

Dream

-0.189

NA

-0.1890

0.117

0.228

Sleep

-0.080

-0.080

NA

0.096

0.040

Span

0.083

0.060

0.0052

NA -0.065

Gest

0.202

0.051

0.1597

-0.175

NA

Pred

0.048

-0.068

0.2025

0.023

-0.201

Exp

0.245

0.127

0.2608

-0.193 -0.193

Danger

0.065

-0.067

0.2089

-0.067 -0.204

Warning message:

 

 

 

 

In cor(sleep, y,

use =

"pairwise.complete.obs") :

the standard deviation is zero

 

In this correlation matrix, the rows are observed variables, and the columns are indicator variables representing missingness. You can ignore the warning message and NA values in the correlation matrix; they’re artifacts of our approach.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]