Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

 

 

 

 

Summary

 

 

 

 

433

NonD

-0.38

-0.37

1.0

0.5

1.0

-0.38 -0.6 -0.32 -0.5

-0.48

Dream

-0.11

-0.11

0.5

1.0

0.7

-0.30 -0.5 -0.45 -0.5

-0.58

Sleep

-0.31

-0.36

1.0

0.7

1.0

-0.41 -0.6 -0.40 -0.6

-0.59

Span

0.30

0.51

-0.4

-0.3

-0.4

1.00

0.6

-0.10

0.4

0.06

Gest

0.65

0.75

-0.6

-0.5

-0.6

0.61

1.0

0.20

0.6

0.38

Pred

0.06

0.03

-0.3

-0.4

-0.4 -0.10

0.2

1.00

0.6

0.92

Exp

0.34

0.37

-0.5

-0.5

-0.6

0.36

0.6

0.62

1.0

0.79

Danger

0.13

0.15

-0.5

-0.6

-0.6

0.06

0.4

0.92

0.8

1.00

In this example, correlations between any two variables use all available observations for those two variables (ignoring the other variables). The correlation between Kaplan-Meier multiple is based on all 62 mammals (the number of mammals with data on both variables). The correlation between Kaplan-Meier multiple is based on 42 mammals, and the correlation between Kaplan-Meier multiple is based on 46 mammals.

Although pairwise deletion appears to use all available data, in fact each calculation is based on a different subset of the data. This can lead to distorted and difficult- to-interpret results. I recommend staying away from this approach.

18.8.2Simple (nonstochastic) imputation

In simple imputation, the missing values in a variable are replaced with a single value (for example, mean, median, or mode). Using mean substitution, you could replace missing values on Kaplan-Meier multiple with the value 1.97 and missing values on KaplanMeier multiple with the value 8.67 (the means on Kaplan-Meier multiple, respectively). Note that the substitution is nonstochastic, meaning that random error isn’t introduced (unlike with multiple imputation).

An advantage of simple imputation is that it solves the missing-values problem without reducing the sample size available for analyses. Simple imputation is, well, simple, but it produces biased results for data that isn’t MCAR. If moderate to large amounts of data are missing, simple imputation is likely to underestimate standard errors, distort correlations among variables, and produce incorrect p-values in statistical tests. Like pairwise deletion, I recommend avoiding this approach for most miss- ing-data problems.

18.9 Summary

Most statistical methods assume that the input data are complete and don't include missing values (such as, NA, NaN, or Inf). But most datasets in real-world settings contain missing values. Therefore, you must either delete the missing values or replace them with reasonable substitute values before continuing with the desired analyses. Often, statistical packages provide default methods for handling missing data, but these approaches may not be optimal. Therefore, it’s important that you understand the various approaches available and the ramifications of using each.

In this chapter, we examined methods for identifying missing values and exploring patterns of missing data. The goal was to understand the mechanisms that led to the missing data and their possible impact on subsequent analyses. We then reviewed

434

CHAPTER 18 Advanced methods for missing data

three popular methods for dealing with missing data: a rational approach, listwise deletion, and the use of multiple imputation.

Rational approaches can be used to recover missing values when there are redundancies in the data or when external information can be brought to bear on the problem. The listwise deletion of missing data is useful if the data are MCAR and the subsequent sample size reduction doesn’t seriously impact the power of statistical tests. Multiple imputation is rapidly becoming the method of choice for complex miss- ing-data problems when you can assume that the data are MCAR or MAR. Although many analysts may be unfamiliar with multiple imputation strategies, user-contributed packages (mice, mi, and Amelia) make them readily accessible. I believe we’ll see rapid growth in their use over the next few years.

I ended the chapter by briefly mentioning R packages that provide specialized approaches for dealing with missing data, and singled out general approaches for handling missing data (pairwise deletion, simple imputation) that should be avoided.

In the next chapter, we’ll explore advanced graphical methods, using the ggplot2 package to create innovative multivariate plots.

Part 5

Expanding your skills

In this final section, we consider advanced topics that will enhance your skills as an R programmer. Chapter 19 completes our discussion of graphics with a presentation of one of R’s most powerful approaches to visualizing data. Based on a comprehensive grammar of graphics, the ggplot2 package provides a set of tools that allow you visualize complex data sets in new and creative ways. You’ll be able to easily create attractive and informative graphs that would be difficult or impossible to create using R’s base graphics system.

Chapter 20 provides a review of the R language at a deeper level. This includes a discussion of R’s object-oriented programming features, working with environments, and advanced function writing. Tips for writing efficient code and debugging programs are also described. Although chapter 20 is more technical than the other chapters in this book, it provides extensive practical advice for developing more useful programs.

Throughout this book, you’ve used packages to get work done. In chapter 21, you’ll learn to write your own packages. This can help you organize and document your work, create more complex and comprehensive software solutions, and share your creations with others. Sharing a useful package of functions with others can also be a wonderful way to give back to the R community (while spreading your fame far and wide).

Chapter 22 is all about report writing. R provides compressive facilities for generating attractive reports dynamically from data. In this last chapter, you’ll learn how to create reports as web pages, PDF documents, and word processor documents (including Microsoft Word documents).

After completing part 5, you’ll have a much deeper appreciation of how R works and the tools it offers for creating more sophisticated graphics, software, and reports.

436

CHAPTER

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]