Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
97
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

426

APPENDIX F Packages used in this book

Table F.1 Contributed packages used in this book (continued )

Package

Authors

Description

Chapters

 

 

 

 

psych

William Revelle

Procedures for psychological,

7, 14

 

 

psychometric, and

 

 

 

personality research

 

pwr

Stephane Champely

Basic functions for power

10

 

 

analysis

 

qcc

Luca Scrucca

Quality control char ts

13

randomLCA

Ken Beath

Random effects latent class

14

 

 

analysis

 

Rcmdr

John Fox, with

R Commander, a platform-

11

 

contributions from

independent basic-statistics

 

 

Liviu Andronic, Michael

graphical user inter face

 

 

Ash, Theophilius Boye,

for R, based on the tcltk

 

 

Stefano Calza, Andy

package

 

 

Chang, Philippe Grosjean,

 

 

 

Richard Heiberger, G. Jay

 

 

 

Kerns, Renaud Lancelot,

 

 

 

Matthieu Lesnoff, Uwe

 

 

 

Ligges, Samir Messad,

 

 

 

Mar tin Maechler, Rober t

 

 

 

Muenchen, Duncan

 

 

 

Murdoch, Erich Neuwir th,

 

 

 

Dan Putler, Brian Ripley,

 

 

 

Miroslav Ristic, and Peter

 

 

 

Wolf.

 

 

reshape

Hadley Wickham

Flexibly reshape data

4, 5, 7

rggobi

Duncan Temple Lang,

An inter face between R and

16

 

Debby Swayne, Hadley

GGobi

 

 

Wickham, and Michael

 

 

 

Lawrence

 

 

rgl

Daniel Adler and Duncan

3D visualization device

11

 

Murdoch

system (OpenGL)

 

RJDBC

Simon Urbanek

Provides access to

2

 

 

databases through the JDBC

 

 

 

inter face

 

rms

Frank E. Harrell, Jr.

Regression modeling

13

 

 

strategies - about 225

 

 

 

function that assist with

 

 

 

and streamline regression

 

 

 

modeling, testing,

 

 

 

estimations, validation,

 

 

 

graphics, prediction, and

 

 

 

typesetting

 

 

 

 

 

APPENDIX F Packages used in this book

427

Table F.1 Contributed packages used in this book (continued )

Package

Authors

Description

Chapters

 

 

 

 

robust

Jiahui Wang, Ruben

A package of robust methods

13

 

Zamar, Alfio Marazzi,

 

 

 

Victor Yohai, Matias

 

 

 

Salibian-Barrera, Ricardo

 

 

 

Maronna, Eric Zivot, David

 

 

 

Rocke, Doug Mar tin,

 

 

 

Mar tin Maechler, and Kjell

 

 

 

Konis

 

 

RODBC

Brian Ripley and Michael

ODBC database access

2

 

Lapsley

 

 

ROracle

David A. James and Jake

Oracle database inter face

2

 

Luciani

for R

 

rrcov

Valentin Todorov

Robust location and scatter

9

 

 

estimation and robust

 

 

 

multivariate analysis with

 

 

 

high breakdown point

 

sampling

Yves Tillé and Alina Matei

Functions for drawing and

4

 

 

calibrating samples

 

scatterplot3d

Uwe Ligges

Plots a three dimensional

11

 

 

(3D) point cloud

 

sem

John Fox with contributions

Structural equation models

14

 

from Adam Kramer and

 

 

 

Michael Friendly

 

 

SeqKnn

Ki-Yeol Kim and Gwan-Su

Sequential KNN imputation

15

 

Yi, CSBio lab., Information

method

 

 

and Communications

 

 

 

University

 

 

sm

Adrian Bowman and

Smoothing methods for

6, 9

 

Adelchi Azzalini. Por ted

nonparametric regression

 

 

to R by B. D. Ripley up to

and density estimation

 

 

version 2.0, version 2.1

 

 

 

by Adrian Bowman and

 

 

 

Adelchi Azzalini, version

 

 

 

2.2 by Adrian Bowman.

 

 

vcd

David Meyer, Achim

Functions for visualizing

1, 6, 7,

 

Zeileis, and Kur t Hornik

categorical data

11, 12

vegan

Jari Oksanen, F. Guillaume

Ordination methods, diversity

9

 

Blanchet, Roeland Kindt,

analysis, and other functions

 

 

Pierre Legendre, R. B.

for community and vegetation

 

 

O’Hara, Gavin L. Simpson,

ecologists

 

 

Peter Solymos, M. Henr y

 

 

 

H. Stevens, and Helene

 

 

 

Wagner

 

 

 

 

 

 

428

APPENDIX F Packages used in this book

Table F.1 Contributed packages used in this book (continued )

Package

Authors

Description

Chapters

 

 

 

 

VIM

Matthias Templ, Andreas

Visualization and imputation

15

 

Alfons, and Alexander

of missing values

 

 

Kowarik

 

 

xlsx

Adrian A. Dragulescu

Read, write, and format Excel

2

 

 

2007 (xlsx) files

 

XML

Duncan Temple Lang

Tools for parsing and

2

 

 

generating XML within R and

 

 

 

S-Plus

 

 

 

 

 

appendix G Working with large datasets

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits will depend primarily on the R build (32 versus 64-bit) and for 32-bit Windows, on the OS version involved. Error messages starting with cannot allocate vector of size typically indicate a failure to obtain sufficient contiguous memory, while error messages starting with cannot allocate vector of length indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. For all builds, the number of elements in a vector is limited to 2,147,483,647 (see ?Memory for more information).

There are three issues to consider when working with large datasets: (a) efficient programming to speed execution, (b) storing data externally to limit memory issues, and (c) using specialized statistical routines designed to efficiently analyze massive amounts of data. We will briefly consider each.

G.1 Efficient programming

There are a number of programming tips that improve performance when working with large datasets.

Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, sapply, lappy, and mapply) and avoid loops (for and while) when feasible.

429

430

APPENDIX G Working with large datasets

Use matrices rather than data frames (they have less overhead).

When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment. char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.

Test programs on subsets of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.

Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) will remove all objects from memory, providing a clean slate. Specific objects can be removed with rm(object).

Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot. com), to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.

Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof()and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.

The Rcpp package can be used to transfer R objects to C++ functions and back when more optimized subroutines are needed.

With large datasets, increasing code efficiency will only get you so far. When bumping up against memory limits, you can also store our data externally and use specialized analysis routines.

G.2 Storing data outside of RAM

There are several packages available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk, and then accessing portions as they are needed. Several useful packages are described in table G.1.

Table G.1 R packages for accessing large datasets

Package

Description

 

 

ff

Provides data structures that are stored on disk but behave as if they

 

were in RAM.

bigmemory

Suppor ts the creation, storage, access, and manipulation of massive

 

matrices. Matrices are allocated to shared memor y and memor y-

 

mapped files.

filehash

Implements a simple key-value database where character string keys

 

are associated with data values stored on disk.

 

 

Analytic packages for large datasets

431

Table G.1 R packages for accessing large datasets (continued )

Package

Description

 

 

ncdf, ncdf4

Provides an inter face to Unidata netCDF data files.

RODBC, RMySQL,

Each provides access to external relational database management

ROracle,

systems.

RPostgreSQL,

 

RSQLite

 

 

 

The packages above help overcome R’s memory limits on data storage. However, specialized methods are also needed when attempting to analyze large datasets in a reasonable length of time. Some of the most useful are described below.

G.3 Analytic packages for large datasets

R provides several packages for the analysis of large datasets:

The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabulate package provides table(), split(), and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.

The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.

The Brobdingnag package can be used to manipulate large numbers (numbers larger than 2^1024).

Working with datasets in the gigabyte to terabyte range can be challenging in any language. For more information on the methods available within R, see the CRAN Task View: High-Performance and Parallel Computing with R (cran.r-project.org/web/views/).

appendix H Updating an R installation

As consumers, we take for granted that we can update a piece of software via a “Check for updates…” option. In chapter 1, I noted that the update.packages() function can be used to download and install the most recent version of a contributed package. Unfortunately, there’s no corresponding function for updating the R installation itself. If you want to update an R installation from version 4.1.0 to 5.1.1, you must get creative. (As I write this, the current version is actually 2.13.0, but I want this book to appear hip and current for years to come).

Downloading and installing the latest version of R from CRAN (http://cran.r- project.org/bin/) is relatively straightforward. The complicating factor is that customizations (including previously installed contributed packages) will not be included in the new installation. In my current set-up, I have 248 contributed packages installed. I really don’t want to have to write their names down and reinstall them by hand the next time I upgrade my R installation.

There has been much discussion on the web concerning the most elegant and efficient way to update an R installation. The method described below is neither elegant nor efficient, but I find that it works well on a variety of platforms (Windows, Mac, and Linux).

In this approach, the installed.packages() function is used to save a list of packages to a location outside of the R directory tree, and then the list is used with the install.packages() function to download and install the latest contributed packages into the new R installation. Here are the steps:

432

APPENDIX H Updating an R installation

433

1If you have a customized Rprofile.site file(see appendix B), save a copy outside of R.

2Launch your current version of R and issue the following statements

oldip <- installed.packages()[,1]

save(oldip, file="path/installedPackages.Rdata")

where path is a directory outside of R.

3Download and install the newer version of R.

4If you saved a customized version of the Rprofile.site file in step 1, copy it into the new installation.

5Launch the new version of R, and issue the following statements

load("path/installedPackages.Rdata") newip <- installed.packages()[,1] for(i in setdiff(oldip, newip))

install.packages(i)

where path is the location specified in step 2. 6 Delete the old installation (optional).

This approach will install only packages that are available from the CRAN. It won’t find packages obtained from other locations. You’ll have to find and download these separately. Luckily, the process will display a list of packages that can’t be installed. During my last installation, globaltest and Biobase couldn’t be found. Since I got them from the Bioconductor site, I was able to install them via the code

source(http://bioconductor.org/biocLite.R) biocLite("globaltest") biocLite("Biobase")

Step 6 involves the optional deletion of the old installation. On a Windows machine, more than one version of R can be installed at a time. If desired, uninstall the older version via Start > Control Panel > Uninstall a Program. On Mac and Linux platforms, the new version of R will overwrite the older version. To delete any remnants on a Mac, use the Finder to go to the /Library/Frameworks/R.frameworks/ versions/ directory and delete the folder representing the older version. On a Linux platform, it’s probably best to leave well enough alone.

Clearly, updating an existing version of R is more involved than is desirable for such a sophisticated piece of software. I’m hopeful that someday this appendix will simply say “Select the Check for Updates… option” to update an R installation.

index

Symbol

! operator

77

!= operator

77

# symbol 8

 

%a symbol

81

%A symbol

81

%B symbol

82

%b symbol

82

%d symbol

81

%m symbol

81

%Y symbol

82

%y symbol

82

*operator 75, 178

**operator 75

... option

58, 61

. symbol

178

 

/ operator

75

 

: symbol

178

 

? function

 

11

 

?? function

11

^ operator

75, 178, 181

~ symbol

178

 

+ operator

75, 178

< operator

77

 

<<- operator

29

<= operator

77

== operator

77

> operator

77

 

>= operator

77

-1 symbol

178

 

brackets

29

 

3D pie charts

127

3D scatter plots

274–278

A

abline( ) function 60, 265 abs( ) function 93 absolute widths 67

acos( ) function 93 acosh( ) function 93 AER package 421

aggr( ) function, VIM

 

package

357

 

aggregate( ) function

113, 240

aggregating data

112–113

AIC( ) function

179, 208

all subsets regression

 

210, 213

alpha option

390

 

 

alternative= option

255

Amelia package

365, 369, 421

analyses, excluding missing

values from

80–81

analysis of covariance

 

(ANCOVA)

 

 

one-way

230–233

 

 

assessing test

 

 

 

assumptions

 

232

visualizing results

 

232–233

overview

 

222

 

 

 

analysis of variance (ANOVA)

219–245, 252–253

fitting models

222–225

aov( ) function

222–223

order of formula

 

 

terms

223–225

MANOVA

239–243

assessing test assumptions

241–242

 

 

 

robust

242–243

 

 

one-way

225–230

 

 

assessing test

 

 

 

assumptions

 

229–230

multiple comparisons

227–229

 

 

 

one-way ANCOVA

 

230–233

assessing test

 

 

 

assumptions

 

232

visualizing results

 

232–233

as regression

243–245

repeated measures

237–239

terminology of

220–222

two-way factorial

234–236

analytic packages, for large

datasets

431

 

 

ANCOVA. See analysis of covariance

ancova( ) function, HH package 232

AND operator 77 annotating datasets 42 annotations 62–64

ANOVA. See analysis of variance anova( ) function 179, 208 Anova( ) function, car

package 225, 239 aov( ) function 222–223 append option 13

apply( ) function 102–103 apropos( ) function 11 aq.plot( ) function, mvoutlier

package 242 arithmetic operators 75 arrayImpute package

370 421 arrayMissPattern package 370,

421 arrays 26–27

Arthritis dataset 19 as.character( ) function 83 ASCII file 35

as.datatype( )function 84 as.Date( ) function 81, 88 asin( ) function 93 asinh( ) function 93 aspect option 378 assumptions

linear model, global validation of 199

of MANOVA tests, assessing 241–242

of OLS regression, assessing 188–199

of one-way ANCOVA tests, assessing 232

of one-way ANOVA tests, assessing 229–230

asypow package 261

435

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]