
Robert I. Kabacoff - R in action
.pdf
426 |
APPENDIX F Packages used in this book |
Table F.1 Contributed packages used in this book (continued )
Package |
Authors |
Description |
Chapters |
|
|
|
|
psych |
William Revelle |
Procedures for psychological, |
7, 14 |
|
|
psychometric, and |
|
|
|
personality research |
|
pwr |
Stephane Champely |
Basic functions for power |
10 |
|
|
analysis |
|
qcc |
Luca Scrucca |
Quality control char ts |
13 |
randomLCA |
Ken Beath |
Random effects latent class |
14 |
|
|
analysis |
|
Rcmdr |
John Fox, with |
R Commander, a platform- |
11 |
|
contributions from |
independent basic-statistics |
|
|
Liviu Andronic, Michael |
graphical user inter face |
|
|
Ash, Theophilius Boye, |
for R, based on the tcltk |
|
|
Stefano Calza, Andy |
package |
|
|
Chang, Philippe Grosjean, |
|
|
|
Richard Heiberger, G. Jay |
|
|
|
Kerns, Renaud Lancelot, |
|
|
|
Matthieu Lesnoff, Uwe |
|
|
|
Ligges, Samir Messad, |
|
|
|
Mar tin Maechler, Rober t |
|
|
|
Muenchen, Duncan |
|
|
|
Murdoch, Erich Neuwir th, |
|
|
|
Dan Putler, Brian Ripley, |
|
|
|
Miroslav Ristic, and Peter |
|
|
|
Wolf. |
|
|
reshape |
Hadley Wickham |
Flexibly reshape data |
4, 5, 7 |
rggobi |
Duncan Temple Lang, |
An inter face between R and |
16 |
|
Debby Swayne, Hadley |
GGobi |
|
|
Wickham, and Michael |
|
|
|
Lawrence |
|
|
rgl |
Daniel Adler and Duncan |
3D visualization device |
11 |
|
Murdoch |
system (OpenGL) |
|
RJDBC |
Simon Urbanek |
Provides access to |
2 |
|
|
databases through the JDBC |
|
|
|
inter face |
|
rms |
Frank E. Harrell, Jr. |
Regression modeling |
13 |
|
|
strategies - about 225 |
|
|
|
function that assist with |
|
|
|
and streamline regression |
|
|
|
modeling, testing, |
|
|
|
estimations, validation, |
|
|
|
graphics, prediction, and |
|
|
|
typesetting |
|
|
|
|
|

APPENDIX F Packages used in this book |
427 |
Table F.1 Contributed packages used in this book (continued )
Package |
Authors |
Description |
Chapters |
|
|
|
|
robust |
Jiahui Wang, Ruben |
A package of robust methods |
13 |
|
Zamar, Alfio Marazzi, |
|
|
|
Victor Yohai, Matias |
|
|
|
Salibian-Barrera, Ricardo |
|
|
|
Maronna, Eric Zivot, David |
|
|
|
Rocke, Doug Mar tin, |
|
|
|
Mar tin Maechler, and Kjell |
|
|
|
Konis |
|
|
RODBC |
Brian Ripley and Michael |
ODBC database access |
2 |
|
Lapsley |
|
|
ROracle |
David A. James and Jake |
Oracle database inter face |
2 |
|
Luciani |
for R |
|
rrcov |
Valentin Todorov |
Robust location and scatter |
9 |
|
|
estimation and robust |
|
|
|
multivariate analysis with |
|
|
|
high breakdown point |
|
sampling |
Yves Tillé and Alina Matei |
Functions for drawing and |
4 |
|
|
calibrating samples |
|
scatterplot3d |
Uwe Ligges |
Plots a three dimensional |
11 |
|
|
(3D) point cloud |
|
sem |
John Fox with contributions |
Structural equation models |
14 |
|
from Adam Kramer and |
|
|
|
Michael Friendly |
|
|
SeqKnn |
Ki-Yeol Kim and Gwan-Su |
Sequential KNN imputation |
15 |
|
Yi, CSBio lab., Information |
method |
|
|
and Communications |
|
|
|
University |
|
|
sm |
Adrian Bowman and |
Smoothing methods for |
6, 9 |
|
Adelchi Azzalini. Por ted |
nonparametric regression |
|
|
to R by B. D. Ripley up to |
and density estimation |
|
|
version 2.0, version 2.1 |
|
|
|
by Adrian Bowman and |
|
|
|
Adelchi Azzalini, version |
|
|
|
2.2 by Adrian Bowman. |
|
|
vcd |
David Meyer, Achim |
Functions for visualizing |
1, 6, 7, |
|
Zeileis, and Kur t Hornik |
categorical data |
11, 12 |
vegan |
Jari Oksanen, F. Guillaume |
Ordination methods, diversity |
9 |
|
Blanchet, Roeland Kindt, |
analysis, and other functions |
|
|
Pierre Legendre, R. B. |
for community and vegetation |
|
|
O’Hara, Gavin L. Simpson, |
ecologists |
|
|
Peter Solymos, M. Henr y |
|
|
|
H. Stevens, and Helene |
|
|
|
Wagner |
|
|
|
|
|
|

428 |
APPENDIX F Packages used in this book |
Table F.1 Contributed packages used in this book (continued )
Package |
Authors |
Description |
Chapters |
|
|
|
|
VIM |
Matthias Templ, Andreas |
Visualization and imputation |
15 |
|
Alfons, and Alexander |
of missing values |
|
|
Kowarik |
|
|
xlsx |
Adrian A. Dragulescu |
Read, write, and format Excel |
2 |
|
|
2007 (xlsx) files |
|
XML |
Duncan Temple Lang |
Tools for parsing and |
2 |
|
|
generating XML within R and |
|
|
|
S-Plus |
|
|
|
|
|


430 |
APPENDIX G Working with large datasets |
■Use matrices rather than data frames (they have less overhead).
■When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment. char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.
■Test programs on subsets of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.
■Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) will remove all objects from memory, providing a clean slate. Specific objects can be removed with rm(object).
■Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot. com), to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.
■Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof()and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.
■The Rcpp package can be used to transfer R objects to C++ functions and back when more optimized subroutines are needed.
With large datasets, increasing code efficiency will only get you so far. When bumping up against memory limits, you can also store our data externally and use specialized analysis routines.
G.2 Storing data outside of RAM
There are several packages available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk, and then accessing portions as they are needed. Several useful packages are described in table G.1.
Table G.1 R packages for accessing large datasets
Package |
Description |
|
|
ff |
Provides data structures that are stored on disk but behave as if they |
|
were in RAM. |
bigmemory |
Suppor ts the creation, storage, access, and manipulation of massive |
|
matrices. Matrices are allocated to shared memor y and memor y- |
|
mapped files. |
filehash |
Implements a simple key-value database where character string keys |
|
are associated with data values stored on disk. |
|
|

Analytic packages for large datasets |
431 |
Table G.1 R packages for accessing large datasets (continued )
Package |
Description |
|
|
ncdf, ncdf4 |
Provides an inter face to Unidata netCDF data files. |
RODBC, RMySQL, |
Each provides access to external relational database management |
ROracle, |
systems. |
RPostgreSQL, |
|
RSQLite |
|
|
|
The packages above help overcome R’s memory limits on data storage. However, specialized methods are also needed when attempting to analyze large datasets in a reasonable length of time. Some of the most useful are described below.
G.3 Analytic packages for large datasets
R provides several packages for the analysis of large datasets:
■The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.
■Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabulate package provides table(), split(), and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.
■The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.
■The Brobdingnag package can be used to manipulate large numbers (numbers larger than 2^1024).
Working with datasets in the gigabyte to terabyte range can be challenging in any language. For more information on the methods available within R, see the CRAN Task View: High-Performance and Parallel Computing with R (cran.r-project.org/web/views/).

APPENDIX H Updating an R installation |
433 |
1If you have a customized Rprofile.site file(see appendix B), save a copy outside of R.
2Launch your current version of R and issue the following statements
oldip <- installed.packages()[,1]
save(oldip, file="path/installedPackages.Rdata")
where path is a directory outside of R.
3Download and install the newer version of R.
4If you saved a customized version of the Rprofile.site file in step 1, copy it into the new installation.
5Launch the new version of R, and issue the following statements
load("path/installedPackages.Rdata") newip <- installed.packages()[,1] for(i in setdiff(oldip, newip))
install.packages(i)
where path is the location specified in step 2. 6 Delete the old installation (optional).
This approach will install only packages that are available from the CRAN. It won’t find packages obtained from other locations. You’ll have to find and download these separately. Luckily, the process will display a list of packages that can’t be installed. During my last installation, globaltest and Biobase couldn’t be found. Since I got them from the Bioconductor site, I was able to install them via the code
source(http://bioconductor.org/biocLite.R) biocLite("globaltest") biocLite("Biobase")
Step 6 involves the optional deletion of the old installation. On a Windows machine, more than one version of R can be installed at a time. If desired, uninstall the older version via Start > Control Panel > Uninstall a Program. On Mac and Linux platforms, the new version of R will overwrite the older version. To delete any remnants on a Mac, use the Finder to go to the /Library/Frameworks/R.frameworks/ versions/ directory and delete the folder representing the older version. On a Linux platform, it’s probably best to leave well enough alone.
Clearly, updating an existing version of R is more involved than is desirable for such a sophisticated piece of software. I’m hopeful that someday this appendix will simply say “Select the Check for Updates… option” to update an R installation.
