Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Robert I. Kabacoff - R in action

.pdf

Скачиваний:

Добавлен:

02.06.2015

Размер:

12.13 Mб

Скачать

☆

<<< < Предыдущая 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 4546 / 4846 47 48 > Следующая >>>

426	APPENDIX F Packages used in this book

Table F.1 Contributed packages used in this book (continued )

Package	Authors	Description	Chapters

psych	William Revelle	Procedures for psychological,	7, 14
		psychometric, and
		personality research
pwr	Stephane Champely	Basic functions for power	10
		analysis
qcc	Luca Scrucca	Quality control char ts	13
randomLCA	Ken Beath	Random effects latent class	14
		analysis
Rcmdr	John Fox, with	R Commander, a platform-	11
	contributions from	independent basic-statistics
	Liviu Andronic, Michael	graphical user inter face
	Ash, Theophilius Boye,	for R, based on the tcltk
	Stefano Calza, Andy	package
	Chang, Philippe Grosjean,
	Richard Heiberger, G. Jay
	Kerns, Renaud Lancelot,
	Matthieu Lesnoff, Uwe
	Ligges, Samir Messad,
	Mar tin Maechler, Rober t
	Muenchen, Duncan
	Murdoch, Erich Neuwir th,
	Dan Putler, Brian Ripley,
	Miroslav Ristic, and Peter
	Wolf.
reshape	Hadley Wickham	Flexibly reshape data	4, 5, 7
rggobi	Duncan Temple Lang,	An inter face between R and	16
	Debby Swayne, Hadley	GGobi
	Wickham, and Michael
	Lawrence
rgl	Daniel Adler and Duncan	3D visualization device	11
	Murdoch	system (OpenGL)
RJDBC	Simon Urbanek	Provides access to	2
		databases through the JDBC
		inter face
rms	Frank E. Harrell, Jr.	Regression modeling	13
		strategies - about 225
		function that assist with
		and streamline regression
		modeling, testing,
		estimations, validation,
		graphics, prediction, and
		typesetting

APPENDIX F Packages used in this book

427

Table F.1 Contributed packages used in this book (continued )

Package	Authors	Description	Chapters

robust	Jiahui Wang, Ruben	A package of robust methods	13
	Zamar, Alfio Marazzi,
	Victor Yohai, Matias
	Salibian-Barrera, Ricardo
	Maronna, Eric Zivot, David
	Rocke, Doug Mar tin,
	Mar tin Maechler, and Kjell
	Konis
RODBC	Brian Ripley and Michael	ODBC database access	2
	Lapsley
ROracle	David A. James and Jake	Oracle database inter face	2
	Luciani	for R
rrcov	Valentin Todorov	Robust location and scatter	9
		estimation and robust
		multivariate analysis with
		high breakdown point
sampling	Yves Tillé and Alina Matei	Functions for drawing and	4
		calibrating samples
scatterplot3d	Uwe Ligges	Plots a three dimensional	11
		(3D) point cloud
sem	John Fox with contributions	Structural equation models	14
	from Adam Kramer and
	Michael Friendly
SeqKnn	Ki-Yeol Kim and Gwan-Su	Sequential KNN imputation	15
	Yi, CSBio lab., Information	method
	and Communications
	University
sm	Adrian Bowman and	Smoothing methods for	6, 9
	Adelchi Azzalini. Por ted	nonparametric regression
	to R by B. D. Ripley up to	and density estimation
	version 2.0, version 2.1
	by Adrian Bowman and
	Adelchi Azzalini, version
	2.2 by Adrian Bowman.
vcd	David Meyer, Achim	Functions for visualizing	1, 6, 7,
	Zeileis, and Kur t Hornik	categorical data	11, 12
vegan	Jari Oksanen, F. Guillaume	Ordination methods, diversity	9
	Blanchet, Roeland Kindt,	analysis, and other functions
	Pierre Legendre, R. B.	for community and vegetation
	O’Hara, Gavin L. Simpson,	ecologists
	Peter Solymos, M. Henr y
	H. Stevens, and Helene
	Wagner

428	APPENDIX F Packages used in this book

Table F.1 Contributed packages used in this book (continued )

Package	Authors	Description	Chapters

VIM	Matthias Templ, Andreas	Visualization and imputation	15
	Alfons, and Alexander	of missing values
	Kowarik
xlsx	Adrian A. Dragulescu	Read, write, and format Excel	2
		2007 (xlsx) files
XML	Duncan Temple Lang	Tools for parsing and	2
		generating XML within R and
		S-Plus

appendix G Working with large datasets

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits will depend primarily on the R build (32 versus 64-bit) and for 32-bit Windows, on the OS version involved. Error messages starting with cannot allocate vector of size typically indicate a failure to obtain sufficient contiguous memory, while error messages starting with cannot allocate vector of length indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. For all builds, the number of elements in a vector is limited to 2,147,483,647 (see ?Memory for more information).

There are three issues to consider when working with large datasets: (a) efficient programming to speed execution, (b) storing data externally to limit memory issues, and (c) using specialized statistical routines designed to efficiently analyze massive amounts of data. We will briefly consider each.

G.1 Efficient programming

There are a number of programming tips that improve performance when working with large datasets.

■Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, sapply, lappy, and mapply) and avoid loops (for and while) when feasible.

429

430	APPENDIX G Working with large datasets

■Use matrices rather than data frames (they have less overhead).

■When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment. char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.

■Test programs on subsets of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.

■Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) will remove all objects from memory, providing a clean slate. Specific objects can be removed with rm(object).

■Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot. com), to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.

■Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof()and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.

■The Rcpp package can be used to transfer R objects to C++ functions and back when more optimized subroutines are needed.

With large datasets, increasing code efficiency will only get you so far. When bumping up against memory limits, you can also store our data externally and use specialized analysis routines.

G.2 Storing data outside of RAM

There are several packages available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk, and then accessing portions as they are needed. Several useful packages are described in table G.1.

Table G.1 R packages for accessing large datasets

Package	Description

ff	Provides data structures that are stored on disk but behave as if they
	were in RAM.
bigmemory	Suppor ts the creation, storage, access, and manipulation of massive
	matrices. Matrices are allocated to shared memor y and memor y-
	mapped files.
filehash	Implements a simple key-value database where character string keys
	are associated with data values stored on disk.

Analytic packages for large datasets

431

Table G.1 R packages for accessing large datasets (continued )

Package	Description

ncdf, ncdf4	Provides an inter face to Unidata netCDF data files.
RODBC, RMySQL,	Each provides access to external relational database management
ROracle,	systems.
RPostgreSQL,
RSQLite

The packages above help overcome R’s memory limits on data storage. However, specialized methods are also needed when attempting to analyze large datasets in a reasonable length of time. Some of the most useful are described below.

G.3 Analytic packages for large datasets

R provides several packages for the analysis of large datasets:

■The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

■Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabulate package provides table(), split(), and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.

■The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.

■The Brobdingnag package can be used to manipulate large numbers (numbers larger than 2^1024).

Working with datasets in the gigabyte to terabyte range can be challenging in any language. For more information on the methods available within R, see the CRAN Task View: High-Performance and Parallel Computing with R (cran.r-project.org/web/views/).

appendix H Updating an R installation

As consumers, we take for granted that we can update a piece of software via a “Check for updates…” option. In chapter 1, I noted that the update.packages() function can be used to download and install the most recent version of a contributed package. Unfortunately, there’s no corresponding function for updating the R installation itself. If you want to update an R installation from version 4.1.0 to 5.1.1, you must get creative. (As I write this, the current version is actually 2.13.0, but I want this book to appear hip and current for years to come).

Downloading and installing the latest version of R from CRAN (http://cran.r- project.org/bin/) is relatively straightforward. The complicating factor is that customizations (including previously installed contributed packages) will not be included in the new installation. In my current set-up, I have 248 contributed packages installed. I really don’t want to have to write their names down and reinstall them by hand the next time I upgrade my R installation.

There has been much discussion on the web concerning the most elegant and efficient way to update an R installation. The method described below is neither elegant nor efficient, but I find that it works well on a variety of platforms (Windows, Mac, and Linux).

In this approach, the installed.packages() function is used to save a list of packages to a location outside of the R directory tree, and then the list is used with the install.packages() function to download and install the latest contributed packages into the new R installation. Here are the steps:

432

APPENDIX H Updating an R installation

433

1If you have a customized Rprofile.site file(see appendix B), save a copy outside of R.

2Launch your current version of R and issue the following statements

oldip <- installed.packages()[,1]

save(oldip, file="path/installedPackages.Rdata")

where path is a directory outside of R.

3Download and install the newer version of R.

4If you saved a customized version of the Rprofile.site file in step 1, copy it into the new installation.

5Launch the new version of R, and issue the following statements

load("path/installedPackages.Rdata") newip <- installed.packages()[,1] for(i in setdiff(oldip, newip))

install.packages(i)

where path is the location specified in step 2. 6 Delete the old installation (optional).

This approach will install only packages that are available from the CRAN. It won’t find packages obtained from other locations. You’ll have to find and download these separately. Luckily, the process will display a list of packages that can’t be installed. During my last installation, globaltest and Biobase couldn’t be found. Since I got them from the Bioconductor site, I was able to install them via the code

source(http://bioconductor.org/biocLite.R) biocLite("globaltest") biocLite("Biobase")

Step 6 involves the optional deletion of the old installation. On a Windows machine, more than one version of R can be installed at a time. If desired, uninstall the older version via Start > Control Panel > Uninstall a Program. On Mac and Linux platforms, the new version of R will overwrite the older version. To delete any remnants on a Mac, use the Finder to go to the /Library/Frameworks/R.frameworks/ versions/ directory and delete the folder representing the older version. On a Linux platform, it’s probably best to leave well enough alone.

Clearly, updating an existing version of R is more involved than is desirable for such a sophisticated piece of software. I’m hopeful that someday this appendix will simply say “Select the Check for Updates… option” to update an R installation.

index

Symbol

! operator	77
!= operator	77
# symbol 8
%a symbol	81
%A symbol	81
%B symbol	82
%b symbol	82
%d symbol	81
%m symbol	81
%Y symbol	82
%y symbol	82

*operator 75, 178

**operator 75

... option	58, 61
. symbol	178
/ operator		75
: symbol	178
? function		11
?? function		11
^ operator		75, 178, 181
~ symbol	178
+ operator		75, 178
< operator		77
<<- operator		29
<= operator		77
== operator		77
> operator		77
>= operator		77
-1 symbol	178
brackets	29
3D pie charts			127
3D scatter plots			274–278

abline( ) function 60, 265 abs( ) function 93 absolute widths 67

acos( ) function 93 acosh( ) function 93 AER package 421

aggr( ) function, VIM
package			357
aggregate( ) function					113, 240
aggregating data			112–113
AIC( ) function			179, 208
all subsets regression					210, 213
alpha option		390
alternative= option				255
Amelia package			365, 369, 421
analyses, excluding missing
values from				80–81
analysis of covariance
(ANCOVA)
one-way	230–233
assessing test
assumptions					232
visualizing results					232–233
overview		222
analysis of variance (ANOVA)
219–245, 252–253
fitting models			222–225
aov( ) function				222–223
order of formula
terms		223–225
MANOVA		239–243
assessing test assumptions
241–242
robust	242–243
one-way	225–230
assessing test
assumptions					229–230
multiple comparisons
227–229
one-way ANCOVA					230–233
assessing test
assumptions					232
visualizing results					232–233
as regression			243–245
repeated measures					237–239
terminology of			220–222
two-way factorial				234–236
analytic packages, for large
datasets			431

ANCOVA. See analysis of covariance

ancova( ) function, HH package 232

AND operator 77 annotating datasets 42 annotations 62–64

ANOVA. See analysis of variance anova( ) function 179, 208 Anova( ) function, car

package 225, 239 aov( ) function 222–223 append option 13

apply( ) function 102–103 apropos( ) function 11 aq.plot( ) function, mvoutlier

package 242 arithmetic operators 75 arrayImpute package

370 421 arrayMissPattern package 370,

421 arrays 26–27

Arthritis dataset 19 as.character( ) function 83 ASCII file 35

as.datatype( )function 84 as.Date( ) function 81, 88 asin( ) function 93 asinh( ) function 93 aspect option 378 assumptions

linear model, global validation of 199

of MANOVA tests, assessing 241–242

of OLS regression, assessing 188–199

of one-way ANCOVA tests, assessing 232

of one-way ANOVA tests, assessing 229–230

asypow package 261

435

<<< < Предыдущая 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 4546 / 4846 47 48 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
04.09.2019123.9 Кб2report_praktika.doc
#
02.06.201534.78 Кб27Research_Proposal_v_3_0.docx
#
01.05.202564.51 Кб1revision.doc
#
02.06.2015613.89 Кб24Rimskoe_pravo_bilety.doc
#
10.11.2019295.94 Кб12RI_lab.doc
#
02.06.201512.13 Mб97Robert I. Kabacoff - R in action.pdf
#
02.06.20152.89 Mб37Rossyskoe_zakonodatelstvo_X_XX_vekov_V_9-ti.doc
#
24.09.20195.23 Mб67RPZ.doc
#
01.05.2025136.7 Кб0RP_NIR_MEI_FM.doc
#
26.03.2016112.64 Кб4Rules.doc
#
26.03.2016233.33 Кб135RUR2012.docx