Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164165 / 173165 166 167 168 169 170 171 172 173 > Следующая >>>

appendix F Working with large datasets

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits depend primarily on the R build (32versus 64-bit) and the OS version involved. Error messages starting with “cannot allocate vector of size” typically indicate a failure to obtain sufficient contiguous memory, whereas error messages starting with “cannot allocate vector of length” indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. See ?Memory for more information.

There are three issues to consider when working with large datasets: efficient programming to speed execution, storing data externally to limit memory issues, and using specialized statistical routines designed to efficiently analyze massive amounts of data. First we’ll consider simple solutions for each. Then we’ll turn to more comprehensive (and complex) solutions for working with big data.

F.1 Efficient programming

A number of programming tips can help you improve performance when working with large datasets:

■Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, ifelse, colMeans, and rowSums), and avoid loops (for and while) when feasible.

■Use matrices rather than data frames (they have less overhead).

■When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment.char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.

551

552	APPENDIX F Working with large datasets

■Correctly size objects initially, rather than growing them from smaller objects by appending values.

■Use parallelization for repetitive, independent, and numerically intensive tasks.

■Test programs on a sample of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.

■Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) removes all objects from memory, providing a clean slate. Specific objects can be removed with rm(object). After removing large objects, a call to gc() will initiate garbage collection, ensuring that the objects are removed from memory.

■Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot

.com) to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.

■Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof()and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.

■Use compiled external routines to speed up program execution. You can use the Rcpp package to transfer R objects to C++ functions and back when more optimized subroutines are needed.

Section 20.4 offers examples of vectorization, efficient data input, correctly sizing objects, and parallelization.

With large datasets, increasing code efficiency will only get you so far. When you bump up against memory limits, you can also store your data externally and use specialized analysis routines.

F.2 Storing data outside of RAM

Several packages are available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk and then accessing portions as needed. Several useful packages are described in table F.1.

Table F.1 R packages for accessing large datasets

Package	Description

bigmemory	Supports the creation, storage, access, and manipulation of massive
	matrices. Matrices are allocated to shared memory and memory-mapped
	files.
ff	Provides data structures that are stored on disk but behave as if they’re
	in RAM.
filehash	Implements a simple key-value database where character string keys are
	associated with data values stored on disk.

APPENDIX F Working with large datasets		553
Table F.1 R packages for accessing large datasets

Package	Description

ncdf, ncdf4	Provide an interface to Unidata netCDF data files.
RODBC, RMySQL, ROracle,	Each provides access to external relational database management sys-
RPostgreSQL, RSQLite	tems.

These packages help overcome R’s memory limits on data storage. But you also need specialized methods when you attempt to analyze large datasets in a reasonable length of time. Some of the most useful are described next.

F.3 Analytic packages for out-of-memory data

R provides several packages for the analysis of large datasets:

■The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory-efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

■Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigrf package can be used to fit classification and regression forests. The bigtabulate package provides table(), split(), and tapply() functionality, and the bigalgebra package provides advanced linear algebra functions.

■The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.

■The data.table package provides an enhanced version of data.frame that includes faster aggregation; faster ordered and overlapping range joins; and faster column addition, modification, and deletion by reference by group (without copies). You can use the data.table structure with large datasets (for example, 100 GB in RAM), and it’s compatible with any R function expecting a data frame.

Each of these packages accommodates large datasets for specific purposes and is relatively easy to use. More comprehensive solutions for analyzing data in the terabyte range are described next.

F.4 Comprehensive solutions for working with enormous datasets

At least five projects have been designed to facilitate the use of R with terabyte-class datasets. Three are free and open source (RHIPE, RHadoop, and pbdr), and two are commercial products (Revolution R Enterprise with RevoScaleR and Oracle R Enterprise). Each requires some familiarity with high-performance computing.

The RHIPE package (www.datadr.org/) provides a programming environment that deeply integrates R and Hadoop (a free Java-based software framework for the

554	APPENDIX F Working with large datasets

processing of large datasets in a distributed computing environment). Additional software from the same authors provides “divide and recombine” methods and data visualization for very large datasets.

The RHadoop project offers a collection of R packages for managing and analyzing data with Hadoop. The rmr package provides Hadoop MapReduce functionality from within R, and the rhdfs and rhbase packages support access to HDFS file systems and HBASE datastores. A Wiki (https://github.com/RevolutionAnalytics/RHadoop/ wiki) describes the project and provides tutorials. Note that RHadoop packages must be installed from GitHub rather than CRAN.

The pbdR (Programming with Big Data in R) project enables high-level data parallelism in R through a simple interface to scalable, high-performance libraries (such as MPI, ScaLAPACK, and netCDF4). The pbdR software also supports the single program, multiple data (SPMD) model on large-scale computing clusters. See http://r-pbd.org/ for details.

Revolution R Enterprise (www.revolutionanalytics.com) is a commercial version of R that includes RevoScaleR, a package supporting scalable data analyses and highperformance computing. RevoScaleR uses a binary XDF data file format to optimize streaming data from disk to memory, and it provides a series of big-data algorithms for common statistical analyses. You can perform data-management tasks and obtain summary statistics, cross tabulations, correlations and covariances, nonparametric statistics, linear and generalized linear regression, stepwise regression, k-means clustering, and classification and regression trees on terabyte-sized datasets. Additionally, Revolution R Enterprise can be integrated with Hadoop (via RHadoop packages) and IBM Netezza (via a plug-in for IBM PureData System for Analytics). At the time of this writing, students and professors in academic settings can obtain a free software subscription (excluding the IBM components).

Finally, Oracle R Enterprise (www.oracle.com) is a commercial product that makes the R environment available for use with massive datasets stored in Oracle databases and Hadoop. Oracle R Enterprise is part of Oracle Advanced Analytics, and it requires an installation of Oracle Database Enterprise Edition. Virtually all of R’s functionality, including the thousands of contributed packages, can be applied to terabyte-sized data problems using the Oracle R Enterprise interface. This is a relatively expensive but comprehensive solution, and it will appeal primarily to large organizations with deep pockets.

Working with datasets in the gigabyte-to-terabyte range can be challenging in any language. Each of these approaches comes with a significant learning curve. Of the four, RevoScaleR is perhaps the easiest to learn and install. (Important disclaimer: I teach Revolution R courses as an adjunct instructor and may be biased.)

Additional information on the analysis of large datasets is available in the CRAN task view “High-Performance and Parallel Computing with R” (http://cran.r-project

.org/web/views). This is an area of rapid change and development, so be sure to check back often.

<<< < Предыдущая 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164165 / 173165 166 167 168 169 170 171 172 173 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx