Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Writing efficient code

479

and a class is assigned to x and y c. Next, mymethod() is applied to each object, and the appropriate function is called d. The default method is used for object z because the object has class integer and no mymethod.integer() function has been defined.

An object can be assigned to more than one class (for example, building, residential, and commercial). How does R determine which generic function to call in such a case? When z is assigned two classes e, the first class is used to determine which generic function to call. In the final example f, there is no mymethod.c() function, so the next class in line (a) is used. R searches the class list from left to right, looking for the first available generic function.

20.3.2Limitations of the S3 model

The primarily limitation of the S3 object model is the fact that any class can be assigned to any object. There are no integrity checks. In this example,

>class(women) <- "lm"

>summary(women)

Error in if (p == 0) { : argument is of length zero

the data frame women is assigned class lm, which is nonsensical and leads to errors. The S4 OOP model is more formal and rigorous and designed to avoid the difficul-

ties raised by the S3 method’s less structured approach. In the S4 approach, classes are defined as abstract objects that have slots containing specific types of information (that is, typed variables). Object and method construction are formally defined, with rules that are enforced. But programming using the S4 model is more complex and less interactive. To learn more about the S4 OOP model, see “A (Not So) Short Introduction to S4” by Chistophe Genolini (http://mng.bz/1VkD).

20.4 Writing efficient code

There is a saying among programmers: “A power user is someone who spends an hour tweaking their code so that it runs a second faster.” R is a spritely language, and most R users don’t have to worry about writing efficient code. The easiest way to make your code run faster is to beef up your hardware (RAM, processor speed, and so on). As a general rule, it’s more important to write code that is understandable and easy to maintain than it is to optimize its speed. But when you’re working with large datasets or highly repetitive tasks, speed can become an issue.

Several coding techniques can help to make your programs more efficient:

Read in only the data you need.

Use vectorization rather than loops whenever possible.

Create objects of the correct size, rather than resizing repeatedly.

Use parallelization for repetitive, independent tasks.

Let’s look at each one in turn.

EFFICIENT DATA INPUT

When you’re reading data from a delimited text file via the read.table() function, you can achieve significant speed gains by specifying which variables are needed and their

480

CHAPTER 20 Advanced programming

types. This can be accomplished by including a colClasses parameter. For example, suppose you want to access 3 numeric variables and 2 character variables in a commadelimited file with 10 variables per line. The numeric variables are in positions 1, 2, and 5, and the character variables are in positions 3 and 7. In this case, the code

my.data.frame <- read.table(mytextfile, header=TRUE, sep=',', colClasses=c("numeric", "numeric", "character", NULL, "numeric", NULL, "character", NULL,

NULL, NULL))

will run faster than

my.data.frame <- read.table(mytextfile, header=TRUE, sep=',')

Variables associated with a NULL colClasses value are skipped. As the number of rows and columns in the text file increases, the speed gain becomes more significant.

VECTORIZATION

Use vectorization rather than loops whenever possible. Here, vectorization means using R functions that are designed to process vectors in a highly optimized manner. Examples in the base installation include ifelse(), colSums(), colMeans(), rowSums(), and rowMeans(). The matrixStats package offers optimized functions for many additional calculations, including counts, sums, products, measures of central tendency and dispersion, quantiles, ranks, and binning. Packages such as plyr, dplyr, reshape2, and data.table also provide functions that are highly optimized.

Consider a matrix with 1 million rows and 10 columns. Let’s calculate the column sums using loops and again using the colSums() function. First, create the matrix:

set.seed(1234)

mymatrix <- matrix(rnorm(10000000), ncol=10)

Next, create a function called accum() that uses for loops to obtain the column sums:

accum <- function(x){

sums <- numeric(ncol(x)) for (i in 1:ncol(x)){

for(j in 1:nrow(x)){

sums[i] <- sums[i] + x[j,i]

}

}

}

The system.time() function can be used to determine the amount of CPU and real time needed to run the function:

>system.time(accum(mymatrix)) user system elapsed

25.670.01 25.75

Calculating the same sums using the colSums() function produces

> system.time(colSums(mymatrix)) user system elapsed

0.02 0.00 0.02

Writing efficient code

481

On my machine, the vectorized function ran more than 1,200 times faster. Your mileage may vary.

CORRECTLY SIZING OBJECTS

It’s more efficient to initialize objects to their required final size and fill in the values than it is to start with a smaller object and grow it by appending values. Let’s say you have a vector x with 100,000 numeric values. You want to obtain a vector y with the squares of these values:

>set.seed(1234)

>k <- 100000

> x <- rnorm(k)

One approach is as follows:

>y <- 0

>system.time(for (i in 1:length(x)) y[i] <- x[i]^2) user system elapsed

10.03 0.00 10.03

y starts as a one-element vector and grows to be a 100,000-element vector containing the squared values of x. It takes about 10 seconds on my machine.

If you first initialize y to be a vector with 100,000 elements,

>y <- numeric(length=k)

>system.time(for (i in 1:k) y[i] <- x[i]^2) user system elapsed

0.23 0.00 0.24

the same calculations take less than a second. You avoid the considerable time it takes R to continually resize the object.

If you use vectorization,

>y <- numeric(length=k)

>system.time(y <- x^2)

user system elapsed

0

0

0

the process is even faster. Note that operations like exponentiation, addition, multiplication, and the like are also vectorized functions.

PARALLELIZATION

Parallelization involves chunking up a task, running the chunks simultaneously on two or more cores, and combining the results. The cores might be on the same computer or on different machines in a cluster. Tasks that require the repeated independent execution of a numerically intensive function are likely to benefit from parallelization. This includes many Monte Carlo methods, including bootstrapping.

Many packages in R support parallelization (see “CRAN Task View: High-Perfor- mance and Parallel Computing with R” by Dirk Eddelbuettel, http://mng.bz/65sT). In this section, you’ll use the foreach and doParallel packages to see parallelization on a single computer. The foreach package supports the foreach looping construct

482

CHAPTER 20 Advanced programming

(iterating over the elements in a collection) and facilitates executing loops in parallel. The doParallel package provides a parallel back end for the foreach package.

In principal components and factor analysis, a critical step is identifying the appropriate number of components or factors to extract from the data (see section 14.2.1). One approach involves repeatedly performing an eigenanalysis of correlation matrices derived from random data that have the same number of rows and columns as the original data. The analysis is demonstrated in listing 20.3. Parallel and non-parallel versions of this analysis are compared in the listing. To execute this code, you’ll need to install both packages and know how many cores your computer has.

Listing 20.3 Parallelization with foreach and doParallel

> library(foreach)

b Loads packages and registers

>

library(doParallel)

the number of cores

>

registerDoParallel(cores=4)

 

>eig <- function(n, p){

x<- matrix(rnorm(100000), ncol=100)

r <- cor(x)

c

eigen(r)$values

 

}

>n <- 1000000

>p <- 100

>k <- 500

>system.time(

x<- foreach(i=1:k, .combine=rbind) %do% eig(n, p)

)

user system elapsed 10.97 0.14 11.11

Defines the function

d Executes normally

> system.time(

 

e Executes

x <- foreach(i=1:k, .combine=rbind) %dopar% eig(n, p)

in parallel

)

 

 

 

user

system elapsed

 

0.22

0.05

4.24

 

First the packages are loaded and the number of cores (four on my machine) is registered b. Next, the function for the eigenanalysis is defined c. Here 100,000 × 100 random data matrices are analyzed c. The eig() function is executed 500 times using foreach and %do%. d. The %do% operator runs the function sequentially, and the .combine=rbind option appends the results to object x as rows. Finally, the function is run in parallel using the %dopar% operator e. In this case, parallel execution was about 2.5 times faster than sequential execution.

In this example, each iteration of the eig() function was numerically intensive, didn’t require access to other iterations, and didn’t involve disk I/0. This is the type of situation that benefits the most from parallelization. The downside of parallelization is that it can make the code less portable—there is no guarantee that others will have the same hardware configuration that you do.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]