Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

464

CHAPTER 20 Advanced programming

creating new applications. In chapter 21, you’ll have an opportunity to put these skills into practice by creating a useful package from start to finish.

20.1 A review of the language

R is an object-oriented, functional, array programming language in which objects are specialized data structures, stored in RAM, and accessed via names or symbols. Names of objects consist of uppercase and lowercase letters, the digits 0–9, the period, and the underscore. Names are case-sensitive and can’t start with a digit, and a period is treated as a simple character without special meaning.

Unlike in languages such as C and C++, you can’t access memory locations directly. Data, functions, and just about everything else that can be stored and named are objects. Additionally, the names and symbols themselves are objects that can be manipulated. All objects are stored in RAM during program execution, which has significant implications for the analysis of massive datasets.

Every object has attributes: meta-information describing the characteristics of the object. Attributes can be listed with the attributes() function and set with the attr() function. A key attribute is an object’s class. R functions use information about an object’s class in order to determine how the object should be handled. The class of an object can be read and set with the class() function. Examples will be given throughout this chapter and the next.

20.1.1Data types

There are two fundamental data types: atomic vectors and generic vectors. Atomic vectors are arrays that contain a single data type. Generic vectors, also called lists, are collections of atomic vectors. Lists are recursive in that they can also contain other lists. This section considers both types in some detail.

Unlike in many languages, you don’t have to declare an object’s data type or allocate space for it. The type is determined implicitly from the object’s contents, and the size grows or shrinks automatically depending on the type and number of elements the object contains.

ATOMIC VECTORS

Atomic vectors are arrays that contain a single data type (logical, real, complex, character, or raw). For example, each of the following is a one-dimensional atomic vector:

passed <- c(TRUE, TRUE, FALSE, TRUE) ages <- c(15, 18, 25, 14, 19)

cmplxNums <- c(1+2i, 0+1i, 39+3i, 12+2i) names <- c("Bob", "Ted", "Carol", "Alice")

Vectors of type "raw" hold raw bytes and aren’t discussed here.

Many R data types are atomic vectors with special attributes. For example, R doesn’t have a scalar type. A scalar is an atomic vector with a single element. So k <- 2 is a shortcut for k <- c(2).

A review of the language

465

A matrix is an atomic vector that has a dimension attribute, dim, containing two elements (number of rows and number of columns). For example, start with a onedimensional numeric vector x:

>x <- c(1,2,3,4,5,6,7,8)

>class(x)

[1] "numeric" > print(x)

{1] 1 2 3 4 5 6 7 8

Then add a dim attribute:

> attr(x, "dim") <- c(2,4)

The object x is now a 2 × 3 matrix of class matrix:

> print(x)

 

 

 

 

[,1] [,2] [,3] [,4]

[1,]

1

3

5

7

[2,]

2

4

6

8

>class(x) [1] "matrix"

>attributes(x) $dim

[1] 2 2

Row and column names can be attached by adding a dimnames attribute:

> attr(x,

"dimnames") <- list(c("A1", "A2"),

 

 

 

 

c("B1", "B2", "B3", "B4"))

> print(x)

 

 

B1

B2

B3

B4

A1

1

3

5

7

A2

2

4

6

8

Finally, the matrix can be returned to a one-dimensional vector by removing the dim attribute:

>attr(x, "dim") <- NULL

>class(x)

[1] "numeric" > print(x)

[1] 1 2 3 4 5 6 7 8

An array is an atomic vector with a dim attribute that has three or more elements. Again, you set the dimensions with the dim attribute, and you can attach labels with the dimnames attribute. Like one-dimensional vectors, matrices and arrays can be of type logical, numeric, character, complex, or raw. But you can’t mix types in a single matrix or array.

The attr() function allows you to create arbitrary attributes and associate them with an object. Attributes store additional information about an object and can be used by functions to determine how they’re processed.

466

CHAPTER 20 Advanced programming

There are a number of special functions for setting attributes, including dim(), dimnames(), names(), row.names(), class(), and tsp(). The latter is used to create time series objects. These special functions have restrictions on the values that can be set. Unless you’re creating custom attributes, it’s always a good idea to use these special functions. Their restrictions and the error messages they produce make coding errors less likely and more obvious.

GENERIC VECTORS OR LISTS

Lists are collections of atomic vectors and/or other lists. Data frames are a special type of list, where each atomic vector in the collection has the same length. Consider the iris data frame that comes with the base R installation. It describes four physical measures taken on each of 150 plants, along with their species (setosa, versicolor, or virginica):

> head(iris)

 

 

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

6

5.4

3.9

1.7

0.4

setosa

This data frame is actually a list containing five atomic vectors. It has a names attribute (a character vector of variable names), a row.names attribute (a numeric vector identifying individual plants), and a class attribute with the value "data.frame". Each vector represents a column (variable) in the data frame. This can be easily seen by printing the data frame with the unclass() function and obtaining the attributes with the attributes() function:

unclass(iris)

attributes(iris)

The output is omitted here to save space.

It’s important to understand lists because R functions frequently return them as values. Let’s look at an example using a cluster-analysis technique from chapter 16. Cluster analysis uses a family of methods to identify naturally occurring groupings of observations.

You’ll apply k-means cluster analysis (section 16.3.1) to the iris data. Assume that there are three clusters present in the data, and observe how the observations (rows) become grouped. You’ll ignore the species variable and use only the physical measures of each plant to form the clusters. The required code is

set.seed(1234)

fit <- kmeans(iris[1:4], 3)

What information is contained in the object fit? The help page for kmeans() indicates that the function returns a list with seven components. The str() function displays the object’s structure, and the unclass() function can be used to examine the

A review of the language

467

object’s contents directly. The length() function indicates how many components the object contains, and the names() function provides the names of these components. You can use the attributes() function to examine the attributes of the object. The contents of the object returned by kmeans() are explored here:

>

names(fit)

 

 

 

[1]

"cluster"

"centers"

"totss"

"withinss"

[5]

"tot.withinss"

"betweenss"

"size"

"iter"

[9]

"ifault"

 

 

 

> unclass(fit) $cluster

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [29] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 [113] 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3

[141] 3 3 2 3 3 3 2 3 3 2

$centers

 

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width

1

5.006

3.428

1.462

0.246

2

5.902

2.748

4.394

1.434

3

6.850

3.074

5.742

2.071

$totss [1] 681.4

$withinss

[1] 15.15 39.82 23.88

$tot.withinss [1] 78.85

$betweenss [1] 602.5

$size

[1] 50 62 38

$iter [1] 2

$ifault [1] 0

Executing sapply(fit, class) returns the class of each component in the object:

> sapply(fit, class)

 

 

 

cluster

centers

totss

withinss tot.withinss

"integer"

"matrix"

"numeric"

"numeric"

"numeric"

betweenss

size

iter

ifault

 

"numeric"

"integer"

"integer"

"integer"

 

In this example, cluster is an integer vector containing the cluster memberships, and centers is a matrix containing the cluster centroids (means on each variable for each

468

CHAPTER 20 Advanced programming

cluster). The size component is an integer vector containing the number of plants in each of the three clusters. To learn about the other components, see the Value section of help(kmeans).

INDEXING

Learning to unpack the information in a list is a critical R programming skill. The elements of any data object can be extracted via indexing. Before diving into a list, let’s look at extracting elements from an atomic vector.

Elements are extracted using object[index], where object is the vector and index is an integer vector. If the elements of the atomic vector have been named, index can also be a character vector with these names. Note that in R, indices start with 1, not 0 as in many other languages.

Here is an example, using this approach for an atomic vector without named elements:

>x <- c(20, 30, 40)

>x[3]

[1] 40

> x[c(2,3)] [1] 30 40

For an atomic vector with named elements, you could use

>x <- c(A=20, B=30, C=40)

>x[c(2,3)]

BC 30 40

>x[c("B", "C")]

BC

30 40

For lists, components (atomic vectors or other lists) can be extracted using object[index], where index is an integer vector. The following uses the fit object from the kmeans example that appears a little later, in listing 20.1:

> fit[c(2,7)]

 

 

 

$centers

 

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width

1

5.006

3.428

1.462

0.246

2

5.902

2.748

4.394

1.434

3

6.850

3.074

5.742

2.071

$size

[1] 50 62 38

Note that components are returned as a list.

To get just the elements in the component, use object[[integer]]:

> fit[2] $centers

 

Sepal.Length Sepal.Width Petal.Length Petal.Width

1

5.006

3.428

1.462

0.246

2

5.902

2.748

4.394

1.434

3

6.850

3.074

5.742

2.071

 

 

A review of the language

469

> fit[[2]]

 

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width

1

5.006

3.428

1.462

0.246

2

5.902

2.748

4.394

1.434

3

6.850

3.074

5.742

2.071

In the first case, a list is returned. In second case, a matrix is returned. The difference can be important, depending on what you do with the results. If you want to pass the results to a function that requires a matrix as input, you would want to use the doublebracket notation.

To extract a single named component, you can use the $ notation. In this case, object[[integer]] and object$name are equivalent:

> fit$centers

 

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width

1

5.006

3.428

1.462

0.246

2

5.902

2.748

4.394

1.434

3

6.850

3.074

5.742

2.071

This also explains why the $ notation works with data frames. Consider the iris data frame. The data frame is a special case of a list, where each variable is represented as a component. This is why iris$Sepal.Length returns the 150-element vector of sepal lengths.

Notations can be combined to obtain the elements within components. For example,

> fit[[2]][1,]

 

 

 

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

5.006

3.428

1.462

0.246

extracts the second component of fit (a matrix of means) and returns the first row (the means for the first cluster on each of the four variables).

By extracting the components and elements of lists returned by functions, you can take the results and go further. For example, to plot the cluster centroids with a line graph, you can use the following code.

Listing 20.1 Plotting the centroids from a k-means cluster analysis

>

set.seed(1234)

 

>

fit <- kmeans(iris[1:4], 3)

b Obtains the cluster means

>means <- fit$centers

>library(reshape2)

>dfm <- melt(means)

>names(dfm) <- c("Cluster", "Measurement", "Centimeters")

>dfm$Cluster <- factor(dfm$Cluster)

>head(dfm)

 

Cluster

Measurement Centimeters

1

1

Sepal.Length

5.006

2

2

Sepal.Length

5.902

3

3

Sepal.Length

6.850

4

1

Sepal.Width

3.428

5

2

Sepal.Width

2.748

6

3

Sepal.Width

3.074

cReshapes the data to long form

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]