Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Data input

37

The read.table() function has many options for fine-tuning data imports. See

help(read.table) for details.

Importing data via connections

Many of the examples in this chapter import data from files that exist on your computer. R provides several mechanisms for accessing data via connections as well. For example, the functions file(), gzfile(), bzfile(), xzfile(), unz(), and url() can be used in place of the filename. The file() function allows you to access files, the clipboard, and C-level standard input. The gzfile(), bzfile(), xzfile(), and unz() functions let you read compressed files.

The url() function lets you access internet files through a complete URL that includes http://, ftp://, or file://. For HTTP and FTP, proxies can be specified. For convenience, complete URLs (surrounded by double quotation marks) can usually be used directly in place of filenames as well. See help(file) for details.

2.3.3Importing data from Excel

The best way to read an Excel file is to export it to a comma-delimited file from Excel and import it into R using the method described earlier. Alternatively, you can import Excel worksheets directly using the xlsx package. Be sure to download and install it before you first use it. You’ll also need the xlsxjars and rJava packages and a working installation of Java (http://java.com).

The xlsx package can be used to read, write, and format Excel 97/2000/XP/ 2003/2007 files. The read.xlsx() function imports a worksheet into a data frame. The simplest format is read.xlsx(file, n) where file is the path to an Excel workbook, n is the number of the worksheet to be imported, and the first line of the worksheet contains the variable names. For example, on a Windows platform, the code

library(xlsx)

workbook <- "c:/myworkbook.xlsx" mydataframe <- read.xlsx(workbook, 1)

imports the first worksheet from the workbook myworkbook.xlsx stored on the C: drive and saves it as the data frame mydataframe.

The read.xlsx() function has options that allow you to specify specific rows (rowIndex) and columns (colIndex) of the worksheet, along with the class of each column (colClasses). For large worksheets (say, 100,000+ cells), you can also use read.xlsx2(). It performs more of the processing work in Java, resulting in significant performance gains. See help(read.xlsx) for details.

There are other packages that can help you work with Excel files. Alternatives include the XLConnect and openxlsx packages; XLConnect depends on Java, but openxlsx doesn’t. All of these package can do more than import worksheets—they can create and manipulate Excel files as well. Programmers who need to develop an interface between R and Excel should check out one or more of these packages.

38

CHAPTER 2 Creating a dataset

2.3.4Importing data from XML

Increasingly, data is provided in the form of files encoded in XML. R has several packages for handling XML files. For example, the XML package written by Duncan Temple Lang allows you to read, write, and manipulate XML files. Coverage of XML is beyond the scope of this text; if you’re interested in accessing XML documents from within R, see the excellent package documentation at www.omegahat.org/RSXML.

2.3.5Importing data from the web

Data can be obtained from the web via webscraping or the use of application programming interfaces (APIs). Webscraping is used to extract the information embedded in specific web pages, whereas APIs allow you to interact with web services and online data stores.

Typically, webscraping is used to extract data from a web page and save it into an R structure for further analysis. For example, the text on a web page can be downloaded into an R character vector using the readLines() function and manipulated with functions such as grep() and gsub(). For complex web pages, the RCurl and XML packages can be used to extract the information desired. For more information, including examples, see “Webscraping Using readLines and RCurl,” available from the website Programming with R (www.programmingr.com).

APIs specify how software components should interact with each other. A number of R packages use this approach to extract data from web-accessible resources. These include data sources in biology, medicine, Earth sciences, physical science, economics and business, finance, literature, marketing, news, and sports.

For example, if you’re interested in social media, you can access Twitter data via twitteR, Facebook data via Rfacebook, and Flickr data via Rflickr. Other packages allow you to access popular web services provided by Google, Amazon, Dropbox, Salesforce, and others. For a comprehensive list of R packages that can help you access web-based resources, see the CRAN Task view on Web Technologies and Services (http://mng.bz/370r).

2.3.6Importing data from SPSS

IBM SPSS datasets can be imported into R via the read.spss() function in the foreign package. Alternatively, you can use the spss.get() function in the Hmisc package. spss.get() is a wrapper function that automatically sets many parameters of read.spss() for you, making the transfer easier and more consistent with what data analysts expect as a result.

First, download and install the Hmisc package (the foreign package is already installed by default):

install.packages("Hmisc")

Then use the following code to import the data:

library(Hmisc)

mydataframe <- spss.get("mydata.sav", use.value.labels=TRUE)

Data input

39

In this code, mydata.sav is the SPSS data file to be imported, use.value.labels=TRUE tells the function to convert variables with value labels into R factors with those same levels, and mydataframe is the resulting R data frame.

2.3.7Importing data from SAS

A number of functions in R are designed to import SAS datasets, including read.ssd() in the foreign package, sas.get() in the Hmisc package, and read.sas7bdat() in the sas7bdat package. If you have SAS installed, sas.get() can be a good option.

Let’s say that you want to import an SAS dataset named clients.sas7bdat that resides in the C:/mydata directory on a Windows machine. The following code imports the data and saves it as an R data frame:

library(Hmisc)

datadir <- "C:/mydata"

sasexe <- "C:/Program Files/SASHome/SASFoundation/9.4/sas.exe"

mydata <- sas.get(libraryName=datadir, member="clients", sasprog=sasexe)

libraryName is a directory containing the SAS dataset, member is the dataset name (excluding the sas7bdat extension), and sasprog is the full path to the SAS executable. Many additional options are available; see help(sas.get) for details.

You can also save the SAS dataset as a comma-delimited text file from within SAS using PROC EXPORT, and you can read the resulting file into R using the method described in section 2.3.2. Here’s an example:

SAS program:

libname datadir "C:\mydata"; proc export data=datadir.clients

outfile="clients.csv"

dbms=csv;

run;

R program:

mydata <- read.table("clients.csv", header=TRUE, sep=",")

The previous two approaches require that you have a fully functional version of SAS installed. If you don’t have access to SAS, the read.sas7bdat() function may be a good alternative. The function can read an SAS dataset in sas7bdat format directly. The code for this example would be

library(sas7bdat)

mydata <- read.sas7bdat("C:/mydata/clients.sas7bdat")

Unlike sas.get(), the read.sas7bdat() function ignores SAS user-defined formats. Additionally, it takes significantly longer to run. Although I’ve had good luck with this package, it’s still considered experimental.

Finally, a commercial product named Stat/Transfer (described in section 2.3.12) does an excellent job of saving SAS datasets (including any existing variable formats) as R data frames. As with read.sas7dbat(), access to an SAS installation isn’t required.

40

CHAPTER 2 Creating a dataset

2.3.8Importing data from Stata

Importing data from Stata to R is straightforward. The necessary code looks like this:

library(foreign)

mydataframe <- read.dta("mydata.dta")

Here, mydata.dta is the Stata dataset, and mydataframe is the resulting R data frame.

2.3.9Importing data from NetCDF

Unidata’s Network Common Data Form (NetCDF) open source software contains machine-independent data formats for the creation and distribution of array-oriented scientific data. NetCDF is commonly used to store geophysical data. The ncdf and ncdf4 packages provide high-level R interfaces to NetCDF data files.

The ncdf package provides support for data files created with Unidata’s NetCDF library (version 3 or earlier) and is available for Windows, Mac OS X, and Linux platforms. The ncdf4 package supports version 4 or earlier but isn’t yet available for Windows.

Consider this code:

library(ncdf)

nc <- nc_open("mynetCDFfile") myarray <- get.var.ncdf(nc, myvar)

In this example, all the data from the variable myvar, contained in the NetCDF file mynetCDFfile, is read and saved into an R array called myarray.

Note that both the ncdf and ncdf4 packages have received major recent upgrades and may operate differently than previous versions. Additionally, function names in the two packages differ. Read the online help for details.

2.3.10Importing data from HDF5

Hierarchical Data Format (HDF5) is a software technology suite for the management of extremely large and complex data collections. The rhdf5 package provides an R interface for HDF5. The package is available on the Bioconductor website rather than CRAN. You can install it with the following code:

source("http://bioconductor.org/biocLite.R")

biocLite("rhdf5")

Like XML, HDF5 is beyond the scope of this book. To learn more, visit the HDF Group website (www.hdfgroup.org). There is an excellent tutorial for the rhdf5 package by Bernd Fischer at http://mng.bz/eg6j.

2.3.11Accessing database management systems (DBMSs)

R can interface with a wide variety of relational database management systems (DBMSs), including Microsoft SQL Server, Microsoft Access, MySQL, Oracle, PostgreSQL, DB2, Sybase, Teradata, and SQLite. Some packages provide access through native database drivers, whereas others offer access via ODBC or JDBC. Using R to

Data input

41

access data stored in external DMBSs can be an efficient way to analyze large datasets (see appendix F) and takes advantage of the power of both SQL and R.

THE ODBC INTERFACE

Perhaps the most popular method of accessing a DBMS in R is through the RODBC package, which allows R to connect to any DBMS that has an ODBC driver. This includes all the DBMSs listed earlier.

The first step is to install and configure the appropriate ODBC driver for your platform and database (these drivers aren’t part of R). If the requisite drivers aren’t already installed on your machine, an internet search should provide you with options.

Once the drivers are installed and configured for the database(s) of your choice, install the RODBC package. You can do so by using the install.packages("RODBC") command. The primary functions included with RODBC are listed in table 2.3.

Table 2.3 RODBC functions

Function

Description

 

 

odbcConnect(dsn,uid="",pwd="")

Opens a connection to an ODBC database

sqlFetch(channel,sqltable)

Reads a table from an ODBC database into a data frame

sqlQuery(channel,query)

Submits a query to an ODBC database and returns the

 

results

sqlSave(channel,mydf,tablename

Writes or updates (append=TRUE) a data frame to a

= sqltable,append=FALSE)

table in the ODBC database

sqlDrop(channel,sqltable)

Removes a table from the ODBC database

close(channel)

Closes the connection

 

 

The RODBC package allows two-way communication between R and an ODBC- connected SQL database. This means you can not only read data from a connected database into R, but also use R to alter the contents of the database itself. Assume that you want to import two tables (Crime and Punishment) from a DBMS into two R data frames called crimedat and pundat, respectively. You can accomplish this with code similar to the following:

library(RODBC)

myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark") crimedat <- sqlFetch(myconn, Crime)

pundat <- sqlQuery(myconn, "select * from Punishment") close(myconn)

Here, you load the RODBC package and open a connection to the ODBC database through a registered data source name (mydsn) with a security UID (rob) and password (aardvark). The connection string is passed to sqlFetch(), which copies the table Crime into the R data frame crimedat. You then run the SQL select statement

www.allitebooks.com

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]