- •brief contents
- •contents
- •preface
- •acknowledgments
- •about this book
- •What’s new in the second edition
- •Who should read this book
- •Roadmap
- •Advice for data miners
- •Code examples
- •Code conventions
- •Author Online
- •About the author
- •about the cover illustration
- •1 Introduction to R
- •1.2 Obtaining and installing R
- •1.3 Working with R
- •1.3.1 Getting started
- •1.3.2 Getting help
- •1.3.3 The workspace
- •1.3.4 Input and output
- •1.4 Packages
- •1.4.1 What are packages?
- •1.4.2 Installing a package
- •1.4.3 Loading a package
- •1.4.4 Learning about a package
- •1.5 Batch processing
- •1.6 Using output as input: reusing results
- •1.7 Working with large datasets
- •1.8 Working through an example
- •1.9 Summary
- •2 Creating a dataset
- •2.1 Understanding datasets
- •2.2 Data structures
- •2.2.1 Vectors
- •2.2.2 Matrices
- •2.2.3 Arrays
- •2.2.4 Data frames
- •2.2.5 Factors
- •2.2.6 Lists
- •2.3 Data input
- •2.3.1 Entering data from the keyboard
- •2.3.2 Importing data from a delimited text file
- •2.3.3 Importing data from Excel
- •2.3.4 Importing data from XML
- •2.3.5 Importing data from the web
- •2.3.6 Importing data from SPSS
- •2.3.7 Importing data from SAS
- •2.3.8 Importing data from Stata
- •2.3.9 Importing data from NetCDF
- •2.3.10 Importing data from HDF5
- •2.3.11 Accessing database management systems (DBMSs)
- •2.3.12 Importing data via Stat/Transfer
- •2.4 Annotating datasets
- •2.4.1 Variable labels
- •2.4.2 Value labels
- •2.5 Useful functions for working with data objects
- •2.6 Summary
- •3 Getting started with graphs
- •3.1 Working with graphs
- •3.2 A simple example
- •3.3 Graphical parameters
- •3.3.1 Symbols and lines
- •3.3.2 Colors
- •3.3.3 Text characteristics
- •3.3.4 Graph and margin dimensions
- •3.4 Adding text, customized axes, and legends
- •3.4.1 Titles
- •3.4.2 Axes
- •3.4.3 Reference lines
- •3.4.4 Legend
- •3.4.5 Text annotations
- •3.4.6 Math annotations
- •3.5 Combining graphs
- •3.5.1 Creating a figure arrangement with fine control
- •3.6 Summary
- •4 Basic data management
- •4.1 A working example
- •4.2 Creating new variables
- •4.3 Recoding variables
- •4.4 Renaming variables
- •4.5 Missing values
- •4.5.1 Recoding values to missing
- •4.5.2 Excluding missing values from analyses
- •4.6 Date values
- •4.6.1 Converting dates to character variables
- •4.6.2 Going further
- •4.7 Type conversions
- •4.8 Sorting data
- •4.9 Merging datasets
- •4.9.1 Adding columns to a data frame
- •4.9.2 Adding rows to a data frame
- •4.10 Subsetting datasets
- •4.10.1 Selecting (keeping) variables
- •4.10.2 Excluding (dropping) variables
- •4.10.3 Selecting observations
- •4.10.4 The subset() function
- •4.10.5 Random samples
- •4.11 Using SQL statements to manipulate data frames
- •4.12 Summary
- •5 Advanced data management
- •5.2 Numerical and character functions
- •5.2.1 Mathematical functions
- •5.2.2 Statistical functions
- •5.2.3 Probability functions
- •5.2.4 Character functions
- •5.2.5 Other useful functions
- •5.2.6 Applying functions to matrices and data frames
- •5.3 A solution for the data-management challenge
- •5.4 Control flow
- •5.4.1 Repetition and looping
- •5.4.2 Conditional execution
- •5.5 User-written functions
- •5.6 Aggregation and reshaping
- •5.6.1 Transpose
- •5.6.2 Aggregating data
- •5.6.3 The reshape2 package
- •5.7 Summary
- •6 Basic graphs
- •6.1 Bar plots
- •6.1.1 Simple bar plots
- •6.1.2 Stacked and grouped bar plots
- •6.1.3 Mean bar plots
- •6.1.4 Tweaking bar plots
- •6.1.5 Spinograms
- •6.2 Pie charts
- •6.3 Histograms
- •6.4 Kernel density plots
- •6.5 Box plots
- •6.5.1 Using parallel box plots to compare groups
- •6.5.2 Violin plots
- •6.6 Dot plots
- •6.7 Summary
- •7 Basic statistics
- •7.1 Descriptive statistics
- •7.1.1 A menagerie of methods
- •7.1.2 Even more methods
- •7.1.3 Descriptive statistics by group
- •7.1.4 Additional methods by group
- •7.1.5 Visualizing results
- •7.2 Frequency and contingency tables
- •7.2.1 Generating frequency tables
- •7.2.2 Tests of independence
- •7.2.3 Measures of association
- •7.2.4 Visualizing results
- •7.3 Correlations
- •7.3.1 Types of correlations
- •7.3.2 Testing correlations for significance
- •7.3.3 Visualizing correlations
- •7.4 T-tests
- •7.4.3 When there are more than two groups
- •7.5 Nonparametric tests of group differences
- •7.5.1 Comparing two groups
- •7.5.2 Comparing more than two groups
- •7.6 Visualizing group differences
- •7.7 Summary
- •8 Regression
- •8.1 The many faces of regression
- •8.1.1 Scenarios for using OLS regression
- •8.1.2 What you need to know
- •8.2 OLS regression
- •8.2.1 Fitting regression models with lm()
- •8.2.2 Simple linear regression
- •8.2.3 Polynomial regression
- •8.2.4 Multiple linear regression
- •8.2.5 Multiple linear regression with interactions
- •8.3 Regression diagnostics
- •8.3.1 A typical approach
- •8.3.2 An enhanced approach
- •8.3.3 Global validation of linear model assumption
- •8.3.4 Multicollinearity
- •8.4 Unusual observations
- •8.4.1 Outliers
- •8.4.3 Influential observations
- •8.5 Corrective measures
- •8.5.1 Deleting observations
- •8.5.2 Transforming variables
- •8.5.3 Adding or deleting variables
- •8.5.4 Trying a different approach
- •8.6 Selecting the “best” regression model
- •8.6.1 Comparing models
- •8.6.2 Variable selection
- •8.7 Taking the analysis further
- •8.7.1 Cross-validation
- •8.7.2 Relative importance
- •8.8 Summary
- •9 Analysis of variance
- •9.1 A crash course on terminology
- •9.2 Fitting ANOVA models
- •9.2.1 The aov() function
- •9.2.2 The order of formula terms
- •9.3.1 Multiple comparisons
- •9.3.2 Assessing test assumptions
- •9.4 One-way ANCOVA
- •9.4.1 Assessing test assumptions
- •9.4.2 Visualizing the results
- •9.6 Repeated measures ANOVA
- •9.7 Multivariate analysis of variance (MANOVA)
- •9.7.1 Assessing test assumptions
- •9.7.2 Robust MANOVA
- •9.8 ANOVA as regression
- •9.9 Summary
- •10 Power analysis
- •10.1 A quick review of hypothesis testing
- •10.2 Implementing power analysis with the pwr package
- •10.2.1 t-tests
- •10.2.2 ANOVA
- •10.2.3 Correlations
- •10.2.4 Linear models
- •10.2.5 Tests of proportions
- •10.2.7 Choosing an appropriate effect size in novel situations
- •10.3 Creating power analysis plots
- •10.4 Other packages
- •10.5 Summary
- •11 Intermediate graphs
- •11.1 Scatter plots
- •11.1.3 3D scatter plots
- •11.1.4 Spinning 3D scatter plots
- •11.1.5 Bubble plots
- •11.2 Line charts
- •11.3 Corrgrams
- •11.4 Mosaic plots
- •11.5 Summary
- •12 Resampling statistics and bootstrapping
- •12.1 Permutation tests
- •12.2 Permutation tests with the coin package
- •12.2.2 Independence in contingency tables
- •12.2.3 Independence between numeric variables
- •12.2.5 Going further
- •12.3 Permutation tests with the lmPerm package
- •12.3.1 Simple and polynomial regression
- •12.3.2 Multiple regression
- •12.4 Additional comments on permutation tests
- •12.5 Bootstrapping
- •12.6 Bootstrapping with the boot package
- •12.6.1 Bootstrapping a single statistic
- •12.6.2 Bootstrapping several statistics
- •12.7 Summary
- •13 Generalized linear models
- •13.1 Generalized linear models and the glm() function
- •13.1.1 The glm() function
- •13.1.2 Supporting functions
- •13.1.3 Model fit and regression diagnostics
- •13.2 Logistic regression
- •13.2.1 Interpreting the model parameters
- •13.2.2 Assessing the impact of predictors on the probability of an outcome
- •13.2.3 Overdispersion
- •13.2.4 Extensions
- •13.3 Poisson regression
- •13.3.1 Interpreting the model parameters
- •13.3.2 Overdispersion
- •13.3.3 Extensions
- •13.4 Summary
- •14 Principal components and factor analysis
- •14.1 Principal components and factor analysis in R
- •14.2 Principal components
- •14.2.1 Selecting the number of components to extract
- •14.2.2 Extracting principal components
- •14.2.3 Rotating principal components
- •14.2.4 Obtaining principal components scores
- •14.3 Exploratory factor analysis
- •14.3.1 Deciding how many common factors to extract
- •14.3.2 Extracting common factors
- •14.3.3 Rotating factors
- •14.3.4 Factor scores
- •14.4 Other latent variable models
- •14.5 Summary
- •15 Time series
- •15.1 Creating a time-series object in R
- •15.2 Smoothing and seasonal decomposition
- •15.2.1 Smoothing with simple moving averages
- •15.2.2 Seasonal decomposition
- •15.3 Exponential forecasting models
- •15.3.1 Simple exponential smoothing
- •15.3.3 The ets() function and automated forecasting
- •15.4 ARIMA forecasting models
- •15.4.1 Prerequisite concepts
- •15.4.2 ARMA and ARIMA models
- •15.4.3 Automated ARIMA forecasting
- •15.5 Going further
- •15.6 Summary
- •16 Cluster analysis
- •16.1 Common steps in cluster analysis
- •16.2 Calculating distances
- •16.3 Hierarchical cluster analysis
- •16.4 Partitioning cluster analysis
- •16.4.2 Partitioning around medoids
- •16.5 Avoiding nonexistent clusters
- •16.6 Summary
- •17 Classification
- •17.1 Preparing the data
- •17.2 Logistic regression
- •17.3 Decision trees
- •17.3.1 Classical decision trees
- •17.3.2 Conditional inference trees
- •17.4 Random forests
- •17.5 Support vector machines
- •17.5.1 Tuning an SVM
- •17.6 Choosing a best predictive solution
- •17.7 Using the rattle package for data mining
- •17.8 Summary
- •18 Advanced methods for missing data
- •18.1 Steps in dealing with missing data
- •18.2 Identifying missing values
- •18.3 Exploring missing-values patterns
- •18.3.1 Tabulating missing values
- •18.3.2 Exploring missing data visually
- •18.3.3 Using correlations to explore missing values
- •18.4 Understanding the sources and impact of missing data
- •18.5 Rational approaches for dealing with incomplete data
- •18.6 Complete-case analysis (listwise deletion)
- •18.7 Multiple imputation
- •18.8 Other approaches to missing data
- •18.8.1 Pairwise deletion
- •18.8.2 Simple (nonstochastic) imputation
- •18.9 Summary
- •19 Advanced graphics with ggplot2
- •19.1 The four graphics systems in R
- •19.2 An introduction to the ggplot2 package
- •19.3 Specifying the plot type with geoms
- •19.4 Grouping
- •19.5 Faceting
- •19.6 Adding smoothed lines
- •19.7 Modifying the appearance of ggplot2 graphs
- •19.7.1 Axes
- •19.7.2 Legends
- •19.7.3 Scales
- •19.7.4 Themes
- •19.7.5 Multiple graphs per page
- •19.8 Saving graphs
- •19.9 Summary
- •20 Advanced programming
- •20.1 A review of the language
- •20.1.1 Data types
- •20.1.2 Control structures
- •20.1.3 Creating functions
- •20.2 Working with environments
- •20.3 Object-oriented programming
- •20.3.1 Generic functions
- •20.3.2 Limitations of the S3 model
- •20.4 Writing efficient code
- •20.5 Debugging
- •20.5.1 Common sources of errors
- •20.5.2 Debugging tools
- •20.5.3 Session options that support debugging
- •20.6 Going further
- •20.7 Summary
- •21 Creating a package
- •21.1 Nonparametric analysis and the npar package
- •21.1.1 Comparing groups with the npar package
- •21.2 Developing the package
- •21.2.1 Computing the statistics
- •21.2.2 Printing the results
- •21.2.3 Summarizing the results
- •21.2.4 Plotting the results
- •21.2.5 Adding sample data to the package
- •21.3 Creating the package documentation
- •21.4 Building the package
- •21.5 Going further
- •21.6 Summary
- •22 Creating dynamic reports
- •22.1 A template approach to reports
- •22.2 Creating dynamic reports with R and Markdown
- •22.3 Creating dynamic reports with R and LaTeX
- •22.4 Creating dynamic reports with R and Open Document
- •22.5 Creating dynamic reports with R and Microsoft Word
- •22.6 Summary
- •afterword Into the rabbit hole
- •appendix A Graphical user interfaces
- •appendix B Customizing the startup environment
- •appendix C Exporting data from R
- •Delimited text file
- •Excel spreadsheet
- •Statistical applications
- •appendix D Matrix algebra in R
- •appendix E Packages used in this book
- •appendix F Working with large datasets
- •F.1 Efficient programming
- •F.2 Storing data outside of RAM
- •F.3 Analytic packages for out-of-memory data
- •F.4 Comprehensive solutions for working with enormous datasets
- •appendix G Updating an R installation
- •G.1 Automated installation (Windows only)
- •G.2 Manual installation (Windows and Mac OS X)
- •G.3 Updating an R installation (Linux)
- •references
- •index
- •Symbols
- •Numerics
- •23.1 The lattice package
- •23.2 Conditioning variables
- •23.3 Panel functions
- •23.4 Grouping variables
- •23.5 Graphic parameters
- •23.6 Customizing plot strips
- •23.7 Page arrangement
- •23.8 Going further
|
|
|
|
Correlations |
153 |
|
|
X^2 df |
P(> X^2) |
|
|
Likelihood Ratio 13.530 |
2 |
0.0011536 |
|
||
Pearson |
13.055 |
2 |
0.0014626 |
|
|
Phi-Coefficient |
: |
0.394 |
|
|
|
Contingency Coeff.: |
0.367 |
|
|
|
|
Cramer's V |
: |
0.394 |
|
|
|
In general, larger magnitudes indicate stronger associations. The vcd package also provides a kappa() function that can calculate Cohen’s kappa and weighted kappa for a confusion matrix (for example, the degree of agreement between two judges classifying a set of objects into categories).
7.2.4Visualizing results
R has mechanisms for visually exploring the relationships among categorical variables that go well beyond those found in most other statistical platforms. You typically use bar charts to visualize frequencies in one dimension (see section 6.1). The vcd package has excellent functions for visualizing relationships among categorical variables in multidimensional datasets using mosaic and association plots (see section 11.4). Finally, correspondence-analysis functions in the ca package allow you to visually explore relationships between rows and columns in contingency tables using various geometric representations (Nenadic and Greenacre, 2007).
This ends the discussion of contingency tables, until we take up more advanced topics in chapters 11 and 15. Next, let’s look at various types of correlation coefficients.
7.3Correlations
Correlation coefficients are used to describe relationships among quantitative variables. The sign ± indicates the direction of the relationship (positive or inverse), and the magnitude indicates the strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship).
In this section, we’ll look at a variety of correlation coefficients, as well as tests of significance. We’ll use the state.x77 dataset available in the base R installation. It provides data on the population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate for the 50 US states in 1977. There are also temperature and land-area measures, but we’ll drop them to save space. Use help(state.x77) to learn more about the file. In addition to the base installation, we’ll be using the psych and ggm packages.
7.3.1Types of correlations
R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial. Let’s look at each in turn.
PEARSON, SPEARMAN, AND KENDALL CORRELATIONS
The Pearson product-moment correlation assesses the degree of linear relationship between two quantitative variables. Spearman’s rank-order correlation coefficient
154 |
CHAPTER 7 Basic statistics |
assesses the degree of relationship between two rank-ordered variables. Kendall’s tau is also a nonparametric measure of rank correlation.
The cor() function produces all three correlation coefficients, whereas the cov() function provides covariances. There are many options, but a simplified format for producing correlations is
cor(x, use= , method= )
The options are described in table 7.2.
Table 7.2 cor/cov options
Option |
Description |
|
|
x |
Matrix or data frame. |
use |
Specifies the handling of missing data. The options are all.obs (assumes no missing |
|
data—missing data will produce an error), everything (any correlation involving a |
|
case with missing values will be set to missing), complete.obs (listwise deletion), |
|
and pairwise.complete.obs (pairwise deletion). |
method |
Specifies the type of correlation. The options are pearson, spearman, and kendall. |
|
|
The default options are use="everything" and method="pearson". You can see an example in the following listing.
Listing 7.14 Covariances and correlations
>states<- state.x77[,1:6]
>cov(states)
|
Population Income Illiteracy |
Life Exp |
Murder |
HS Grad |
||
Population |
19931684 |
571230 |
292.868 |
-407.842 |
5663.52 -3551.51 |
|
Income |
571230 |
377573 |
-163.702 |
280.663 |
-521.89 |
3076.77 |
Illiteracy |
293 |
-164 |
0.372 |
-0.482 |
1.58 |
-3.24 |
Life Exp |
-408 |
281 |
-0.482 |
1.802 |
-3.87 |
6.31 |
Murder |
5664 |
-522 |
1.582 |
-3.869 |
13.63 |
-14.55 |
HS Grad |
-3552 |
3077 |
-3.235 |
6.313 |
-14.55 |
65.24 |
> cor(states) |
|
|
|
|
|
|
|
Population Income Illiteracy |
Life Exp |
Murder |
HS Grad |
||
Population |
1.0000 |
0.208 |
0.108 |
-0.068 |
0.344 |
-0.0985 |
Income |
0.2082 |
1.000 |
-0.437 |
0.340 |
-0.230 |
0.6199 |
Illiteracy |
0.1076 |
-0.437 |
1.000 |
-0.588 |
0.703 |
-0.6572 |
Life Exp |
-0.0681 |
0.340 |
-0.588 |
1.000 |
-0.781 |
0.5822 |
Murder |
0.3436 |
-0.230 |
0.703 |
-0.781 |
1.000 |
-0.4880 |
HS Grad |
-0.0985 |
0.620 |
-0.657 |
0.582 |
-0.488 |
1.0000 |
> cor(states, method="spearman") |
|
|
|
|||
|
Population Income Illiteracy |
Life Exp |
Murder |
HS Grad |
||
Population |
1.000 |
0.125 |
0.313 |
-0.104 |
0.346 |
-0.383 |
Income |
0.125 |
1.000 |
-0.315 |
0.324 |
-0.217 |
0.510 |
Illiteracy |
0.313 |
-0.315 |
1.000 |
-0.555 |
0.672 |
-0.655 |
Life Exp |
-0.104 |
0.324 |
-0.555 |
1.000 |
-0.780 |
0.524 |
Murder |
0.346 |
-0.217 |
0.672 |
-0.780 |
1.000 |
-0.437 |
HS Grad |
-0.383 |
0.510 |
-0.655 |
0.524 |
-0.437 |
1.000 |
Correlations |
155 |
The first call produces the variances and covariances. The second provides Pearson product-moment correlation coefficients, and the third produces Spearman rankorder correlation coefficients. You can see, for example, that a strong positive correlation exists between income and high school graduation rate and that a strong negative correlation exists between illiteracy rates and life expectancy.
Notice that you get square matrices by default (all variables crossed with all other variables). You can also produce nonsquare matrices, as shown in the following example:
>x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
>y <- states[,c("Life Exp", "Murder")]
>cor(x,y)
|
Life Exp Murder |
|
Population |
-0.068 |
0.344 |
Income |
0.340 |
-0.230 |
Illiteracy |
-0.588 |
0.703 |
HS Grad |
0.582 |
-0.488 |
This version of the function is particularly useful when you’re interested in the relationships between one set of variables and another. Notice that the results don’t tell you if the correlations differ significantly from 0 (that is, whether there’s sufficient evidence based on the sample data to conclude that the population correlations differ from 0). For that, you need tests of significance (described in section 7.3.2).
PARTIAL CORRELATIONS
A partial correlation is a correlation between two quantitative variables, controlling for one or more other quantitative variables. You can use the pcor() function in the ggm package to provide partial correlation coefficients. The ggm package isn’t installed by default, so be sure to install it on first use. The format is
pcor(u, S)
where u is a vector of numbers, with the first two numbers being the indices of the variables to be correlated, and the remaining numbers being the indices of the conditioning variables (that is, the variables being partialed out). S is the covariance matrix among the variables. An example will help clarify this:
>library(ggm)
>colnames(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" "HS Grad" > pcor(c(1,5,2,3,6), cov(states))
[1] 0.346
In this case, 0.346 is the correlation between population (variable 1) and murder rate (variable 5), controlling for the influence of income, illiteracy rate, and high school graduation rate (variables 2, 3, and 6 respectively). The use of partial correlations is common in the social sciences.
OTHER TYPES OF CORRELATIONS
The hetcor() function in the polycor package can compute a heterogeneous correlation matrix containing Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, polychoric
156 |
CHAPTER 7 Basic statistics |
correlations between ordinal variables, and tetrachoric correlations between two dichotomous variables. Polyserial, polychoric, and tetrachoric correlations assume that the ordinal or dichotomous variables are derived from underlying normal distributions. See the documentation that accompanies this package for more information.
7.3.2Testing correlations for significance
Once you’ve generated correlation coefficients, how do you test them for statistical significance? The typical null hypothesis is no relationship (that is, the correlation in the population is 0). You can use the cor.test() function to test an individual Pearson, Spearman, and Kendall correlation coefficient. A simplified format is
cor.test(x, y, alternative = , method = )
where x and y are the variables to be correlated, alternative specifies a two-tailed or one-tailed test ("two.side", "less", or "greater"), and method specifies the type of correlation ("pearson", "kendall", or "spearman") to compute. Use alternative ="less" when the research hypothesis is that the population correlation is less than 0. Use alternative="greater" when the research hypothesis is that the population correlation is greater than 0. By default, alternative="two.side" (population correlation isn’t equal to 0) is assumed. See the following listing for an example.
Listing 7.15 Testing a correlation coefficient for significance
> cor.test(states[,3], states[,5])
Pearson's product-moment correlation
data: states[, 3] and states[, 5]
t = 6.85, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:
0.528 0.821 sample estimates:
cor 0.703
This code tests the null hypothesis that the Pearson correlation between life expectancy and murder rate is 0. Assuming that the population correlation is 0, you’d expect to see a sample correlation as large as 0.703 less than 1 time out of 10 million (that is, p = 1.258e-08). Given how unlikely this is, you reject the null hypothesis in favor of the research hypothesis, that the population correlation between life expectancy and murder rate is not 0.
Unfortunately, you can test only one correlation at a time using cor.test(). Luckily, the corr.test() function provided in the psych package allows you to go further. The corr.test() function produces correlations and significance levels for matrices of Pearson, Spearman, and Kendall correlations. An example is given in the following listing.
Correlations |
157 |
Listing 7.16 Correlation matrix and tests of significance via corr.test()
>library(psych)
>corr.test(states, use="complete")
Call:corr.test(x = states, use = "complete")
Correlation matrix |
|
|
|
|
|
|
|
Population Income Illiteracy Life Exp Murder HS Grad |
|||||
Population |
1.00 |
0.21 |
0.11 |
-0.07 |
0.34 |
-0.10 |
Income |
0.21 |
1.00 |
-0.44 |
0.34 |
-0.23 |
0.62 |
Illiteracy |
0.11 |
-0.44 |
1.00 |
-0.59 |
0.70 |
-0.66 |
Life Exp |
-0.07 |
0.34 |
-0.59 |
1.00 |
-0.78 |
0.58 |
Murder |
0.34 |
-0.23 |
0.70 |
-0.78 |
1.00 |
-0.49 |
HS Grad |
-0.10 |
0.62 |
-0.66 |
0.58 |
-0.49 |
1.00 |
Sample Size |
|
|
|
|
|
|
[1] 50 |
|
|
|
|
|
|
Probability value |
|
|
|
|
|
|
|
Population Income Illiteracy Life Exp Murder HS Grad |
|||||
Population |
0.00 |
0.15 |
0.46 |
0.64 |
0.01 |
0.5 |
Income |
0.15 |
0.00 |
0.00 |
0.02 |
0.11 |
0.0 |
Illiteracy |
0.46 |
0.00 |
0.00 |
0.00 |
0.00 |
0.0 |
Life Exp |
0.64 |
0.02 |
0.00 |
0.00 |
0.00 |
0.0 |
Murder |
0.01 |
0.11 |
0.00 |
0.00 |
0.00 |
0.0 |
HS Grad |
0.50 |
0.00 |
0.00 |
0.00 |
0.00 |
0.0 |
The use= options can be "pairwise" or "complete" (for pairwise or listwise deletion of missing values, respectively). The method= option is "pearson" (the default), "spearman", or "kendall". Here you see that the correlation between population size and high school graduation rate (–0.10) is not significantly different from 0 (p = 0.5).
OTHER TESTS OF SIGNIFICANCE
In section 7.4.1, we looked at partial correlations. The pcor.test() function in the psych package can be used to test the conditional independence of two variables controlling for one or more additional variables, assuming multivariate normality. The format is
pcor.test(r, q, n)
where r is the partial correlation produced by the pcor() function, q is the number of variables being controlled, and n is the sample size.
Before leaving this topic, it should be mentioned that the r.test() function in the psych package also provides a number of useful significance tests. The function can be used to test the following:
■The significance of a correlation coefficient
■The difference between two independent correlations
■The difference between two dependent correlations sharing a single variable
■The difference between two dependent correlations based on completely different variables
See help(r.test) for details.