Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Robert I. Kabacoff - R in action

.pdf

Скачиваний:

Добавлен:

02.06.2015

Размер:

12.13 Mб

Скачать

☆

<<< < Предыдущая 24 25 26 27 28 29 30 31 32 33 34 35 36 3738 / 4838 39 40 41 42 43 44 45 46 47 48 > Следующая >>>

346	CHAPTER 14 Principal components and factor analysis
vocab	0.80 0.23 0.69 0.31

PA1 PA2

SS loadings 1.83 1.75

Proportion Var 0.30 0.29

Cumulative Var 0.30 0.60

[… additional output omitted …]

Looking at the factor loadings, the factors are certainly easier to interpret. Reading and vocabulary load on the first factor, and picture completion, block design, and mazes loads on the second factor. The general nonverbal intelligence measure loads on both factors. This may indicate a verbal intelligence factor and a nonverbal intelligence factor.

By using an orthogonal rotation, you’ve artificially forced the two factors to be uncorrelated. What would you find if you allowed the two factors to correlate? You can try an oblique rotation such as promax (see the next listing).

Listing 14.8 Factor extraction with oblique rotation

>fa.promax <- fa(correlations, nfactors=2, rotate="promax", fm="pa")

>fa.promax

Factor Analysis using method = pa

Call: fa(r = correlations, nfactors = 2, rotate = "promax", fm = "pa") Standardized loadings based upon correlation matrix

	PA1	PA2	h2	u2
general	0.36	0.49	0.57	0.43
picture -0.04		0.64	0.38	0.62
blocks	-0.12	0.98	0.83	0.17
maze	-0.01	0.45	0.20	0.80
reading	1.01	-0.11 0.91 0.09
vocab	0.84	-0.02 0.69 0.31

PA1 PA2

SS loadings 1.82 1.76

Proportion Var 0.30 0.29

Cumulative Var 0.30 0.60

With factor correlations of PA1 PA2

PA1 1.00 0.57

PA2 0.57 1.00

[… additional output omitted …]

Several differences exist between the orthogonal and oblique solutions. In an orthogonal solution, attention focuses on the factor structure matrix (the correlations of the variables with the factors). In an oblique solution, there are three matrices to consider: the factor structure matrix, the factor pattern matrix, and the factor intercorrelation matrix.

The factor pattern matrix is a matrix of standardized regression coefficients. They give the weights for predicting the variables from the factors. The factor intercorrelation matrix gives the correlations among the factors.

Exploratory factor analysis

347

In listing 14.8, the values in the PA1 and PA2 columns constitute the factor pattern matrix. They’re standardized regression coefficients rather than correlations. Examination of the columns of this matrix is still used to name the factors (although there’s some controversy here). Again you’d find a verbal and nonverbal factor.

The factor intercorrelation matrix indicates that the correlation between the two factors is 0.57. This is a hefty correlation. If the factor intercorrelations had been low, you might have gone back to an orthogonal solution to keep things simple.

The factor structure matrix (or factor loading matrix) isn’t provided. But you can easily calculate it using the formula F = P*Phi, where F is the factor loading matrix, P is the factor pattern matrix, and Phi is the factor intercorrelation matrix. A simple function for carrying out the multiplication is as follows:

fsm <- function(oblique) {

if (class(oblique)[2]=="fa" & is.null(oblique$Phi)) { warning("Object doesn't look like oblique EFA")

} else {

P <- unclass(oblique$loading) F <- P %*% oblique$Phi colnames(F) <- c("PA1", "PA2") return(F)

}

Applying this to the example, you get

> fsm(fa.promax) PA1 PA2

general 0.64 0.69 picture 0.33 0.61 blocks 0.44 0.91 maze 0.25 0.45 reading 0.95 0.47 vocab 0.83 0.46

Now you can review the correlations between the variables and the factors. Comparing them to the factor loading matrix in the orthogonal solution, you see that these columns aren’t as pure. This is because you’ve allowed the underlying factors to be correlated. Although the oblique approach is more complicated, it’s often a more realistic model of the data.

You can graph an orthogonal or oblique solution using the factor.plot() or fa. diagram() function. The code

factor.plot(fa.promax, labels=rownames(fa.promax$loadings))

produces the graph in figure 14.5. The code

fa.diagram(fa.promax, simple=FALSE)

produces the diagram in figure 14.6. If you let simple=TRUE, only the largest loading per item would be displayed. It shows the largest loadings for each factor, as well as the correlations between the factors. This type of diagram is helpful when there are several factors.

348	CHAPTER 14 Principal components and factor analysis

	1.0
	0.8
	0.6
PA2	0.4
	0.2
	0.0

Factor Analysis

blocks

picture

general

maze

vocab

reading

0.0

0.2

0.4

0.6

0.8

1.0

PA1

Figure 14.5 Two factor plot for the psychological tests in ability.cov. Vocab and reading load on the first factor (PA1). while blocks, picture, and maze load on the second factor (PA2). The general intelligence test loads on both.

When you’re dealing with data in real life, it’s unlikely that you’d apply factor analysis to a dataset with so few variables. You’ve done it here to keep things manageable. If you’d like to test your skills, try factor-analyzing the 24 psychological tests contained in

Harman74.cor. The code

library(psych)

fa.24tests <- fa(Harman74.cor$cov, nfactors=4, rotate="promax")

should get you started!

Factor Analysis

reading
	1
vocab	0.8
	0.8	PA1
		PA1
blocks
	0.4	0.6
	0.4
picture	1
picture
	0.6
	0.5	PA2
general	0.5
general
	0.5
maze

Figure 14.6 Diagram of the oblique two factor solution for the psychological test data in ability.cov

Other latent variable models

349

14.3.4Factor scores

Compared with PCA, the goal of EFA is much less likely to be the calculation of factor scores. But these scores are easily obtained from the fa() function by including the score=TRUE option (when raw data is available). Additionally, the scoring coefficients (standardized regression weights) are available in the weights element of the object returned.

For the ability.cov dataset, you can obtain the beta weights for calculating the factor score estimates for the two-factor oblique solution using

> fa.promax$weights [,1] [,2] general 0.080 0.210

picture 0.021 0.090 blocks 0.044 0.695 maze 0.027 0.035 reading 0.739 0.044 vocab 0.176 0.039

Unlike component scores, which are calculated exactly, factor scores can only be estimated. Several methods exist. The fa() function uses the regression approach. To learn more about factor scores, see DiStefano, Zhu, and Mîndrila˘, (2009).

Before moving on, let’s briefly review other R packages that are useful for exploratory factor analysis.

14.3.5Other EFA-related packages

R contains a number of other contributed packages that are useful for conducting factor analyses. The FactoMineR package provides methods for PCA and EFA, as well as other latent variable models. It provides many options that we haven’t considered here, including the use of both numeric and categorical variables. The FAiR package estimates factor analysis models using a genetic algorithm that permits the ability to impose inequality restrictions on model parameters. The GPArotation package offers many additional factor rotation methods. Finally, the nFactors package offers sophisticated techniques for determining the number of factors underlying data.

14.4 Other latent variable models

EFA is only one of a wide range of latent variable models used in statistics. We’ll end this chapter with a brief description of other models that can be fit within R. These include models that test a priori theories, that can handle mixed data types (numeric and categorical), or that are based solely on categorical multiway tables.

In EFA, you allow the data to determine the number of factors to be extracted and their meaning. But you could start with a theory about how many factors underlie a set of variables, how the variables load on those factors, and how the factors correlate with one another. You could then test this theory against a set of collected data. The approach is called confirmatory factor analysis (CFA).

CFA is a subset of a methodology called structural equation modeling (SEM). SEM not only allows you to posit the number and composition of underlying factors but

350	CHAPTER 14 Principal components and factor analysis

how these factors impact one another as well. You can think of SEM as a combination of confirmatory factor analyses (for the variables) and regression analyses (for the factors). The resulting output includes statistical tests and fit indices. There are several excellent packages for CFA and SEM in R. They include sem, openMx, and lavaan.

The ltm package can be used to fit latent models to the items contained in tests and questionnaires. The methodology is often used to create large scale standardized tests. Examples include the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE).

Latent class models (where the underlying factors are assumed to be categorical rather than continuous) can be fit with the FlexMix, lcmm, randomLCA, and poLC packages. The lcda package performs latent class discriminant analysis, and the lsa package performs latent semantic analysis, a methodology used in natural language processing.

The ca package provides functions for simple and multiple correspondence analysis. These methods allow you to explore the structure of categorical variables in two-way and multiway tables, respectively.

Finally, R contains numerous methods for multidimensional scaling (MDS). MDS is designed to detect underlying dimensions that explain the similarities and distances between a set of measured objects (for example, countries). The cmdscale() function in the base installation performs a classical MDS, while the isoMDS() function in the MASS package performs a nonmetric MDS. The vegan package also contains functions for classical and nonmetric MDS.

14.5 Summary

In this chapter, we reviewed methods for principal components (PCA) analysis and exploratory factor analysis (EFA). PCA is a useful data reduction method that can replace a large number of correlated variables with a smaller number of uncorrelated variables, simplifying the analyses. EFA contains a broad range of methods for identifying latent or unobserved constructs (factors) that may underlie a set of observed or manifest variables.

Whereas the goal of PCA is typically to summarize the data and reduce its dimensionality, EFA can be used as a hypothesis generating tool, useful when you’re trying to understand the relationships between a large number of variables. It’s often used in the social sciences for theory development.

Although there are many superficial similarities between the two approaches, important differences exist as well. In this chapter, we considered the models underlying each, methods for selecting the number of components/factors to extract, methods for extracting components/factors and rotating (transforming) them to enhance interpretability, and techniques for obtaining component or factor scores. The steps in a PCA or EFA are summarized in figure 14.7. We ended the chapter with a brief discussion of other latent variable methods available in R.

Summary

351

	Select Factor Model
Components	Common Factor
Components	Maximum Likelihood
	Principal Axis
	Weighted Least Squa
	Minimum Residual
Number of
Components/Factors

Kaiser/Harris

ted	Interpret	Theory
Rotate Components/Factors
Orthogonal		Oblique
Varimax		Promax
ta!	Other
	Interpret Components/Factors
	Calculate Factor Scores

Figure 14.7 A principal components/exploratory factor analysis decision chart

Because PCA and EFA are based on correlation matrices, it’s important that any missing data be eliminated before proceeding with the analyses. Section 4.5 briefly mentioned simple methods for dealing with missing data. In the next chapter, we’ll consider more sophisticated methods for both understanding and handling missing values.

Advanced15methods

for missing data

This chapter covers

■Identification of missing data

■Visualization of missing data patterns

■Complete-case analysis

■Multiple imputation of missing data

In previous chapters, we focused on the analysis of complete datasets (that is, datasets without missing values). Although doing so has helped simplify the presentation of statistical and graphical methods, in the real world, missing data are ubiquitous.

In some ways, the impact of missing data is a subject that most of us want to avoid. Statistics books may not mention it or may limit discussion to a few paragraphs. Statistical packages offer automatic handling of missing data using methods that may not be optimal. Even though most data analyses (at least in social sciences) involve missing data, this topic is rarely mentioned in the methods and results sections of journal articles. Given how often missing values occur, and the degree to which their presence can invalidate study results, it’s fair to say that the subject has received insufficient attention outside of specialized books and courses.

352

Steps in dealing with missing data

353

Data can be missing for many reasons. Survey participants may forget to answer one or more questions, refuse to answer sensitive questions, or grow fatigued and fail to complete a long questionnaire. Study participants may miss appointments or drop out of a study prematurely. Recording equipment may fail, internet connections may be lost, and data may be miscoded. The presence of missing data may even be planned. For example, to increase study efficiency or reduce costs, you may choose not to collect all data from all participants. Finally, data may be lost for reasons that you’re never able to ascertain.

Unfortunately, most statistical methods assume that you’re working with complete matrices, vectors, and data frames. In most cases, you have to eliminate missing data before you address the substantive questions that led you to collect the data. You can eliminate missing data by (1) removing cases with missing data, or (2) replacing missing data with reasonable substitute values. In either case, the end result is a dataset without missing values.

In this chapter, we’ll look at both traditional and modern approaches for dealing with missing data. We’ll primarily use the VIM and mice packages. The command install.packages(c("VIM", "mice")) will download and install both.

To motivate the discussion, we’ll look at the mammal sleep dataset (sleep) provided in the VIM package (not to be confused with the sleep dataset describing the impact of drugs on sleep provided in the base installation). The data come from a study by Allison and Chichetti (1976) that examined the relationship between sleep, ecological, and constitutional variables for 62 mammal species. The authors were interested in why animals’ sleep requirements vary from species to species. The sleep variables served as the dependent variables, whereas the ecological and constitutional variables served as the independent or predictor variables.

Sleep variables included length of dreaming sleep (Dream), nondreaming sleep (NonD), and their sum (Sleep). The constitutional variables included body weight in kilograms (BodyWgt), brain weight in grams (BrainWgt), life span in years (Span), and gestation time in days (Gest). The ecological variables included degree to which species were preyed upon (Pred), degree of their exposure while sleeping (Exp), and overall danger (Danger) faced. The ecological variables were measured on 5-point rating scales that ranged from 1 (low) to 5 (high).

In their original article, Allison and Chichetti limited their analyses to the species that had complete data. We’ll go further, analyzing all 62 cases using a multiple imputation approach.

15.1 Steps in dealing with missing data

Readers new to the study of missing data will find a bewildering array of approaches, critiques, and methodologies. The classic text in this area is Little and Rubin (2002). Excellent, accessible reviews can be found in Allison (2001), Schafer and Graham (2002) and Schlomer, Bauman, and Card (2010). A comprehensive approach will usually include the following steps:

354	CHAPTER 15 Advanced methods for missing data

1Identify the missing data.

2Examine the causes of the missing data.

3Delete the cases containing missing data or replace (impute) the missing values with reasonable alternative data values.

Unfortunately, identifying missing data is usually the only unambiguous step. Learning why data are missing depends on your understanding of the processes that generated the data. Deciding how to treat missing values will depend on your estimation of which procedures will produce the most reliable and accurate results.

A classification system for missing data

Statisticians typically classify missing data into one of three types. These types are usually described in probabilistic terms, but the underlying ideas are straightfor ward. We’ll use the measurement of dreaming in the sleep study (where 12 animals have missing values) to illustrate each type in turn.

(1)Missing completely at random—If the presence of missing data on a variable is unrelated to any other obser ved or unobser ved variable, then the data are missing completely at random (MCAR). If there’s no systematic reason why dream sleep is missing for these 12 animals, the data is said to be MCAR. Note that if ever y variable with missing data is MCAR, you can consider the complete cases to be a simple random sample from the larger dataset.

(2)Missing at random—If the presence of missing data on a variable is related to other obser ved variables but not to its own unobser ved value, the data is missing at random (MAR). For example, if animals with smaller body weights are more likely to have missing values for dream sleep (perhaps because it’s harder to obser ve smaller animals), and the “missingness” is unrelated to an animal’s time spent dreaming, the data would be considered MAR. In this case, the presence or absence of dream sleep data would be random, once you controlled for body weight.

(3)Not missing at random—If the missing data for a variable is neither MCAR nor MAR, it is not missing at random (NMAR). For example, if animals that spend less time dreaming are also more likely to have a missing dream value (perhaps because it’s harder to measure shor ter events), the data would be considered NMAR.

Most approaches to missing data assume that the data is either MCAR or MAR. In this case, you can ignore the mechanism producing the missing data and (after replacing or deleting the missing data) model the relationships of interest directly. Data that’s NMAR can be difficult to analyze properly. When data is NMAR, you have to model the mechanisms that produced the missing values, as well as the relationships of interest. (Current approaches to analyzing NMAR data include the use of selection models and pattern mixtures. The analysis of NMAR data can be quite complex and is beyond the scope of this book.)

		Identifying missing values			355
		Identify Missing Values
			is.na()
		!complete.cases()
		VIM package
Delete Missing Values		Maximum Likelihood		Impute Missing Values
		Estimation
		mvmle package
Casewise (Listwise)	Available Case		Single (simple)		Multiple Imputation
omit.na()		(Pairwise)	Imputation		mi package
	Option available for		Hmisc Package		mice package
	Option available for		Hmisc Package		amelia package
	some functions				amelia package
	some functions				mitools package
					mitools package

Figure 15.1 Methods for handling incomplete data, along with the R packages that support them

There are many methods for dealing with missing data—and no guarantee that they’ll produce the same results. Figure 15.1 describes an array of methods used for handling incomplete data and the R packages that support them.

A complete review of missing data methodologies would require a book in itself. In this chapter, we’ll review methods for exploring missing values patterns and focus on the three most popular methods for dealing with incomplete data (a rational approach, listwise deletion, and multiple imputation). We’ll end the chapter with a brief discussion of other methods, including those that are useful in special circumstances.

15.2 Identifying missing values

To begin, let’s review the material introduced in chapter 4, section 4.5, and expand on it. R represents missing values using the symbol NA (not available) and impossible values by the symbol NaN (not a number). In addition, the symbols Inf and -Inf represent positive infinity and negative infinity, respectively. The functions is.na(), is.nan(), and is.infinite() can be used to identify missing, impossible, and infinite values respectively. Each returns either TRUE or FALSE. Examples are given in table 15.1.

Table 15.1 Examples of return values for the is.na(), is.nan(), and is.infinite() functions

			x	is.na(x)	is.nan(x)	is.infinite(x)

x <-		NA		TRUE	FALSE	FALSE
x	<-	0	/ 0	TRUE	TRUE	FALSE
x	<-	1	/ 0	FALSE	FALSE	TRUE

<<< < Предыдущая 24 25 26 27 28 29 30 31 32 33 34 35 36 3738 / 4838 39 40 41 42 43 44 45 46 47 48 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
04.09.2019123.9 Кб2report_praktika.doc
#
02.06.201534.78 Кб24Research_Proposal_v_3_0.docx
#
01.05.202564.51 Кб1revision.doc
#
02.06.2015613.89 Кб21Rimskoe_pravo_bilety.doc
#
10.11.2019295.94 Кб11RI_lab.doc
#
02.06.201512.13 Mб97Robert I. Kabacoff - R in action.pdf
#
02.06.20152.89 Mб35Rossyskoe_zakonodatelstvo_X_XX_vekov_V_9-ti.doc
#
24.09.20195.23 Mб64RPZ.doc
#
01.05.2025136.7 Кб0RP_NIR_MEI_FM.doc
#
26.03.2016112.64 Кб4Rules.doc
#
26.03.2016233.33 Кб135RUR2012.docx