Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Essentials of NEPSY-II Assessment.pdf
Скачиваний:
7
Добавлен:
06.12.2024
Размер:
3.13 Mб
Скачать

STRENGTHS AND WEAKNESSES OF NEPSY-II 233

starting and stopping points, scores were determined using empirically-based findings.

Despite the aforementioned strengths, the NEPSY-II does have several potential weaknesses. As can be seen in Rapid Reference 5.2, as a strategy to facilitate obtaining the normative data from the lengthy standardization battery, a number of subtests were not included. These subtests were not included in an effort to shorten the time required for a child to complete the standardization battery, but also because the development team determined that tasks required no modifications and likely would show little change in normative values. Consequently, these subtests (i.e., Design Fluency, Oromotor Sequences, Repetition of Nonsense Words, Manual Motor Sequences, Route Finding, Imitative Hand Positions) were not included in the standardization version and retained the normative scores from the 1998 version of the NEPSY. Despite these efforts, there are no data provided, in either the pilot or tryout phases of the NEPSY-II, to suggest that the 1998 version of these tasks is equivalent to the 2007 version. This creates a potential problem with interpretation and, in many respects, hinders one of the major benefits of a battery of tasks (i.e., all tasks normed on the sample population). If these versions are not equivalent in their normative data, then unknown error variance will be created when these subtests are compared to the newly normed subtests in a profile of scores. How this may affect interpretation is not known.

Second, the use of 50 children per age band is relatively weak, especially for the preschool years where there is significantly more potential error variance in test data given the relatively lower reliability of test scores in a younger population. Further, the test developers provide little empirical data to support the use of six-month intervals in their age-splitting of the normative data. It is unclear if this was in line with their empirical data showing developmental changes on tasks over time.

PSYCHOMETRIC PROPERTIES

Reliability

The area of reliability is critical in test construction; not only does it determine the test’s capabilities of replicating results, but it also sets the upper limit on validity (Bracken, 1992). A summary of the strengths and weaknesses for the reliability of the NEPSY-II is presented in Rapid Reference 5.3 on the following page.

234 ESSENTIALS OF NEPSY-II ASSESSMENT

Rapid Reference 5.3

Strengths and Weaknesses of NEPSY-II Reliability

Strengths

Weaknesses

Most subtests were adequate to high for

Lowest reliability coefficients were

internal consistency estimates.

achieved on Response Set Total

Good inter-rater reliability for Clocks,

Correct, Inhibition Total Errors,

Memory for Designs Spatial, Total

Design Copying, Memory for Names,

Score, and Delay Total.

Theory of Mind, Word Generation,

 

Visual Memory Delayed, and Visual

Practice effects most noted on

Motor Precision, ranging from

Memory for Designs, Memory for

.93 to .99.

Faces, and Inhibition.

Test-retest reliability across seven age

Less focus on improved ceilings, which

groups revealed little change in scores, with

may place a limit on interpreting

test intervals ranging from 12 to 51 days

neurocognitive strengths; at present,

(M = 21 days).

from ages 3 to 5.9, there are 0–5

Focused on improving the test floors, with

subtests; from ages 6.0 to 9.9 there

nearly all subtests across age bands having

are 3–8 subtests; and from 10.0 to

at least a 2 standard deviation limit; more

16.9 there are 11–17 subtests.

tasks at the 3.0 to 4.5 ages (5–13) do not

 

reach this criterion, while 0–1 from ages 9.0

 

to 16.9.

 

Standard error of measure ranged from .85

 

to 2.18 across subtests and ages.

 

Reliability of subtests was relatively stable

 

for both typical and impaired samples.

 

 

 

For the NEPSY-II primary and process scaled scores, most subtests have adequate to high internal consistency estimates, and the standard error of measure for all of the subtests ranged from 0.85 to 2.18 across all of the age ranges. Temporal stability of the NEPSY-II subtests across seven age groups (i.e., ages 3 to 4, 5 to 6, 7 to 8, 8 to 9, 9 to 10, 11 to 12, 13 to 16), revealed little change in scores over an average of a three-week period (range 12 to 51 days), suggesting that there was little practice effect across a relatively short time frame. This is critical in a test where alternative form reliability is not available, and provides an evidence-based

STRENGTHS AND WEAKNESSES OF NEPSY-II 235

foundation to use many of the subtests in various types of intervention studies. The largest score differences were noted on the Memory for Designs, Memory for Faces, and Inhibition subtests. In general, the highest reliability coefficients were achieved for the subtests: Comprehension of Instructions, Design Copying, Fingertip Tapping, Imitating Hand Positions, List Memory, Memory for Names, Phonological Processing, Picture Puzzles, and Sentence Repetition. The lowest reliability coefficients were noted for the subtest variables: Response Set Total Correct, Inhibition Total Errors, Memory for Designs Spatial and Total Scores, and Memory for Designs Delayed Total Score. Given the potential application of the NEPSY-II, it also is important to note that the reliability of the subtests was relatively stable for both typical and impaired samples.

The NEPSY-II also examined inter-rater agreement across all cases ascertained for standardization and clinical validity studies by employing two independent scores. For the more objective types of subtests (e.g., Comprehension of Instructions), rates were quite high, ranging from .98 to .99. This also speaks to the integrity of the data included in the standardization phase. Scoring for several of the subtests require clinical judgment (e.g., Clocks, Design Copying) or implementation of specific scoring rules (e.g., Memory for Names, Word Generation), thus necessitating the generation of inter-rater reliability estimates. For these subtests, the inter-rater agreement ranged from .93 (Word Generation) to .99 (Memory for Names, Theory of Mind). These findings suggest that these specific subtests will require more scoring attention from trainers and administrators of the NEPSY-II, but that a high degree of scoring reliability can be achieved on these subtests.

Finally, although not directly related to reliability, the issue of tests’ floors and ceilings can have a direct effect on reliability by potentially restricting the range of scores available. For the NEPSY-II, the test developers focused significant resources on this issue, with targeted resources being devoted to improving the test floors. This is absolutely critical for a test that is designed to uncover neurocognitive weaknesses across different presenting problems and disorders. In this regard, nearly all of the NEPSY-II subtests across the age bands have at least a two standard deviation limit. There is a difference across the age range, however, with all or nearly all of the subtests having at least this floor from ages 9.0 to 16.9, but 5 to 13 of these subtests not reaching this floor for ages 5 to 13 years. In general, the number of subtests having at least a two standard deviation floor increases with age. Conversely, the test developers were not as focused on having at least a two standard deviation ceiling on the subtests. Although this is consistent with the notion of assessment via the NEPSY-II, this is unfortunate, as it may place

236 ESSENTIALS OF NEPSY-II ASSESSMENT

a limit on determining the presence of neurocognitive strengths; consequently, the concept of utilizing neurocognitive strengths to facilitate intervention may be limited. This notion is apparent in the NEPSY-II normative data, as from ages 3.0 to 5.9 few subtests meet this criterion (0 to 5 subtests); from ages 6.0 to 9.9 only 3 to 8 subtests meet this criterion, and from ages 10.0 to 16.9 only 11 to 17 subtests meet this criterion.

Validity

As noted in the test manual, “contemporary definitions of validity describe lines of evidence of validity as opposed to different types of validity” (p. 79). Evidence lines for validity may be the most important aspect of a test such as the NEPSY-II, where interpretation issues are critical to the ultimate clinical utility of the test (American Educational Research Association et al., 1999). According to the Standards for Educational and Psychological Testing (1999), key lines of validity include content, construct, and criterion-related. Strengths and weaknesses of the NEPSY-II for these lines of validity are presented in Rapid Reference 5.4.

Content validity (i.e., do the subtests adequately sample the targeted constructs of interest?) for the NEPSY-II had the benefit of the 1998 NEPSY upon which to

Rapid Reference 5.4

Strengths and Weaknesses of NEPSY-II Validity

Strengths

Weaknesses

Content, concurrent, and construct

Although the NEPSY-II is driven by a

validity issues all adequately addressed.

subtest model, users are still left with

Subtests have strong theoretical (Lurian)

the issue of how tests are clustered,

especially at different developmental

and evidence-based foundations.

epochs.

Subtest intercorrelations fit a multitrait-

No subtest specificity estimates were

multimethod model.

 

provided to reinforce interpretation

 

strength of the NEPSY-II.

 

No relationships with adaptive behavior

 

as measured by the ABAS-II.

STRENGTHS AND WEAKNESSES OF NEPSY-II 237

 

 

 

 

 

Strengths

Weaknesses

 

Correlations with intellectual batteries

Research criteria not employed for many

 

 

(WISC-IV, DAS) and other task

of the special group studies, leaving

 

cognitive batteries (NEPSY) are

them too heterogeneous and likely not

 

moderate to strong.

generalizable to the larger contemporary

 

Correlations with achievement batteries

research corpus.

 

 

 

 

(e.g., WIAT-II) are moderate to strong.

The special groups were not compared

 

Correlations with specific neurocognitive

to show differential profiles.

 

batteries (e.g., DKEFS, CMS, BBCS) are

 

 

 

moderate to strong.

 

 

 

Correlations with Devereux

 

 

 

Scales of Mental Disorder show

 

 

 

specific relationships with Autism

 

 

 

(Comprehension of Instructions)

 

 

 

and Conduct Disorder (Affect

 

 

 

Recognition).

 

 

 

Correlations with ADHD Scale show

 

 

 

relationships between Inhibition Subtest

 

 

 

and Focus Cluster.

 

 

 

Employed 10 special group studies

 

 

 

(e.g., ADHD, RD, MD, TBI, ASD, etc.).

 

 

 

 

 

 

 

 

 

 

base its modifications. The 1998 NEPSY was based on Lurian neuropsychological theory, but capitalized on recent advances in the field of child neuropsychology. For the NEPSY-II, this theoretical foundation remained, but the research that utilized the 1998 version of the NEPSY was reviewed as to its relevance for test revision and specific modifications. The pilot and tryout phases of test development further facilitated the examination of specific items within subtests, as well as the subtests proper, with a particular focus on content gaps, and following the standardization phase additional analysis was conducted to determine the adequacy of content at the specific item level, content biases, and associated psychometric properties. This process also extended into examination of the child’s responses such that traditional and atypical responses were considered with respect to whether the item or subtest was capturing the intended information. Taken together, these procedures produced a battery of tasks that adequately sample the targeted constructs of interest.

238 ESSENTIALS OF NEPSY-II ASSESSMENT

Construct validity pertains to the internal structure of a test, particularly with respect to the interrelationship of subtests or components. This is important, given the theoretical neurocognitive domains espoused by the NEPSY-II, as it will drive how the components of the test are viewed for interpretation. The NEPSY-II provides an interesting challenge for construct validity in that while there appears to be an overarching set of neuropsychological domains (e.g., Language, Visuospatial Processing, Social Perception), the test is not designed to provide scores for these domains. As such, and in accordance with the guidance in the Clinical and Interpretive Manual, the administration and interpretation of the NEPSY-II should be guided by the subtests. The 1998 version of the NEPSY did not present any factor analysis data, but two subsequent reports did address this issue via exploratory (Stinnett, Oehler-Stinnett, Fuqua, & Palmer, 2002) and confirmatory factor analytic methods (Mosconi, Nelson, & Hooper, 2008). Consistent with this philosophy, the NEPSY-II continues to focus on the subtests and their interrelationships. For the NEPSY-II, the test developers hypothesized that there would be a subtest intercorrelation pattern wherein the subtests within a domain would correlate more highly than subtests across domains. This multitrait-multimethod model for construct validity was supported in both the normative and clinical samples, with the correlations within many of the domains being higher in the clinical samples. Of note, subtests within the Language domain produced the highest intercorrelations, and many of these subtests also were more highly correlated with verbally based subtests in the other neurocognitive domains. The test developers use this pattern of correlations as support for the structure of the NEPSY-II.

Although the NEPSY-II is driven by a subtest model, with the authors and test developers arguing that this is the best approach for interpreting the data, users are still left to wonder about how these subtests cluster within and across domains, and across developmental epochs. As noted earlier, there were at least two efforts to examine the factor structure of the 1998 version of the NEPSY (Mosconi et al., 2008; Stinnett et al., 2002), with mixed support being provided. Stinnett and colleagues (2002) conducted an exploratory principle axis factor analysis using the correlation matrix for the 5- to 12-year-old children from the standardization sample (n = 800) and found that it yielded a 1-factor solution—a language/comprehension factor—and accounted for only 24.9% of the variance. Results also indicated that numerous subtests cross-loaded on multiple factors in two-, three-, and four-factor solutions, and that the same 11–12 subtests loaded on the first factor for each of these models. These findings suggested that the NEPSY fivedomain model was not supported, but Stinnett and colleagues (2002) suggested

STRENGTHS AND WEAKNESSES OF NEPSY-II 239

that confirmatory factor analysis would provide more convincing support of the test’s structural validity.

In that regard, Mosconi and colleagues (2008), using the standardization sample from the 1998 version of the NEPSY, conducted a confirmatory factor analysis for ages 5 though 12, as well as for the younger (5 to 8 years, n = 400) and older (9 to 12 years, n = 400) age bands. This latter question was important in the exploration of possible differences in test structure at different developmental epochs. Using four standard fit indices, results indicated that a five-factor model was less than adequate for the entire sample, and produced negative error variance for the younger and older age groups, making any solutions for the two subgroups statistically inadmissible. A four-factor model without the Executive Function/Attention Domain subtests produced satisfactory fit statistics for the entire sample and the younger group, but did not fit the data as well for the older group. In contrast to Stinnett and colleagues’ (2002) findings, a one-factor model did not fit well for the full sample. These results indicated that the structure of the 1998 NEPSY was not invariant across development, with the four-factor model best fitting the data for the younger age group and for the entire school-age sample.

It is unfortunate that these additional analyses were not explored, or at least presented in the NEPSY-II Clinical and Interpretive Manual, as this structure would support data reduction strategies for research and would provide some sense of linkage to the theoretical model that underlies the NEPSY-II. In support of the test developers’ contentions, however, it is likely that different factor structures would be present across different clinical groups, much as is seen in the pattern of intercorrelations of subtests between clinical samples and the normative sample; this question will require ongoing examination.

Finally, with respect to construct validity, there are no data provided with respect to subtest specificity estimates. Subtest specificity estimates provide an index for what proportion of a subtest’s variance is reliable and unique to the subtest. For any assessment tool where subtest interpretation is possible, subtests with low specificity (i.e., < .25, and greater than the proportion of error variance; McGrew & Murphy, 1995) should not be interpreted as measuring a specific function. Given the strong emphasis on utilization and interpretation of the NEPSY-II at the subtest level, it would seem that empirical data would have been provided with respect to the strength of specific subtests and its assessment of unique variance, and the specificity estimates would have been examined across each age block encompassed by the NEPSY-II. This would have facilitated interpretation and utilization of the subtests in the way they

240 ESSENTIALS OF NEPSY-II ASSESSMENT

were intended, but with empirical evidence as to their ability to measure a specific function.

Criterion-related validity was determined primarily by using concurrent validity studies (i.e., the relationship of the NEPSY-II with other tests measuring similar constructs). The NEPSY-II was concurrently administered with a wide range of measures, including intellectual batteries (e.g., Wechsler Intelligence Scale for Children—Fourth Edition; WISC-IV); achievement batteries (e.g., Wechsler Individual Achievement Test—Second Edition; WIAT-II); specific neuropsychological measures (e.g., Delis-Kaplan Executive Function System; DKEFS); behavior (e.g., Devereux Scales of Mental Disorders); and adaptive behaviors (e.g., Adaptive Behavior Assessment System-II). The concurrent administration of the NEPSY-II and other tasks included a sufficient number of participants to gain relatively stable validity coefficient.

As can be seen in Rapid Reference 5.4, the NEPSY-II correlated in a moderate to strong fashion with the intellectual and achievement batteries—the latter being particularly important with respect to its importance for use with children being referred for a variety of learning problems. When specific neurocognitive batteries were examined, the NEPSY-II subtests that were most similar to the items being tapped by the battery generally aligned in a moderate to strong manner.

For example, the NEPSY-II Memory and Learning subtests correlated most highly with selected subtests from the Children’s Memory Scale; NEPSY-II Attention and Executive Function subtests correlated most highly with selected subtests from the DKEFS; and NEPSY-II Language subtests correlated most highly with the Bracken Basic Concept Scale—Third Edition Receptive and Expressive scales. The NEPSY-II subtests did not correlate with the Children’s Communication Checklist. Given the potential usage of the NEPSY-II for children with emotional/behavioral disturbance and intellectual disabilities, correlations with several behavioral measures (e.g., Devereux Scales of Mental Disorders, Brown Attention-Deficit Disorders Scale, Adaptive Behavior Assessment System-II) also were examined. For the Devereux, the NEPSY-II subtests of Comprehension of Instructions and Affect Recognition showed moderate negative correlations with Autism and Conduct Disorder, respectively. The NEPSY-II Affect Recognition also correlated with the Devereux Externalizing Composite score. For the Brown Attention-Deficit Scale, the Focus Cluster moderately and negatively correlated with the NEPSY-II Inhibition-Switching Combined scaled score, reflecting declining inhibitory control with increasing ADHD symptoms. None of the NEPSY-II subtests correlated significantly with the Adaptive Behavior Assessment System,

STRENGTHS AND WEAKNESSES OF NEPSY-II 241

perhaps an indication of its lack of association to ecological, day-to-day behaviors. These findings support a convergent validity line of evidence for the NEPSY-II; however, it is important to note that these patterns of correlation may change depending on the age of the child, the presenting clinical condition, and the method of assessment (e.g., parent rater, clinician ratings, etc.), and this will require additional examination as the NEPSY-II begins to be employed in a variety of clinical settings.

The special group studies conducted with the NEPSY-II are noteworthy in that the test developers employed 10 different clinical conditions (i.e., Reading Disability, Math Disability, Traumatic Brain Injury, Autism Spectrum Disorder, Attention-Deficit Hyperactivity Disorder, Language Disorders, Intellectual Disability, Asperger’s Disorder, Hearing Impairment, Emotional Disturbance). Comparison groups were derived from the normative sample and matched on chronological age, gender, race, and parent education levels. These studies have promised to determine the differential sensitivity of the NEPSY-II to the neuropsychological profiles that can be manifested by specific disorders.

Findings from these special group studies generally support the clinical utility of the NEPSY-II in the assessment of children referred for different conditions and disorders. More specifically, the special groups typically differed from the typical groups on variables where they would be expected to deviate, as well as in a wide range of other variables. For example, the special group study using children with ADHD showed significant differences on NEPSY-II subtests assessing attention and executive functions, verbal memory, and sensorimotor abilities. Similarly, children with language disorders showed significant differences on NEPSY-II subtests measuring language-related functions, and children with intellectual disabilities and autism spectrum disorders performed more poorly than the typical group on nearly all of the NEPSY-II subtests. The separate examination of the newly added Theory of Mind Subtest for the Autism Spectrum Disorders versus Controls and for the Asperger’s Disorder versus Controls also provided support for the use of this subtest with these types of clinical referrals.

While the test developers should be commended on the inclusion of the various clinical disorders and conditions, and the findings generally support the separation of these groups from the typical group, there are several concerns with respect to these clinical studies that require mention. First, many of the studies employed small sample sizes, with group sizes ranging from 10 (Traumatic Brain Injury) to 55 (ADHD). Second, although inclusion and exclusion criteria for participation in these studies are provided in Appendix F of the