Abstract
Keywords
The use of multiple measures to identify gifted students is the most commonly reported method of gifted identification used in schools since at least 2008 (National Association for Gifted Children, 2009). Often, these multiple measures are combined into an identification matrix in order to simplify the identification process for the teachers and administrators involved. An identification matrix is an organizational tool for recording and collating identification data from a variety of different measures (McCabe, 1978). These identification matrices can be constructed to combine the results from many different types of measures, including test scores, grades, and nominations across a variety of methods. However, how those multiple measures are selected, combined, and used may result in differences in outcomes between our various student populations (Lee et al., 2024; Moon, 2017). Even though the same multiple identification measures may be used for all students within a district for the sake of equality (equal treatment), differences in how the combined instruments actually measure could result in inequities, especially for already underrepresented populations (Ford et al., 2020). These differences in identification outcomes could exacerbate historical and ongoing systemic injustices in the enrollment of underrepresented groups in gifted and talented education (List & Dykeman, 2021; Peters, 2022), all the while schools using these multiple-measure identification procedures believe they are improving outcomes for the diverse students they serve. While there have been explorations of individual instruments and decision rules for their combination (Lakin, 2018; McBee et al., 2014; Pereira, 2021; Peters & Gentry, 2013), the potential differential impacts of the construction and use of matrices on our culturally, linguistically, and economically diverse (CLED) students have just begun to be explored in research on gifted identification (Peters et al., 2025). Our study proposes to add to the body of knowledge in this new and growing area of research through an exploration of psychometric properties of one district's in-use multiple-measure identification matrix.
Literature Review
In both quantitative and qualitative research fields, there is general agreement that using multiple measurements improves the reliability and validity of the measured variable (Whitley & Kite, 2013). From construct validity to triangulation, multiple measures are used across the research methodology range to improve the chances of accurate and trustworthy findings (FairTest, 2007). As Worrell (2009, p. 243) stated, “outstanding accomplishments by children and adults are multivariate in nature and require multivariate explanations.” The most commonly reported method of gifted identification used in schools is the use of multiple measures (National Association for Gifted Children, 2009). These multiple measurements can differ both in criteria (e.g., ability, creativity, and leadership) and in mode (e.g., observation, performance, and portfolio). Using multiple measures that differ in both criteria and mode is a long-standing method for examining the construct validity of a multifaceted measurement, such as giftedness (Campbell & Fiske, 1959).
There are many reasons for using multiple measures in gifted identification (Rinn et al., 2020). In some states, legislators have passed laws requiring multiple-criteria identification processes after lobbying by gifted researchers, educators, and parents (Krisel & Brown, 1997). In other states, the use of multiple criteria was imposed through legal mandate (Lohman & Renzulli, 2007; Romey, 2006). For example, the Alabama Department of Education entered into a consent decree with the federal Office of Civil Rights in 1999 to adopt a multiple-criteria approach to gifted identification (Romey, 2006). In 2007, the Wisconsin Department of Public Instruction was required by a State Circuit judge to create specific rules for its school districts to follow when using multiple measures to identify gifted children (Lohman & Renzulli, 2007).
Others moved to a multiple-criteria and/or multiple-mode identification process with the belief it would be more inclusive in identifying CLED students (National Association for Gifted Children, 2019). CLED students are often underserved in school-based programs, at least in part due to being underidentified for those programs (Long et al., 2023; Peters, 2022). In a systematic literature review, Mun et al. (2020) found that almost half of the reviewed articles provided recommendations for the use of multiple measures to increase the identification rates of CLED students. After the implementation of a new multiple-criteria rule following legislation requiring multiple-criteria identification for gifted students in the state of Georgia in 1995, Krisel and Brown (1997) found more students from underrepresented populations were being identified. In the decade following the implementation of Georgia's multiple-criteria rule, the percentages of traditionally underrepresented students identified in Georgia through the use of multiple criteria continued to increase dramatically, with Black students increasing in identification by 206% and Latinx students by 570% (Stephens, 2009).
Difficulties Identifying Gifted CLED Students
The programming standard for assessment, outcome 2.3.1, from the National Association for Gifted Students (2019) states that, “educators select and use equitable approaches and assessments that minimize bias for referring and identifying students with gifts and talents” (p. 2). Some researchers believe adding measures that better represent diverse talents and experiences of CLED students will improve our ability to identify those students (Joseph & Ford, 2006). However, instruments may also be selected for use in a multiple measure system based on their face validity; appearing to capture a broader, unbiased picture of our CLED learners than our standardized ability/achievement tests, while instead introducing bias and lowering the reliability of our overall matrix (McBee, 2006). Some of the instruments suggested for use, at least partially due to a belief that they would be better at identifying CLED students, include teacher rating/behavioral scales, native-language and/or nonverbal instruments, and other nontraditional alternative assessments (Joseph & Ford, 2006).
Teacher ratings are often used as one component in gifted identification (Carman, 2013). Lohman and Renzulli (2007) noted that including measures such as teacher ratings and behavioral checklists to the ability and achievement tests already in use for gifted identification could help increase the diversity of the population of identified students. However, recent research has shone a harsh light on the quality, reliability, and validity of teacher rating scales and other teacher nomination processes (Hodges et al., 2018). Teacher ratings have been strongly linked to the individual teacher performing the rating (McCoach et al., 2024) and the grade level of the students being assessed by those teacher ratings (Marsili & Pellegrini, 2022). Additionally, their use as part of the gifted identification process has been viewed as inequitable for Black students (Britten, 2021; Ford, 2010) and a potential opening for bias in the identification process (McBee, 2006).
Tests designed specifically for native Spanish speakers, such as the Logramos achievement test (Riverside Insights, 2019), or nonverbal tests such as the Naglieri Nonverbal Abilities Test (NNAT; Naglieri, 1997), are often suggested as a way to reduce the verbal load for emergent bilingual (EB) students (Abbott & McQuarrie, 2015; Lakin, 2010) and be more culture-fair (Naglieri, 2008; Naglieri & Ford, 2003). The Logramos has been developed to align with the Iowa Assessments and has been nationally normed to cover “the many diverse characteristics” of the Spanish-speaking bilingual/emergent bilingual student population (Aparicio, n.d.), to “parallels the scope and sequence” of the Iowa Assessments (Riverside Insights, 2019, p. 1), and is expected to function similarly to the Iowa Assessments for Spanish-speaking students (Logramos Third Edition, 2014). Nonverbal and other nontraditional identification methods have faired less well in identifying CLED students. Research exploring the effectiveness of the use of nonverbal instruments at identifying gifted CLED students has occasionally found positive results (Naglieri et al., 2004; Naglieri & Ford, 2003; Naglieri & Ronning, 2000), but more often has not found the use of nonverbal tests to address the gap in scoring between CLED and non-CLED students (Carman et al., 2020; Giessman et al., 2013; Hodges et al., 2018; Lohman et al., 2008; Lohman & Lakin, 2021).
Identification for gifted programs often involves the use of one or more measures that have been found in research to differentiate in identification ability between underrepresented and overrepresented groups (Carman et al., 2020; Lee et al., 2024). It is possible the use of such measures could at least partially explain the continuing underrepresentation of CLED students in gifted identification (Ford et al., 2020). Latinx students (Godinez-Cedillo, 2022; Lewis et al., 2007; Peters et al., 2024), Black students (Ford et al., 2020; Peters et al., 2024; Ricciardi et al., 2020), female students (Petersen, 2013; Ricciardi et al., 2020), twice-exceptional students (Jung & Hay, 2018; Peters & Johnson, 2024), multicultural and low-income students (Lee et al., 2022; Ricciardi et al., 2020), and emergent bilingual students (Abedi, 2002; Peters & Johnson, 2024; Ricciardi et al., 2020) are only a few of the many CLED student groups which have been historically and currently underrepresented in gifted identification and then underserved in gifted programming (List & Dykeman, 2021).
One of the reasons these instruments may be differing in their identification ability may not be due to the instruments themselves, but rather due to systemic inequalities, experienced historically as well as in present day, which can lead to a variety of outcomes that can have a negative effect on students’ opportunity and ability to learn, causing group differences in achievement, ability, and other related tests often used in gifted identification (Erwin & Worrell, 2012). Long et al. (2023) recently explored potential competing explanations for underrepresentation in gifted identification among various CLED student groups and found a majority of identification disparities could be traced to differences in students’ early academic abilities, suggesting differences in early opportunities to learn (OTL) may drive the underrepresentation we persistently see. While an extended exploration of the causes of these inequities is beyond the scope of this article, we point to Peters (2022) as an excellent review of many of the factors involved in these persistent issues.
Although using the same multiple measures for every student promotes equality, or equal treatment, using the same measures for everyone does not necessarily improve equity, or equality of outcomes (Ferlazzo, 2023). After all, if student starting lines for a race are in different locations, we should be unsurprised when students achieve different race times, even though they were all measured by the same finish line (Long, 2022). When making high-stakes test-based decisions, as is the case in gifted identification, it is important to use methods which are equivalently valid for all students, no matter where their starting line. Using multiple measures is generally thought to improve the reliability, validity, and fairness of the gifted identification process and, thus, increase the identification rates for CLED students, but this outcome has not been consistently supported in the literature (Callahan et al., 2012; McBee et al., 2014; Plucker & Callahan, 2014).
Matrices in the Gifted Identification Process
One tool for making decisions using multiple criteria/modes in the gifted identification process is to use an identification matrix. One of the earliest reports of matrix use in gifted identification comes in 1978 in a report by McCabe, who suggested the use of a matrix to record and organize the data created by using multiple measures to identify gifted children. He suggested using a matrix “could help make a broad, comprehensive definition of giftedness a definition which could be practical and workable as well” (McCabe, 1978, p. 6). Almost 30 years later, Lohman and Renzulli (2007, p. 1) remarked that “it is common practice to collect many different kinds of information about students, arrange this information in a matrix, and then combine it in some way to decide which children to admit to the G&T program.” Many districts find the use of an identification matrix to combine those multiple measures to simplify the identification process for the teachers and administrators involved. However, while the use of multiple identification measures is common across districts, the methods for combining those multiple measures into a singular matrix assessment and the instruments chosen to be included in those matrices are not.
There are many ways to build and use an identification matrix, and Moon (2017) discussed two common usage scenarios. The first scenario involves a multistage identification process in which a student proceeds through a series of screenings on a variety of instruments during the identification process. In this scenario, the student must meet or exceed one or more cutoffs to proceed to the next screening stage, where they must meet or exceed at least one additional screening cutoff before being identified as gifted. This more linear screening model has at least one first-stage instrument that the student must pass before proceeding to the rest of the screening instruments. The second scenario envisions the student being assessed on multiple instruments and having the scores from all the instruments considered concurrently using a point-based system with one cutoff for gifted identification. In this model, student achievement on every instrument contributes to the final decision (Moon, 2017).
There have been both positive and negative findings in research examining the use of identification matrices to promote equity. Pearson (2001) explored the effects of implementing a multicriteria identification matrix on the proportional identification of Black and Latinx students in Alabama. After implementation of the multicriteria matrix, the identification rate for both Black and Latinx students increased by a small percentage in its first year of implementation (Pearson, 2001). Romey (2006) later extended Pearson's study of the effects of the implementation in Alabama and found certain instruments were used more frequently as part of the identification matrix in districts which were more successful in reaching proportional identification for culturally diverse students. However, Romey also found other district-level factors related to an increased likelihood of identifying a proportionally representative student group, such as SES, were not included in the matrix calculations. Additionally, Romey called into question the use of matrix components that did not have well-established reliability and validity, as that could have effects on which students are identified. Lidz and Macrine (2001) proposed adding a dynamic assessment to a multicriteria screening battery in a district that had previously been identifying less than 1% of their CLED students as gifted. The addition of the dynamic assessment, when used with the rest of the criteria, increased the district's identification rate to 5%, resulting in an identified pool of students that more closely matched their proportions of representation in the school population (Lidz & Macrine, 2001). Creating and implementing a matrix resulted in an over 1000% increase (from
Because multiple types of matrices are used and not all matrices are built the same, the use of an identification matrix could cause differences in outcomes depending upon which matrix is used, how the matrix is created, and the makeup of its component parts (Peters et al., 2025). Research outside the field of gifted education described the various effects of combining multiple criteria measures into a unidimensional score. Combining multiple criteria into one matrix-based score may be a better predictor of student giftedness than using those multiple criteria alone. Walters (2011) found that a summed score composed of multiple criteria to be equivalent to a weighted score composed of the same criteria in their ability to predict recidivism, and that either method (summed or weighted) for combining the multiple criteria resulted in a better prediction than using each criterion individually.
However, the method of combining multiple criteria into one score made a significant difference in outcomes in other studies (Timbie & Normand, 2008; Wilson, 2008). Timbie and Normand (2008) found significant differences in the classification of hospitals’ value depending on the combination method used to create the value variable. Similarly, Wilson (2008) explored multiple methods for combining student assessment results and found a large variation in number of students classified as failing depending upon the combination method used.
Finally, even the most effective method for combining multiple criteria may not be useful if the method is too complicated for practitioners to use. Teixeira-Pinto and Normand (2008) examined the effects of multiple methods of combining multiple best practice indicators to classify hospitals’ performance into two categories (superior/not superior). They found the best fitting method of classification was a complex statistical model-based score rather than a classification method based on simple average measure scores. However, they cautioned that using a model-based score made it difficult for nonstatisticians to understand and use the score.
One potential cause of differences in identification outcomes between demographic groups could be that some of the matrices used for gifted identification are not psychometrically/theoretically sound (Callahan et al., 2013). While the assessments selected to be part of a multicriteria matrix may individually be unbiased, there could be bias generated as part of the method used to select and/or combine those assessments and the timing of when those assessments are given. Using definitions of giftedness alone to decide which tests are included, how they are combined, and to what degree those tests are represented without evaluating the psychometric properties of their combination may result in differences in the abilities of the students who are identified and in how many students are identified (Callahan et al., 2013; McBee et al., 2014). Plucker and Callahan (2014) stated, “simply using more measures is not as important as how those measures are actually used” (p. 395). Including more “authentic” measures to broaden the criteria/method without regard to their reliability or validity will still lower the overall reliability and validity of the matrix (Lohman, 2012), as will oversampling from any one domain because it is measured by multiple of the included instruments (Callahan et al., 2013; Lohman, 2012). The method selected for combining various scores on assessments can have large effects on both who is selected and how similar (or different) the selected individuals are (McBee et al., 2014). Combining multiple instruments that measure different constructs into one identification matrix increases measurement error, which increases the chance students are identified as gifted when they should not be or are not identified when they should have been (Moon, 2017).
Timing of when to assess students for inclusion in gifted services can also affect which students are identified (Carman et al., 2018; Hodges et al., 2018). Testing for gifted services tends to be conducted in early elementary school rather than middle school or later (Hodges et al., 2018; Sternberg & Davidson, 2005). Researchers have found significant differences in identification rates among CLED students based on the grade in which they were identified (Hodges et al., 2018; Lohman & Korb, 2006). Marsili and Pellegrini’s (2022) systematic review and meta-analysis across 29 studies found school level to have a significant moderator effect on the relationship between traditional identification measures and nominations, with higher relationships occurring in elementary school as opposed to middle school. Lohman and Korb (2006) note that students’ scores on identification instruments may change over time due to multiple factors, including maturation, quality of instruction, and other personal and social factors. When decisions for which instruments to include, how to combine those instruments, and when to assess with those instruments are arbitrary, it can result in undesirable consequences, such as lowered matrix reliability and validity (Lohman & Renzulli, 2007).
Measurement Invariance
The 2014 Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014) define measurement bias as nonconstruct-related characteristics of a test which may impact the scores differentially for some subgroups. One of the ways researchers measure nonconstruct-related characteristics impact on instrument performance is measurement invariance. Measurement invariance explores the extent to which an instrument or subscale performs equivalently (i.e., has equal meaning and interpretation) across examinee groups (French & Finch, 2006). There are multiple methods for determining measurement invariance, including multisample confirmatory factor analysis, which explores the structure of an instrument/construct across groups. This comparison iterates with increasingly restrictive models until the process results in a model that is found to show a significant decrease in fit. As the model structure is found to be more equivalent between tested groups, the researcher can assume better measurement invariance (French & Finch, 2006).
Measurement invariance is an important part of validating an instrument for group comparisons, as scores on noninvariant measures may be due to extraneous variables and should not be used for group comparisons due to the increased chance of inappropriate conclusions (Warne, 2023). Peters and Gentry (2013) found the HOPE Scale to be measurement noninvariant for gender and income groups, even though there was no differential item functioning (DIF) found due to gender or income. As a result, they recommended the instrument should not be used to compare across gender and income groups, but rather within those groups instead (Peters & Gentry, 2013). In exploring the same scale for measurement invariance between English language learner (ELL) and English proficient (EP) students, Pereira (2021) also found significant differences in the underlying factor structure for those two demographic groups. He too recommends the HOPE Scale not be used to compare scores between ELL and EP students (Pereira, 2021). Warne (2023) explored four versions of the Wechsler tests for measurement invariance in four developing African nations as compared to American measurement models. While some of the samples did reach strict measurement invariance, other samples did not, leading to the conclusion that, while some American test batteries can produce validly interpretable scores using an international comparison group, other instruments/samples may not support comparisons across national groups.
While it is possible instruments themselves may be biased, the use of instruments where little to no DIF has been measured within an identification matrix can still result in differences in which students are identified, potentially due to systematic inequities among those minoritized groups, including lack of OTL (Erwin & Worrell, 2012; Long et al., 2023). Instruments may also be equal but not produce equitable results (measurement noninvariant) through differences in group scores outside of the latent construct of the instrument itself. It is possible for demographic groups to have differences in percentage passing or on average scores even if there is no psychometric bias found (Jonson & Geisinger, 2022). Worrell (2009) noted that mean score differences between groups do not necessarily indicate bias, and, in fact, it would be surprising if one did not find such score differences given the known gaps in school achievement between demographic groups. As noted in Peters (2022), there are many inequities faced by CLED students that could result in real group differences which are environmental and societal in origin. The American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014) note that, while group differences in testing outcomes should trigger additional scrutiny of the instrument for test bias, these differences could be a result of real differences between groups on the construct being measured or a combination of real group differences and test bias. Because one may not be able to rule out all sources of bias, researchers must be careful in their interpretations and continue to work toward improving test design to help eliminate potential sources of bias while maintaining instrument validity (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014).
We were unable to find any previous articles on the measurement invariance of an entire gifted identification matrix, although there were several studies involving measurement invariance on individual instruments which may be included in some matrices (Lee et al., 2022; Pereira, 2021; Peters & Gentry, 2013; Warne, 2011, 2023) and also research on the effects of combinations of multiple instruments that did not explore measurement invariance (Lakin, 2018; Lee et al., 2024; McBee et al., 2014; Peters et al., 2025). As there has not yet been an exploration of measurement invariance in an identification matrix in the literature, the purpose of our study is to examine the psychometric properties of a real-world, in-use multiple-measure identification matrix in order to determine if there are differences in student identification outcomes by demographic/grade level variables and if those differences in identification outcome might be attributable to measurement noninvariance between different demographic/grade level groups.
Research Questions
Before examining an instrument's measurement invariance across groups within a dataset, it is useful to know if differences in instrument outcome between groups exist. Our first three research questions examine these differences. RQ1: Are there differences in matrix scores/outcome by grade level? RQ2: Are there differences in matrix scores/outcomes by demographic (gender, race/ethnicity, SES, EB, SPED)? RQ3: Are there differences in matrix scores/outcomes by demographic within grade level?
Once we have examined the differences in instrument outcome, we then move to examining the matrix invariance. These parallel research questions are: RQ4: Are there differences in matrix performance (noninvariance) by grade level? RQ5: Are there differences in matrix performance (noninvariance) by demographic (gender, race/ethnicity, SES, EB, SPED)? RQ6: Are there differences in matrix performance (noninvariance) by demographic within grade level?
Method
Participants
All research was conducted under institutional review board approval. Participants were 22,280 Kindergarten and fifth grade students (
Demographic Characteristics of Sample.
Note. EB=emergent bilingual.
SES level was measured using student FRPL status, which includes parental income and number of family members, in accordance with state guidelines. While FRPL is not a perfect measure of student socioeconomic status, it is a close approximation and has value for use beyond the calculation of household income alone (Domina et al., 2018). EB and SPED status were based on designation by the participating school district.
Matrix Instruments
The identification matrix developed by the participating district is a two-page document that details how to combine points derived from student scores on an achievement test (Iowa Assessments [IOWA]/Logramos), an ability test (Cognitive Abilities Test [CogAT7]), report card, and teacher recommendation into a total matrix score. Students who earn above a set total matrix point cutoff are automatically qualified for the gifted program, while students earning at least 90% of the cutoff score will also qualify if at least 16% of their points come from the CogAT7 and at least 32% of their points come from the IOWA/Logramos.
Iowa Assessments/Logramos
The Iowa Assessments (IOWA; Dunbar & Welch, 2022) are a set of multiple-level achievement tests normed for grades K-12. They can be administered online or paper-and-pencil and take between 2 to 4 hours to administer depending upon the level selected. The Core Battery contains multiple subtests, including English, language arts, and math, and comprises approximately 145 to 200 questions depending on the level. Reliabilities on the Core subtests are mostly in the .80 s and .90 s. Concurrent validity of the assessment was examined through a comparison with scores on the CogAT with the same standardization sample. Students in the participating district were administered Core Battery level 5 (Kindergarten) or level 11 (5th grade) midway through the school year. The core test was used to produce both an English/Language Arts (ELA) and Math score, both of which were included as part of the matrix calculations in the form of a national percentile rank (NPR).
The Logramos (Riverside Publishing, 2014) is an achievement measure specifically designed for native Spanish speakers. It “parallels the scope and sequence” of the IOWA while using Spanish vocabulary that is commonly used in Spanish-speaking countries (Riverside Insights, 2019, p. 1; Riverside Publishing Company, 2012). It is available for use in grades K-8. The Core Battery, made of the same subscales as the IOWA, was given to native Spanish-speaking students in the participating district as an alternative to the IOWA. The ELA and Math scores from the Logramos were used in the matrix calculations in the form of an NPR.
NPR scores from the IOWA/Logramos ELA and Math portions contributed separately to the total matrix score, so a student could earn points from both ELA and Math. Scores on the ELA and Math tests contributed on the same scale, where scoring at the 70th to 79th percentile earns seven points, 80th to 84th earns 10 points, 85th to 89th earns 13 points, 90th to 94th earns 16 points, and 95th to 99h earns 20 points. If a student earns in the highest percentile range on both exams, the IOWA/Logramos could contribute a maximum of 40 points toward their total matrix score.
Cognitive Abilities Test
The CogAT7 (Lohman, 2011) is an ability test that measures the Verbal, Nonverbal, and Quantitative domains in a group setting. The participating district only administers the nonverbal portion as part of their universal screening. The CogAT produces score reports measured in multiple ways, including a Standard Age Score (SAS), with a mean of 100 and a standard deviation of 16. Split-half reliabilities for the CogAT7 across all grade levels and domains are reported at .80 and higher, with reliabilities increasing as student grade level increases (Warne, 2015). Concurrent validity and confirmatory factor analysis provided evidence of instrument validity. In the participating district, earning an SAS score of 100 and above contributes points towards the total matrix score, with a score between 100 and 103 adding five points, 104 and 108 adding 10 points, 109 and 113 adding 15 points, 114 and 120 adding 20 points, 121 and 125 adding 25 points, and 126 and 130 adding 30 points.
Report Card
Additional matrix points were contributed from the students’ report cards. For Kindergarten students, scores in the core content areas (Language Arts, Math, Science, and Social Studies) from their most recent nine weeks report card were added together, and the resulting score was assessed on a set scale based on the range of total points achieved. For the fifth grade students, all grades from students’ prior year final report card were averaged to determine each student's score, which was then compared against several ranges to determine how many matrix points to award. For both grades, students who earned an average score below 80% did not earn any points, between 80% and 84% earned five points, 85% and89% earned 10 points, 90% and 94% earned 15 points, and students earning in the 95% and 100% range earned 20 points toward their total matrix score.
Teacher Recommendation
Students’ primary teachers were asked to fill out a district-created teacher recommendation form adapted from the Scales for Rating the Behavioral Characteristics of Superior Students (SRBCSS; Renzulli et al., 2002) for each student in kindergarten and fifth grade as part of the district's universal screening process. This form assesses students on their general intellectual, creative, and leadership abilities, with six to seven questions per ability category. Teachers rated each student from Rarely (1) to Consistently most of the time (5) for each question. Point totals were added up for each recommendation, and the matrix points were awarded based on the total number of recommendation points earned, with scores between 60 and 69 contributing four points, 70 and 79 contributing six points, 80 and 89 contributing eight points, and 90 and 100 contributing 10 points toward their total matrix score.
Procedures/Data Collection
All participants were administered the identification matrix instruments during the district's regular annual universal screening process for grades K and 5. All students were administered the CogAT7 and either the Iowa Test of Basic Skills or Logramos based upon their EB status. Additional measures for the matrix included teacher recommendations and student report cards. Only students whose file included all scores on all matrix measures were included in this analysis.
The participating district provided archival data from one full academic year of assessment. Data provided by the district included de-identified student demographic data (including age, grade, EB status, SPED status, FRPL status, federal aggregated ethnicity code, and gender), along with both scores and matrix points from the teacher recommendations, IOWA/Logramos ELA and Math, report card, and CogAT7.
Coding
We created dummy codes for the nominal-level variables, including grade level, gender, ethnicity, FRPL status, EB status, and SPED status. Due to the low numbers of students from Pacific Islander, Native American, and multi-ethnic backgrounds, we grouped these students into an “Other” category for analyses involving race/ethnicity. We coded identification outcome as 0/1, where nonidentified students score 0 and identified students score 1.
Data Analyses
Our first three research questions examined differences in scores/outcomes by both grade level and demographics using a series of independent-samples t-tests, two-way contingency table tests, and analysis of variances (ANOVAs) in SPSS, depending upon the level of measurement and number of groups in the independent and dependent variables. Type I error inflation was controlled using Holm's Sequential Bonferroni (1979) for all family-wise t-tests, and Games-Howell post hoc tests (Games & Howell, 1976) following ANOVAs.
For the last three research questions, matrix performance (invariance) across different groups (by grade and/or demographics) was explored through multiple hierarchical multigroup invariance analyses in Mplus. A one factor confirmatory factor analysis (CFA) was conducted to examine the model fit of the matrix for all participants. Acceptable or good model fit would indicate that the matrix fit the sample when all participants were considered together. Multigroup analyses were then conducted to determine if the matrix items fit the latent factor differently by grade level, demographic factors, and demographic factors within grade level. In the base model, the items were freely estimated across groups. In the subsequent models, each item loading was constrained to be equal across groups one at a time, followed by testing whether the model fit significantly changed with the additional constraint. A statistically significant change in the chi-square statistic indicated a between group difference in the item loading, and a nonsignificant change in the chi-square statistic indicated the item loading was similar between the groups.
Positionality Statement
All three authors have been previously identified as gifted at some point in their K-12 education. The three authors identify as White, with one identifying as male and the two others identifying as female. While one author is an immigrant, all authors are native English speakers. All three authors have children in the US public education system, and one of the children has been identified as gifted while the other child is twice-exceptional. One author is an Educational Psychologist, one a Developmental Psychologist, and the third a Scholarship of Teaching and Learning researcher. Growing up in public school gifted education settings, one of the authors was educated in quota-based gifted magnet program and was able to experience having diverse gifted classmates from elementary through high school, another of the authors was unable to participate in gifted programming until secondary school due to the use of quota-based gifted programming, and the third author did not enter gifted programming until secondary school because there was no gifted programming offered in their elementary school. With our past experiences in gifted public education and with gifted/2e children, the authors have a personal interest in making sure our gifted identification processes are as equitable as possible, so that all gifted children, including our own, have the opportunity to participate in gifted programs.
Results
Research Questions 1–3
The first three research questions all focused on differences in total matrix scores and identification outcome between student grade level and/or demographic characteristics in our sample. Table 2 presents the means and standard deviations on the total matrix score across all demographic groups in our sample. RQ1: Are There Differences in Matrix Scores/Outcome by Grade Level?
Means and Standard Deviations of Overall Matrix Score by Demographic Group.
Note. CFI=comparative fit index; EB=emergent bilingual; FRPL=free and reduced price lunch; PI=Pacific Islander; RMSEA=root mean square error of approximation; SES=socioeconomic status; SPED=special education.
We conducted an independent-samples RQ2: Are There Differences in Matrix Scores/Outcomes by Demographic Groups (Gender, Race/Ethnicity, SES, EB, SPED)?
We conducted multiple independent-samples
Independent-Samples t-Tests Between Student Demographic Groups.
Note. CI=confidence interval; EB=emergent bilingual.
We conducted a one-way ANOVA to determine if there were differences in total matrix scores across racial/ethnic groups. The data did not meet the assumption of homogeneity of variance and there were large differences in sample sizes among racial/ethnic groups, so we opted to conduct the ANOVAs using the Welch statistic (Welch, 1951) with post hoc Games-Howell comparisons. The Welch statistics are indicated for use when the assumption of homogeneity of variance is violated, as it does not assume equality of population variances (Green & Salkind, 2016). The Games-Howell post hoc test, an extension of the Tukey test, corrects for family-wise type I error inflation and is designed for use after conducting an ANOVA with unequal variances and differences in sample sizes among the groups (Shingala & Rajyaguru, 2015). The ANOVA was significant at the
We examined differences between all demographic groups (independent variables) and identification outcomes (dependent variable) using two-way contingency tables. As expected, all demographic groups displayed significant differences in identification outcomes at the RQ3: Are There Differences in Matrix Scores/Outcomes by Demographic within Grade Level?
Independent-Samples t-Tests Between Kindergarten Student Demographic Groups.
Note. CI=confidence interval; EB=emergent bilingual.
We conducted a one-way ANOVA to determine if there were differences in total matrix scores across racial/ethnic groups for kindergarten-level groups. Neither the kindergarten nor the fifth grade groups met the assumption of homogeneity of variance and both grades exhibited large differences in sample sizes among racial/ethnic groups, so we opted to conduct the ANOVAs using the Welch statistic with post hoc Games-Howell comparisons. The ANOVA was significant at the
We examined differences between all demographic groups (independent variables) and identification outcomes (dependent variable) using two-way contingency tables for the kindergarten group. As expected, all demographic groups displayed significant differences in identification outcomes at the
Independent-Samples t-Tests Between Fifth Grade Student Demographic Groups.
Note: CI=confidence interval; EB=emergent bilingual.
We conducted a one-way ANOVA to determine if there were differences in total matrix scores across racial/ethnic groups for the fifth grade-level group. The ANOVA was significant at the
We examined differences between all demographic groups (independent variables) and identification outcomes (dependent variable) using two-way contingency tables. The gender variable did not display significant differences in identification outcome (
Research Questions 4–6
A CFA was conducted to test the fit of a one factor model to the matrix items. The fit statistics confirmed acceptable fit ( RQ4: Are There Differences in Matrix Performance (Invariance) by Grade Level?
Model fit for ELA score and math score did not worsen when constrained to be equal between grade levels, which indicates factor loadings were similar for fifth graders and kindergartners. See Table 6 for factor loadings, 95% confidence intervals (CIs), and RQ5: Are There Differences in Matrix Performance (Invariance) by Demographic Groups (Gender, Race/Ethnicity, SES, EB, SPED)?
Analysis of Research Question 4: Matrix Invariance by Grade Level.
Factor loadings, 95% CIs, and
Analysis of Research Question 5: Matrix Invariance by Demographics.
The instrument-demographic combinations with the highest change in chi-square values included (1) the ELA and Math measures in EB students such that ELA factor loadings were higher for non-EB students and Math factor loadings were higher for EB students; (2) for SPED students, factor loadings for teacher recommendations were higher than for non-SPED students; (3) both CogAT7 and report cards had higher factor loadings for Asian students than non-Asian students; and (4) ELA scores had higher factor loadings for girls and Math scores had higher factor loadings for boys. RQ6: Are There Differences in Matrix Performance (Invariance) by Demographic within Grade Level?
Factor loadings, 95% CIs, and
Analysis of Research Question 6: Matrix Invariance by Demographics by Grade Level.
For fifth graders, all five measures had different factor loadings when all other items were constrained to be equal, at the .001 significance level for the following demographic student groups: gender, Latinx, White, FRPL, and EB. For kindergarteners, no demographic group exhibited this pattern of results, although EB students had four out of five measures demonstrating noninvariance. For kindergarteners, three demographic student groups (Black, Other, and SPED) had a single measure demonstrating noninvariance. For fifth-graders one demographic student group (Black) had no measures showing noninvariance. All other demographic groups had at least three measures showing noninvariance.
Summary of Results
A summary of all significant results for the three matrix invariance questions can be found on Table 9.
Significant Factor Loadings for Research Questions 4–6.
Discussion
Explanation of Findings
Our first three research questions (RQ1–3) explored differences in student total matrix scores and identification outcomes by student demographic/grade level. We found significant differences between all demographic and grade level groups for both total matrix scores and identification outcomes, which supports our inquiry into the measurement invariance of the overall matrix. For our first research question, we found kindergarten students scored significantly higher than fifth grade students on the total matrix score but had a significantly larger standard deviation. This finding is in line with Lohman and Korb (2006), which found that expected scores over time may decrease, at least in part due to regression to the mean, while variance should also decrease if the instruments are scaled using a Rasch (1960) model. There was also a significantly different identification outcome between groups, matching the significant difference in scores. Lohman and Korb (2006) state that “even for highly reliable test scores, approximately half of the students who score in the top 3% of the score distribution in 1 year will not fall in the top 3% of the distribution in the next year” (p. 478). However, the effect sizes for both findings were weak, which may indicate this finding was more strongly influenced by the large sample size used rather than grade level effects.
Our second research question explored differences in total matrix scores and identification outcomes by demographic group, and, once again, we found significant differences for both total matrix score and outcome. This also aligns with previous findings in gifted identification literature. Hodges et al. (2018) found significant differences in proportional identification rates across student race/ethnicity in a recent meta-analysis of 54 studies. Significant differences in gifted identification based on ethnicity, gender, poverty, and emergent bilingual status were also found by Ricciardi et al. (2020). Most effects were in the weak to small range, again indicating large sample size may have affected the results. However, a few comparisons showed stronger effects, with a moderately strong effect found for the difference in total score between students who have FRPL and non-FRPL, a moderate effect for the difference in total score between students who qualify for SPED services and those who do not, and a moderate effect for race/ethnicity on total matrix scores.
For our third research question, exploring demographic differences in total matrix scores and identification outcomes by grade level, we again found significant differences between all groups at all levels for both total matrix score and identification outcome. Similar to the first two questions’ results, most effect sizes were weak to small with a few exceptions. Student FRPL status showed a moderately strong effect on total matrix score for both the kindergarten and fifth grade sample. EB and SPED status had moderate effects on total matrix scores for fifth grade students, but weaker effects for kindergarteners. Student race/ethnicity had moderate effects on total matrix scores for both kindergarten and fifth grade students. Significant differences in identification by demographic between grade levels has previously been found by Ricciardi et al. (2020) and Hodges et al. (2018) among others. All effects of group membership on identification outcomes were weak to small across all three research questions.
For the three matrix measurement invariance questions (RQ4–6), there were significant differences in model fit across all three comparisons. When the five matrix components (ELA, Math, CogAT, report card, and teacher recommendations) were set to be equivalent, model fit worsened for different grade levels, demographic groups, and demographic groups within each grade level. This worsening of model fit indicates the matrix does not function equivalently across demographics/grade levels. Combining these diverse measures into one identification matrix did not remove the differential functioning of the individual matrix instrument components, as discussed in Moon (2017). While some matrix components function equivalently for certain comparisons, no matrix component consistently functions equivalently across all comparison groups. This is a similar finding to many prior studies that explored the measurement invariance of component instruments separately (Lee et al., 2022; Pereira, 2021; Peters & Gentry, 2013; Warne, 2011, 2023).
Some patterns did emerge in the strength and significance of factor loadings of matrix components between groups. For readers unfamiliar with factor loadings, the higher (or stronger) the factor loading is for any group, the stronger it effects that group, but not necessarily in a positive or negative way, similar to the magnitude of a correlation that indicates strength without direction. If a matrix component has a significantly stronger loading for one group in comparison to another, it indicates that component has more weight in the overall total score for that group than the other group, but that weight could be positive or negative in nature. So, a matrix component that has a significantly stronger loading for males means that it affects males more than females in determining gifted identification in this sample, but not necessarily in a positive way.
Across all comparisons, neither the CogAT7 nonverbal score nor the report card produced any recognizable pattern of effect for any one demographic/grade level group. This means neither the CogAT7 nonverbal nor the report card disadvantaged/advantaged any one group consistently. However, the other three matrix components did produce repeated similar results for various demographic groups across all three comparisons (all demographics, demographics for kindergarten, demographics for fifth grade). For teacher recommendations, across all demographic comparisons, students who are identified as either male, non-White, on FRPL, and/or using SPED services had significantly stronger factor loadings than their comparison demographic group. Teacher recommendations had significantly stronger effects (positive or negative) on gifted identification for members of those groups than on members of other demographic groups. This is in line with McBee (2006), which found teacher nominations to be more less to identify Black, Hispanic, and low-SES students. The IOWA/Logramos Math score also displayed repeated results for specific demographic groups across all demographic comparisons, with students identified as either male, Latinx, non-White, Asian, and/or EB showing significantly stronger factor loadings than their demographic counterparts. Math scores had significantly stronger effects (positive or negative) on gifted identification for members of those groups than on members of other demographic groups. Finally, IOWA/Logramos ELA scores also displayed repeated results for specific demographic groups across all demographic comparisons, although these groups were almost completely diametrically opposite than the previous findings for Math scores and teacher recommendations. The demographic groups with the significantly stronger loadings across all demographic comparisons on the ELA scores were either female, non-Latinx, White, non-Asian, and/or non-EB. ELA scores had significantly stronger effects (positive or negative) on gifted identification for members of those groups than on members of other demographic groups. These results align with the general body of literature in the field, including Lewis et al. (2007), who found an achievement test to identify significantly fewer ethnically diverse students than White students, Petersen (2013), who found achievement tests were more likely to identify boys than girls, and Abedi (2002), who found significant differences in performance on achievement tests based on EB status.
Limitations and Implications
There are multiple limitations on both the generalizability and validity of these findings. Similar to Carman et al. (2020), this research used a sample generated by one very large district, which has effects on both external and statistical conclusion validity of our findings. The participating district, while very large, is also very diverse. However, that diversity is spread across more than 100 elementary schools and each school within the district has a different assortment of CLED students, with a high level of economic, ethnic/racial, and native language self-segregation between schools. In as much as the participating district mirrors the demographic makeup of districts around the country, our results may not be as applicable to districts that are demographically different. Additionally, our results are based on the in-use identification matrix of the participating district. While the results of exploring the measurement invariance of a single district's matrix will not necessarily generalize across districts and matrices, we are able to conduct an exploration of measurement invariance within a single district's identification matrix because all the students within that single district took the same measures and were measured through the same single matrix. While matrices overall will differ by district, components included, and student body makeup, we believe our exploration of measurement invariance within one district's matrix adds to the overall gifted identification literature and fills a knowledge gap. Districts that use different matrix components or who combine those components using different methods may find different results, although we suspect different combinations of instruments will produce similar results if not combined in statistically appropriate ways. Unfortunately, this is in keeping with the general lack of common identification methods across schools/districts/states/countries. If our educational system (or field) were to agree upon a common definition (and therefore common identification method) of giftedness, any study of identification methods would then have more generalizability. Until we come to a common consensus on how to identify the gifted, districts that use multiple measure matrix identification may wish to explore differing combination rules and/or different measures statistically, to see if changing the components or how they are combined could lead to improved and equitable identification for the gifted CLED students.
The use of large sample sizes can lead to a greater likelihood of finding statistical significance for many statistical analyses. We calculated, interpreted, and presented effect sizes and CIs as a means of counterbalancing potentially inflated statistically significant findings. While almost every comparison was found to be statistically significant, our effect sizes were mostly weak to small, which could indicate artificial inflation of our statistical significance findings due to the size of our sample.
Additionally, our analyses do not reflect the actual implementation of the identification matrix in the participating district. The participating district includes the option to add additional points to the total matrix scores of students who, by virtue of their demographic groups membership(s), have experienced historically less OTL than their majority counterparts. While we chose to remove consideration of those extra points from our exploration of matrix functioning because it would most likely have been a confound in our study, that removal also made our findings less reflective of actual practice in this one district. Future studies could explore the functioning of the identification matrix including the OTL scores and determine if the additional points added to the total makes a difference in who is selected and how the matrix functions for the various demographic groups.
These patterns of effects, where a particular matrix component more strongly affects one group than another even across grade level, should be explored further, especially in areas where these findings are in line with previous literature. For components that can be affected by training, such as teacher recommendations, these patterns may signal areas for further focus during teacher development sessions. For components that are less affected by training, districts may want to monitor the outcomes of those instruments to identify if there are patterns of differences in scores among demographic groups in their students and potentially apply targeted remediation/development or move to a within-group comparison model. A model for closing these scoring gaps, whether they be matrix-specific or not, can be found in Plucker and Peters (2016) Excellence Gaps Intervention Model, which proposes six areas for targeted interventions at the national, district, and classroom level.
Although using multiple criteria/modes is a widely supported recommendation, the use of those instruments in matrix form offers no silver bullet to the problem of underrepresentation in gifted identification. Similar to previous results by Pereira (2021) and Peters and Gentry (2013), we found significant differences in how this identification matrix functioned by demographic groups, which could have effects on which students are identified if the results of the matrix are used without modification. The multiple significant differences in factor loadings for many of the demographic groups indicate that using this identification matrix to compare demographically different students will leave some students at a disadvantage either through producing biased scores or because of true score differences between the demographic groups (Pereira, 2021; Peters, 2022). The use of multiple criteria/modes is important for capturing a clearer picture of the talents and abilities of the students we are considering for enrichment programs but using those scores in a comparative or competitive way can lead to significant harm for CLED students. While we encourage the continued use of multiple criteria/modes, we strongly discourage districts from using the results of those matrices to compare students from differing demographics. Identifying students with matrix-based results through the use of within-group comparisons rather than in comparison to overall cross-group scores may result in more equitable selection.
In addition to our efforts to create better instruments and to use those instruments in less-biased ways, we should also be expanding our efforts not on ever more complex identification models to try to capture every single gifted student, but rather on providing opportunities to learn at an earlier stage (frontloading), expanding access to targeted advanced programming that is responsive to local needs, and providing support to retain those students we’ve identified, among other areas (Plucker et al., 2022). In areas/districts with little funding for enrichment, we might change our framing and simplify instead, determine the resources each school/district has and the enrichment programs they are capable of offering and identifying for those specific programs rather than trying to perfectly and broadly identify students who will then not be well-served by the schools in which they are identified (Gubbins et al., 2021).
Using an identification matrix is an easy way to combine scores on a variety of measures that is simple for teachers and administrators to use without advanced training in statistical analysis. We encourage the continued use of identification matrices to combine diverse instruments in a nongatekeeping manner but suggest that the results of those matrices be used to identify students within similar backgrounds, such as through the use of building/local norms (Carman et al., 2020; Peters et al., 2019), rather than district wide, and as only a part of a broader model for equitable identification and service. Districts should be aware that using multiple measures, even in matrix format, will not in itself result in more equitable identification decisions, but may be part of creating a more equitable system for identification overall.
