Sage Journals: Discover world-class research

Abstract

Approaches to multilevel measurement modeling are often distinguished as aggregated or disaggregated. While past research has established the benefits and limitations of these modeling frameworks, none have critically evaluated the reliability of factor score predictions representing level-1 and level-2 processes. We utilize simulation methodology to explore factor-scoring techniques within both modeling frameworks. Results suggest that factor scores from the aggregated approach are at least as reliable as those from the disaggregated approach after simple numeric transformations, contingent on accurate knowledge of the level-2 factor structure in the presence of cross-level configural non-invariance. Further, given differences in the range of reliability across approaches when fewer clusters are available for analysis, the aggregated approach may be preferable within any given sample.

Keywords

factor analysis hierarchical linear modeling classroom research higher education assessment

It is common in the educational and psychological sciences to collect data from individuals nested in hierarchical structures, such as students in classrooms. Further, in many instances, observed individual-level data are believed to indicate one or more unobserved latent variables (Bollen, 2002) that operate at both the individual and cluster- levels (Stapleton, McNeish, & Yang, 2016; Stapleton, Yang, & Hancock, 2016). For example, to evaluate the effectiveness of high-impact learning practices in higher education, an institution might present students with a series of survey items designed to assess a single latent learning gains factor and use student-level item response data to obtain scores to infer both individual learning gains and classroom curricular advantages. This illustrates the intersection of latent variable modeling and multilevel modeling, or multilevel measurement modeling (Kamata et al., 2008).

There are two general approaches to multilevel measurement modeling: the aggregated approach and the disaggregated approach (Raudenbush & Bryk, 2002). Broadly, the aggregated approach involves fitting a measurement model to a conflation of level-1 and level-2 effects and correcting attenuation of standard errors due to clustering, and the disaggregated approach involves simultaneously and separately estimating a factor model at both level-1 and level-2. The advantages and disadvantages of these approaches are well established (Muthén & Satorra, 1995; Pornprasertmanit et al., 2014; Stapleton, McNeish, & Yang, 2016; Stapleton, Yang, & Hancock, 2016). Currently, however, methodological work has neglected to critically compare the reliability of factor score predictions extracted from multilevel measurement models. As past work has demonstrated, even well-researched factor analytic methods can produce factor score predictions that do not behave as expected or desired (Croon, 2002; Curran et al., 2016; McDonald & Burr, 1967; Skrondal & Laake, 2001). Thus, there is a need to expand our understanding of multilevel measurement modeling frameworks when the goal of the analysis is to obtain reliable level-1 and level-2 factor scores.

Factor scores are numerical predictions that indicate where an individual lies on an underlying continuous scale of some unobserved, or latent, construct such as intelligence, learning gains, or sense of belonging (Bartholomew et al., 2009). For example, a student might be asked a series of questions designed to measure overall learning gains. Items can then be combined to form a single summary score representing that student’s learning, which can offer insight to instructors on how individual students differentially benefited from the implementation of high-impact learning practices. In addition, classroom-level factor scores can be utilized institutionally to determine what components of high-impact learning are maximized in different courses. While factor scores for non-clustered data structures have been extensively studied (Fava & Velicer, 1992; Grice, 2001a, 2001b; McDonald & Burr, 1967; Skrondal & Laake, 2001; Velicer, 1976), there is a present gap in the literature evaluating the intersection of multilevel latent variable modeling and factor scoring.

The goal of our paper is to fill this gap by empirically investigating factor score predictions at both level-1 and level-2, extracted from aggregated and disaggregated measurement models, in the context of conditions commonly encountered in the educational and behavioral sciences and conditions relevant to multilevel measurement structures. Specifically, drawing on past research demonstrating the importance of considering sources of measurement non-invariance in single-level factor score predictions (Curran et al., 2016, 2018), we evaluate the reliability of factor score predictions from multiple modeling frameworks in the presence of forms of cross-level non-invariance (Jak, 2019; Jak et al., 2013, 2014; Jak & Jorgensen, 2017). To begin, we briefly review the technical details of confirmatory factor models for multilevel item response data within both approaches. Next, we describe factor-scoring methods conducive to both approaches. This will be followed by a detailed presentation of the subsequent simulation study and salient findings.

The Aggregated Approach to Multilevel Measurement Modeling

Returning to the example of measuring the effectiveness of high-impact learning at both the student- and classroom level, suppose a researcher collects item response data from students nested in classrooms and aims to measure learning outcomes at both levels of analysis. One approach is to begin by specifying and estimating a single-level confirmatory factor model (CFA; e.g., Brown, 2006) to student item response data. Of critical concern is the fact that the clustering attenuates standard errors and biases test statistics, leading to incorrect inference (Hox et al., 2018; Kamata et al., 2008; Muthén & Satorra, 1995). The aim of the aggregated approach is to correct for bias in standard errors associated with clustered data structures (Muthén, 1985, 1994; Stapleton, McNeish, & Yang, 2016).

Although CFA is well established in the literature, we present model equations to introduce a shared notational system across modeling frameworks and scores. For a CFA of items $y_{i}$ believed to indicate a smaller subset of latent variables, $η_{i}$ , we can define:

$y_{i} = ν + Λ η_{i} + ϵ_{i}$ (1)

$η_{i} = κ + ζ_{i}$ (2)

Here $y_{i}$ is vector of item responses for individual i, $ν$ is a vector of item intercepts, $Λ$ is matrix of factor loadings, $η_{i}$ is a vector of the m latent variable scores for case i, $ϵ_{i}$ is a vector of error terms for case i with $ϵ_{i} ~ N (0, Θ_{ϵ})$ , $κ$ is a vector of factor means, and $ζ_{i}$ is a vector of factor residuals with $ζ_{i} ~ N (0, Ψ)$ .

Equations (1) and (2) imply the following mean $(θ)$ and covariance $Σ (θ)$ structures, where $θ$ is a vector of model parameters:

$μ (θ) = ν + Λ κ$ (3)

$Σ (θ) = Λ Ψ Λ' + Θ_{ϵ}$ (4)

Standard errors can then be corrected for attenuation due to clustering to aid in accurate model selection and inference. The most common correction procedure involves the application of formulas presented in Liang and Zeger (1986), which are extensions of general robust standard error methods (Eicker, 1967; Huber, 1967; White, 1980) derived for clustered data structures.

While this approach addresses standard error attenuation, it ignores a critical aspect of multilevel analysis: aggregated approaches to multilevel modeling produce model parameter estimates of total or aggregated effects (Hox et al., 2018; Muthén, 1991), which are a conflation of unique level-1 (e.g., student) and level-2 (e.g., classroom) effects, weighted by cluster-based intraclass correlations (ICCs; Raudenbush & Bryk, 2002). In the context of the example assessing student and classroom-level learning outcomes associated with high-impact learning practices, a CFA with cluster-robust standard errors would be fit to raw item responses, which by definition represents a conflation of student and classroom characteristics. That is, the aggregated approach is unable to detect if item responses for a given student are due to a student’s individual learning gains or to an advantageous learning environment. Thus, if the goal of the analysis is to understand and distinguish individual-level constructs from cluster-level constructs, conflation of parameter estimates due to aggregation of effects may intrinsically lead to parameter estimate bias.

The Disaggregated Approach to Multilevel Measurement Modeling

To overcome this issue, multilevel confirmatory factor analysis (MCFA) separately and simultaneously estimates level-1 and level-2 effects, decomposing total effects into their constituent level-1 and level-2 components and allowing analysts to draw inferences unique to each level of analysis (Hox et al., 2018; Mehta & Neale, 2005; Stapleton, McNeish, & Yang, 2016). Muthén was the first to introduce a procedure for multilevel factor analysis, Muthén’s maximum likelihood, or MUML (Muthén, 1991, 1994), which separately estimates factor models associated with the level-1 and level-2 covariance structure, respectively. However, as computational power and efficiency improved over time, normal theory maximum likelihood has become the more ubiquitous and recommended method of MCFA (Hox et al., 2018; Yuan & Hayashi, 2005). We present equations below for notational consistency.

For an MCFA of items $y_{ij}$ believed to indicate latent variables $η_{ij}$ at level-1 and $η_{j}$ level-2 we can define:

$y_{ij} = μ + Λ_{W} η_{ij} + Λ_{B} η_{j} + ϵ_{W_{ij}} + ϵ_{B_{j}}$ (5)

$η_{ij} = ζ_{W ij}$ (6)

$η_{j} = κ_{B} + ζ_{Bj}$ (7)

where $y_{ij}$ refers to item responses for individual i in cluster j, $μ$ represents the fixed effect associated with random intercepts given by $μ_{j}$ which represents latent cluster item means, $Λ_{W}$ and $Λ_{B}$ represent the fixed within- and between-covariance matrices, respectfully, $ϵ_{Wij}$ and $ϵ_{Bj}$ represent the level-1 and level-2 error matrices, respectfully, such that $ϵ_{Wij} ~ N (0, Θ_{{ϵ_{W}}_{ij}})$ and $ϵ_{Bj} ~ N (0, Θ_{ϵ_{B_{j}}})$ . In addition, level-1 factor residuals $ζ_{W ij}$ are distributed as $ζ_{W ij} ~ N (0, Ψ_{W})$ , level-2 factor residuals $ζ_{Bj}$ , are distributed $ζ_{Bj} ~ N (0, Ψ_{B})$ , and $κ_{B}$ is a vector of level-2 factor means.

Equations 5 to 7 imply a level-1 mean structure of zero at level-1, since all individual-level variables are centered at group means (Ryu, 2014a) and a level-2 mean structure of

$μ_{B} (θ) = μ + Λ_{B} κ_{B} .$ (8)

The model-implied covariance structures at level-1 and level-2, respectively, are

$Σ_{W} (θ) = Λ_{W} Ψ_{W} Λ_{W}' + Θ_{ϵ_{W}}$ (9)

$Σ_{B} (θ) = Λ_{B} Ψ_{B} Λ_{B}' + Θ_{ϵ_{B}}$ (10)

Equations 5 to 10 demonstrate key advantages of the MCFA over the CFA with cluster-robust standard errors. First, there are distinct model-implied covariance structures for level 1 and level 2 with separate matrices at each level of analysis. This implies that MCFA can systematically model cross-level measurement non-invariance. Cross-level measurement non-invariance refers to measurement models that differ in function and form across hierarchical levels of analysis. This includes configural non-invariance wherein the factor structures differ across levels, and metric non-invariance wherein the factor loadings differ across levels (Jak, 2019; Jak et al., 2013, 2014). Notably, a 2016 review found that 31% of multilevel factor models reported a different number of level-1 and level-2 factors, suggesting that cross-level configural non-invariance is fairly common (Kim et al., 2016). In addition, distinct model-implied covariance structures across levels implicate factor loadings can differ across level 1 and level 2 or can be held constant with equality constraints (Jak, 2019). For example, one item may be more strongly predictive of student learning but less strongly predictive of general classroom-level curricular advantages. Finally, differences across clusters are captured and modeled in the random intercept component which contextualizes item responses to their associated cluster.

Because of the MCFA’s ability to decompose level-1 and level-2 effects, it is often the preferred method of multilevel factor analysis (Hox et al., 2018; Pornprasertmanit et al., 2014); however, in many applications, disaggregated modeling approaches are not estimable or are numerically unstable due to the interplay of complex model specification at each level of analysis and limitations in number of clusters available for analysis (Jak, 2019; Maas & Hox, 2005). In the same way that it would not be recommended to evaluate the validity of a scale on a sample of 50 individuals (MacCallum et al., 1999) and use unstable CFA parameter estimates to obtain factor score estimates (Skrondal & Rabe-Hesketh, 2004), it may be equally problematic to establish the function and form of a level-2 factor structure and extract level-2 factor scores with 50-clusters, due to the impacts of sampling variability. Therefore, the decision between aggregated and disaggregated approaches to multilevel measurement modeling often hinges on the conflict between ideal modeling and practical or viable modeling, given available data, particularly when the goal of the analysis is to assign reliable scores to multilevel constructs. It is to this we now turn.

Factor Scores

Factor scores have existed for nearly a century (M. S. Bartlett, 1937; Thomson, 1935, 1938; Thurstone, 1935) and have many practical uses. In a multilevel context, factor scores allow analysts to estimate a single number summary of both where an individual $i$ lies on a given unobservable latent construct, compared to other individuals, and where a cluster $j$ lies on an unobservable latent construct, compared to other clusters.

Mean scores are one of the simplest, and consequently ubiquitous methods for predicting latent standings (Bauer & Curran, 2016). In the context of multilevel response data obtained from individuals nested in clusters, individual-level scores are computed by summing all items for a given latent variable and dividing by the total number of items, and cluster-level scores are computed by summing all individual-level scores in a cluster and dividing by the number of individuals. While simple to compute and interpret, mean scores inherently assume equal weighting of items (McNeish & Wolf, 2020; Thissen & Wainer, 2001), which may not accurately represent all measurement structures in practice.

An alternative approach to scoring involves extracting scores from more complex measurement models. In the context of the aggregated approach, specifically CFA, sample estimates of parameters in Equations 1 to 4 represent unbiased estimates of the aggregated total effects. Therefore, sample estimates can be used in standard factor-scoring formulas (M. S. Bartlett, 1937; Thomson, 1935; Thurstone, 1935) to obtain a single total effect score for each level-1 unit; however, as these estimates are based on a conflation of level-1 and level-2 effects, they may not (and likely do not) accurately capture true level-1 and level-2 processes, particularly in the presence of cross-level non-invariance. Importantly, while correcting standard errors for attenuation due to clustering is necessary with clustered data, factor scores, which rely solely on point estimates of model parameters, are not impacted by standard error corrections.

Alternatively, factor scores can be extracted directly from the MCFA, which disaggregates level-1 and level-2 effects, allowing for specification and estimation of differences across level-1 and level-2 factor models (Jak, 2019; Jak & Jorgensen, 2017). Further, because MCFA decomposes covariance matrices into within- and between-components and simultaneously and separately estimates level-1 and level-2 factor models, separate level-1 and level-2 factor scores can be directly extracted, representing unique level-1 and level-2 effects, and cross-level non-invariance (or invariance) can be systematically modeled and incorporated into score predictions. At level 2, the random intercept, or latent-mean component of the MCFA, precludes direct computation of factor scores through matrix-based, closed-form equations, because the between-component of items is a per-cluster realization of a random intercept, and thus a latent variable itself. Therefore, empirical Bayes approaches to scoring are typically utilized in conjunction with maximum likelihood MCFA, and both level-1 and level-2 factor score estimates are computed by taking the mean of the posterior distribution, given the model and data, using numerical integration (Asparouhov & Muthen, 2012).

In sum, factor score estimates computed from aggregated approaches to multilevel measurement modeling are likely subject to the same general limitations associated with this modeling approach (i.e., scores may not accurately capture true level-1 and level-2 processes as these are based on a conflation of level-1 and level-2 effects), but the degree of this bias has not been empirically investigated. Further, factor score estimates from disaggregated approaches to multilevel measurement modeling may overcome this limitation, but the extent to which this is advantageous, in the presence of sampling variability, has not been established. Some recent research has explored the use of multilevel factor scores in subsequent analyses, with the goal of using scores to extract unbiased paths between multilevel latent variables (Devlieger & Rosseel, 2020; Kelcey et al., 2021), but these studies do not consider multiple methods to multilevel measurement (i.e., aggregated and disaggregated approaches) prior to factor score extraction. To our knowledge, no methodological research has been conducted to specifically determine the utility of level-1 and level-2 factor scores extracted from aggregated and disaggregated measurement models under conditions commonly encountered in practice and in the presence of different forms of cross-level non-invariance. This is our purpose here.

Simulation Study

Our simulation study was designed to critically evaluate the relation between true scores and factor score predictions extracted from aggregated and disaggregated multilevel measurement models, under conditions commonly encountered in practice. We selected the relation between true scores and factor scores as the primary outcome of interest, as opposed to other metrics such as standard errors of scores, given the marked importance of score estimate accuracy for applied researchers aiming to use scores to understand the nature of level-1 and level-2 effects. The population-generating model and simulation conditions were motivated by prior pilot analyses of a real educational dataset evaluating the effectiveness of course-based research at a large southern research university in the United States (Sathy et al., 2020), as well as additional follow-up analyses as data collection proceeded. This was balanced with the goal of procuring findings generalizable beyond this target dataset. Therefore, some design characteristics were specifically included to mirror additional situations commonly encountered in multilevel factor modeling applications.

Given the goals of analyses, hypotheses emphasized differences in aggregated and disaggregated measurement frameworks and factor score estimates. Hypotheses were further separated by level of analysis (i.e., level-1 and level-2) and by population-generating factor structure (i.e., one factor at level-2 and three factors at level-2). This resulted in a total of three primary hypotheses.

First, we hypothesized that level-1 factor scores extracted from the MCFA would be most closely related to true underlying level-1 scores, compared to regression scores from the CFA and mean scores. Cluster-mean-centered regression scores will offer a viable alternative to level-1 factor scores, producing comparably reliable factor score estimates. In addition, imposing cross-level equality of factor loadings, even when this imposition is not supported by the population-generating model, will not meaningfully reduce the reliability of level-1 factor scores. Second, we hypothesized that when cross-level configural invariance is satisfied (i.e., equivalent number of factors at level-1 and level-2), MCFA with cross-level equality of factor loadings will produce the most reliable level-2 factor scores given improved stability of model estimation. Third, we hypothesized when cross-level configural invariance was not satisfied (i.e., nonequivalent number of factors at level-1 and level-2), a properly specified MCFA would produce the most reliable level-2 factor scores. The mean of cluster-aggregated regression scores will be more reliable than other score types, but will not outperform level-2 factor scores. All models and score types will be explicated in the following section.

Simulation Design

To test hypotheses, we systematically varied five key components in our simulation design: (a) cross-level configural invariance; (b) number of clusters at level 2; (c) number of items per latent factor; (d) modeling procedure; and (e) scoring technique. Details of these design characteristics, as well as general data generation procedures, are outlined below. Throughout, extensive validation procedures were utilized to confirm the proper generation of data.

Level-2 True Scores

First, we simulated level-2 true scores to correspond to one of two population-generating factor structures: one factor at level-2, and three factors at level-2. With three factors at level 2, factor correlations were set to .20, .35, and .50 (see Figure 2). In addition, we simulated data with either 250 level-2 true scores, or clusters, or 50 level-2 true scores, or clusters, where the former was selected to match the target dataset of interest and the latter was selected to exemplify the lower end of necessary clusters accurate for estimation of multilevel factor models (Maas & Hox, 2005).

Level-2 Indicators

Next, latent means or level-2 factor indicators were simulated. Three latent factors at levels 2, 4, 6, and 8 indicators per factor were specified. With one latent factor at level-2, 18 indicators per factor were specified, to align with a consistent level-1 factor structure. Standardized factor loadings ranged from 0.5 to 0.8 in increments of 0.1, and error terms were specified so that raw and standardized factor loadings were equivalent (Figure 1).

Figure 1.

Population-generating models at level 2.

Level-1 True Scores

Level-1 true scores were then simulated using level-2 factor indicators and a level-1 factor structure of three factors with factor correlations of .20, .35, and .50 (identical to the factor correlations at level-2 with three factors; Figure 2). This further implied that 50% of simulated datasets did not satisfy configural cross-level invariance and 50% did satisfy this constraint. The number of true scores per cluster was sampled randomly, with replacement, from the empirical distribution of course enrollment from the target dataset, and the same randomly sampled cluster sizes were used in all replications. Observations per cluster ranged from 3 to 175, with a median cluster size of 21, to represent a distribution of courses that were mostly small but contained some large lecture sections. The first 50 cluster sizes were used when only 50 clusters were simulated. The level-1 sample size totaled N = 1,282 with 50 clusters and N = 5,381 with 250 clusters.

Figure 2.

Population-generating model at level 1.

Level-1 Indicators

We then simulated level-1 item responses, with four, six, and eight items per factor, setting factor loadings to be proportional to level-2 loadings. Specifically, raw loadings were selected such that item communalities were set to be equal across level 1 and level 2 to create more reasonable comparisons for design factors and interest (and given the impact of communalities on factor score predictions is well student, e.g., Fava & Velicer, 1992) and such that items ICCs were set to .20 for all items. The population data-generating models are depicted in Figures 1 and 2 and a summary of the data-generating processes is outlined in Figure 3.

Figure 3.

Summary of data-generating process.

Top panel features the 3-factor level-2 structure and the bottom panel features the 1-factor level-2 structure. Raw loadings were simulated to be equivalent to standardized loadings by setting error variance to $1 - λ^{2}$ , where $λ^{2}$ is the squared raw loading. Factor correlation values are indicated by arrows connecting factors, and factor loadings are indicated by arrows connecting factors to items. Mean structures are omitted as these were set to 0.

Raw loadings, indicated in Figure 2, were simulated to be twice the standardized loading (e.g., raw loadings of 1.6 correspond to standardized loadings of 0.8) by setting error variance to $4 - λ^{2}$ , where $λ^{2}$ is the squared raw loading. This ensured population-generating ICCs of all items were set to .20. Factor correlation values are indicated by arrows connecting factors, and factor loadings are indicated by arrows connecting factors to items. Mean structure is omitted as factors are mean-deviated at level 1. Random item intercepts are indicated by filling in circles at the end of factor loading arrows.

Model Estimation

We next estimated a series of measurement models within each simulated dataset, inducing nesting of scoring procedure in replications. For population-generating models satisfying configural cross-level invariance we fit the following measurement models: (1) CFA with cluster-robust standard errors; (2) MCFA freely estimating level-1 and level-2 loadings; and (3) MCFA imposing cross-level factor loading equality constraints, or metric invariance. While (3) did not precisely conform to the population-generating structure, metric invariance was tenable given factor loadings and level-1 and level-2 were proportional. Further, evidence suggests that imposing cross-level invariance can improve factor interpretability and enhance the stability of model estimation (Jak, 2019; Kim & Cao, 2015). Given that factor scores are highly subject to model instability in the single-level case (Skrondal & Rabe-Hesketh, 2004), it is likely that instability in the multilevel measurement models will deleteriously impact factor score predictions, and that this may be remediated by imposing parameter constraints.

For population-generating models not satisfying configural cross-level invariance, only (1) and (2) were estimated. Without satisfying the equivalence of factor structure across levels, it makes little sense to impose equivalence of loadings.

Factor Score Predictions

Factor score predictions were also computed within each simulated dataset. First, we computed level-1 and level-2 mean scores for each latent factor, as described previously. Next, we extracted level-1 factor scores, specifically regression scores (Thomson, 1935; Thurstone, 1935), from the CFA with cluster-robust standard errors, which produces equivalent factor scores compared to the standard CFA but was selected to better align with the clustered data-generating process. We aggregated these to level 2 by summing scores within the same cluster and dividing by the total number of observations per cluster. Notably, three level-2 factor scores were computed even when the data-generating mechanism had only one-factor score. This was justified because an analyst using the aggregated approach may not have information about differences in level-1 and level-2 factor structures, in turn enhancing the external validity of results.

In addition, a series of simple transformations were applied to mean scores and regression scores extracted from the CFA to rescale scores to better match what is accomplished under the full MCFA. These transformations were designed to be as uncomplicated as possible, so they could be easily implemented in practice. The goal in including these scores was to see if a less complex model could produce comparably reliable factor score estimates under one or both of two conditions: (a) Number of available clusters limits stable estimation of the full MCFA; and (b) an analyst is otherwise unable to estimate the full MCFA due to lack of expertise in advanced modeling. Specifically, two manipulations were conducted.

The first involved level-1 regression scores from the CFA and level-1 mean scores. Because the MCFA disaggregates level-1 and level-2 effects and other methods do not, level-1 factor scores from the MCFA indicate a given level-1 unit’s distance from their associated cluster mean, whereas level-1 factor scores from aggregated approaches indicate a given level-1 unit’s distance from the grand mean of all clusters. Therefore, we additionally group-mean-centered (Raudenbush & Bryk, 2002) level-1 mean scores and level-1 factor scores from the CFA so these would better indicate level-1 deviations from level-2 effects.

The second transformation was specifically designed to accommodate differing factor structures at level 1 and level 2. As noted, because the CFA does not allow an analyst to empirically determine if factor structures differ at level 1 and level 2, we computed three level-2 cluster-aggregated regression scores, even when the population-generating model had only a single latent factor at level 2. This created a natural conflation between the modeling approach (i.e., aggregated versus disaggregated) and the number of items per factor. In practice, researchers may have a priori hypotheses indicating a specific level-2 factor structure. In these cases, it may be justifiable to compute a different number of factor scores at level 2 compared to level 1, even when aggregated approaches to measurement modeling suggest three distinct factors. As a simple solution, we computed a single level-2 mean score and regression score by taking a mean of the three individual cluster-aggregated mean scores or cluster-aggregated regression scores, respectively. Of note, these scores do not differentially weight each individual factor score by number of indicators, and while possible to compute a weighted mean, we opted for a simpler score estimate given the goal of ease of implementation.

Lastly, level-1 and level-2 factor scores were extracted from a properly specified MCFA in all conditions. When cross-level configural invariance was established, scores were also extracted from an MCFA with equality constraints on level-1 and level-2 loadings. All measurement models and factor score estimates were obtained using Mplus version 8 (Muthén & Muthén, 1998–2023) and output was compiled using MplusAutomation in R (Hallquist & Wiley, 2018). A summary of modeling procedures and score estimates is presented in Table 1.

Table 1.

Measurement Models and Factor Scores

Measurement model	Level-1 scores	Level-2 scores (3-factors at level-2)	Level-2 scores (1-factor at level-2)
—	3 Mean scores	3 Mean scores	3 Mean scores
	3 Cluster-mean-centered mean scores	—	1 Mean of three cluster-aggregated mean scores
CFA	3 Regression scores	3 Cluster-aggregated regression scores	3 Cluster-aggregated regression scores
	3 Cluster-mean-centered regression scores	—	1 Mean of three cluster-aggregated regression scores
MCFA; without cross-level factor loading equality constraints	3 Factor scores	3 Factor scores	1 Factor score
MCFA; with cross-level factor loading equality constraints	3 Factor scores	3 Factor scores	—

Note. CFA = confirmatory factor model; MCFA = multilevel confirmatory factor analysis.

Outcome Computation

For all estimated factor scores, Pearson product-moment correlations between each score estimate and its associated true score value were computed, which is a direct estimate of the reliability index of factor scores (Estabrook & Neale, 2013). When the population-generating model had only one factor, but three factor scores were computed (i.e., mean scores and cluster-aggregated regression scores) correlations were computed between the single true score and the factor score with the most indicators. This factor score was selected given it was consistently more strongly associated with the single true score compared to the other estimated factor scores.

Evaluating Hypotheses

Given reliability was computed using a value with a known and interpretable metric (i.e., correlation coefficient), graphical analyses were selected as the primary method to determine meaningful differences in factor score reliability¹.

Results

Convergence

Although past research has suggested 50 clusters as a benchmark for stable estimation of the full MCFA, justifying its inclusion as a simulation condition (Maas & Hox, 2005), convergence issues were noted in some design cells. Specifically, 35 out of 500 (7.00%) replications did not converge when the full MCFA, without cross-level constraints, was fit to a population-generating model with three level-2 factors. In addition, 5 out of 500 (1.00%) did not converge with the full MCFA with cross-level constraints was fit to the population-generating model with three level-2 factors. In most cases, convergence issues consisted of Heywood cases (i.e., negative residual estimates), with a smaller number of solutions terminating at saddle points. Because so few models did not converge out of the total 5,000 measurement models, these were counted as missing and additional datasets were not simulated to replace non-converged solutions.

Level-1 Factor Scores

Next, we evaluated factor score reliability for all level-1 factor score estimates (see Figure 4). Hypothesis (1) was generally supported in that level-1 mean scores and level-1 regression scores were notably less reliable than level-1 scores from the MCFA and cluster-mean-centered regression scores. Average correlations for mean scores and regression scores were 0.80 (SD = 0.03) and 0.81 (SD = 0.03), respectively, across all simulation conditions. There were only slight differences in average correlations between level-1 scores from the MCFA (with and without equality constraints, when appropriate) and cluster-mean-centered regression scores or mean scores. For example, cluster-mean-centered mean scores had an average correlation with true scores of 0.88 (SD = 0.03) compared to 0.90 (0.02) for all other models and scoring techniques. This suggests comparable reliability across disaggregated and aggregated approaches to measurement modeling, particularly when considering cluster-mean-centered regression scores.

Figure 4.

Level-1 factor score and true score correlations.

We also noted findings outside of the key hypotheses. Specifically, while average correlations were not substantially different with different numbers of clusters, the range of correlation coefficients was larger with fewer clusters. This was especially apparent for mean scores and uncentered regression scores, wherein minimum factor score and true score correlations differed as much as 0.10 when comparing models fitted to 50 clusters as opposed to 250 clusters. Finally, as expected, factor score and true score correlations were generally higher with a larger number of items per factor.

Level-2 Factor Scores With Three Level-2 Population-Generating Factors

Next, we evaluated level-2 factor scores for all data simulated from a population-generating model with three level-2 factors, thus satisfying cross-level configural invariance. In this subset of conditions, there was no conflation between the number of items per factor and measurement modeling or factor-scoring procedure, and an average cluster-mean-regression score was not computed.

Generally, there were less pronounced differences in average factor score and true score correlations across design conditions compared to level-1 factor scores (Figure 5). Hypothesis (2) predicted that scores from the MCFA with cross-level factor loading constraints would be most reliable, and this was largely supported. Pooling across other simulation conditions, mean scores correlated an average of 0.81 (SD = 0.04) with true scores, followed by level-2 scores from an MCFA without cross-level constraints 0.82 (SD = 0.05). Notably, the most reliable scores were cluster-aggregated regression scores from a CFA (M = 0.83, SD = 0.04) and level-2 scores from an MCFA in which cross-level metric invariance was imposed (M = 0.85, SD = 0.05).

Figure 5.

Level-2 factor score and true score correlations with three level-2 population-generating factors.

As expected, reliability differences occurred when comparing the number of items per factor, such that average correlations for factor scores with four indicators were 0.04 units higher than average correlations for factor scores with four indicators. Further, there were noted differences in the range of factor score reliability when specific scoring techniques were applied to data with fewer level-2 clusters available for analysis. Specifically, factor score reliability was as low as .36 for level-2 factor scores extracted from an MCFA fit to only 50 clusters. This improved markedly when imposing cross-level factor loading equality constraints, wherein minimum reliability increased to .51. Minimum reliability estimates were highest (r = .60), for both cluster-aggregated regression scores and mean scores. This is all in comparison to a minimum reliability estimate of .70, pooling across models and scoring techniques, with 250 clusters.

Level-2 Factor Scores With One Level-2 Population-Generating Factor

Finally, we considered level-2 factor scores when a single factor existed at level-2, thus cross-level configural invariance was not satisfied. Here, since 18 items indicated the single underlying latent factor at level-2, number of items per factor was no longer fully crossed with the scoring procedure.

Hypothesis (3) stated that the unweighted mean of cluster-aggregated regression scores would outperform mean scores and the separate regression scores, as this score better matched the true population-generating structure; however, this score would not be as reliable as level-2 scores from the MCFA. This was not the case. In fact, the mean of cluster-aggregated regression scores was consistently more reliable (M = 0.93, SD = 0.02) than level-2 scores from the MCFA (M = 0.91, SD = 0.02), though differences were small across conditions. Moreover, the mean of cluster-aggregated mean scores was also more reliable on average (M = 0.92, SD = 0.02) than scores from the MCFA. Interestingly, the mean of cluster-aggregated mean scores differed from cluster-aggregated regression scores by only .01, whereas the separate mean scores were less reliable than separate cluster-aggregated regression scores by .05. Broadly, scores that utilized information from more indicators were consistently more reliable.

Again, the range of reliability was larger with fewer available level-2 clusters, particularly when scores were based on fewer indicators (i.e., mean scores and regression scores). This trend mirrored observations in Figure 5.

Finally, reliability was generally lowest when correlations were computed between true scores and a single level-2 mean score (M = 0.84, SD = 0.04) or cluster-aggregated regression scores (M = 0.86, SD = 0.03). We reiterate, however, that the reliability indices in Figure 6 (i.e., those selected for comparative purposes) represent correlations between a cluster-aggregated score of $η_{3}$ (eight indicators) in Figure 2. Notably, reliability was lower if correlations were re-estimated using cluster-aggregated scores computed from $η_{2}$ (six indicators) and $η_{1}$ (four indicators). Specifically for $η_{2}$ , cluster-aggregated mean scores had an average correlation with true scores of 0.82 (SD = 0.04) and cluster-aggregated regression scores had an average correlation with true scores of 0.86 (SD = 0.03). For $η_{1}$ , average correlations were reduced to 0.79 (SD = 0.04) and 0.84 (SD = 0.03), respectively. Cell mean differences are summarized in Figure 6.

Figure 6.

Level-2 factor score and true score correlations with one level-2 population-generating factor.

Discussion and Conclusion

The goal of our simulation study was to augment current knowledge about aggregated and disaggregated multilevel measurement model approaches by expanding this work to consider the utility of factor score predictions from these approaches. Further, we aimed to offer factor-scoring options and solutions when a full and properly specified MCFA may not be a viable modeling possibility. Broadly, results suggested that factor score reliability is not deterministically related to modeling procedure in all instances, but is instead driven by the extent to which scores are designed to capture the function of form of true level-1 and level-2 effects and the extent to which the impacts of sampling variability are reasonably considered and mitigated. To explicate these findings, we return to the original research hypotheses.

Level-1 Factor Scores: Hypothesis (1)

Hypothesis (1) stated that level-1 factor scores from the MCFA would be most highly correlated with true scores, but that cluster-mean-centered regression scores would be a viable alternative. Further, when appropriate to impose cross-level factor loading equality constraints (i.e., when configural invariance is minimally satisfied), we hypothesized this imposition would not meaningfully reduce factor score reliability. This was supported in all simulation design cells. First, we found that both cluster-mean-centered regression scores and level-2 factor scores from an MCFA with cross-level factor loading equality constraints were comparably reliable to scores from the MCFA without cross-level factor loading equality constraints. This finding is particularly salient when considering an analytical scenario with a limited number of level-2 clusters. Specifically, in some simulated datasets with only 50 clusters, the properly specified MCFA did not converge, but in all cases, the CFA with cluster-robust standard errors and in more cases compared to the MCFA without cross-level factor loadings constraints the MCFA with cross-level factor loading constraints were estimable. Therefore, our findings suggest there are available scoring techniques that will produce reliable level-1 factor score predictions, even in instances where an MCFA is not estimable due to limited level-2 clusters and/or increased model complexity.

Level-2 Factor Scores With Three Level-2 Population-Generating Factors: Hypothesis (2)

Hypothesis (2) stated that when cross-level configural invariance is satisfied, the MCFA with cross-level factor loading equality constraints would produce the most reliable level-2 factor scores. This hypothesis was supported. Further, other scoring techniques were comparably reliable for all practical purposes, including mean scores. The most pronounced differences occurred when considering the range of reliability. Specifically, within some samples factor scores were unacceptably unreliable with fewer clusters available for analysis. Notably, from the perspective of sampling variability, cluster-aggregated regression scores were the least likely to be deleteriously impacted by limited available clusters.

Level-2 Factor Scores With One Level-2 Population-Generating Factor: Hypothesis (3)

Hypothesis (3) stated that when level-1 and level-2 factor structures differed, the properly specified MCFA would produce the most reliable level-2 factor scores, followed by the mean of cluster-aggregated regression scores. While factor scores from the MCFA were more reliable than cluster-aggregated regression scores and mean scores, the mean of cluster-aggregated regression scores and the mean of cluster-aggregated mean scores were consistently the most reliable. This demonstrated the overall importance of correctly specifying the level-2 factor structure. When factor score estimates did not accurately represent the true level-2 data-generating factor structure (i.e., three level-2 factor scores were estimated for a single data-generating factor), factor scores were consistently less reliable. Decades of work have highlighted the importance of extracting the correct number of factors (Cattell, 1958), and this is further established by our simulation in a multilevel setting, particularly when cross-level configural invariance is not achieved.

Limitations

A key consideration in deciding between the disaggregated and aggregated approaches to multilevel measurement modeling is to determine which level of analysis is of primary interest (Stapleton, McNeish, & Yang, 2016). Our study was designed to determine optimal methods of factor score extraction when the goal is to obtain reliable estimates of both true level-1 and level-2 effects. It is important to note that factor score reliability was estimated by correlating factor scores with true level-1 scores. That is, reliability was likely lower for mean scores and uncentered regression scores because these are estimates of aggregated total effects, and not disaggregated level-1 effects. Consequently, findings should not be generalized to situations where the goal of the analysis is to obtain reliable estimates of the total effect, in which case uncentered scores may be preferable.

In addition, all datasets were simulated with proportional factor loadings across level-1 and level-2. Because of this proportionality, imposing metric invariance in simulation conditions when configural invariance was supported did not significantly reduce model fit in most cases. That is, while factor loadings were not precisely equal across level 1 and level 2, they were proportional, thus metric invariance was tenable and likely to be accepted by an analyst in practice. Because of this, the benefits of this constraint should not be generalized to scenarios in which metric invariance would not be supported by tests of model fit.

In addition, our simulation did not systematically consider model fit and model building. Throughout model estimation, we confirmed that models achieved adequate fit (Ryu, 2014b; West et al., 2012) on average but did not pursue model modification. This decision was justified given the present focus on scoring; however, additional methodological contributions are necessary to better understand the role of model misfit and misspecification on factor score estimation.

Finally, our simulation exclusively considered score reliability as the sole evaluative metric. In some cases, this was at the expense of score validity. For example, to maximize the external validity of findings, when a CFA with cluster-robust standard errors was fit to a population-generating model with a single level-2 factor, three level-2 factor scores were computed. Importantly, all level-2 factor scores computed in this condition were theoretically invalid. Therefore, even though correlations with true scores were reasonably large on average, these scores represented constructs that did not exist at level 2, and use of them would likely lead to erroneous conclusions. We underscore the conclusion that maximal reliability can only be obtained with proper knowledge of the true population-generating factor structure, and that reliability is likely to diminish as scores become more and more invalid. Therefore, we encourage applied researchers to critically consider underlying theory and ensure properly specified factor models prior to computing factor scores.

Future Directions

In summary, results consistently suggested that simple manipulations to factor scores from the aggregated approach offer comparably reliable, and in any given sample more reliable, score predictions than scores extracted from a properly specified MCFA. In addition, imposing cross-level factor loading constraints, when reasonably supported, improved factor score reliability in the disaggregated modeling approach. These findings extend to research scenarios where the goal of the analysis is to descriptively explore latent constructs at multiple levels of analysis. For example, results can be applied to critical evaluations of distributions of learning gains at both the student- and classroom levels. We did not, however, extend our simulation to evaluate the use of factor scores in subsequent models. Similar to the single-level case, this study serves as a necessary precursor (Curran et al., 2016) motivating future work in the use of multilevel factor scores in subsequent models (Curran et al., 2018). Specifically, given the ability of less complicated models (i.e., CFA with cluster-robust standard errors) to recover reliable level-1 and level-2 effects under certain conditions, our findings offer promising pathways forward to consider how and when these scores may be used to understand relations between latent variables at multiple levels of analysis.

Footnotes

Appendix Meta-Models

Table A2.

Level-2 Meta-Model

Effect	df (within, between)	F	Generalized η 2
Three factors at level 2
Number of clusters	(1, 960)	16.25***	.006
Factor-scoring procedure	(1.64, 1,571.86)	1,639.17***	.121
Number of items per factor	(1.98, 1,903.08)	485.48***	.190
Number of clusters: Factor-scoring procedure	(1.64, 1,571.86)	134.42***	.011
Number of clusters: Number of items per factor	(1.98, 1,903.08)	3.59**	.002
Factor-scoring procedure: Number of items per factor	(3.53, 3,389.45)	12.78***	.002
One-factor at level-2
Number of clusters	(1, 998)	39.74***	.031
Factor-scoring procedure	(1.69, 1,687.96)	14,474.28***	.730
Number of clusters: Factor-scoring procedure	(1.69, 1,687.96)	27.18***	.005

Note. Only the main effects of two-way interactions are included in this table. The three-way interaction was associated with effect size estimates or below at or below .001. Effect in bold is that noted in graphical analyses.

p < .05. ***p < .01.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Christian L. L. Strauss

Notes

Authors

CHRISTIAN L. L. STRAUSS is a senior lecturer in the Department of Psychology and Human Development at Vanderbilt University, 230 Appleton Place #552, Nashville, TN 37203; e-mail: christian.strauss@vanderbilt.edu. His research interests include measurement, scoring, and multilevel measurement models, with an emphasis on application to classroom-based high-impact learning interventions.

PATRICK J. CURRAN is a professor in the Department of Psychology and Neuroscience at the University of North Carolina at Chapel Hill, 253 E. Cameron Avenue, Chapel Hill, NC 27514; e-mail: curran@unc.edu. His research and teaching are focused on the development, evaluation, and application of quantitative methods in the social sciences. His current quantitative work relates to various topics in the measurement and analysis of longitudinal data from both structural equations and multilevel modeling perspectives.

References

Asparouhov

Muthen

(2012). Multiple group multilevel analysis (Mplus Web Notes No. 16). http://statmodel.com/examples/webnotes/webnote16.pdf

Bakeman

(2005). Recommended effect size statistic. Behavior Research Methods, 37(3), 379–384. https://doi.org/10.3758/BF03192707

Bartholomew

D. J.

Deary

I. J.

Lawn

(2009). The origin of factor scores: Spearman, Thomson and Bartlett. British Journal of Mathematical and Statistical Psychology, 62(3), 569–582. https://doi.org/10.1348/000711008X365676

Bartlett

M. S.

(1937). The statistical conception of mental factors. British, 28(1), 97–104.

Bartlett

R. F.

(1993). Linear modelling of Pearson’s product moment correlation coefficient: An application of Fisher’s z-transformation. The Statistician, 42(1), 45. https://doi.org/10.2307/2348110

Bauer

D. J.

Curran

P. J.

(2016). The discrepancy between measurement and modeling in longitudinal data analysis. In Harring

J. R.

Stapleton

L. M.

Beretvas

S. N.

(Eds.), Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications. IAP Information Age Publishing.

Bollen

K. A.

(2002). Latent variables in psychology and the social sciences. Annual Review of Psychology, 53, 605–634.

Brown

T. A.

(2006). Confirmatory factor analysis for applied research. Guilford publications. (p. 475).

Cattell

R. B.

(1958). Extracting the correct number of factors in factor analysis. Educational and Psychological Measurement, 18(4), 791–838. https://doi.org/10.1177/001316445801800412

10.

Croon

(2002). Using predicted latent scores in general latent structure models. In Marcoulides

G. A.

Moustaki

(Eds.), Latent variable and latent structure models (pp. 195–224). Psychology Press.

11.

Curran

P. J.

Cole

Bauer

D. J.

Hussong

A. M.

Gottfredson

(2016). Improving factor score estimation through the use of observed background characteristics. Structural Equation Modeling, 23(6), 827–844. https://doi.org/10.1097/SLA.0000000000001177.Complications

12.

Curran

P. J.

Cole

V. T.

Bauer

D. J.

Rothenberg

W. A.

Hussong

A. M.

(2018). Recovering predictor–criterion relations using covariate-informed factor score estimates. Structural Equation Modeling: A Multidisciplinary Journal, 25(6), 860–875. https://doi.org/10.1080/10705511.2018.1473773

13.

Devlieger

Rosseel

(2020). Multilevel factor score regression. Multivariate Behavioral Research, 55(4), 600–624. https://doi.org/10.1080/00273171.2019.1661817

14.

Eicker

(1967). Limit theorems for regression with unequal and dependent errors. In Le Cam

L. M.

Heyman

(Eds.), Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability (pp. 59–82). California University Press.

15.

Estabrook

Neale

(2013). A comparison of factor score estimation methods in the presence of missing data: Reliability and an application to nicotine dependence. Multivariate Behavioral Research, 48(1), 1–27. https://doi.org/10.1080/00273171.2012.730072

16.

Fava

J. L.

Velicer

W. F.

(1992). Multivariate behavioral an empirical comparison of factor, image, component, and scale scores. Multivariate Behavioral Research, 27(3), 301–322. https://doi.org/10.1207/s15327906mbr2703

17.

Grice

J. W.

(2001a). A comparison of factor scores under conditions of factor obliquity. Psychological Methods, 6(1), 67–82. https://doi.org/10.1037//1082-989X.6.1.67

18.

Grice

J. W.

(2001b). Computing and evaluating factor scores. Psychological Methods, 6(4), 430–450.

19.

Hallquist

M. N.

Wiley

J. F.

(2018). MplusAutomation: An R package for facilitating large-scale latent variable analyses in Mplus. Structural Equation Modeling, 25(4), 621–638. doi:10.1080/10705511.2017.1402334

20.

Hox

J. J.

Moerbeek

van de Schoot

(2018). Multilevel analysis: Techniques and applications (3rd ed.). Routledge.

21.

Huber

J. P.

(1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Le Cam

L. M.

Heyman

(Eds.), The Berkeley symposium on mathematical statistics and probability (pp. 221–233). California Univeristy Press.

22.

Jak

(2019). Cross-level invariance in multilevel factor models. Structural Equation Modeling, 26(4), 607–622. https://doi.org/10.1080/10705511.2018.1534205

23.

Jak

Jorgensen

T. D.

(2017). Relating measurement invariance, cross-level invariance, and multilevel reliability. Frontiers in Psychology, 8, 1–9. https://doi.org/10.3389/fpsyg.2017.01640

24.

Jak

Oort

F. J.

Dolan

C. V.

(2013). A test for cluster bias: Detecting violations of measurement invariance across clusters in multilevel data. Structural Equation Modeling, 20(2), 265–282. https://doi.org/10.1080/10705511.2013.769392

25.

Jak

Oort

F. J.

Dolan

C. V.

(2014). Measurement bias in multilevel data. Structural Equation Modeling, 21(1), 31–39. https://doi.org/10.1080/10705511.2014.856694

26.

Kamata

Bauer

D. J.

Miyazaki

(2008). Multilevel measurement modeling. In O’Connell

A. A.

McCoach

D. B.

(Eds.), Multilevel modeling of educational data (pp. 345–388). Information Age Publishing, Inc.

27.

Kelcey

Cox

Dong

(2021). Croon’s bias-corrected factor score path analysis for small- to moderate-sample multilevel structural equation models. Organizational Research Methods, 24(1), 55–77. https://doi.org/10.1177/1094428119879758

28.

Kim

E. S.

Cao

(2015). Testing group mean differences of latent variables in multilevel data using multiple-group multilevel CFA and multilevel MIMIC modeling. Multivariate Behavioral Research, 50(4), 436–456. https://doi.org/10.1080/00273171.2015.1021447

29.

Kim

E. S.

Dedrick

R. F.

Cao

Ferron

J. M.

(2016). Multilevel factor analysis: Reporting guidelines and a review of reporting practices. Multivariate Behavioral Research, 51(6), 881–898. https://doi.org/10.1080/00273171.2016.1228042

30.

Liang

K.-Y.

Zeger

S. L.

(1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13–22.

31.

Maas

C. J. M.

Hox

J. J.

(2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86–92. https://doi.org/10.1027/1614-1881.1.3.86

32.

MacCallum

R. C.

Widaman

K. F.

Zhang

Hong

(1999). Sample size in factor analysis. Psychological Methods, 4(1), 84.

33.

McDonald

R. P.

Burr

E. J.

(1967). A comparison of four methods of constructing factor scores. Psychometrika, 32(4), 381–401. https://doi.org/10.1007/BF02289653

34.

McNeish

Wolf

M. G.

(2020). Thinking twice about sum scores. Behavior Research, 52, 2287–2305.

35.

Mehta

P. D.

Neale

M. C.

(2005). People are variables too: Multilevel structural equations modeling. Psychological Methods, 10(3), 259–284. https://doi.org/10.1037/1082-989X.10.3.259

36.

Muthén

B. O.

(1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10(2), 121–132. https://doi.org/10.3327/jaesj.48.889

37.

Muthén

B. O.

(1991). Multilevel factor analysis of class and student achievement components. Journal of Educational Measurement, 28(4), 338–354.

38.

Muthén

B. O.

(1994). Multilevel covariance structure analysis. Sociological Methods & Research, 22(3), 176–198.

39.

Muthén

B. O.

Satorra

(1995). Complex sample data in structural equation modeling. American Sociological Association, 25, 267–316.

40.

Muthén

L. K.

Muthén

B. O.

(1998–2023). Mplus user’s guide (8.9th ed.). Muthén & Muthén. https://www.statmodel.com

41.

Olejnik

Algina

(2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8(4), 434–447. https://doi.org/10.1037/1082-989X.8.4.434

42.

Pornprasertmanit

Lee

Preacher

K. J.

(2014). Ignoring clustering in confirmatory factor analysis: Some consequences for model fit and standardized parameter estimates. Multivariate Behavioral Research, 49(6), 518–543. https://doi.org/10.1080/00273171.2014.933762

43.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear modeling: Applications and data analysis methods (2nd ed.). Sage Publications.

44.

Ryu

(2014a). Factorial invariance in multilevel confirmatory factor analysis. The British Journal of Mathematical and Statistical Psychology, 67(1), 172–194. https://doi.org/10.1111/bmsp.12014

45.

Ryu

(2014b). Model fit evaluation in multilevel structural equation models. Frontiers in Psychology, 5, 51.

46.

Sathy

Strauss

C. L.

Nasiri

Panter

A. T.

Hogan

K. A.

Hutson

B. L.

(2020). Cultivating inclusive research experiences through course-based curriculum. Scholarship of Teaching and Learning in Psychology, 7(4), 312–322.

47.

Skrondal

Laake

(2001). Regression among factor scores. Psychometrika, 66(4), 563–575. https://doi.org/10.1007/bf02296196

48.

Skrondal

Rabe-Hesketh

(2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. CRC Press.

49.

Stapleton

L. M.

McNeish

D. M.

Yang

J. S.

(2016). Multilevel and single-level models for measured and latent variables when data are clustered. Educational Psychologist, 51(3–4), 317–330. https://doi.org/10.1080/00461520.2016.1207178

50.

Stapleton

L. M.

Yang

J. S.

Hancock

G. R.

(2016). Construct meaning in multilevel settings. Journal of Educational and Behavioral Statistics, 41(5), 481–520. https://doi.org/10.3102/1076998616646200

51.

Thissen

Wainer

(Eds.) (2001). An overview of test scoring. In Test scoring (pp.1–22). Routledge.

52.

Thomson

G. H.

(1935). The definition and measurement of “g” (general intelligence). Journal of Educational Psychology, 26(4), 241–262. https://doi.org/10.1037/h0059873

53.

Thomson

G. H.

(1938). Methods of estimating mental factors. Nature, 141, 246.

54.

Thurstone

L. L.

(1935). Vectors of mind. University of Chicago Press.

55.

Velicer

W. F.

(1976). The relation between factor score estimates, image scores, and principal component scores. Educational and Psychological Measurement, 36, 149–159.

56.

Wainer

Thissen

(Eds.) (2001). True score theory: The traditional method. In Test scoring (pp. 23–72). Routledge.

57.

West

S. G.

Taylor

A. B.

(2012). Model fit and model selection in structural equation modelling. In Hoyle

R. H.

(Ed.), Handbook of structural equation modelling (pp. 209–231). The Guilford Press.

58.

White

(1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heterskedasticity. Econometrica, 48, 817–838.

59.

Widaman

K. F.

Reise

S. P.

(1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In Bryant

K. J.

Windle

West

S. G.

(Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281–324). American Psychological Association.

60.

Yuan

K. H.

Hayashi

(2005). On Muthén’s maximum likelihood for two-level covariance structure models. Psychometrika, 70(1), 147–167. https://doi.org/10.1007/s11336-003-1070-8