Abstract
It is common in the educational and psychological sciences to collect data from individuals nested in hierarchical structures, such as students in classrooms. Further, in many instances, observed individual-level data are believed to indicate one or more unobserved latent variables (Bollen, 2002) that operate at both the individual
There are two general approaches to multilevel measurement modeling: the aggregated approach and the disaggregated approach (Raudenbush & Bryk, 2002). Broadly, the aggregated approach involves fitting a measurement model to a conflation of level-1 and level-2 effects and correcting attenuation of standard errors due to clustering, and the disaggregated approach involves simultaneously and separately estimating a factor model at both level-1 and level-2. The advantages and disadvantages of these approaches are well established (Muthén & Satorra, 1995; Pornprasertmanit et al., 2014; Stapleton, McNeish, & Yang, 2016; Stapleton, Yang, & Hancock, 2016). Currently, however, methodological work has neglected to critically compare the reliability of factor score predictions extracted from multilevel measurement models. As past work has demonstrated, even well-researched factor analytic methods can produce factor score predictions that do not behave as expected or desired (Croon, 2002; Curran et al., 2016; McDonald & Burr, 1967; Skrondal & Laake, 2001). Thus, there is a need to expand our understanding of multilevel measurement modeling frameworks when the goal of the analysis is to obtain reliable level-1 and level-2 factor scores.
Factor scores are numerical predictions that indicate where an individual lies on an underlying continuous scale of some unobserved, or latent, construct such as intelligence, learning gains, or sense of belonging (Bartholomew et al., 2009). For example, a student might be asked a series of questions designed to measure overall learning gains. Items can then be combined to form a single summary score representing that student’s learning, which can offer insight to instructors on how individual students differentially benefited from the implementation of high-impact learning practices. In addition, classroom-level factor scores can be utilized institutionally to determine what components of high-impact learning are maximized in different courses. While factor scores for non-clustered data structures have been extensively studied (Fava & Velicer, 1992; Grice, 2001a, 2001b; McDonald & Burr, 1967; Skrondal & Laake, 2001; Velicer, 1976), there is a present gap in the literature evaluating the intersection of multilevel latent variable modeling and factor scoring.
The goal of our paper is to fill this gap by empirically investigating factor score predictions at both level-1 and level-2, extracted from aggregated and disaggregated measurement models, in the context of conditions commonly encountered in the educational and behavioral sciences and conditions relevant to multilevel measurement structures. Specifically, drawing on past research demonstrating the importance of considering sources of measurement non-invariance in single-level factor score predictions (Curran et al., 2016, 2018), we evaluate the reliability of factor score predictions from multiple modeling frameworks in the presence of forms of cross-level non-invariance (Jak, 2019; Jak et al., 2013, 2014; Jak & Jorgensen, 2017). To begin, we briefly review the technical details of confirmatory factor models for multilevel item response data within both approaches. Next, we describe factor-scoring methods conducive to both approaches. This will be followed by a detailed presentation of the subsequent simulation study and salient findings.
The Aggregated Approach to Multilevel Measurement Modeling
Returning to the example of measuring the effectiveness of high-impact learning at both the student- and classroom level, suppose a researcher collects item response data from students nested in classrooms and aims to measure learning outcomes at both levels of analysis. One approach is to begin by specifying and estimating a single-level confirmatory factor model (CFA; e.g., Brown, 2006) to student item response data. Of critical concern is the fact that the clustering attenuates standard errors and biases test statistics, leading to incorrect inference (Hox et al., 2018; Kamata et al., 2008; Muthén & Satorra, 1995). The aim of the aggregated approach is to correct for bias in standard errors associated with clustered data structures (Muthén, 1985, 1994; Stapleton, McNeish, & Yang, 2016).
Although CFA is well established in the literature, we present model equations to introduce a shared notational system across modeling frameworks and scores. For a CFA of items
Here
Equations (1) and (2) imply the following mean
Standard errors can then be corrected for attenuation due to clustering to aid in accurate model selection and inference. The most common correction procedure involves the application of formulas presented in Liang and Zeger (1986), which are extensions of general robust standard error methods (Eicker, 1967; Huber, 1967; White, 1980) derived for clustered data structures.
While this approach addresses standard error attenuation, it ignores a critical aspect of multilevel analysis: aggregated approaches to multilevel modeling produce model parameter estimates of total or aggregated effects (Hox et al., 2018; Muthén, 1991), which are a conflation of unique level-1 (e.g., student) and level-2 (e.g., classroom) effects, weighted by cluster-based intraclass correlations (ICCs; Raudenbush & Bryk, 2002). In the context of the example assessing student and classroom-level learning outcomes associated with high-impact learning practices, a CFA with cluster-robust standard errors would be fit to raw item responses, which by definition represents a conflation of student and classroom characteristics. That is, the aggregated approach is unable to detect if item responses for a given student are due to a student’s individual learning gains or to an advantageous learning environment. Thus, if the goal of the analysis is to understand and distinguish individual-level constructs from cluster-level constructs, conflation of parameter estimates due to aggregation of effects may intrinsically lead to parameter estimate bias.
The Disaggregated Approach to Multilevel Measurement Modeling
To overcome this issue, multilevel confirmatory factor analysis (MCFA) separately and simultaneously estimates level-1 and level-2 effects, decomposing total effects into their constituent level-1 and level-2 components and allowing analysts to draw inferences unique to each level of analysis (Hox et al., 2018; Mehta & Neale, 2005; Stapleton, McNeish, & Yang, 2016). Muthén was the first to introduce a procedure for multilevel factor analysis, Muthén’s maximum likelihood, or MUML (Muthén, 1991, 1994), which separately estimates factor models associated with the level-1 and level-2 covariance structure, respectively. However, as computational power and efficiency improved over time, normal theory maximum likelihood has become the more ubiquitous and recommended method of MCFA (Hox et al., 2018; Yuan & Hayashi, 2005). We present equations below for notational consistency.
For an MCFA of items
where
Equations 5 to 7 imply a level-1 mean structure of zero at level-1, since all individual-level variables are centered at group means (Ryu, 2014a) and a level-2 mean structure of
The model-implied covariance structures at level-1 and level-2, respectively, are
Equations 5 to 10 demonstrate key advantages of the MCFA over the CFA with cluster-robust standard errors. First, there are distinct model-implied covariance structures for level 1 and level 2 with separate matrices at each level of analysis. This implies that MCFA can systematically model cross-level measurement non-invariance. Cross-level measurement non-invariance refers to measurement models that differ in function and form across hierarchical levels of analysis. This includes configural non-invariance wherein the factor structures differ across levels, and metric non-invariance wherein the factor loadings differ across levels (Jak, 2019; Jak et al., 2013, 2014). Notably, a 2016 review found that 31% of multilevel factor models reported a different number of level-1 and level-2 factors, suggesting that cross-level configural non-invariance is fairly common (Kim et al., 2016). In addition, distinct model-implied covariance structures across levels implicate factor loadings can differ across level 1 and level 2 or can be held constant with equality constraints (Jak, 2019). For example, one item may be more strongly predictive of student learning but less strongly predictive of general classroom-level curricular advantages. Finally, differences across clusters are captured and modeled in the random intercept component which contextualizes item responses to their associated cluster.
Because of the MCFA’s ability to decompose level-1 and level-2 effects, it is often the preferred method of multilevel factor analysis (Hox et al., 2018; Pornprasertmanit et al., 2014); however, in many applications, disaggregated modeling approaches are not estimable or are numerically unstable due to the interplay of complex model specification at each level of analysis and limitations in number of clusters available for analysis (Jak, 2019; Maas & Hox, 2005). In the same way that it would not be recommended to evaluate the validity of a scale on a sample of 50 individuals (MacCallum et al., 1999) and use unstable CFA parameter estimates to obtain factor score estimates (Skrondal & Rabe-Hesketh, 2004), it may be equally problematic to establish the function and form of a level-2 factor structure and extract level-2 factor scores with 50-clusters, due to the impacts of sampling variability. Therefore, the decision between aggregated and disaggregated approaches to multilevel measurement modeling often hinges on the conflict between ideal modeling and practical or viable modeling, given available data, particularly when the goal of the analysis is to assign reliable scores to multilevel constructs. It is to this we now turn.
Factor Scores
Factor scores have existed for nearly a century (M. S. Bartlett, 1937; Thomson, 1935, 1938; Thurstone, 1935) and have many practical uses. In a multilevel context, factor scores allow analysts to estimate a single number summary of both where an individual
Mean scores are one of the simplest, and consequently ubiquitous methods for predicting latent standings (Bauer & Curran, 2016). In the context of multilevel response data obtained from individuals nested in clusters, individual-level scores are computed by summing all items for a given latent variable and dividing by the total number of items, and cluster-level scores are computed by summing all individual-level scores in a cluster and dividing by the number of individuals. While simple to compute and interpret, mean scores inherently assume equal weighting of items (McNeish & Wolf, 2020; Thissen & Wainer, 2001), which may not accurately represent all measurement structures in practice.
An alternative approach to scoring involves extracting scores from more complex measurement models. In the context of the aggregated approach, specifically CFA, sample estimates of parameters in Equations 1 to 4 represent unbiased estimates of the aggregated total effects. Therefore, sample estimates can be used in standard factor-scoring formulas (M. S. Bartlett, 1937; Thomson, 1935; Thurstone, 1935) to obtain a single total effect score for each level-1 unit; however, as these estimates are based on a conflation of level-1 and level-2 effects, they may not (and likely do not) accurately capture true level-1 and level-2 processes, particularly in the presence of cross-level non-invariance. Importantly, while correcting standard errors for attenuation due to clustering is necessary with clustered data, factor scores, which rely solely on point estimates of model parameters, are not impacted by standard error corrections.
Alternatively, factor scores can be extracted directly from the MCFA, which disaggregates level-1 and level-2 effects, allowing for specification and estimation of differences across level-1 and level-2 factor models (Jak, 2019; Jak & Jorgensen, 2017). Further, because MCFA decomposes covariance matrices into within- and between-components and simultaneously and separately estimates level-1 and level-2 factor models, separate level-1 and level-2 factor scores can be directly extracted, representing unique level-1 and level-2 effects, and cross-level non-invariance (or invariance) can be systematically modeled and incorporated into score predictions. At level 2, the random intercept, or latent-mean component of the MCFA, precludes direct computation of factor scores through matrix-based, closed-form equations, because the between-component of items is a per-cluster realization of a random intercept, and thus a latent variable itself. Therefore, empirical Bayes approaches to scoring are typically utilized in conjunction with maximum likelihood MCFA, and both level-1 and level-2 factor score estimates are computed by taking the mean of the
In sum, factor score estimates computed from aggregated approaches to multilevel measurement modeling are likely subject to the same general limitations associated with this modeling approach (i.e., scores may not accurately capture true level-1 and level-2 processes as these are based on a conflation of level-1 and level-2 effects), but the degree of this bias has not been empirically investigated. Further, factor score estimates from disaggregated approaches to multilevel measurement modeling may overcome this limitation, but the extent to which this is advantageous, in the presence of sampling variability, has not been established. Some recent research has explored the use of multilevel factor scores in subsequent analyses, with the goal of using scores to extract unbiased paths between multilevel latent variables (Devlieger & Rosseel, 2020; Kelcey et al., 2021), but these studies do not consider multiple methods to multilevel measurement (i.e., aggregated and disaggregated approaches) prior to factor score extraction. To our knowledge, no methodological research has been conducted to specifically determine the utility of level-1 and level-2 factor scores extracted from aggregated and disaggregated measurement models under conditions commonly encountered in practice and in the presence of different forms of cross-level non-invariance. This is our purpose here.
Simulation Study
Our simulation study was designed to critically evaluate the relation between true scores and factor score predictions extracted from aggregated and disaggregated multilevel measurement models, under conditions commonly encountered in practice. We selected the relation between true scores and factor scores as the primary outcome of interest, as opposed to other metrics such as standard errors of scores, given the marked importance of score estimate accuracy for applied researchers aiming to use scores to understand the nature of level-1 and level-2 effects. The population-generating model and simulation conditions were motivated by prior pilot analyses of a real educational dataset evaluating the effectiveness of course-based research at a large southern research university in the United States (Sathy et al., 2020), as well as additional follow-up analyses as data collection proceeded. This was balanced with the goal of procuring findings generalizable beyond this target dataset. Therefore, some design characteristics were specifically included to mirror additional situations commonly encountered in multilevel factor modeling applications.
Given the goals of analyses, hypotheses emphasized differences in aggregated and disaggregated measurement frameworks and factor score estimates. Hypotheses were further separated by level of analysis (i.e., level-1 and level-2) and by population-generating factor structure (i.e., one factor at level-2 and three factors at level-2). This resulted in a total of three primary hypotheses.
First, we hypothesized that level-1 factor scores extracted from the MCFA would be most closely related to true underlying level-1 scores, compared to regression scores from the CFA and mean scores. Cluster-mean-centered regression scores will offer a viable alternative to level-1 factor scores, producing comparably reliable factor score estimates. In addition, imposing cross-level equality of factor loadings, even when this imposition is not supported by the population-generating model, will not meaningfully reduce the reliability of level-1 factor scores. Second, we hypothesized that when cross-level configural invariance is satisfied (i.e., equivalent number of factors at level-1 and level-2), MCFA with cross-level equality of factor loadings will produce the most reliable level-2 factor scores given improved stability of model estimation. Third, we hypothesized when cross-level configural invariance was not satisfied (i.e., nonequivalent number of factors at level-1 and level-2), a properly specified MCFA would produce the most reliable level-2 factor scores. The mean of cluster-aggregated regression scores will be more reliable than other score types, but will not outperform level-2 factor scores. All models and score types will be explicated in the following section.
Simulation Design
To test hypotheses, we systematically varied five key components in our simulation design: (a) cross-level configural invariance; (b) number of clusters at level 2; (c) number of items per latent factor; (d) modeling procedure; and (e) scoring technique. Details of these design characteristics, as well as general data generation procedures, are outlined below. Throughout, extensive validation procedures were utilized to confirm the proper generation of data.
Level-2 True Scores
First, we simulated level-2 true scores to correspond to one of two population-generating factor structures: one factor at level-2, and three factors at level-2. With three factors at level 2, factor correlations were set to .20, .35, and .50 (see Figure 2). In addition, we simulated data with either 250 level-2 true scores, or clusters, or 50 level-2 true scores, or clusters, where the former was selected to match the target dataset of interest and the latter was selected to exemplify the lower end of necessary clusters accurate for estimation of multilevel factor models (Maas & Hox, 2005).
Level-2 Indicators
Next, latent means or level-2 factor indicators were simulated. Three latent factors at levels 2, 4, 6, and 8 indicators per factor were specified. With one latent factor at level-2, 18 indicators per factor were specified, to align with a consistent level-1 factor structure. Standardized factor loadings ranged from 0.5 to 0.8 in increments of 0.1, and error terms were specified so that raw and standardized factor loadings were equivalent (Figure 1).

Population-generating models at level 2.
Level-1 True Scores
Level-1 true scores were then simulated using level-2 factor indicators and a level-1 factor structure of three factors with factor correlations of .20, .35, and .50 (identical to the factor correlations at level-2 with three factors; Figure 2). This further implied that 50% of simulated datasets did not satisfy configural cross-level invariance and 50% did satisfy this constraint. The number of true scores per cluster was sampled randomly, with replacement, from the empirical distribution of course enrollment from the target dataset, and the same randomly sampled cluster sizes were used in all replications. Observations per cluster ranged from 3 to 175, with a median cluster size of 21, to represent a distribution of courses that were mostly small but contained some large lecture sections. The first 50 cluster sizes were used when only 50 clusters were simulated. The level-1 sample size totaled

Population-generating model at level 1.
Level-1 Indicators
We then simulated level-1 item responses, with four, six, and eight items per factor, setting factor loadings to be proportional to level-2 loadings. Specifically, raw loadings were selected such that item communalities were set to be equal across level 1 and level 2 to create more reasonable comparisons for design factors and interest (and given the impact of communalities on factor score predictions is well student, e.g., Fava & Velicer, 1992) and such that items ICCs were set to .20 for all items. The population data-generating models are depicted in Figures 1 and 2 and a summary of the data-generating processes is outlined in Figure 3.

Summary of data-generating process.
Top panel features the 3-factor level-2 structure and the bottom panel features the 1-factor level-2 structure. Raw loadings were simulated to be equivalent to standardized loadings by setting error variance to
Raw loadings, indicated in Figure 2, were simulated to be twice the standardized loading (e.g., raw loadings of 1.6 correspond to standardized loadings of 0.8) by setting error variance to
Model Estimation
We next estimated a series of measurement models within each simulated dataset, inducing nesting of scoring procedure in replications. For population-generating models satisfying configural cross-level invariance we fit the following measurement models: (1) CFA with cluster-robust standard errors; (2) MCFA freely estimating level-1 and level-2 loadings; and (3) MCFA imposing cross-level factor loading equality constraints, or metric invariance. While (3) did not precisely conform to the population-generating structure, metric invariance was tenable given factor loadings and level-1 and level-2 were proportional. Further, evidence suggests that imposing cross-level invariance can improve factor interpretability and enhance the stability of model estimation (Jak, 2019; Kim & Cao, 2015). Given that factor scores are highly subject to model instability in the single-level case (Skrondal & Rabe-Hesketh, 2004), it is likely that instability in the multilevel measurement models will deleteriously impact factor score predictions, and that this may be remediated by imposing parameter constraints.
For population-generating models not satisfying configural cross-level invariance, only (1) and (2) were estimated. Without satisfying the equivalence of factor structure across levels, it makes little sense to impose equivalence of loadings.
Factor Score Predictions
Factor score predictions were also computed within each simulated dataset. First, we computed level-1 and level-2 mean scores for each latent factor, as described previously. Next, we extracted level-1 factor scores, specifically regression scores (Thomson, 1935; Thurstone, 1935), from the CFA with cluster-robust standard errors, which produces equivalent factor scores compared to the standard CFA but was selected to better align with the clustered data-generating process. We aggregated these to level 2 by summing scores within the same cluster and dividing by the total number of observations per cluster. Notably, three level-2 factor scores were computed even when the data-generating mechanism had only one-factor score. This was justified because an analyst using the aggregated approach may not have information about differences in level-1 and level-2 factor structures, in turn enhancing the external validity of results.
In addition, a series of simple transformations were applied to mean scores and regression scores extracted from the CFA to rescale scores to better match what is accomplished under the full MCFA. These transformations were designed to be as uncomplicated as possible, so they could be easily implemented in practice. The goal in including these scores was to see if a less complex model could produce comparably reliable factor score estimates under one or both of two conditions: (a) Number of available clusters limits stable estimation of the full MCFA; and (b) an analyst is otherwise unable to estimate the full MCFA due to lack of expertise in advanced modeling. Specifically, two manipulations were conducted.
The first involved level-1 regression scores from the CFA and level-1 mean scores. Because the MCFA disaggregates level-1 and level-2 effects and other methods do not, level-1 factor scores from the MCFA indicate a given level-1 unit’s distance from their associated cluster mean, whereas level-1 factor scores from aggregated approaches indicate a given level-1 unit’s distance from the grand mean of all clusters. Therefore, we additionally group-mean-centered (Raudenbush & Bryk, 2002) level-1 mean scores and level-1 factor scores from the CFA so these would better indicate level-1 deviations from level-2 effects.
The second transformation was specifically designed to accommodate differing factor structures at level 1 and level 2. As noted, because the CFA does not allow an analyst to empirically determine if factor structures differ at level 1 and level 2, we computed three level-2 cluster-aggregated regression scores, even when the population-generating model had only a single latent factor at level 2. This created a natural conflation between the modeling approach (i.e., aggregated versus disaggregated) and the number of items per factor. In practice, researchers may have
Lastly, level-1 and level-2 factor scores were extracted from a properly specified MCFA in all conditions. When cross-level configural invariance was established, scores were also extracted from an MCFA with equality constraints on level-1 and level-2 loadings. All measurement models and factor score estimates were obtained using Mplus version 8 (Muthén & Muthén, 1998–2023) and output was compiled using MplusAutomation in R (Hallquist & Wiley, 2018). A summary of modeling procedures and score estimates is presented in Table 1.
Measurement Models and Factor Scores
Outcome Computation
For all estimated factor scores, Pearson product-moment correlations between each score estimate and its associated true score value were computed, which is a direct estimate of the reliability index of factor scores (Estabrook & Neale, 2013). When the population-generating model had only one factor, but three factor scores were computed (i.e., mean scores and cluster-aggregated regression scores) correlations were computed between the single true score and the factor score with the most indicators. This factor score was selected given it was consistently more strongly associated with the single true score compared to the other estimated factor scores.
Evaluating Hypotheses
Given reliability was computed using a value with a known and interpretable metric (i.e., correlation coefficient), graphical analyses were selected as the primary method to determine meaningful differences in factor score reliability 1 .
Results
Convergence
Although past research has suggested 50 clusters as a benchmark for stable estimation of the full MCFA, justifying its inclusion as a simulation condition (Maas & Hox, 2005), convergence issues were noted in some design cells. Specifically, 35 out of 500 (7.00%) replications did not converge when the full MCFA, without cross-level constraints, was fit to a population-generating model with three level-2 factors. In addition, 5 out of 500 (1.00%) did not converge with the full MCFA with cross-level constraints was fit to the population-generating model with three level-2 factors. In most cases, convergence issues consisted of Heywood cases (i.e., negative residual estimates), with a smaller number of solutions terminating at saddle points. Because so few models did not converge out of the total 5,000 measurement models, these were counted as missing and additional datasets were not simulated to replace non-converged solutions.
Level-1 Factor Scores
Next, we evaluated factor score reliability for all level-1 factor score estimates (see Figure 4). Hypothesis (1) was generally supported in that level-1 mean scores and level-1 regression scores were notably less reliable than level-1 scores from the MCFA and cluster-mean-centered regression scores. Average correlations for mean scores and regression scores were 0.80 (

Level-1 factor score and true score correlations.
We also noted findings outside of the key hypotheses. Specifically, while average correlations were not substantially different with different numbers of clusters, the range of correlation coefficients was larger with fewer clusters. This was especially apparent for mean scores and uncentered regression scores, wherein minimum factor score and true score correlations differed as much as 0.10 when comparing models fitted to 50 clusters as opposed to 250 clusters. Finally, as expected, factor score and true score correlations were generally higher with a larger number of items per factor.
Level-2 Factor Scores With Three Level-2 Population-Generating Factors
Next, we evaluated level-2 factor scores for all data simulated from a population-generating model with three level-2 factors, thus satisfying cross-level configural invariance. In this subset of conditions, there was no conflation between the number of items per factor and measurement modeling or factor-scoring procedure, and an average cluster-mean-regression score was not computed.
Generally, there were less pronounced differences in average factor score and true score correlations across design conditions compared to level-1 factor scores (Figure 5). Hypothesis (2) predicted that scores from the MCFA with cross-level factor loading constraints would be most reliable, and this was largely supported. Pooling across other simulation conditions, mean scores correlated an average of 0.81 (

Level-2 factor score and true score correlations with three level-2 population-generating factors.
As expected, reliability differences occurred when comparing the number of items per factor, such that average correlations for factor scores with four indicators were 0.04 units higher than average correlations for factor scores with four indicators. Further, there were noted differences in the range of factor score reliability when specific scoring techniques were applied to data with fewer level-2 clusters available for analysis. Specifically, factor score reliability was as low as .36 for level-2 factor scores extracted from an MCFA fit to only 50 clusters. This improved markedly when imposing cross-level factor loading equality constraints, wherein minimum reliability increased to .51. Minimum reliability estimates were highest (
Level-2 Factor Scores With One Level-2 Population-Generating Factor
Finally, we considered level-2 factor scores when a single factor existed at level-2, thus cross-level configural invariance was not satisfied. Here, since 18 items indicated the single underlying latent factor at level-2, number of items per factor was no longer fully crossed with the scoring procedure.
Hypothesis (3) stated that the unweighted mean of cluster-aggregated regression scores would outperform mean scores and the separate regression scores, as this score better matched the true population-generating structure; however, this score would not be as reliable as level-2 scores from the MCFA. This was not the case. In fact, the mean of cluster-aggregated regression scores was consistently more reliable (
Again, the range of reliability was larger with fewer available level-2 clusters, particularly when scores were based on fewer indicators (i.e., mean scores and regression scores). This trend mirrored observations in Figure 5.
Finally, reliability was generally lowest when correlations were computed between true scores and a single level-2 mean score (

Level-2 factor score and true score correlations with one level-2 population-generating factor.
Discussion and Conclusion
The goal of our simulation study was to augment current knowledge about aggregated and disaggregated multilevel measurement model approaches by expanding this work to consider the utility of factor score predictions from these approaches. Further, we aimed to offer factor-scoring options and solutions when a full and properly specified MCFA may not be a viable modeling possibility. Broadly, results suggested that factor score reliability is not deterministically related to modeling procedure in all instances, but is instead driven by the extent to which scores are designed to capture the function of form of true level-1 and level-2 effects and the extent to which the impacts of sampling variability are reasonably considered and mitigated. To explicate these findings, we return to the original research hypotheses.
Level-1 Factor Scores: Hypothesis (1)
Hypothesis (1) stated that level-1 factor scores from the MCFA would be most highly correlated with true scores, but that cluster-mean-centered regression scores would be a viable alternative. Further, when appropriate to impose cross-level factor loading equality constraints (i.e., when configural invariance is minimally satisfied), we hypothesized this imposition would not meaningfully reduce factor score reliability. This was supported in all simulation design cells. First, we found that both cluster-mean-centered regression scores and level-2 factor scores from an MCFA with cross-level factor loading equality constraints were comparably reliable to scores from the MCFA without cross-level factor loading equality constraints. This finding is particularly salient when considering an analytical scenario with a limited number of level-2 clusters. Specifically, in some simulated datasets with only 50 clusters, the properly specified MCFA did not converge, but in all cases, the CFA with cluster-robust standard errors and in more cases compared to the MCFA without cross-level factor loadings constraints the MCFA with cross-level factor loading constraints were estimable. Therefore, our findings suggest there are available scoring techniques that will produce reliable level-1 factor score predictions, even in instances where an MCFA is not estimable due to limited level-2 clusters and/or increased model complexity.
Level-2 Factor Scores With Three Level-2 Population-Generating Factors: Hypothesis (2)
Hypothesis (2) stated that when cross-level configural invariance is satisfied, the MCFA with cross-level factor loading equality constraints would produce the most reliable level-2 factor scores. This hypothesis was supported. Further, other scoring techniques were comparably reliable for all practical purposes, including mean scores. The most pronounced differences occurred when considering the range of reliability. Specifically, within some samples factor scores were unacceptably unreliable with fewer clusters available for analysis. Notably, from the perspective of sampling variability, cluster-aggregated regression scores were the least likely to be deleteriously impacted by limited available clusters.
Level-2 Factor Scores With One Level-2 Population-Generating Factor: Hypothesis (3)
Hypothesis (3) stated that when level-1 and level-2 factor structures differed, the properly specified MCFA would produce the most reliable level-2 factor scores, followed by the mean of cluster-aggregated regression scores. While factor scores from the MCFA were more reliable than cluster-aggregated regression scores and mean scores, the mean of cluster-aggregated regression scores and the mean of cluster-aggregated mean scores were consistently the most reliable. This demonstrated the overall importance of correctly specifying the level-2 factor structure. When factor score estimates did not accurately represent the true level-2 data-generating factor structure (i.e., three level-2 factor scores were estimated for a single data-generating factor), factor scores were consistently less reliable. Decades of work have highlighted the importance of extracting the correct number of factors (Cattell, 1958), and this is further established by our simulation in a multilevel setting, particularly when cross-level configural invariance is not achieved.
Limitations
A key consideration in deciding between the disaggregated and aggregated approaches to multilevel measurement modeling is to determine which level of analysis is of primary interest (Stapleton, McNeish, & Yang, 2016). Our study was designed to determine optimal methods of factor score extraction when the goal is to obtain reliable estimates of both true level-1 and level-2 effects. It is important to note that factor score reliability was estimated by correlating factor scores with true level-1 scores. That is, reliability was likely lower for mean scores and uncentered regression scores because these are estimates of aggregated total effects, and not disaggregated level-1 effects. Consequently, findings should not be generalized to situations where the goal of the analysis is to obtain reliable estimates of the total effect, in which case uncentered scores may be preferable.
In addition, all datasets were simulated with proportional factor loadings across level-1 and level-2. Because of this proportionality, imposing metric invariance in simulation conditions when configural invariance was supported did not significantly reduce model fit in most cases. That is, while factor loadings were not precisely equal across level 1 and level 2, they were proportional, thus metric invariance was tenable and likely to be accepted by an analyst in practice. Because of this, the benefits of this constraint should not be generalized to scenarios in which metric invariance would not be supported by tests of model fit.
In addition, our simulation did not systematically consider model fit and model building. Throughout model estimation, we confirmed that models achieved adequate fit (Ryu, 2014b; West et al., 2012) on average but did not pursue model modification. This decision was justified given the present focus on scoring; however, additional methodological contributions are necessary to better understand the role of model misfit and misspecification on factor score estimation.
Finally, our simulation exclusively considered score reliability as the sole evaluative metric. In some cases, this was at the expense of score validity. For example, to maximize the external validity of findings, when a CFA with cluster-robust standard errors was fit to a population-generating model with a single level-2 factor, three level-2 factor scores were computed. Importantly, all level-2 factor scores computed in this condition were theoretically invalid. Therefore, even though correlations with true scores were reasonably large on average, these scores represented constructs that did not exist at level 2, and use of them would likely lead to erroneous conclusions. We underscore the conclusion that maximal reliability can only be obtained with proper knowledge of the true population-generating factor structure, and that reliability is likely to diminish as scores become more and more invalid. Therefore, we encourage applied researchers to critically consider underlying theory and ensure properly specified factor models prior to computing factor scores.
Future Directions
In summary, results consistently suggested that simple manipulations to factor scores from the aggregated approach offer comparably reliable, and in any given sample more reliable, score predictions than scores extracted from a properly specified MCFA. In addition, imposing cross-level factor loading constraints, when reasonably supported, improved factor score reliability in the disaggregated modeling approach. These findings extend to research scenarios where the goal of the analysis is to descriptively explore latent constructs at multiple levels of analysis. For example, results can be applied to critical evaluations of distributions of learning gains at both the student- and classroom levels. We did not, however, extend our simulation to evaluate the use of factor scores in subsequent models. Similar to the single-level case, this study serves as a necessary precursor (Curran et al., 2016) motivating future work in the use of multilevel factor scores in subsequent models (Curran et al., 2018). Specifically, given the ability of less complicated models (i.e., CFA with cluster-robust standard errors) to recover reliable level-1 and level-2 effects under certain conditions, our findings offer promising pathways forward to consider how and when these scores may be used to understand relations between latent variables at multiple levels of analysis.
Footnotes
Appendix Meta-Models
Level-2 Meta-Model
| Effect |
|
|
Generalized |
|---|---|---|---|
| Three factors at level 2 | |||
| Number of clusters | (1, 960) | 16.25*** | .006 |
| |
|
|
. |
| |
|
|
. |
| Number of clusters: Factor-scoring procedure | (1.64, 1,571.86) | 134.42*** | .011 |
| Number of clusters: Number of items per factor | (1.98, 1,903.08) | 3.59** | .002 |
| Factor-scoring procedure: Number of items per factor | (3.53, 3,389.45) | 12.78*** | .002 |
| One-factor at level-2 | |||
| Number of clusters | (1, 998) | 39.74*** | .031 |
| |
|
|
. |
| Number of clusters: Factor-scoring procedure | (1.69, 1,687.96) | 27.18*** | .005 |
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
Authors
CHRISTIAN L. L. STRAUSS is a senior lecturer in the Department of Psychology and Human Development at Vanderbilt University, 230 Appleton Place #552, Nashville, TN 37203; e-mail:
PATRICK J. CURRAN is a professor in the Department of Psychology and Neuroscience at the University of North Carolina at Chapel Hill, 253 E. Cameron Avenue, Chapel Hill, NC 27514; e-mail:
