Abstract
Keywords
Introduction
The reliability of measurement tools is a foundational concern in psychological research. Inferences about traits, behaviors, and processes depend on the extent to which instruments consistently capture the constructs they are intended to measure. Within psychology, the subfield of psychometrics focuses on assessing the validity and reliability of instruments, often using the framework of classical test theory (CTT). CTT has guided the development of reliability coefficients such as Cronbach’s alpha, split-half reliability, and the Test–Retest Coefficient (TRC). While the structural underpinnings of coefficients like Cronbach’s alpha and the split-half method have been extensively examined and critiqued (Sijtsma, 2008; Webb et al., 2006), the TRC has been predominantly examined on experimental grounds, which leaves its structural robustness underexplored. Despite these gaps, the TRC remains a foundational measure of reliability in behavioral science (Berchtold, 2016; Polit, 2014).
The TRC is based on the assumptions of perfectly stable true scores and the absence of systematic errors, both of which are conditions that are seldom met in real-world applications. Surprisingly, no research has investigated how the TRC performs when these assumptions are violated, which raises important questions about its validity and utility. This study seeks to address these issues by delineating the TRC’s theoretical foundations within the CTT framework and evaluating its reliability and feasibility through simulation.
The paper is organized as follows. First, the logic of the TRC is outlined within the framework of CTT, with a brief emphasis on its assumptions and derivations. Next, a simulation study examines the TRC’s performance under varying conditions, including true score stability and error score dependence. Finally, the implications of these findings are discussed, along with recommendations for the use and interpretation of the TRC in psychometric research.
Reliability in Classical Test Theory
The framework of CTT posits that every observed score
The true score
Furthermore, CTT defines the expected value of the error scores
Within CTT reliability is defined as the proportion of variance that is caused by the true score, divided by the total observed variance of the measurement (Lord & Novick, 2008):
The square root of this expression
Because of the earlier assumptions about true and error scores, the TRC simplifies as follows:
Unlike single-administration coefficients, such as McDonald’s
Factors Affecting the Test–Retest Correlation
While the preceding section showed that the TRC can, under ideal conditions, serve as an exact measure of reliability, this equivalence depends on a range of assumptions that are rarely fully, if ever, satisfied in practice. These assumptions stem not only from the theoretical framework of CTT itself but also from the statistical properties of the correlation coefficient—typically Pearson’s
The first and perhaps most straightforward condition is the requirement of an adequate sample size. As stated above, in CTT, the reliability of a measurement is quantified using the correlation coefficient. The standard error of the Pearson correlation coefficient is a function of both sample size and the magnitude of the correlation itself, meaning that smaller true correlations require even larger samples to achieve stable estimates (Bowley, 1928). When sample sizes are too small for a given correlation magnitude, the sample correlation coefficient becomes an unstable estimate of the true population value. As a result, even a measurement with high true reliability may appear less (or more) reliable due to sampling variability. Notably, simulation studies indicate that, for correlations as large as
A second critical assumption of the TRC concerns the temporal stability of the trait being measured. As outlined in the preceding section, CTT defines the true score as the expected value of observed scores across repeated measurements. By this definition, the true score is assumed to be constant for each individual, and any deviations between test administrations are attributed solely to random error. However, psychological traits often exhibit some degree of intra-individual variation over time, even when labeled as “stable.” A longstanding distinction exists between psychological traits, which are considered relatively enduring dispositions (e.g., cognitive ability and personality), and psychological states, which reflect more transient experiences (Fridhandler, 1986). By their very nature, states fluctuate across time and thus violate the assumption of fixed true scores. Even among stable traits, empirical retest correlations over time commonly fall within a range of r ≈ .6 to .9 (Asendorpf & Wilpers, 1998; Breit et al., 2024; Costa et al., 2012; Rantanen et al., 2007; Scharfen et al., 2018). These estimates are not derived from a single method but span a variety of analytical techniques, including simple test–retest correlations, average inter-measure correlations, latent growth models, and meta-regressions. Regardless of method, they suggest that, while stable traits may approximate temporal invariance, they do not fulfill the strict CTT assumption of unchanging true scores.
A third factor influencing the TRC is the relative contribution of true and error score variances to the total observed variance. As mentioned above, in psychological testing, scores from repeated administrations typically contain both components. Again, under the assumptions of CTT, the true score is defined as fixed for each individual, making any observed fluctuations attributable solely to error. When this condition holds, changes in the TRC reflect differences in error variance: higher error variance reduces the TRC, while lower error variance increases it. This follows from the structure of the correlation formula, where error variance inflates the denominator but does not affect the covariance in the numerator. If, however, true scores also fluctuate over time, they contribute additional variance to the denominator, thereby reducing the TRC even further. The TRC would then become a joint function of both temporal stability and the relative proportions of true and error score variance. This can also be demonstrated formally. Given two observed scores from repeated administrations,
Using the decomposition (5) and the independence assumptions, the covariance reduces to:
Assuming equal variance for all components
Substituting equations (7) to (9) into equation (6) then yields:
Beyond their proportional relationship, the absolute size of these variance components also shapes the distribution of observed scores. As total variance increases, the spread of scores becomes wider, making extreme values more likely. This is particularly problematic in small samples, where outliers can disproportionately influence statistical estimates and reduce the stability of the TRC. Broader distributions often exhibit heavier tails, meaning that even a small number of extreme observations can distort reliability estimates. Large sample sizes may still be required to achieve stable results, even when working with relatively light-tailed distributions, especially when population parameters must be estimated from empirical data (Wilcox, 2010). In such contexts, the TRC becomes more volatile, less reflective of true reliability, and more sensitive to random fluctuations.
Lastly, the TRC also depends on the independence of error scores. As discussed above, CTT assumes that error scores are random and independent across measurements. When error scores remain independent, they do not contribute to the correlation between repeated measurements. However, if error scores become dependent, meaning factors from the first measurement influence the second measurement, this dependency introduces covariance between the scores, which increases the resulting TRC. These errors are known as systematic errors because they arise from consistent factors related to the testing situation, sample characteristics, training effects, or other shared influences between the measurements (Polit, 2014). Unlike random error, which only reduces the reliability estimate, systematic error inflates the TRC by confounding the true score signal with extraneous variance that is shared across administrations. This makes it increasingly difficult to disentangle genuine reliability from the consistency of shared influences and makes a clear interpretation of the TRC impossible.
This too can be formally demonstrated. Starting from the classical decomposition of observed scores (equation (5)), the assumption that error terms are uncorrelated across time is now relaxed. The covariance between observed scores then becomes:
Assuming, as before, equal variances for all components it again follows that:
Using the same logic as in equation (9), the true score and error covariances can be expressed as:
Substituting equations (11) to (13) into the formula for the TRC yields:
Thus, demonstrating, that when error dependence is introduced, the TRC becomes the weighted sum of true score stability and error score dependence, with each component scaled by its relative contribution to total variance. As a consequence, the TRC no longer reflects a pure estimate of reliability, but rather a combination of genuine signal and shared systematic error.
Taken together, the TRC is influenced by several factors: sample size, the stability of the measured trait, the variance of the true score, the variance of the error score, and the independence or dependence of errors. In any given study, these assumptions may be violated to varying degrees and may even interact with one another. While numerous experimental investigations of the TRC exist (e.g., Polit, 2014), no study to date has systematically examined how the TRC performs when these factors are manipulated. Therefore, this study simulates data under varying conditions of sample size, true score stability, and the variances of both true and error scores to explore the TRC’s performance under realistic conditions.
Simulation Studies
To systematically investigate the conditions under which the TRC provides stable and accurate estimates of reliability, two separate simulation studies were conducted. Study 1 examined the effects of sample size, variance ratios, and true score stability under the assumption of independent error scores. Study 2 extended this by introducing correlation between error scores to assess how dependent error structures affect the TRC. Simulations were carried out using R (R Core Team, 2024).
For both studies, 1,000 samples were simulated per condition combination. The average correlation across these 1,000 datasets served as an estimate of the TRC’s accuracy, while the standard deviation of the correlations was used as a measurement of coefficient stability. This setup rests on a key statistical property of the correlation coefficient. Pearson’s
This property allows the standard deviation of
Following Lord and Novick (2008), reliability is conventionally quantified using Pearson’s
Both the code and the graphics used to justify the assertion of normality for both studies are available at https://osf.io/x7ptw/.
Study 1: Method
The first study examined how the TRC performs under varying conditions of sample size, variance ratio, and true score stability. Both the true score (
To simulate changes in true score stability, the correlation between the two true score variables (
The variables manipulated in Study 1 were as follows: (1) Sample size: 2 to 1000 (2) True score variance: 1, 2, 3, 4, 5, 6, 7, 8, and 9 (3) Error score variance: 1, 2, 3, 4, 5, 6, 7, 8, and 9 (4) True score stability: Correlation between
Study 1: Results
Figure 1 displays how the mean TRC develops as a function of sample size and variance ratios across two Simulation study results showing mean estimated correlations (mean TRC) over sample sizes (0 1000) for four variance ratios (0.6–0.9) at two true score stability levels (τ = 1 vs. τ = 0.8). Gray shaded ribbons represent 95% confidence intervals around each LOESS fit. The dotted line in the lower (τ = 0.8) panels overlays the τ = 1 trajectory for direct comparison.
Recommended Minimum Sample Sizes for Different Variance Ratios Across True Score Stability Levels for Good or Excellent Stability
Ratio represents the true score variance divided by total variance.
In contrast, when true score stability (further also referred to as τ-stability) drops to .7, more than 500 participants are required, even at the same measurement reliability. Similarly, when
The bottom row of Figure 1 illustrates what happens when
Deviation of TRC Estimates from Actual Reliability Across Stability Condition at

Relationship between mean estimated correlation (mean TRC) and true score stability at a fixed sample size of 1000, across four reliability levels.
Lastly, contrary to the assumption stated in section 3, the simulations reveal that the absolute size of the true and error score variances does not influence TRC behavior beyond their relative ratio. That is, neither the degree of bias nor the variability in TRC estimates is affected by how large or small the variances are in absolute terms (cf. Supplemental Table 3). Instead, the TRC’s behavior is fully governed by the variance ratio and trait stability.
Study 2: Method
The simulation for Study 2 mirrored that of Study 1, with the main difference being that error scores were drawn from correlated distributions rather than independent and identical ones, as detailed in Study 1.
As in Study 1, observed scores were calculated by summing the respective true and error scores (cf. Method Study 1). This was varied across three levels of error score correlation (.1, .3, and .5) and five levels of (1) Sample size: 2 to 1000 (2) True score stability: 1, .9, .8, .7, and .6 (3) Error score dependence: Correlations of .1, .3, and .5 between (4) Variance ratio (true score to error score): 0.6 to 0.95 in steps of .05
Study 2: Results
Figure 3 illustrates how the TRC evolves as a function of Mean estimated correlations (mean TRC) across increasing sample sizes for two levels of reliability (0.9, 0.8, and 0.7). Shaded ribbons represent 95% confidence intervals. Panels depict varying levels of true score stability (τ) and error dependency (ε).
These two trends are further quantified in Figure 4, which summarizes the average distortions in TRC estimates across varying levels of Mean estimated correlations (mean TRC) at a fixed sample size of 1000, plotted across levels of true score stability. Lines represent different levels of reliability (0.6–0.9), with more dotted lines indicating lower reliability. Panels vary by error dependency (ε).
Discussion
The present study investigated the performance of the TRC under different conditions of variance ratios, sample sizes, true score stability, and error score dependences. Two simulations were conducted. The first study assessed how the TRC performs across different sample sizes, variance ratios, and levels of true score stability. The second study explored the effects of sample size, variance ratio,
Average Estimation and Bias of the TRC
Multiple findings stand out in relation to the average estimation of the TRC. When the assumptions of fixed true scores and independent errors are met, the TRC is an almost perfect estimator of the ratio of true score to error score variance, even at small sample sizes (cf. Table 1).
When true score stability decreases, the TRC estimation deviates from the variance ratio as predicted mathematically in equation (10). The expected TRC equals the product of trait stability and measurement reliability. The practical consequence is that highly reliable instruments may appear to perform poorly when the underlying trait shows only moderate stability. For instance, as shown in Table 2, a measurement with a true reliability of .9 and τ-stability of .7 yields a TRC of just .63—a downward bias of .27 relative to its true value. Though the nominal degree of bias is lower for measurements with lower reliability, the point of true score stability at which a
When error scores are not independent, interpreting the TRC becomes significantly more difficult. Two key trends emerge under error dependence. First, TRC estimates across different levels of measurement quality begin to converge, reducing the discriminability between instruments with varying degrees of reliability. Second, this homogenizing effect becomes more pronounced as true score stability decreases. Both patterns are consistent with the theoretical predictions developed in equation (14), where the expected TRC is modeled as a weighted sum of true score stability and systematic error, each scaled by its contribution to the total variance.
These patterns introduce two core interpretive problems. First, a measurement with low reliability can appear deceptively strong if enough systematic error is present. For example, as shown in Figure 4, a poor instrument can yield a TRC similar to that of a highly reliable one if τ-stability is low (e.g.,
This poses a fundamental challenge. Without separate estimates of τ-stability and systematic error, low TRC values are uninterpretable as they may reflect low reliability, low stability, or both. Conversely, a high TRC may not reflect strong measurement properties, but merely the compensatory influence of systematic error. This ambiguity is especially severe at lower levels of trait stability, where systematic error exerts disproportionate influence, and thus further distorts the reliability estimate. In such cases, the TRC loses its interpretive value entirely.
Coefficient Stability
As seen in Table 2, the TRC stabilizes remarkably fast when true scores are perfectly stable and error is zero. Under these ideal conditions, fewer than 20 participants are required to achieve good coefficient stability (
These sample size demands stand in stark contrast to both standard recommendations and common research practice. For instance, De Vet et al. (2011) propose 50 participants as a general benchmark, while many studies report test–retest reliability using samples of 30–100. The current results indicate that such sizes are only sufficient when trait stability exceeds .90, reliability is good (≥.80), and precision requirements are modest. Although this suggests that small sample sizes may suffice under ideal conditions, such conditions cannot be known in advance. As ideal conditions cannot be known in advance, sample sizes must plan for less favorable scenarios. The present results therefore demonstrate that much larger samples are generally required to ensure that reliability estimates are accurate and robust across a range of plausible measurement conditions.
Contrary to intuition, the TRC stabilizes faster when systematic error is introduced. However, this observed stability can be misleading as it occurs under conditions where the measurement is inherently less reliable. Systematic error furthermore increases the homogeneity of distributions, regardless of
Feasibility
As discussed above, if the assumptions underpinning the TRC are violated, the coefficient no longer refers to reliability in any meaningful psychometric sense. The issue, however, extends beyond bias. It becomes a question of feasibility. Even in principle, can reliability be estimated when true scores are unstable or when systematic error is present?
To address this, it is important to distinguish between bias and identifiability. Bias refers to the degree an estimator diverges from the true parameter it seeks to estimate (Lehmann & Casella, 1998). Identifiability, by contrast, concerns whether the estimator produces unique results that can be clearly attributed to the true parameter. An estimator is identifiable when the number of observed (known) variables is equal to or exceeds the number of unknowns (Lewbel, 2019). With only two observed scores, it is impossible to isolate more than two latent components (Coleman, 1968; Heise, 1969; Rogosa, 2013). The TRC framework implicitly involves four latent components: measurement reliability, occasion-specific error, latent-trait stability, and systematic measurement error. Of these, the latter two are assumed to be one and zero, respectively, thereby rendering the TRC mathematically identifiable under its standard assumptions.
However, when the assumptions of score stability and error independence are not met, a principal problem of underidentification emerges as the number of unknowns exceeds the available information. This issue is illustrated in Figure 2. A measurement with excellent reliability becomes nominally unreliable at a stability level of approximately .65. The same TRC (∼.60) could also result from a measure with acceptable reliability and a stability of ∼ .75 or, as in Study 2, from a measure with poor reliability, a stability of ∼ .80, and a systematic error of .3. Because the researcher lacks direct access to these latent components and is attempting to estimate all of them simultaneously, it is impossible to identify which combination of measurement reliability, occasion-specific error, latent-trait stability, and systematic measurement error produced the observed TRC. In other words, because the TRC relies on both perfectly stable true scores and error independence, violating either assumption does not merely introduce bias, a systematic distortion, but fundamentally renders the model mathematically unidentifiable. The more assumptions that are violated, the more severely the TRC becomes underidentified.
This limitation would be less problematic if the underlying assumptions held reliably. However, as discussed in section 3, no psychological trait exhibits perfect temporal stability, and many fall below the threshold (
While convenient, this approach is riddled with confounding influences. Testing environments are typically uncontrolled, increasing the likelihood of systematic error through contextual variation or motivational shifts. Moreover, the psychological state of students changes considerably over the course of a semester. Midterms, deadlines, fatigue, and fluctuating personal circumstances all contribute to transient changes in affect, cognition, and behavior, the very domains in which psychological traits are typically measured. These sources of variation are not just noise. They represent systematic, temporally structured influences on measurement. Ironically, the further apart the two testing occasions are spaced, a common practice to reduce memory effects, the more likely it becomes that the latent variable itself has changed (Polit, 2014). This creates a methodological impasse. Testing too close risks recall bias, and testing too far apart risks violating stability. The TRC is thus caught in a catch-22, where its assumptions can only be satisfied under conditions that undermine the study design itself.
Furthermore, stability cannot be verified using the same dataset from which reliability is estimated. With only two time points, observed changes remain ambiguous, potentially reflecting random error, systematic bias, or genuine shifts in the latent trait. The direction and magnitude of this change are likewise unknowable. This epistemic limitation is absolute. No internal diagnostic within the coefficient can reveal which source predominates. The same, of course, applies to systematic error. Changes in testing conditions or psychological context cannot be detected or corrected with only two observations. If such error is present, it is absorbed silently into the observed scores, leaving no empirical trace. As a result, researchers must assume both stability and error independence despite having no empirical basis for doing so. Given the above, both assumptions can be rarely justifiable in applied contexts.
In applied psychometric work, these limitations demand caution. The TRC, when used in isolation, is insufficient as a claim about reliability. In practice, it must be contextualized by information such as internal consistency metrics, prior knowledge about trait stability, or design choices that plausibly minimize confounding variance. Even then, the estimate rests on untestable assumptions. Predictions about trait stability from longitudinal studies may offer one approach to addressing these limitations by forecasting potential biases and adjusting guidelines, but such approaches are speculative and offer no guarantee of validity in the current application.
Furthermore, as discussed earlier, the required sample size for accurate TRC estimation increases rapidly as trait stability decreases, such that
This makes the TRC unusable in most applied settings. Even without systematic error, adequate coefficient stability at moderate τ-stability requires hundreds of participants. For example, at
The only viable path forward for test–retest contexts lies in abandoning the two-time-point framework and adopting models that can empirically estimate, rather than assume, key components like trait stability and systematic error. Designs incorporating three or more measurement occasions, such as the longitudinal framework proposed by Heise (1969), allow for model identification under realistic conditions. Similarly, latent variable approaches like longitudinal structural equation modeling offer a principled way to separate true change from error, though they require more complex designs and greater methodological rigor. Without such designs, or a shift toward single-administration reliability coefficients, the TRC provides not a meaningful estimate of reliability, but a mathematically ambiguous quantity that gives the illusion of psychometric certainty where none exists.
Limitations
Several limitations of the present study should be noted. The present simulations used normally distributed continuous data analyzed with Pearson’s
Likewise, ordinal or binomial data can yield different estimates. However, in psychological measurement, summed scores from multiple items tend to approximate normal distributions by the CLT (Norman, 2010), and ordinal variables with several categories can often be treated as continuous without meaningful distortion (Rhemtulla et al., 2012).
Lastly, in line with classical test theory, the simulations assumed independence between true scores and error scores. Situations in which τ and ε covary, for example, when error is partly determined by trait level (e.g., ceiling or floor effects), were not modeled. Such covariances would introduce additional sources of systematic error and further reduce the identifiability of the TRC. Again, the results therefore represent a relatively favorable case and violations of this assumption would only strengthen the conclusion that the TRC is unidentifiable. Thus, while absolute values may vary across data types, the central conclusion remains robust. The TRC is identifiable only under an idealized scenario that is unattainable in practice.
Conclusion
The TRC is a fundamental reliability index in CTT. It provides a straightforward method for quantifying measurement reliability by comparing two measurement points, making it a widely used tool in psychometrics. The current findings indicate that the TRC offers a robust and accurate estimate of reliability if and only if the true scores are perfectly stable (
This is not merely a practical shortcoming but a structural one. With only two time points, it is mathematically impossible to separate true reliability from temporal instability and systematic error. As such, TRC-based reliability collapses under common empirical conditions, even when parameters are well specified or large samples are used. Given these limitations, the TRC should be applied only when its assumptions can be justified and when sample sizes are sufficient to enable precise estimation. In most, if not all, applied contexts, these criteria are not met.
While other reliability indices such as Cronbach’s α, McDonald’s ω, or the GLB exist, they do not resolve the limitations outlined above, as they assess internal consistency within a single test administration rather than temporal reliability. For longitudinal assessments, designs incorporating three or more time points, as proposed by Heise (1969), provide a more principled approach to estimating reliability, although they demand a level of methodological rigor not commonly observed in practice. Without such designs, the TRC does not provide a meaningful estimate of reliability but rather a false sense of certainty.
Supplemental Material
Supplemental material - On the Unreliability of Test–Retest Reliability
Supplemental material for On the Unreliability of Test–Retest Reliability by Domenic Groh in Applied Psychological Measurement.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Supplemental Material
Supplemental material is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
