Two-way interaction effects in linear regression occur when the relation between two variables changes depending on the level of a third. Despite their frequent use, interactions are notoriously difficult to estimate accurately and test for statistical significance because of small effect sizes and low reliability. In this study, we used Monte Carlo simulations to establish stability thresholds for two-way interactions between continuous variables across combinations of reliability (0.7–1.0), main effect size (0.1–0.5), collinearity (0.1–0.5), and interaction effect size (0.05–0.2). Stability was defined as the consistency of estimated effect sizes across repeated samples of the same size from the same population and operationalized using modified definitions of the corridor of stability and point of stability from Schönbrodt and Perugini. Results show that the stability of interaction estimates is primarily determined by sample size and predictor reliability. The case representing a realistic psychology field study, in which researchers have limited control over variables, stabilized at n = 3,800, requiring 72% statistical power. At n ≤ 100, 11% to 45% of the estimates were incorrectly signed (i.e., negative when the true effect was positive). Most psychology studies enroll far fewer than 500 participants, and our results indicate many published interactions may be unstable. Analyses involving highly reliable predictors, such as group assignment in experimental designs, may stabilize at lower sample sizes because they attenuate the expected effect size less than variables with more measurement error. Researchers are encouraged to avoid routine tests of two-way interactions unless sample size and reliability are adequate and hypotheses are specified a priori.
Interaction or moderation effects in linear regression occur when the relation between two variables changes depending on the level of a third variable.1 Interaction analyses are commonplace in the social sciences, in which researchers often seek to understand how the influence of one variable may change in different contexts or conditions. Interaction analyses are used to test a wide range of questions, including but not limited to gene–environment (Plomin et al., 1977), person–situation (Lewin et al., 1936), aptitude–treatment (Cronbach, 1957; Cronbach & Snow, 1981), and society–individual (Blumer, 1986) interactions, among many others. A significant interaction indicates that the relationship between the predictors and the outcome variable cannot be fully explained by the main effects alone because the predictors’ effects are interdependent in a nonadditive way (Jaccard & Turrisi, 2003). Interactions introduce unique complexities that can make them challenging to detect (MacKinnon, 2011), especially in nonexperimental or naturalistic investigations in which researchers have limited control over variables (i.e., field studies; McClelland & Judd, 1993). Some researchers use 80% power as a stopping rule for detecting interactions, but in practice, many reported interactions are underpowered. Power analyses are often omitted or conducted using conventional benchmarks that assume unrealistically large effect sizes for interactions. As a result, the stability of interaction effects, defined as the consistency of their estimated effect size across repeated samples, is frequently uncertain even when statistical significance is obtained.
In the context of null hypothesis significance testing, “power” refers to the probability of correctly detecting an effect if it exists. There are two primary issues with power in interaction effects: their typically small effect sizes (Aguinis et al., 2005; Vize, Sharpe, et al., 2023)and low reliability (Busemeyer & Jones, 1983).2 Beyond the normatively small effect sizes of interactions, they can be obscured by restricted ranges of predictors (McClelland & Judd, 1993), the size and collinearity of the main effects (Baranger et al., 2023; Dormann et al., 2013), and unequal cell sizes in categorical moderators (Frazier et al., 2004). The reliability of the interaction term is approximately the product of the reliabilities of its components, which can further attenuate effect size. As a consequence, large sample sizes tend to be necessary to detect interactions (Hyatt et al., 2022; Lakens, 2019; Vize, Baranger, et al., 2023). Even seasoned researchers easily fall into false dichotomies around statistical significance when interpreting interaction effects (McShane & Gal, 2017), yet many studies in the extant literature do not sufficiently power for interaction analyses.
Publication bias casts doubt on the understanding of published interactions (Sotola & Credé, 2023; Ferguson & Brannick, 2012). The number of significant interaction effects is likely overstated because of the selective reporting of exaggerated effect sizes (Simmons et al., 2011), failure to correct for multiple comparisons (see Pease & Lewis, 2015), and severe underpowering (see Jensen-Campbell et al., 2007; see Ode et al., 2008). Recent z-curve analyses of health-psychology journals confirm this pattern, in which researchers estimated replication rates below 50% and showed that for every significant interaction reported, nearly two nonsignificant results go unreported (Fremling et al., 2025). A meta-analysis found that only 22% of interaction analyses replicate in psychology (Open Science Collaboration, 2015), and similarly poor replicability exists in other fields (Credé & Sotola, 2024; Greenland, 1993). Some have even urged researchers to “mend it” or “end it” regarding the search for moderators (Murphy & Russell, 2017).
One factor to consider when designing interaction analyses is stability, defined as the consistency of estimated effect sizes across repeated samples of the same size from the same population. Larger samples yield tighter clusters of estimates and thus greater stability. Whereas standard measures of spread (e.g., standard errors, standard deviations, and confidence intervals [CIs]) describe uncertainty for a single analytic result, stability is about the analytic design of the study and reflects how reliably a given design reproduces estimates of effects.
“Power” and “stability” both describe the expected performance of a design rather than outcomes from a single data set, but they capture different features of that performance. Power pertains to the performance of decision rules in hypothesis testing: the probability of rejecting a null hypothesis at a chosen alpha when an interaction truly exists. Stability pertains to estimate reliability: the degree to which identically designed studies yield similar estimates of effects. Unlike power or confidence-interval coverage, stability does not depend on alpha or a specific test; it is purely about the spread of estimates generated by a study design across repeated simulations. Interactions are particularly prone to artifacts (Gelman, 2023; McClelland & Judd, 1993; Murphy & Russell, 2017; Rimpler et al., 2025), and conflating power or confidence-interval coverage with stability can lead researchers to draw theoretical conclusions from effects that are statistically significant yet unlikely to replicate.
These replication failures point to a deeper issue with conventional power analyses. Power is the probability of rejecting a null hypothesis at a chosen alpha when an interaction truly exists but does so based on assumed effect sizes. When those assumed effects are exaggerated because of publication bias, selective reporting, or unrealistic benchmarks, a study can be “powered” on paper yet still yield misleading results. In addition, finding that an effect differs from zero is not the same as estimating its magnitude accurately. For science, the critical question is not whether an effect exists but whether its size can be estimated with sufficient consistency and accuracy across samples. This is the question stability answers directly.
A well-known framework for quantifying stability was introduced by Schönbrodt and Perugini (2013) in the context of correlations.3 Instead of focusing on the CIs around a single estimate, they used simulations to determine the sample size at which repeated estimates fell within a predefined range of the true value. They simulated a population correlation and drew 100,000 bootstrapped samples across a range of sample sizes (ns = 20–1,000). In each sample, they estimated the correlation between the two variables and traced the correlation values as a function of sample size to form trajectories. The number of trajectories falling within a fixed-width corridor around the population parameter, called the “corridor of stability” (COS),4 was used to find the point of stability (POS), the smallest sample size at which a given percentage (80%, 90%, or 95%) of trajectories remained within the COS. For example, a correlation of r = .1 stabilized at n = 252 with 80% of estimates in the COS. At this sample size, there is approximately 38% power to detect the effect.5 For = .1, two of their three COS bounds permitted sign errors, in which effects could reverse direction and still be considered stable. Converting their COS bounds to percentages, we found that their narrowest COS allowed for these small correlation estimates to deviate by as much as 97% from the expected value; for their widest COS, estimates were permitted to deviate from the expected value by as much as 292%6 for the smallest effect size they examined (r = .1) and still be regarded as stable. Although modest absolute deviations are informative for larger effects, they translate poorly to small effects. This is particularly problematic for interaction analyses, in which almost all empirically observed effect sizes in the psychological sciences have effect sizes of r ≤.1 (Murphy & Russell, 2017; Open Science Collaboration, 2015).
Although Schönbrodt and Perugini (2013) evaluated correlations between perfectly measured observed variables, research evaluating correlations between imperfectly measured latent variables has shown that such correlations require even larger sample sizes to stabilize because of unreliability. Kretzschmar and Gignac (2019) conducted Monte Carlo simulations similar to Schönbrodt and Perugini and used McDonald’s omega to introduce unreliability into latent-variable correlations. Although a latent-variable correlation of stabilized at with perfect reliability, when McDonald’s was simulated at .7, the requisite sample size for stability more than doubled to ; with , this jumped to . Perfect reliability in psychological instruments is exceptionally rare, and most observed relations among psychological constructs are attenuated by the unreliability of their measures. The conclusions of Schönbrodt and Perugini regarding sample size are important for understanding stability but may not generalize to many applied scenarios. Interaction effects are particularly susceptible to low reliability because their reliability is approximately the product of the reliabilities of the main effects (Busemeyer & Jones, 1983). Coupled with their tendency toward small effect sizes (Vize, Sharpe, et al., 2023), interactions examined with linear regression possess unique complexities that may affect their stability, which warrants additional investigation.
In this study, we used Monte Carlo simulations to evaluate the stability of interaction effects under varying conditions. We aimed to find the sample sizes, main effect sizes, and predictor collinearity values that produce stable interactions. Our approach is different from Schönbrodt and Perugini (2013) in four notable ways. First, we considered more potential conditions by including three effect sizes: that of the interaction, main effects, and the collinearity of predictors. Second, our definition of stability accounts for effect size directly through the use of percentages instead of translating effect sizes using Fisher’s r to z and adding or subtracting predetermined widths (w = 0.1, 0.15, 0.2) before back-translating with Fisher’s z to r. Third, we tested a wider range of COS widths to allow us to consider conservative and less conservative thresholds for stability. Finally, we did not trace trajectories across sample sizes from the same population and instead resampled entirely at each new sample size because we had a uniform simulated population to draw from. Given that the mean effect size for interactions is below r = .1 in empirical research (Credé & Sotola, 2024; Freese & Peterson, 2017; Open Science Collaboration, 2015), the goal of this simulation study is to illustrate the conditions necessary for stable interactions. We answer the following research question:
Research Question: How do main effect size, intercorrelation, and reliability affect the sample size required for stable interaction-effect estimates?
Method
In this preregistered study, we used Monte Carlo simulations to evaluate the conditions required for interaction effects to stabilize. Simulations were conducted in R (R Core Team, 2024), and the full reproducible code is available in the supplemental materials (https://osf.io/zmvsf/?view_only=46e3f25d45ea4c83a33ffaef111abef9). The R package InteractionPoweR (Baranger & Castillo, 2025) is available for researchers interested in conducting similar analyses. In this article, the regression model applied is
where and represent the main effects, is the interaction term, is the outcome variable, and is normally distributed random error. Simulations were performed using the generate_interaction() function from the InteractionPoweR R package (see Baranger et al., 2023). In brief, variable correlations are input and then adjusted for reliability, and regression coefficients (, , and ) are computed. The variables and are simulated by drawing from a multivariate normal distribution, the interaction term is calculated as the product of and and is then drawn from a normal distribution. All variables are standardized to have a standard deviation of 1 and a mean of 0, and thus,
Simulation parameters
Reliability
Reliability reflects the degree to which predictor values are affected by random measurement error. We considered four levels of reliability for and , (1, 0.9, 0.8, 0.7). The reliability of the interaction term is distinct from the predictors (Busemeyer & Jones, 1983) and is calculated as
Effect size
We evaluated two relevant effect sizes, those of the main effects () and interaction (). The main effects were simulated across a range of sizes . Note that our preregistration included only main effect sizes of and . One main effect size was added after the preregistered analysis was conducted so the simulation parameters would more directly emulate additional effect sizes common in the literature.
Both and were simulated with equal effect sizes to minimize potential confounds and isolate the conditions under which the interaction stabilizes. Keeping and β2 equal allows the interaction to reflect a true bidirectional effect rather than a dominant main effect with a secondary modifier. In addition, the main effects were simulated with differing degrees of intercorrelation. Collinearity among the predictors is a common issue in regression analysis, and we considered several values for which are 0.1, 0.3, and 0.5. Note that under assumptions of multivariate normality, the interaction term is independent of and ().
The expected interaction effect size () was calculated after attenuation from the population-level true value using the procedure described in Baranger et al. (2023).7 Multiple values were selected for to reflect the effect sizes typically observed in the literature (Aguinis et al., 2005; Vize, Baranger, et al., 2023; Vize, Sharpe, et al., 2023), which were preregistered as and and later expanded to include to better illustrate the low end of common interaction effect sizes. Each combination of parameters attenuates to a unique expected value For a complete table of parameter combinations, see the Supplemental Material available online.
Sample sizes
Although the original preregistered approach was to examine sample sizes ranging from to in increments of we identified a more efficient way to achieve our aims using a search algorithm. Data sets were generated at the median of a specified range, which then iteratively narrows until the outcome is identified. In our simulations, the range begins with up to a maximum of n = 350,000, and the changes reduced computational costs by several orders of magnitude. In addition, for illustrative purposes, we simulated data at standard sample-size benchmarks of interest using the same data-generation procedure. These additional sample sizes were generated to provide readers with an intuitive sense of the stability of interactions at common benchmarks for sample size. In light of the extent of these changes, for transparency, all original, preregistered analyses are reported in the Supplemental Material.
The simulation parameters produced 192 distinct combinations of variables, excluding sample size. At each examined sample size, for each combination of parameters, 10,000 data sets were generated.
Evaluating stability
Stability of the interaction is measured using a modified version of the COS and POS from Schönbrodt and Perugini (2013). Specifically, we used percentages to define how similar effect-size estimates must be to the expected effect to be considered stable. Statistical power was calculated at each examined sample size using the power_interaction_r2() analytic power function from the InteractionPoweR package in R (Baranger et al., 2023). The bias of the model was measured by evaluating how frequently the models recapture the true value across simulations.
COS
The COS8 is an interval in which estimates are considered stable if they fall within its bounds. Its width is determined using percentages, which we denote . The COS is symmetric and centered on the expected effect after attenuation and is thus centered on a downwardly biased estimate, similar to CIs, which also do not account for bias.
To demonstrate how the is calculated, consider the following example: Suppose we select a of , and the true effect size is 0.2. In a combination with (selected for convenience), the expected effect size is attenuated to 0.1. By applying the above formula, the is Under these stability requirements, estimates greater than and less than fall within the COS. If we were to create a for the same combination with a of , the interval for stability would widen to Estimates falling within the smaller will necessarily fall within a larger , but the reverse does not hold.
POS
The POS is the sample size at which a certain percentage of the estimates (one of or of the 10,000 data sets generated for each sample size falls within the . We denote this percentage as As increases, the sample-size requirement for stabilization becomes more stringent. Thus, stability in our simulations is operationalized similarly to Schönbrodt and Perugini (2013), diverging only in the definition of the to apply more directly to interaction effects.
The following example demonstrates how the COS and POS are calculated and applied in our study. Consider the case in which we select a of of and other parameters identical to the previous example There are 10,000 estimates at each sample size we consider. In this example, is identified as the because of the estimates of are within of the expected effect. In our simulations, we searched for the sample size at which a is identified for each combination for each pair of and percentages.
We opted for percentage-based thresholds when quantifying stability to scale according to the magnitude of the effect. Schönbrodt and Perugini (2013) constructed a using Cohen’s benchmarks for width (w = 0.1, 0.15, 0.2) on correlations ranging from r = .1 to r = .7. This approach is suitable for larger effects, which almost never describe interactions effects in field studies in the psychological literature. For a demonstration of the application of their recommended widths to effects more typical of interactions, as percentages, see Table 1.
Note: The italicized rows are the effect sizes employed in the present study. The bold rows represent effect sizes unique to the present study.
In six of 12 cases with the effect sizes we analyzed in our study, the bounds calculated using the fixed-width approach permit sign errors (i.e., a true positive effect falling within the COS despite being negatively signed). An interaction estimate deviating by 200% from the population-level effect with its sign inverted is best described as unstable (and inaccurate); it will tend to vary substantially across samples and prove difficult to reproduce. We thus sought to establish practical thresholds for stability for interaction effects. By opting for percentages, we offer a parsimonious alternative to calculate the with more informative bounds. In addition, our approach benefits from the width of the being a function of the effect size of the expected value, whereas Schönbrodt and Perugini’s (2013) width is independent of effect size.9
Multiple values for were examined in this study to allow for the requirements for stability to range from conservative to less conservative. A of was also preregistered but is not reported in this article because it was evaluated to be so conservative as to be not informative to readers. For this analysis, see the Supplemental Material.
True-value recapture rate
The true-value recapture rate is a measure of bias in the estimates of the interaction coefficients. It is the proportion of estimates falling within bounds centered on the true (unattenuated) interaction effect size and is calculated similarly to the :
Power
Power for the interaction term is calculated analytically using the InteractionPoweR (Baranger et al., 2023) package in R, which accounts for attenuation. This is different than the preregistered approach of recording the proportion of significant estimates during simulations. The change was made to improve computational efficiency.
Sign errors
Sign errors for the interaction term were calculated as the proportion of negatively signed interaction effect-size estimates to the total number of estimates because our simulated effect sizes ( are all positive.
Statistical analyses
All simulations and analyses were performed using the parameters and evaluation metrics described above. Information presented about the POS is calculated using data simulated with the search algorithm described above in Sample Sizes. When specific sample sizes were examined (see the Results section; , the data were simulated separately from the search procedure. The data-generation process remained the same at any specific sample size (10,000 data sets were generated with the specified combination of parameters) regardless of whether the search algorithm was used to identify the sample size for simulation. For additional details on model performance and the properties of the simulated data, see the Supplemental Material.
Results
Stabilization across combinations
The subsequent analyses are reported with PCOS = 50% and unless otherwise specified. Of the 192 combinations tested, 145 (76%) were stable at a sample size less than or equal to Across all combinations with an effect size of the mean is When the mean is 789 . As effect size increases, the number of estimates falling within the COS. increases; when , there is an average of The largest simulated interaction effect ( has an average of An effect of has values that are, on average, 16 times smaller than those from our smallest effect under otherwise identical conditions. Larger interaction effects lead to more stable results in smaller samples; note that our two largest effect sizes are much larger than typical interaction effects.
As the reliability of the predictors increases, the POS decreases. The average for a reliability of is . When reliability is the mean POS is . For a reliability of , the average is . In the case of perfect reliability, the mean is . Each 10% decrement in the reliability of the predictors results in an approximate increase of 20% to the sample size required for a stable prediction. If an interaction is estimated from predictors with reliabilities of , it requires a sample size 94% larger than the same interaction estimated with perfect reliability. Reliable predictors lead to more stable estimates of the interaction in smaller samples.
Main effect size and collinearity also affect the although to a lesser degree than sample size, effect size, or reliability. Stability is evaluated at four predictor effect sizes. The smallest main effect size of has a mean of This pattern continues for , which have average sample sizes of and , respectively. Collinearity among the main effects is considered at three levels. For the smallest level of collinearity the mean is At the mean is When the predictors are highly collinear (, the average is Small, highly intercorrelated main effects result in larger sample sizes being required for stability.
Stability in the average study
A combination of parameters that approximates a field study in psychology10 (Brysbaert, 2019; Nosek et al., 2022; Open Science Collaboration, 2015) has a point of stability of 11 with and . At the statistical power to detect the interaction effect is , and of the estimates are incorrectly signed! There is little change at . Even at power is , and of the estimates are signed incorrectly. Power at the is With the most generous stability requirements , the is , and power at this is By contrast, a rigorous stability requirement corresponds to an impractically large of and power When assuming the interaction effect is 10 times what is average in a psychological-science study12 with zero measurement error in the predictors, the sample size at which stabilization occurs is , which provides power. The sample sizes required to achieve stability are higher than those commonly found in published studies, even under ideal conditions with interaction effects of equal size to the main effects.
effect size
In a nonpreregistered analysis, to better illustrate the POS/COS trade-offs, we also calculated Cohen’s for each combination of simulation parameters. is a commonly used effect size in regression analysis that complements unstandardized coefficients. In a comparison of two nested models, a reduced model (e.g., ), and a full model (e.g., ), the statistic13 reflects how much of the residual variance not accounted for by the reduced model ( is captured by the full model. In this context, the predictor of interest is the interaction. We included primarily to aid in visualization (see Fig. 1). It provides a standardized metric that assists in illustrating several features from our simulations: sample size, effect size, and impact of the width and percentage on when the interaction can be regarded as stable. Cohen’s (1988) conventions suggest values of 0.02, 0.15, and 0.35 correspond to small, medium, and large effects, respectively. In line with our parameter selection, the values in this study encompass a range suited to interaction effects, although they are small in absolute terms. Figure 1 shows the relation between interaction effect size (measured using and the sample size for all combinations. Each line represents a pair of and values.
Sample size and at the point of stability across different COS and POS percentages. Figure 1 traces the sample size required for stability as a function of the value for COS widths of 100%, 50%, and 25% with a POS threshold of 80%. All 192 combinations of simulation parameters are represented on each line. In the upper left, the numbers labeled “max” illustrate the sample size required for stability with the smallest in our simulations (0.0013). Points 1, 2, and 3 are for the same parameter combination and demonstrate how loosening stability requirements results in stability at smaller sample sizes. Points 2 and 4 lie along the same line and show how smaller effect sizes require larger sample sizes to stabilize, all other things being equal. COS = corridor of stability; POS = point of stability.
For example, consider Points 1 through 3 in Figure 1.14 All illustrate the for an of and vary by width. Point 1 has a of , which is 4 times the for Point 2 ( and 15 times that of Point 3 ( Point 3 lies along the most generous and percentages we examined, and Point 2 represents the COS and percentages likely to be useful and practical to researchers. Point 3 is prohibitively strict and illustrates how the sample size required for stability increases as the narrows. In addition, Points 2 and 4 lie on the same line but with different effect sizes. Relative to Point 2, the for Point 3 is smaller and the is larger To stably estimate a small effect, sample-size requirements quickly balloon to be untenably large.
Power and stability
In a deviation from our preregistration, we chose to more specifically examine statistical power at the POS for all combinations considering various COS widths because we believe readers might find such an examination useful. With a width of and of 80%, estimates of the interaction are stable in all 192 combinations when power is approximately 72%. To achieve stability with a of holding the width at the required power increases to and with a of it is With a narrower of holding at the average power for all 192 cases is The most lenient confidence requirements correspond with power on average at the Stability and power are closely related, and greater power indicates more stable interaction estimates.
Sign errors and true-value-recapture rates
Sign errors are most frequent at small sample sizes across all combinations ( at at ). Overall, the average sign-error rate is at . Smaller effect sizes tend to have more sign errors; when , the average sign-error rate is for whereas a larger of 0.20 has a sign-error rate of at the same sample size. The mean rate of sign errors in all simulations is at , at , at , and at . For larger sample sizes, such as , the average sign-error rate is and , respectively. Sign errors are strongly negatively related to power overall () and are most frequent for interactions with small effect sizes estimated at sample sizes under When considering the ability of the models to recapture the true value , we found that if the expected effect is similar in size and direction to increasing the sample size improves accuracy. However, if is biased because of measurement error, a larger sample does not correct the bias.
Case example
For the simulated data for the combination with median parameters,15 see Figure 2. The data illustrated are not simulated using the search algorithm described in the method section because the plot would be incomplete. Instead, data are generated using the same procedure for each from to in increments of with data sets at each sample size.
Plot of the estimates for the interaction-term coefficient (). Figure 2 demonstrates our simulations for the median combination (true interaction effect size = 0.15, main effect size = 0.3, main effect intercorrelation = 0.3, reliability = 0.8). There are 10,000 estimates of the interaction at each sample size. These estimates are split into 10 groups of even size (deciles), denoted by the horizontal, jagged lines. Darker shading indicates the estimates are tightly clustered, and lighter shading illustrates greater spread in the estimates. The horizontal dotted line is the expected effect (), which is 0.11.
Stability is achieved at n = 425 with a COS of . Note that in Figure 2, the 80% point of stability is where the 90th and 10th deciles lines intersect .
Discussion
Despite their theoretical appeal, the results of this study suggest that estimating two-way interaction effects using linear regression presents major methodological and statistical challenges. An estimate of an effect is considered stable when it is likely to replicate in magnitude and direction across samples. Our simulations identified sample size, predictor reliability, and interaction effect size as the primary determinants of the stability of interaction estimates. When these parameters were evaluated at levels realistic for a field study, the sample size required for a stable estimate of the interaction coefficient was found to be 3,800.
An important takeaway from this study is that 72% power to detect the interaction is synonymous with estimates being reasonably stable () because the stability is attained at 72% power with this COS width and POS percentage in all combinations. Thus, powering a study to at least 80% to detect an interaction will essentially guarantee stability at a level that will yield accurate results. Most studies fall well short of this mark (Open Science Collaboration, 2015). With a sample size of 500 or below, 49% of our parameter combinations were stable,16 and it is effectively a coin flip of whether an interaction analysis in this sample-size range is replicable. The conditions that produce stable interaction estimates and the conditions of the many field studies diverge; without sufficient power from a robust sample size and reliable predictors, it is generally inadvisable to conduct interaction analyses because they are likely to be uninformative, and there is a large probability that the observed magnitude and sign of the interaction will not align with the true population effect.
Although there is no inherent flaw with interaction analyses, there are unique properties to interaction effects that explain the difficulties researchers encounter when testing for them. It has been repeatedly observed that interaction effects are small, with an absence of evidence from any field that there are many large interactions (Credé & Sotola, 2024; Freese & Peterson, 2017; Open Science Collaboration, 2015). Simultaneously, they are uniquely susceptible to attenuation because their reliability is approximately the product of the two main effects. Methodologically, the sample-size thresholds for stability we have outlined will likely be impractical for most researchers to meet because of constraints on resources and sociological pressures around publishing. The same pressures drive many interaction analyses. If the main effects in a study are not found to be significant, researchers may turn to testing interactions until a significant (and thus, publishable; Greenwald, 1975) result is found. In such cases, researchers are typically searching for interactions they are grossly underpowered to detect and that would yield highly unstable estimates. These conditions can encourage atheoretical serial testing of interactions. This will lead and has led to increased family-wise error rate and gross overestimations of population effect sizes (Sotola & Credé, 2023).
Power and stability are not the only issues with interactions. Even if a given sample size provides adequate power and stability for an interaction, there remain deeper conceptual and interpretive challenges. As Rohrer and Arslan (2021) emphasized, there are additional issues to consider: scale dependence, the distinction between moderation of slopes and moderation of correlations, and causal identification. Interactions can change in magnitude (or even reverse direction) depending on the measurement scale, leading to contradictory conclusions. Differences in correlations between groups may be mistaken for slope differences or vice versa, obscuring the true nature of the effect. Finally, significant interaction terms do not imply causal interactions unless both variables are appropriately manipulated or strong assumptions are met. Thus, even precise and stable estimates can be misleading without careful attention to these foundational issues.
Stability and power
Statistical power was identified as a strong proxy for stability, although the two constructs are theoretically and operationally distinct. For the 192 parameter combinations, stability (defined as of the estimates falling within a with width of ) was achieved with statistical power of approximately Thus, if the probability of correctly detecting a true effect (power) is high, resampled estimates will tend to cluster tightly around the expected effect (stability). Although this operationalization of stability is less informative at larger effect sizes ( as a result of its percentage-based approach, it is optimal for application to interactions in which effect sizes almost universally fall below Furthermore, more stringent stability requirements demand greater power. If an interaction effect is to be replicable within of the expected effect, near power is required. Although the associated sample sizes for stabilization varies based on the size of the expected effect and other parameters, the power at the is consistent. There is 12% power to detect an interaction with under conditions typical of psychology studies at (see Table 2; also see Aguinis et al., 2005; Vize, Sharpe, et al., 2023). Furthermore, when approximately 85% of the estimated interactions vary by 100% or more from the expected value. The probability of detecting a true effect is affected by sampling variability, and stability is a measure of spread that can be largely attributed to sampling variability. Thus, if an interaction analysis is underpowered, it is also unstable.
Point of Stability (n) for Different , COS, and POS Percentages and Reliabilities
POS: 80%
POS: 90%
POS: 95%
COS levels
25%
50%
100%
25%
50%
100%
25%
50%
100%
0.05
0.7
0.035
20,700
5,015
1,240
33,590
8,420
2,080
46,430
11,630
2,910
0.8
0.040
15,455
3,800
940
25,280
6,300
1,585
36,390
9,060
2,290
0.9
0.045
12,140
2,930
760
19,925
5,080
1,250
28,130
6,860
1,760
1
0.050
9,590
2,435
610
15,900
4,075
980
21,825
5,630
1,415
0.10
0.7
0.070
5,080
1,240
325
8,195
2,060
520
11,725
2,930
730
0.8
0.080
3,800
950
240
6,310
1,540
400
8,790
2,200
565
0.9
0.090
2,955
750
190
5,055
1,220
310
6,985
1,810
430
1
0.100
2,390
610
150
3,970
1,000
250
5,635
1,415
350
0.15
0.7
0.106
2,200
550
140
3,570
910
235
5,070
1,305
330
0.8
0.120
1,650
425
110
2,770
690
175
3,890
960
250
0.9
0.135
1,310
330
85
2,150
540
140
3,130
765
190
1
0.150
1,040
270
70
1,720
440
110
2,450
625
155
0.20
0.7
0.141
1,200
310
80
2,060
500
130
2,885
715
180
0.8
0.160
920
230
60
1,560
390
100
2,205
535
140
0.9
0.180
720
190
50
1,170
305
80
1,695
420
110
1
0.200
570
150
40
960
245
60
1,350
340
85
Note: Main effect size is 0.2, and the collinearity between the predictors is 0.1. We also calculated POS values for COSs with widths of 10% and 75%. For brevity, they are excluded from this table and are available in the Supplemental Material available online. Negative effect sizes were not evaluated in this study. In response to reviewer feedback, we evaluated one case from this table () with the sign of the interaction effect inverted . We found the to be 983, a 3.2% increase from the positively signed equivalent, which has a POS of 950. POS = point of stability; COS = corridor of stability.
CIs are often misinterpreted as indicators of stability. A confidence interval reflects within‑studies uncertainty around the attenuated parameter targeted by the estimator under measurement error. It does not evaluate consistency across replications like the COS and POS. For example, at n = 100 with ρ = 0.1, the average 95% CI for the interaction term across 10,000 simulations was [–0.11, 0.27] around an attenuated β3 ≈ 0.08, indicating substantial imprecision in a single study. By contrast, the COS and POS are design‑level metrics, and they quantify the proportion of replicated estimates that fall within a prespecified band around the attenuated population value. CIs summarize precision of one estimate, whereas the COS and POS summarize the ability of the design to produce stable results. Narrow CIs do not guarantee high stability if the corridor is tighter than the typical sampling variability; conversely, a design can meet a stability criterion even when single‑study CIs remain relatively wide.
Recapture rates and sign errors
There is a moderate to high risk of sign errors when the sample size is below the effect size is small (), and the predictors are less reliable (. Main effects that are smaller () and highly correlated further increase the risk of sign errors. In the least reliable case,17 defined as the combination of parameters with the highest observed the rate of incorrectly signed estimates is when and when . Unreasonably large sample sizes are necessary to reduce the sign-error rate to below under these conditions. By contrast, the most reliable case,18 defined as the combination with the lowest observed , has zero sign errors by The median case,19 defined as the combination with median input parameters, has a sign-error rate of at and less than at Of all the incorrectly signed estimates, only are statistically significant, on average. These results suggest incorrectly signed estimates are infrequent in the literature given that publication bias rewards significant results.
Likewise, the true population value () is recaptured more frequently for larger interaction effect sizes with more reliable predictors. As sample size increases, the precision of the estimates of attenuated effect improves. The difference between and becomes more accentuated as the expected effect becomes smaller because of attenuation. As sample size increases, recapture rates tend to either zero (when the predictors are unreliable) or 100% (when the predictors are reliable). That is, when is outside the around , the recapture rate shrinks toward zero as sample size increases after an initial peak. In the aforementioned example20 representing a realistic field study, the true value is recaptured at a rate of when for a width of This recapture rate suggests a relatively small amount of “true” signal is detected in most interaction estimates, again attributable to their small sample size and susceptibility to the reliability of their predictors. Functionally, this affirms what is already known about correlations: The reliability of the predictors can attenuate the observed effect size.
Alternatives
Our results accord with Murphy and Russell’s (2017) recommendation for researchers to end the search for moderators unless major improvements to their detection can be made. Estimating interaction effects necessitates at least four groups because they involve the examination of multiple levels of two independent variables simultaneously. Additive main effects are often mistaken to be interactions (Vize, Baranger, et al., 2023), and one cannot conclude an interaction exists simply by virtue of the two variables producing a larger effect jointly as opposed to separately. In linear regression, the interaction term models a situation in which variance is explicable beyond the sum of the constituent main effects. Thus, theoretical justification on why two main effects are insufficient for explaining the phenomena is critical when testing for interactions. A theoretical examination of all possible moderators should always be disclosed as exploratory. Given seven variables, there are 21 possible two-way interactions; one is likely to be significant at At the very least, power should be calculated using thresholds consistent with the small anticipated effect sizes of interaction estimates, which are an order of magnitude smaller than the common benchmarks proposed by Cohen (1988). Cohen’s benchmarks are integrated into G*Power,21 and there is a substantial risk of researchers erroneously underpowering their interaction analyses because of the gulf between these standard thresholds and actual interaction effect sizes. The problematic statistical and theoretical properties of interactions in the social sciences make their analysis with linear regression difficult for most researchers. In a field in which main effects are often insufficiently powered, a post hoc test of an interaction will almost never be sufficiently powered or stable unless using consortium-collected data (e.g., Adolescent Brain Cognitive Development Study).
Assuming a genuine interaction is hypothesized, we suggest that statistical approaches beyond linear regression may reduce the impact of measurement error and the signal-to-noise ratio inherent to small effects. Reliability affects both the point at which estimates for the interaction can be regarded as “stable” and on the observable effect size because of attenuation of . Structural equation modeling (SEM) mitigates unreliability by separating true score variance from measurement error with explicit latent variables, which can reduce attenuation bias, improve stability, and increase statistical power for the modeled interaction. In genomics, methods have also been developed to infer the presence of interactions even while studies remain too underpowered to detect and characterize more than a handful of them (Zhu et al., 2023). Regardless of the method employed in testing for an interaction among two variables, we stress the importance of a priori hypotheses; if the main effects do not yield results viewed as publishable, an exploratory search for interactions significantly increases the risk of erroneous findings, especially given that the vast majority of studies in psychology are overwhelmingly underpowered for such searches.
Practical recommendations
Researchers planning interaction analyses with two continuous predictors can take several concrete steps to improve the replicability of their research. First and most important, developing an interaction hypothesis and powering for its analysis using realistic interaction effect sizes (r ≤ .10) rather than Cohen’s benchmarks will help avoid the pitfalls of underpowered, exploratory interaction testing. When feasible, SEM can mitigate some of the reliability-related challenges we have identified. Under an SEM framework, stability may be achieved at sample sizes comparable with our simulated cases that had perfect reliability (Hoyle, 2012), although many of the perfectly reliable cases still require enormous sample sizes for stability. Given the substantial sample-size requirements we identify, researchers unable to achieve at least 72% power for the interaction should consider whether their research questions can be addressed through alternative approaches.
Limitations
This study had several limitations. Any attempt to quantify the stability of an estimate in a regression framework across multiple trials requires an arbitrary selection of cut points to delineate “stable” and “unstable,” which applies to our percentage-based COS and POS, even though these had advantages over their existing definitions. Furthermore, the Monte Carlo simulations relied on synthetic data that may not reflect typical social-science data in which peculiarities such as nonnormality, omitted variables, or selection bias can appear. Measurement error similarly is modeled only through predictor reliability, which is but one potential source of noise in the data among many (Loken & Gelman, 2017; Schmidt et al., 2003). Potential concerns about the applicability of linear models to nonlinear patterns remain beyond the scope of this study. Large sample sizes required for stability may be impractical in empirical research, although alternative methods, such as SEM (Hoyle, 2012), could mitigate some of the identified issues with interaction analyses.
Conclusion
Our results strongly suggest that testing two-way interaction effects in psychology field studies using a linear regression framework is largely untenable.22 Even under the most favorable conditions, achieving stability requires sample sizes in excess of what is common in empirical research. It follows from the results of this study and corroborating evidence from existing research (Murphy & Russell, 2017; Vize, Baranger, et al., 2023; Vize, Sharpe, et al., 2023) that published interaction effect using a linear-regression framework should be regarded with skepticism. Researchers should test interactions only if they have a specific interaction hypothesis that the relation between two variables changes depending on the level of a third and the study design is appropriate (sufficiently large N and reliable measures). Alternative methods to regression, such as SEM, may better address challenges related to reliability, power, and sampling variability in interaction analyses (Cole & Preacher 2014; Hoyle, 2012; Marsh et al., 2004). Neglecting these factors risks portraying negligible interactions as stable and reliable, leading to irreplicable studies that form the backbone of misguided theories built on the artifacts of misaligned incentives and null hypothesis testing.
Supplemental Material
sj-pdf-1-amp-10.1177_25152459251407860 – Supplemental material for When Do Interaction/Moderation Effects Stabilize in Linear Regression?
Supplemental material, sj-pdf-1-amp-10.1177_25152459251407860 for When Do Interaction/Moderation Effects Stabilize in Linear Regression? by Andrew Castillo, Joshua D. Miller, Colin Vize, David A. A. Baranger and Donald R. Lynam in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
Action Editor: Pamela Davis-Kean
Editor: David A. Sbarra
Author Contributions
Andrew Castillo: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Validation; Visualization; Writing – original draft; Writing – review & editing.
Colin Vize: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.
David A. A. Baranger: Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Supervision; Validation; Visualization; Writing – review & editing.
Donald R. Lynam: Conceptualization; Investigation; Methodology; Project administration; Resources; Supervision; Validation; Visualization; Writing – original draft; Writing – review & editing.
ORCID iDs
Andrew Castillo
Joshua D. Miller
Colin Vize
David A. A. Baranger
Donald R. Lynam
Supplemental Material
Additional supporting information can be found at
Notes
References
1.
AguinisH.BeatyJ. C.BoikR. J.PierceC. A. (2005). Effect size and power in assessing moderating effects of categorical variables using multiple regression: a 30-year review. Journal of applied psychology, 90(1), 94.
BarangerD. A.FinsaasM. C.GoldsteinB. L.VizeC. E.LynamD. R.OlinoT. M. (2023). Tutorial: Power analyses for interaction effects in cross-sectional regressions. Advances in Methods and Practices in Psychological Science, 6(3), Article 25152459231187531. https://doi.org/10.1177/25152459231187531
4.
BlumerH. (1986). Symbolic interactionism: Perspective and method. University of California Press.
5.
BrysbaertM. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition, 2(1), Article 16. https://doi.org/10.5334/joc.72
6.
BusemeyerJ. R.JonesL. E. (1983). Analysis of multiplicative combination rules when the causal variables are measured with error. Psychological Bulletin, 93(3), 549.
7.
CohenJ. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
8.
ColeD. A.PreacherK. J. (2014). Manifest variable path analysis: potentially serious and misleading consequences due to uncorrected measurement error. Psychological methods, 19(2), 300.
9.
CredéM.SotolaL. K. (2024). All is well that replicates well: The replicability of reported moderation and interaction effects in leading organizational sciences journals. Journal of Applied Psychology, 109(10), 1659–1667.
10.
CronbachL. J. (1957). The two disciplines of scientific psychology. American psychologist, 12(11), 671.
11.
CronbachL. J.SnowR. E. (1981). Aptitudes and instructional methods: A handbook for research on interactions. Ardent Media.
12.
DormannC. F.ElithJ.BacherS.BuchmannC.CarlG.CarréG.MárquezJ.GruberB.LafourcadeB.LeitãoP.MünkemüllerT.McCleanC.OsborneP.ReinekingB.SchröderB.SkidmoreA.ZurellD.LautenbachS. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46.
13.
FergusonC. J.BrannickM. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17(1), 120–128. https://doi.org/10.1037/a0024445
14.
FreeseJ.PetersonD. (2017). Replication in social science. Annual Review of Sociology, 43(1), 147–165.
15.
FremlingL.StrauelC.BognarE. (2025). Z-curve analysis of studies involving moderation published in leading health psychology journals. Health Psychology.
16.
GreenlandS. (1993). Basic problems in interaction assessment. Environmental Health Perspectives, 101(Suppl. 4), 59–66.
17.
GreenwaldA. G. (1975). Consequences of prejudice against the null hypothesis. Psychological bulletin, 82(1), 1.
18.
HoyleR. H. (Ed.). (2012). Handbook of structural equation modeling. The Guilford Press.
19.
HyattC. S.CroweM. L.WestS. J.VizeC. E.CarterN. T.ChesterD. S.MillerJ. D. (2022). An empirically based power primer for laboratory aggression research. Aggressive Behavior, 48(3), 279–289.
20.
JaccardJ.TurrisiR. (2003). Interaction effects in multiple regression. Sage.
21.
Jensen-CampbellL. A.KnackJ. M.WaldripA. M.CampbellS. D. (2007). Do big five personality traits associated with self-control influence the regulation of anger and aggression?. Journal of research in personality, 41(2), 403–424.
22.
KretzschmarA.GignacG. E. (2019). At what sample size do latent variable correlations stabilize?. Journal of Research in Personality, 80, 17–22.
23.
LakensD. (2019). The value of preregistration for psychological science: A conceptual analysis. Japanese Psychological Review, 62(3), 221–230.
24.
LewinK.HeiderF.HeiderG. M. (1936). Principles of topological psychology. McGraw-Hill.
25.
LokenE.GelmanA. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.
26.
MacKinnonD. P. (2011). Integrating mediators and moderators in research design. Research on Social Work Practice, 21(6), 675–681.
27.
MarshH. W.WenZ.HauK.-T. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9(3), 275–300.
28.
McClellandG. H.JuddC. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114(2), 376–390. https://doi.org/10.1037/0033-2909.114.2.376
29.
McShaneB. B.GalD. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885–895.
30.
MurphyK. R.RussellC. J. (2017). Mend it or end it: Redirecting the search for interactions in the organizational sciences. Organizational Research Methods, 20(4), 549–573. https://doi.org/10.1177/1094428115625322
31.
NosekB. A.HardwickeT. E.MoshontzH.AllardA.CorkerK. S.DreberA.FidlerF.HilgardJ.StruhlM. K.NuijtenM. B.RohrerJ. M.RomeroF.ScheelA. M.SchererL. D.SchönbrodtF. D.VazireS. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73(1), 719–748.
32.
OdeS.RobinsonM. D.WilkowskiB. M. (2008). Can one’s temper be cooled? A role for agreeableness in moderating neuroticism’s influence on anger and aggression. Journal of research in personality, 42(2), 295–311.
33.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716
34.
PeaseC. R.LewisG. J. (2015). Personality links to anger: Evidence for trait interaction and differentiation across expression style. Personality and Individual Differences, 74, 159–164.
35.
PlominR.DeFriesJ. C.LoehlinJ. C. (1977). Genotype-environment interaction and correlation in the analysis of human behavior. Psychological Bulletin, 84(2), 309–322.
36.
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
37.
RimplerA.KiersH. A.van RavenzwaaijD. (2025). To interact or not to interact: The pros and cons of including interactions in linear regression models. Behavior Research Methods, 57(3), 92.
38.
RohrerJ. M.ArslanR. C. (2021). Precise answers to vague questions: Issues with interactions. Advances in Methods and Practices in Psychological Science, 4(2), Article 25152459211007368. https://doi.org/10.1177/25152459211007368
39.
SchmidtF. L.LeH.IliesR. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual-differences constructs. Psychological Methods, 8(2), 206–224.
40.
SchönbrodtF. D.PeruginiM. (2013). At what sample size do correlations stabilize?Journal of Research in Personality, 47(5), 609–612.
41.
SimmonsJ. P.NelsonL. D.SimonsohnU. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
42.
SotolaL. K.CredéM. (2023). Estimating the replicability of statistically significant moderation effects in personality research using z-curve analysis. Journal of Research in Personality, 107, Article 104435. https://doi.org/10.1016/j.jrp.2023.104435
43.
VizeC. E.BarangerD. A.FinsaasM. C.GoldsteinB. L.OlinoT. M.LynamD. R. (2023). Moderation effects in personality disorder research. Personality Disorders: Theory, Research, and Treatment, 14(1), 118–126. https://doi.org/10.1037/per0000582
44.
VizeC. E.SharpeB. M.MillerJ. D.LynamD. R.SotoC. J. (2023). Do the Big Five personality traits interact to predict life outcomes? Systematically testing the prevalence, nature, and effect size of trait by trait moderation. European Journal of Personality, 37, 605–625.
45.
ZhuC.MingM. J.ColeJ. M.EdgeM. D.KirkpatrickM.HarpakA. (2023). Amplification is the primary mode of gene-by-sex interaction in complex human traits. Cell Genomics, 3(5), Article 100297. https://doi.org/10.1016/j.xgen.2023.100297
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.