Sage Journals: Discover world-class research

Abstract

In the landscape of statistical software, from customizable programming-language-based to point-and-click systems, SPSS remains a popular choice among researchers. In SPSS, analyses with conventional methods, such as ordinary least squares regression, can be easily performed. However, violated assumptions, such as homoskedasticity or normality of the errors, can lead to altered Type I error rates or a reduction in statistical power. SPSS provides a multitude of alternative inference methods associated with linear regression, but accessing them is not always straightforward. To facilitate data analysis when assumptions for conventional inference methods are not met, in this tutorial, we aim to provide applied researchers, particularly SPSS users, with a guide for performing linear regression analyses using heteroskedasticity-consistent (HC) standard errors (HC3 and HC4) and two different bootstrap resampling methods (pairs bootstrap and wild bootstrap). Each bootstrap method can further be combined with a bootstrap p value, a percentile confidence interval, or a bias-corrected and accelerated confidence interval. For illustration, the methods are then compared using a computer-generated data set. Although the focus of this article is on applied researchers who use mainly SPSS for their analyses, a tutorial on how to do everything shown here in R (with custom functions) is included in the supplementary materials.

Keywords

SPSS robust standard error bootstrap heteroskedasticity nonnormality open materials

More often than not, real-life data may not satisfy the conditions assumed by the statistical models that are used (Blanca et al., 2013; Bono et al., 2017; Micceri, 1989; Sladekova & Field, 2024b). Especially in the context of linear regression, model assumptions are strict and not easily evaluated. Depending on the severity, violations can have dire consequences on the Type I error rate, power, or even bias of the parameter estimates (Field & Wilcox, 2017; Wilcox, 2022). Many prebuilt packages and functions for even the most recent robust methods can be accessed through statistical programming languages, such as R. Yet SPPS, which primarily supports conventional methods, remains a widely used software among applied researchers (Blanca et al., 2018; Masuadi et al., 2021). Hence, the goal of this tutorial is to present applied researchers with alternative inference methods in the scope of linear regression that are also available in IBM SPSS (Version 30). Most important, these methods do not require researchers to familiarize themselves with a new type of analysis altogether but simply have the benefit of potentially leading to more valid conclusions given the data at hand.

Linear Regression

Many hypotheses pertain to the relationship between two continuous variables, combined effects of multiple predictors on one outcome, or unique effects of a single variable on an outcome while statistically controlling for linear effects of other covariates. In these cases, linear regression analysis, using the ordinary-least-squares (OLS) method, is generally the method of choice. Mathematically, linear regression can be described by the following model:

$y_{i} = β_{0} + β_{1} x_{i 1} + \dots + β_{K} x_{i K} + ε_{i} .$ (1)

Here, an outcome variable $y_{i}$ is described as the sum of an intercept parameter $β_{0}$ , a combination of $k = 1, \dots, K$ predictors, each weighted by their regression coefficient $y_{i}$ and some unknown error $ε_{i}$ because it is unlikely that any outcome variable will be explained perfectly by all $K$ predictor variables. The outcome variable $y_{i}$ , the predictors $x_{i 1}$ to $x_{i K}$ , and the error $ε_{i}$ have an additional index $i$ to indicate these are the values of the $i$ th person or observation (with $i = 1, \dots, n$ ).

In most applied settings, researchers often focus on whether predictors have a nonzero effect on the outcome. This is typically tested using standard p values or confidence intervals guiding the conclusions. There are two quality criteria of null hypothesis significance tests researchers typically aim to keep at satisfactory levels: (a) the probability of incorrectly rejecting the null hypothesis with their test, that is, concluding there is an effect when in truth there is none (i.e., the Type I error rate, which researchers usually want to keep low), and (b) the probability of correctly rejecting the null hypothesis with their test, that is, concluding there is an effect given there truly exists one (i.e., the power of the test which they typically want to keep high). To keep the Type I error rate low, a sufficiently low significance level alpha is chosen, representing the Type I error rate researchers find acceptable for their test. To achieve sufficiently high power, researchers routinely determine the sample size necessary to detect an effect of some assumed size with some predefined probability regarded as sufficiently high (often 80% or 95%).

One problem with this approach is that when OLS assumptions are violated, p values and confidence intervals are no longer reliable. So regardless of one’s chosen significance level alpha, the real Type I error rate could be much smaller or larger than this nominal value. Likewise, if researchers expect a statistical power of some specific value (e.g., 80%) given their significance level, sample size, and assumed effect size, the true statistical power of their test can also deviate substantially from the nominal value if assumptions are violated.

So what conditions does OLS regression assume to produce results that are valid with respect to those two quality criteria? In OLS regression, the errors $ε_{i}$ are assumed to be independent and identically normally distributed random variables with constant variance. Thus, the assumptions of OLS regression pertaining to the errors include independence, homoskedasticity (constant variance), and a normal distribution (Berry, 1993). Some assumptions of OLS are associated with estimating the size of each predictor’s effect, whereas others pertain to its inferential tests. Assumptions such as homoskedasticity and normality of the errors influence only inference associated with the regression coefficients (Berry, 1993; Cribari-Neto, 2004).

Violations of homoskedasticity will often lead to larger than anticipated (i.e., inflated) Type I error rates (Astivia & Zumbo, 2019; Cribari-Neto, 2004; Long & Ervin, 2000; Rajh-Weber et al., 2025). Simultaneously, depending on the variance pattern, the statistical power can be lower under heteroskedasticity compared with scenarios in which assumptions are not violated (Hayes & Cai, 2007; Long & Ervin, 2000; Rajh-Weber et al., 2025).

In this tutorial, we focus solely on violations of the normality and homoskedasticity assumptions. This is because in such cases, researchers can easily use alternative methods that remain within the familiar framework of OLS regression while obtaining corrected p values or more accurate confidence intervals. Specifically, using heteroskedasticity-consistent (HC) standard errors is a viable alternative if homoskedasticity cannot be assumed. In addition, there exist some bootstrap procedures that are well suited when either the normality or homoskedasticity assumption is violated (Chernick & LaBudde, 2011; MacKinnon, 2006).

Present Work

Because many applied researchers continue to rely on the functionalities provided by the SPSS software (Blanca et al., 2018; Masuadi et al., 2021), in this tutorial, we focus on robust inference methods available in IBM SPSS (Version 30). Because the performance of an inference method generally varies depending on the type and severity of assumption violation, it is important to enable researchers to flexibly choose a method fitting for a given scenario. Therefore, in this tutorial, we present eight different alternative inference methods beyond the classical OLS regression method that are accessible in SPSS: using either an HC3 or HC4 standard error for inference or using a pairs or wild bootstrap, each in combination with either a p value, a percentile confidence interval, or a bias-corrected and accelerated (BCa) confidence interval.

For best use of this tutorial, we prepared step-by-step instructions (including many screenshots and detailed elaborations of procedures) that can be found on OSF (https://osf.io/7du4t/) alongside some example data and the complete SPSS syntax to replicate the results. This shall, if necessary, allow updating of online step-by-step guides with new SPSS version releases independently from this tutorial, which is more focused on the general procedures. In addition, custom R functions were created for this tutorial reflecting SPSS’s functionality. R code for these custom functions, R code on how the sample data were created, and an R tutorial file are also provided on OSF.

Example Data

To showcase the different inference methods covered in this tutorial, we simulated one example data set. This hypothetical data set contains 95 data points and four variables. The “id” variable denotes identifiers of some fictitious participants. The other three continuous variables are called “TV,” “reading,” and “focus.” In this made-up example, we are interested in how the hours spent watching TV or reading a book on an average day can predict the focus of a person, measured by some imaginary metric scale ranging from 5 (little focus) to 20 (a lot of focus). The theoretical model underlying the relations between those variables is given by

$f o c u s_{i} = β_{0} + β_{1} \times T V_{i} + β_{2} \times r e a d i n g_{i} + ε_{i},$ (2)

where we want to estimate the intercept parameter with ${\hat{β}}_{0}$ , the unique effect of watching TV on a person’s focus described by ${\hat{β}}_{1}$ , and the unique effect of reading on a person’s focus described by ${\hat{β}}_{2}$ . Each unique effect, that is, the estimated regression coefficient ${\hat{β}}_{k}$ , tells us the amount of change in focus (in points) when we increase the respective predictor (TV or reading) by 1 hr while holding the other predictor constant. We then compute the difference between the observed focus values with the focus values predicted by the regression model ( ${\hat{β}}_{0} + {\hat{β}}_{1} \times T V_{i} + {\hat{β}}_{2} \times r e a d i n g_{i}$ ) to get the model residuals $r$ . This means

$r_{i} = f o c u s_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} \times T V_{i} + {\hat{β}}_{2} \times r e a d i n g_{i}) .$ (3)

Because we simulated the data ourselves, we have an advantage that we do not have in any real-life experiment: We know the true data-generating process, that is, we know the true values of $β_{0}$ , $β_{1}$ , and $β_{2}$ in Equation 2. In our example, the data were generated so that TV has no unique effect ( $β_{1} = 0$ ) and reading has a positive effect on the focus of a person ( $β_{2} = 3$ ). The two predictor variables have a small positive correlation of .10. In addition, we know that the errors in Equation 2 are indeed normally distributed but heteroskedastic. This means we know that the homoskedasticity assumption of OLS regression is not met. Specifically, heteroskedasticity was introduced so that the variance of the errors is a function of the predictor TV (where $v a r (ε_{i}) = σ_{i}^{2} = e^{2 * T V_{i}}$ ) to create a strong funnel shape in a scatterplot of the residuals against the predictor TV.

Initial Analysis and Visualization of the Residuals

For the initial analysis, we recommend saving the model residuals from the original OLS-regression model so that a histogram or P-P plot of the (studentized) residuals can be produced (see Fig. S1 in the supporting information for SPSS provided on OSF). We also recommend inspecting a scatterplot visualizing the relationship between (standardized) predicted values and (studentized) residuals (Schützenmeister et al., 2012) and all the partial (residual) plots. Here, one must differentiate between two types of plots: bivariate scatterplots in which each predictor is plotted against the residuals of the model, recommended by textbooks (see Cohen et al., 2003), and the partial plots of SPSS, where the partial correlation between each predictor and the outcome is plotted, when removing the influence of all other predictors. The former is more in line with the classic predicted values versus residual plot, in which the spread around a horizontal line intercepting the y-axis at zero can be assessed (see Fig. S2 on OSF).

As previously discussed, the regression assumptions about homoskedasticity and normality pertain to the (usually) unknown errors $ε_{i}$ , which are approximated by the residuals $r_{i}$ in our sample. Inspecting the histogram of the residuals can thus give only some idea about the distribution of the errors. Ideally, researchers would like to see that the shape of the histogram resembles a normal distribution. This means researchers would thus want the blue bars in Figure 1a to smoothly fill out the area under the black normal curve. Furthermore, for a distribution of the residuals closely resembling a normal distribution, the points would stick closely to the diagonal line in the P-P plot depicted in Figure 1b.

Fig. 1.

Visual inspection of the model residuals. (a) Histogram of the model residuals, (b) normal P-P plot of model residuals, (c) scatterplot of the predicted values against the model residuals, (d) scatterplot of the values of predictor TV against the model residuals, and (e) scatterplot of the values of predictor reading against the model residuals.

In our example data, the distribution of our residuals does not seem to have those features. Instead, it looks like there are a lot of data points close to the mean of zero and then quite a few others spread out much farther than would be expected from a normal distribution. This is also referred to as “heavy tails,” reflected in the difference in expected and observed probabilities for low and high values in Figure 1b.

Homoskedasticity, or constant variance of the errors for all levels of the set of predictors, can be visualized through multiple scatterplots. In the scatterplot of the predicted values against the residuals, the data points are thus expected to be evenly distributed around zero for all levels of the predicted values (see Fig. 1c). This is not the case in our example because we can see that the data points spread farther apart for small (negative) predicted values and spread less for larger predicted values. Again, in a model with perfectly homoskedastic errors, one would expect to see the data points equally spread out around the zero line for all levels of the predictor in the partial (residual) plots as well. In our example, for the predictor TV, a clear funnel shape is visible (see Fig. 1d). For the predictor reading, no distinct shape can be observed (see Fig. 1e), but the residual variance seems to be larger around average reading values, hinting at an inverse butterfly shape (Sladekova & Field, 2024a). Based on the visual inspection of the sample data alone, we can see that using the classical inference method for the unique effect of TV on focus, maybe even reading on focus, is likely not going to be valid with respect to power and Type I error rate.

Note that significance tests of assumption violations, such as the Breusch-Pagan test for heteroskedasticity or the Shapiro-Wilk test for nonnormality, also exist and are easily accessible in SPSS. However, consistent with many sources (Field & Wilcox, 2017; Long & Ervin, 2000; Sanchis-Segura & Wilcox, 2024), we do not recommend to rely solely on significance tests of assumption violations.

HC Standard Errors

To perform a significance test for a linear regression model, the uncertainty around the estimated regression coefficient must be quantified through its standard error. The classical inference method in OLS computes a valid standard error only when it can be assumed that among other things (see Berry, 1993), the errors are homoskedastic and normally distributed.

Unlike classical standard errors in OLS regression, HC standard errors do not assume homoskedasticity. Instead of using a single estimate of the error variance for computing the standard errors, HC standard errors use information contained in the variability of the residuals. In their first version (today known as HC0), the squared $i$ th residual was used to estimate the variance of the $i$ th error (Hayes & Cai, 2007).

Today, there are multiple versions of HC standard errors, with HC0 up to HC4 being the most well known, that are also readily available in many software programs. HC0, HC1, and HC2 showed increased Type I error rates compared with newer versions (Long & Ervin, 2000). Therefore, the ones often recommended (Cribari-Neto, 2004; Hayes & Cai, 2007) and demonstrated in this tutorial are HC3 and HC4.

The HC3 and HC4 standard errors are usually preferred because they do not simply use the raw residuals but instead transform them based on information about their respective leverage, thus they accommodate differences in influence. Both HC3 and HC4 standard errors have been found to result in Type I error rates closer to the nominal value (e.g., 5%) even under strong heteroskedasticity (Cribari-Neto, 2004; Long & Ervin, 2000; Rajh-Weber et al., 2025).

How to run the analysis in IBM SPSS (Version 30)

So far, SPSS does not allow for the computation of HC standard errors via the linear-regression window (“Analyze >> Regression >> Linear”). Currently, there are, however, two different ways to still get the desired standard errors and associated p values.

One way is to specify the regression model via a window typically used for analysis of variance (“Analyze >> General Linear Model >> Univariate”; see Fig. S5 on OSF) and to categorize all continuous predictors as “covariates.” In this menu, the versions available for robust standard errors are HC0 up to HC4. Note that using any HC standard error via the analysis-of-variance window will result in robust inference only for the estimated regression coefficients. The overall null hypothesis $H_{0} : β_{0} = β_{1} = β_{2} = 0$ is tested using an F test that is not corrected for potential heteroskedasticity.

If a robust F test is also of interest, the SPSS macro “RLM” written by Andrew F. Hayes can be used instead. After adding it to SPSS, an option to use the RLM macro will be found under “Analyze >> Regression >> RLM Macro by Andrew F. Hayes.” Using the macro will be possible only via the SPSS menu. As with the native SPSS menu, the five versions of the HC standard errors, HC0 to HC4, are available through the RLM macro (see Fig. S7 on OSF).

Bootstrap Methods

Compared with the robust-standard-error methods, the bootstrap methods rely on a different approach that does not assume a specific shape for the distribution of a test statistic. Instead, the theoretical distribution of regression coefficients that would result if samples of size $n$ were drawn an infinite number of times from the population is approximated. This bootstrap sampling distribution of coefficients is then used to compute confidence intervals or p values. To get a reasonably dense bootstrap-sampling distribution, 1,000 bootstrap samples or more are typically suggested (Chernick & LaBudde, 2011; Field, 2024). Thus, bootstrap methods do not rely on assumptions of normality or, depending on the method, even homoskedasticity (Flachaire, 2005) for inferences to be reliable. However, as with most other statistical procedures, the sample must be representative of the population, in heterogeneity and size, in order for a bootstrap method to yield sensible results (Chernick & LaBudde, 2011). There are various resampling methods to end up with a distribution of some statistic, such as a regression coefficient in our example. The two resampling methods covered in this tutorial are the pairs bootstrap and the wild bootstrap.

Once the bootstrap sampling distribution is obtained by either method, it can be used for statistical inference. For instance, the bootstrap sampling distribution can be used to compute confidence-interval limits or p values. There are different methods to compute bootstrap confidence intervals, such as the percentile and the bias-corrected and accelerated (BCa) method (Chernick & LaBudde, 2011; Hesterberg, 2011). These intervals can give (slightly) different results, depending on the shape of the distribution of bootstrap regression coefficients. In addition, conclusions based on the bootstrap p value, computed by SPSS, might not necessarily match the conclusions drawn from either of the bootstrap confidence intervals. Thus, it is important to report which bootstrap-resampling method (pairs or wild) combined with which bootstrap-inference method (percentile confidence intervals, BCa confidence intervals, or p value) was used for the analysis.

Pairs bootstrap

The pairs bootstrap approximates the theoretical distribution of regression coefficients by drawing many bootstrap samples of size $n$ from the original sample with replacement. That means if the sample consisted of 10 cases named 1, 2, . . . 10, one bootstrap sample could, for instance, consist of the cases 2, 3, 3, 4, 6, 7, 7, 7, 8, 9. Note that some cases could be missing from any single bootstrap sample (e.g., Cases 1, 5, and 10), and other cases (e.g., Cases 3 and 7) could appear multiple times. For each bootstrap sample, a linear regression model is fit, and the regression coefficients are computed.

The name “pairs bootstrap” is meant to convey that entire cases, that is, pairs of predictor(s) and outcome, are resampled in the bootstrap process (Flachaire, 2005). In SPSS, this sampling method is known as “simple.” The wild bootstrap, a different resampling method, introduces only randomness through transformation of the residuals and is discussed in the next section.

How to run the analysis in IBM SPSS (Version 30)

Using a pairs-bootstrap method for inference in SPSS can currently be accessed via “Analyze >> Regression >> Linear . . .” (see Fig. S9 on OSF). In this menu, SPSS performs a pairs bootstrap as described above if the keyword “simple” is selected from the bootstrap-sampling options. Moreover, either the percentile or BCa method can be selected for the bootstrap confidence interval. A bootstrap p value is automatically included in the output in SPSS.

Wild bootstrap

Some bootstrap methods, such as the wild bootstrap, were specifically developed to counteract problems with heteroskedasticity in regression models (Chernick & LaBudde, 2011; MacKinnon, 2006). Contrary to the pairs bootstrap, the wild bootstrap does not resample cases but adds only some random perturbation to the residuals. In particular, a regression model is fit to the original data, and the residuals (optionally transformed, see below) from this model are saved. Then, in each bootstrap iteration, each residual is multiplied with a random number drawn from a distribution with mean 0 and variance 1. These new residuals are used to compute a new outcome variable, which is then used to fit a new linear-regression model using the original set of predictors. Repeating this procedure results in a distribution of regression coefficients that approximates the respective parameter’s sampling distribution. For further details, see, for example, MacKinnon (2013) or Rajh-Weber et al. (2025).

How to run the analysis in IBM SPSS (Version 30)

Running the wild-bootstrap procedure in IBM SPSS (Version 30) is currently slightly more complicated than for the pairs-bootstrap procedure. First, a regression model must be fit to the original data to acquire the residuals required in the next step. Here, we recommend saving the deleted residuals instead of the unstandardized residuals. This is a transformation of the residuals using leverage values that mirrors the HC3 procedure (MacKinnon, 2013).

In a second step, another regression analysis must be performed, now using the previously saved deleted residuals for the wild-bootstrap procedure (see Fig. S10 on OSF). A step-by-step guide on how to implement this in SPSS is provided in the supporting material on OSF. For R users, this two-step process was automated in the custom R functions provided in the additional resources for this tutorial on OSF (lm_wild_p, lm_wild_percentile, and lm_wild_bca).

Comparison of the Results

For the p values and the 95% confidence intervals for nine different methods employable in SPSS, see Table 1. The classical inference method associated with OLS regression and the two HC standard errors provide both p values and confidence intervals. For the bootstrap, it depends on the inference method. If one is using the SPSS bootstrap p value, it is not possible to compute a confidence interval because of the way SPSS calculates the p value. Likewise, the bootstrap-confidence-interval methods return only a confidence interval, not a p value.

Table 1.

Comparison of All Considered Methods Regarding p Value and 95% CI

	Intercept		TV		Reading
	p value	95% CI	p value	95% CI	p value	95% CI
Classical	2.53 × 10 ⁻³²	[14.39, 17.90]	.011	[−4.92, −0.65]	.019	[0.46, 5.04]
HC3	1.85 × 10 ⁻²⁵	[13.93, 18.36]	.083	[−5.95, 0.38]	.017	[0.51, 4.99]
HC4	1.05 × 10 ⁻²⁵	[13.95, 18.34]	.081	[−5.92, 0.35]	.016	[0.53, 4.97]
Pairs (p value)	< .001 ^a		.081		.009
Pairs (percentile)		[14.12, 18.23]		[−5.72, 0.35]		[0.54, 4.68]
Pairs (BCa)		[14.44, 17.88]		[−6.06, 0.49]		[0.28, 4.83]
Wild (p value)	< .001 ^a		.081		.008
Wild (percentile)		[14.19, 18.33]		[−5.90, 0.09]		[0.46, 4.70]
Wild (BCa)		[13.98, 18.53]		[−5.74, −0.17]		[0.62, 4.63]

Note: For values formatted as bold, the null hypothesis would be rejected at the .05 significance level. CI = confidence interval; HC = heteroskedasticity-consistent standard error; BCa = bias-corrected and accelerated.

In fact, a p value of .000999000999001 is obtained, which is the smallest p value that can be computed for a bootstrap parameter distribution with 1,000 bootstrap iterations and the normal quantile interpolation used in SPSS.

To compare all methods, a common binary conclusion scheme is used in which the null hypothesis that no effect exists in the population is either rejected or not rejected. The null hypothesis is rejected if the p value is smaller than the predefined significance level of .05 or if zero is outside the 95% confidence interval. Because the hypothesis that the intercept parameter is different from zero is not of interest to most researchers, it is not explicitly examined in this section.

If only the classical-inference approach was considered, the conclusion would be to reject the null hypothesis for the unique effect of TV on focus and of reading on focus in this example. Thus, one would (cautiously) infer that watching TV has a negative effect on the ability to focus, for a constant level of reading, because the 95% confidence interval ranges from −4.92 to −0.65 and would further conclude that reading has a positive effect on the ability to focus, for a constant level of watching TV, because the 95% confidence interval ranges from 0.46 to 5.04. However, as mentioned before when describing the example data, we know for a fact that in truth, watching TV has no effect on the ability to focus because this is how the data were generated. This means that basing our conclusions on the classical method would lead us to falsely reject the null hypothesis for this effect in this example (i.e., a Type I error).

Based on the visual inspection of the scatterplots, we already suspected that there might be some heteroskedasticity present, specifically, heteroskedasticity related to the predictor TV. Indeed, both the HC3 and the HC4 methods would instead lead us to not reject the null hypothesis for the unique effect of TV on focus. For the unique effect of reading on the ability to focus, HC3 and HC4 agree with the classical method to reject the null hypothesis. Likewise, most of the bootstrap methods, except for the wild bootstrap with the BCa confidence interval, would also encourage us to reject the null hypothesis only for the effect of reading and not for watching TV.

Discussion

The goal of this tutorial was to inform applied researchers that robust standard errors and different types of bootstrap methods are viable alternatives to classical inference, which are readily available in commercial software, such as SPSS.

All the methods presented here were recently tested for a variety of data scenarios, including different combinations of heteroskedasticity and nonnormality. In many instances, the HC standard errors and the wild-bootstrap-resampling method (especially combined with the percentile confidence interval) were shown to perform satisfactorily (Rajh-Weber et al., 2025) regarding both Type I error rate and power. However, apart from computer simulations, the true data-generating processes are hardly ever known in real-life settings. Generally, there are an infinite number of combinations of assumption violations by type and degree, so deducing which method delivers the most valid results for a specific observed data situation is virtually impossible. Here, simulation studies that assess a variety of data scenarios can help with the choice of some methods over others for at least some specific scenarios (Cribari-Neto, 2004; Long & Ervin, 2000; Rajh-Weber et al., 2025).

A limitation of the methods presented here is that they affect only inference associated with the regression coefficients, not the estimation of the regression coefficient itself. These methods are suitable under nonnormality and/or heteroskedasticity, but different estimators should be sought out when dealing with outliers, for example (Wilcox, 2022).

Practical Recommendations

As often recommended in literature (Field & Wilcox, 2017; Wagenmakers et al., 2021; Wilcox, 2022), we also encourage researchers to familiarize themselves with their own data through assessing assumption violations visually (at least in addition to commonly used significance tests) and also compare results of classical and robust methods.

Even though the desire for definite guidelines in this context is understandable, it is not possible to recommend one method over all others without further insight into the specific research scenario. Still, some general recommendations regarding the methods covered in this tutorial might be summarized as follows: (a) Use an HC standard error if there are signs of or if one wants to protect against heteroskedasticity. Out of the available HC standard errors, HC3 or HC4 should be preferred over HC0 to HC2 (Cribari-Neto, 2004; Hayes & Cai, 2007). HC4 is especially recommended if a few cases exhibit high leverage (Hayes & Cai, 2007), but HC4 is not always better than HC3 (MacKinnon, 2013). (b) Bootstrap methods can provide results that are free of distributional assumptions. Both the pairs and the wild bootstrap have been shown to also work well under heteroskedasticity but should be combined with the percentile instead of the BCa confidence interval (Rajh-Weber et al., 2025). Inference using bootstrap confidence intervals compared with bootstrap p values provides the added benefit of an interval for parameter estimation instead of just a null hypothesis significance test. (c) Ideally, both an HC method and a bootstrap method could be performed: The bootstrap method has no distributional assumptions, but its validity is highly dependent on the sample being representative of the population, which can be problematic, especially for small samples. The HC method relies on the assumption that the sampling distribution of the test statistic (“t value”) is a t-distribution, but it can provide more accurate confidence-interval coverage in some conditions (Rajh-Weber et al., 2025).

For working with commercial software such as SPSS, good practice also means making sure analyses are reproducible by using seeds for bootstrap methods or saving SPSS syntax files and data sets in nonproprietary forms (e.g., txt and csv, respectively). In addition, being transparent about the types of analyses performed and reporting if and how they differed are generally good practice.

Finally, discussing why different methods produced different results can give insight into potential processes that may generate the data. To paraphrase Ly et al. (2020), if different methods agree with each other, the confidence in the conclusions is strengthened; if they clash, one’s confidence may be weakened, but the sensitivity of results on methodological choices may itself convey valuable information for the research problem at hand. “Either way, something useful has been learned” (Ly et al., 2020, p. 160).

Footnotes

Acknowledgements

The example data, the SPSS syntax, and R Code, including all methods as custom functions and an R markdown tutorial, can be found on the OSF: https://osf.io/7du4t/. The article has been uploaded to OSF as a preprint: .

Transparency

Action Editor: Yasemin Kisbu-Sakarya

Editor: David A. Sbarra

Author Contributions

Hanna Rajh-Weber: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Validation; Visualization; Writing – original draft; Writing – review & editing.

Stefan Ernest Huber: Conceptualization; Data curation; Methodology; Validation; Writing – original draft; Writing – review & editing.

Martin Arendasy: Conceptualization; Funding acquisition; Project administration; Resources; Supervision; Writing – review & editing.

ORCID iDs

Hanna Rajh-Weber

Stefan Ernest Huber

References

Astivia

O. L. O.

Zumbo

B. D.

(2019). Heteroskedasticity in multiple regression analysis: What it is, how to detect it and how to solve it with applications in R and SPSS. Practical Assessment, Research, and Evaluation, 24(1), Article 1. https://doi.org/10.7275/Q5XR-FR95

Berry

(1993). Understanding regression assumptions. SAGE Publications. https://doi.org/10.4135/9781412986427

Blanca

M. J.

Alarcón

Bono

(2018). Current practices in data analysis procedures in psychology: What has changed? Frontiers in Psychology, 9, Article 2558. https://doi.org/10.3389/fpsyg.2018.02558

Blanca

M. J.

Arnau

López-Montiel

Bono

Bendayan

(2013). Skewness and kurtosis in real data samples. Methodology, 9(2), 78–84. https://doi.org/10.1027/1614-2241/a000057

Bono

Blanca

M. J.

Arnau

Gómez-Benito

(2017). Non-normal distributions commonly used in health, education, and social sciences: A systematic review. Frontiers in Psychology, 8, Article 1602. https://doi.org/10.3389/fpsyg.2017.01602

Chernick

M. R.

LaBudde

R. A.

(2011). An introduction to bootstrap methods with applications to R. John Wiley & Sons.

Cohen

West

S. G.

Aiken

L. S.

(2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates.

Cribari-Neto

(2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics & Data Analysis, 45(2), 215–233. https://doi.org/10.1016/S0167-9473(02)00366-3

Field

A. P.

(2024). Discovering statistics using IBM SPSS statistics (6th ed.). Sage.

10.

Field

A. P.

Wilcox

R. R.

(2017). Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behaviour Research and Therapy, 98, 19–38. https://doi.org/10.1016/j.brat.2017.05.013

11.

Flachaire

(2005). Bootstrapping heteroskedastic regression models: Wild bootstrap vs. pairs bootstrap. Computational Statistics & Data Analysis, 49(2), 361–376. https://doi.org/10.1016/j.csda.2004.05.018

12.

Hayes

A. F.

Cai

(2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior Research Methods, 39(4), 709–722. https://doi.org/10.3758/BF03192961

13.

Hesterberg

(2011). Bootstrap. WIREs Computational Statistics, 3(6), 497–526. https://doi.org/10.1002/wics.182

14.

Long

J. S.

Ervin

L. H.

(2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3), 217–224. https://doi.org/10.1080/00031305.2000.10474549

15.

Stefan

Van Doorn

Dablander

Van Den Bergh

Sarafoglou

Kucharský

Derks

Gronau

Q. F.

Raj

Boehm

Van Kesteren

E.-J.

Hinne

Matzke

Marsman

Wagenmakers

E.-J.

(2020). The Bayesian methodology of Sir Harold Jeffreys as a practical alternative to the P value hypothesis test. Computational Brain & Behavior, 3(2), 153–161. https://doi.org/10.1007/s42113-019-00070-x

16.

MacKinnon

J. G.

(2006). Bootstrap methods in econometrics. Economic Record, 82(s1), S2–S18. https://doi.org/10.1111/j.1475-4932.2006.00328.x

17.

MacKinnon

J. G.

(2013). Thirty years of heteroskedasticity-robust inference. In Chen

Swanson

N. R.

(Eds.), Recent advances and future directions in causality, prediction, and specification analysis (pp. 437–461). Springer. https://doi.org/10.1007/978-1-4614-1653-1_17

18.

Masuadi

Mohamud

Almutairi

Alsunaidi

Aldhafeeri

O. F.

(2021). Trends in the usage of statistical software and their associated study designs in health sciences research: A bibliometric analysis. Cureus, 13(1), Article e12639. https://doi.org/10.7759/cureus.12639

19.

Micceri

(1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156–166. https://doi.org/10.1037/0033-2909.105.1.156

20.

Rajh-Weber

Huber

S. E.

Arendasy

(2025). A practice-oriented guide to statistical inference in linear modeling for non-normal or heteroskedastic error distributions. Behavior Research Methods, 57(12), Article 338. https://doi.org/10.3758/s13428-025-02801-4

21.

Sanchis-Segura

Wilcox

R. R.

(2024). From means to meaning in the study of sex/gender differences and similarities. Frontiers in Neuroendocrinology, 73, Article 101133. https://doi.org/10.1016/j.yfrne.2024.101133

22.

Schützenmeister

Jensen

Piepho

H.-P.

(2012). Checking normality and homoscedasticity in the general linear model using diagnostic plots. Communications in Statistics – Simulation and Computation, 41(2), 141–154. https://doi.org/10.1080/03610918.2011.582560

23.

Sladekova

Field

A. P.

(2024a). Commonly used statistical models in psychology are not equipped to deal with real-world conditions: A simulation study. PsyArXiv. https://doi.org/10.31234/osf.io/xb4at

24.

Sladekova

Field

A. P.

(2024b). In search of unicorns: Assessing statistical assumptions in real psychology datasets. PsyArXiv. https://doi.org/10.31234/osf.io/4rznt

25.

Wagenmakers

E.-J.

Sarafoglou

Aarts

Albers

Algermissen

Bahník

Š.

Van Dongen

Hoekstra

Moreau

Van Ravenzwaaij

Sluga

Stanke

Tendeiro

Aczel

(2021). Seven steps toward more transparency in statistical practice. Nature Human Behaviour, 5(11), 1473–1480. https://doi.org/10.1038/s41562-021-01211-8

26.

Wilcox

R. R.

(2022). Introduction to robust estimation and hypothesis testing (5th ed.). Academic Press.