Abstract
Keywords
Introduction
Large
This type of data structure is common in macro-level research that analyzes cross-national or meso-level (states and provinces) data. For instance, the World Bank’s (2021) world development indicators database extends back to 1960 in many cases, as does state-level energy and income data in the United States (Bureau of Economic Analysis 2021; U.S. Energy Information Administration 2021). Although these data are typically annual, more granular temporal data are available on a monthly, weekly, or daily basis in some cases (e.g., financial, employment, and industry-level data).
As interest in big data increases and the temporal depth of widely used data sets expands, large
To fill this gap, this article coalesces the literature on large
The article is separated into four sections. The first section outlines the modeling issues to consider with large
Modeling Large N , Large T Panel Data
In the following section, I first outline fixed effects estimation of a static model, which serves as the workhorse modeling approach in sociology. Following this discussion, I review dynamic models and the implications of dynamic misspecification. I then outline the issues associated with the fixed effects estimator in the cases of slope heterogeneity and strong cross-sectional dependence, followed by introducing the common correlated effects estimator which can address these concerns.
The Workhorse: Fixed Effects Estimation of a Static Model
Fixed Effects estimation of a static model (Equation 1) is the most commonly used approach to analyzing large
where
Fixed effects estimation can occur in two ways (Allison 2009). The first way is through the mean deviation approach, which works by constructing case-specific mean deviations for the dependent and independent variables and subtracting the individual-specific mean from each observed value:
and regressing
The second way fixed effects can be estimated is by including a set of dummy variables for each case, which is also known as the least square dummy variable (LSDV) estimator (Allison 2009). This method allows a fixed effect coefficient to be estimated for each case (the mean-deviation approach does not allow for this), but it does constrain one of the units to zero, which drops it from the analysis. The LSDV estimator and the mean deviation approach produce equivalent estimates.
Fixed effects estimation of a static model has several limitations that can make it perform poorly. First, researchers tend to give little consideration to whether the static model is appropriate—often using it by default. Thus, many models are potentially misspecified due to incorrectly modeling the lag structure (i.e., not including the appropriate number of lags of the dependent and independent variables; De Boef and Keele 2008; Plümper and Troeger 2019). Second, even if the dynamics are properly specified, fixed effects estimation of a dynamic model in the presence of slope heterogeneity produces inconsistent estimates because the slope heterogeneity gets captured in the error term (Pesaran and Smith 1995). Third, failing to appropriately model cross-sectional dependence can lead to inconsistent and biased estimates because the cross-sectional dependence will become part of the error term and is likely correlated with the regressors (Pesaran 2006, 2015).
Dynamic Models and Dynamic Misspecification
Although the static model is commonly used, it is often statistically and theoretically inappropriate (Thombs, Huang, and Fitzgerald 2022). This model assumes that any potential autocorrelation is handled as a nuisance (e.g., measurement error)—rather than as part of the data generating process (i.e., that the estimated model is incorrect). When autocorrelation (and other issues like heteroskedasticity and cross-sectional dependence) arise, a robust variance estimator like clustered robust standard errors, panel corrected standard errors, and/or a generalized least squares transformation like Prais-Winsten are often used (Beck and Katz 1995). However, if the aforementioned issues are actually part of the data generating process, such corrections do little to mitigate their impact on estimation because the estimated model is not changed. 3 Panel data are very likely to be autoregressive given that we expect social processes to be a function of the past. It would be unusual for any social theory to argue that history does not matter, which is what is assumed in a static model.
To see why applying a standard error correction to a static model likely fails to appropriately model the true data generating process, we can start by discussing how standard errors are estimated with ordinary least squares (OLS):
OLS assumes that the variance (
where
Using a Prais-Winsten transformation is another common way to account for autocorrelation. This approach works by first estimating the autocorrelation in the residuals of a static model:
then applying the Cochrane-Orcutt transformation to observations for
Then, to preserve the first time period, the following transformation is done to
The Prais-Winsten transformation is commonly used in tandem with panel corrected standard errors because they do not correct for autocorrelation (e.g., Jorgenson and Clark 2012; Kelly 2020; Kollmeyer and Peters 2019).
The underlying assumption of clustered robust standard errors (or panel corrected standard errors) and the Prais-Winsten transformation is that the estimated model is specified correctly and any heteroskedasticity or correlation in the errors are nuisances to correct for. However, as Plümper and Troeger (2019:19) lucidly argue, “This error structure exists not because nature invented a complex error process that ought to be controlled away, but because of a dynamic misspecification in the underlying data-generating process.” In other words, rather than treating autocorrelation (or other correlations in the errors) as a nuisance to get rid of, the presence of these issues is likely telling us something about our model. Thus, in the case of autocorrelation, researchers can explicitly model it by using a dynamic model. Dynamic models model the memory of a time series, where effects are distributed over time. In contrast to static models, dynamic models estimate both short-run (immediate) effects
The defining feature of a dynamic model is that it includes the lag of the dependent variable. The most basic dynamic model is the lagged dependent variable (LDV) model, which is also referred to as the partial adjustment model or the Koyck model (De Boef and Keele 2008):
In addition to the LDV, lags of the independent variables can be included, which is referred to as an autoregressive distributed lag (ARDL) model (see Table 1 for a summary of the different dynamic models). The number of lagged dependent variables in the model is denoted as
The ARDL model can also be written as the algebraically equivalent error-correction model (ECM), where the contemporaneous values are modeled in first differences and the lags are in levels:
Equations 9 through 11 are algebraically equivalent to one another.
Dynamic Models.
Estimating the Distribution of the Long-Run Effect
Researchers can calculate the distribution of the long-run effect over time by using the model estimates. In the calculation of the long-run effects, the denominator tells us how much the outcome variable changes over time given a shock to an independent variable (known as the speed of adjustment). In the LDV and ARDL models, an autoregressive coefficient near unity indicates that the dependent variable is highly persistent (and near zero for the ECM), 4 and therefore, the long-run effect is distributed over a longer period of time. In contrast, an autoregressive coefficient near zero indicates that most of the effect occurs immediately.
We can manipulate the autoregressive coefficient in Equation 9 to illustrate this point. In a case where the dependent variable is highly persistent,
If

Distribution of the long-run effect over time by autoregressive coefficient.
Slope Heterogeneity
Slope heterogeneity, which is where the effect of a variable differs across units (e.g., the effect of a variable in a cross-national study is different in the United States than in Nigeria), is theoretically and statistically relevant for sociologists. In terms of theory testing, many macro-level sociological theories make arguments regarding heterogeneity—making it important to model to appropriately test different hypotheses. Statistically, slope heterogeneity makes the fixed effects estimator (and other pooled estimators) problematic to use.
We can see this by considering the following dynamic model with heterogeneous slopes that is estimated by pooled, fixed effects regression:
Regardless of whether the lags are properly specified, fixed effects estimation of a dynamic model in the presence of slope heterogeneity produces inconsistent estimates because the heterogeneity gets captured in the error term (Pesaran and Smith 1995).
6
The fixed effects estimator will produce unbiased and consistent (but inefficient) estimates in the presence of slope heterogeneity (Pesaran and Smith 1995). Even if
To model slope heterogeneity, Pesaran and Smith (1995) propose the mean group (MG) estimator. 7 The MG estimator runs a time-series regression on each unit and then takes the mean of all the estimates. MG estimation of the common ARDL(1,1) model is as follows:
where the mean group coefficients are the average of the unit-specific coefficients (Ditzen 2018; Thombs et al. 2022):
Cross-Sectional Dependence
Addressing cross-sectional dependence (cross-sectional units are correlated with one another) with large
Traditionally, cross-sectional dependence is dealt with using time fixed effects. Time fixed effects model the cross-sectional dependence by including
Another approach is to use panel corrected standard errors, which adjust the standard errors but not the estimated coefficients (like robust standard errors). 8 A fundamental difference between the two is that clustered robust standard errors assume that errors are not correlated across units. Beck and Katz (1995) first developed panel corrected standard errors and showed that they performed favorably to feasible generalized least squares, which were commonly used at the time. Assuming there are balanced data, panel corrected standard errors are estimated as follows:
where Ω
where
However, treating the cross-sectional dependence as a nuisance in the error term suffers from the same problems discussed in the previous section regarding autocorrelation. These approaches are sufficient if the cross-sectional dependence is truly part of the error term, but they will not adequately model the dependence if it is truly part of the data generating process. Given these considerations, we can turn to the common correlated effects estimator that offers a way to directly model the cross-sectional dependence.
Common Correlated Effects
Pesaran (2006) advanced the common correlated effects estimator to eliminate the effects of the unobserved common factors by using cross-sectional averages with OLS. This method contrasts with other common approaches that rely on correcting standard errors rather than explicitly modeling the cross-sectional dependence. The common correlated effects estimator allows these factors to affect each unit differently (Pesaran 2006). Assuming a model with heterogeneous slopes with one unobserved common factor, the estimator is based on Equation 17:
where
where
Monte Carlo Experiments and Evidence
Monte Carlo Experiments
Five Monte Carlo experiments were conducted to compare the robustness of the fixed effects estimator and the common correlated effects estimator to autoregression (or persistence), slope heterogeneity, and cross-sectional dependence.
9
The data generating process of the experiments are based on the following ARDL (
where
The unit-specific fixed effects are generated as:
The common factor is autoregressive:
and the error terms are independent and identically distributed:
I performed five experiments based on Equation 20, which I summarize in Table 2, Table 3, and Table 4. Experiments 1 through 3 follow an ARDL (1,1) and include a common factor and differ according to the degree of slope heterogeneity. Experiment 1 allows for the most variation in the slope coefficients, whereas Experiment 3 allows for the least, and Experiment 2 falls in the middle. Experiment 4 also follows an ARDL (1,1), but the slope coefficients are homogeneous. Experiment 5 restricts Equation 20 to an ARDL (0,0), that is, a static model, where the slope coefficients are homogeneous and there is no cross-sectional dependence. Experiment 5 is a “best case” scenario for the one-way fixed effects estimator. One thousand simulations are performed for each combination of
Summary of Monte Carlo Experiments.
Coefficients in Monte Carlo Experiments.
Estimators and Models used in the Monte Carlo Experiments.
Monte Carlo Results
The results of the Monte Carlo experiments are summarized for each experiment in the following. The results are reported in tabular format in the supplemental material. The bias, root mean square error (RMSE), and coverage are reported for each estimator. The bias reports the average difference between the estimator and the true parameter in percentage terms. The RMSE provides information on how much each estimator deviates from the true parameter, which is a function of how biased and inefficient an estimator is, and the coverage is the proportion of the confidence intervals that include the true parameter.
Experiment 1: dynamic, high slope heterogeneity, and cross-sectional dependence
As expected, the DCCE
10
performs the best of the estimators in Experiment 1. The dynamic fixed effects estimators and the DCCE perform similarly when
The results for
It is also notable that the dynamic models perform much better than the static models in estimating
Experiment 2: dynamic, medium slope heterogeneity, and cross-sectional dependence
The results for the long-run effect (
The results for
Like Experiment 1, the dynamic models are superior to the static models with regards to estimating
Experiment 3: dynamic, low slope heterogeneity, and cross-sectional dependence
The results for the long-run effect (
The results for
For
Like Experiments 1 and 2, the dynamic models outperform the static models in estimating
Experiment 4: dynamic, no slope heterogeneity, and cross-sectional dependence
Experiment 4 differs from the first three experiments because the slope coefficients are homogeneous. The 2DFE performs the best in terms of estimating the long-run effect (
For
Moving to the static models, the 2FEPCSE/PSAR performs the best of the static models, with biases near 0 for all combinations of
Experiment 5: static, no slope heterogeneity, and no cross-sectional dependence
Experiment 5 is a “best case” scenario for the one-way fixed effects estimator because there is no slope heterogeneity or cross-sectional dependence. The estimator does perform well, with biases near 0, a low RMSE, and high coverage for all combinations of
Summary of the Monte Carlo Experiments by Estimator and Model.
An Example: Drivers of U.S. State-Level Fossil Fuel Consumption
Large
Descriptive Statistics.
Slope Heterogeneity Tests.
The results of the dynamic models are reported in Table 8, and the static models are reported in Table 9 for reference. For the dynamic models, the coefficients behave as expected. The autoregressive coefficient is near unity in the fixed effects models, while it is .504 for the DCCE model. The short-run coefficients and the coefficients on the lags of the independent variables tend to take on equal and opposite signs in the fixed effects models, which is also expected. Pesaran’s test for weak cross-sectional dependence (reported as the CD-statistic in Table 8) indicates strong cross-sectional dependence in the 1DFE. The CD statistic is less in magnitude with the inclusion of time fixed effects, but the statistic is significant at the .05 level. The use of PCSEs does little to mitigate the strong cross-sectional dependence because the CD statistic remains the same value.
Dynamic Models of Fossil Fuel Consumption, 1960–2018.
Static Models of Fossil Fuel Consumption, 1960–2018.
Turning to the short-run effects, industrial share and population are positively associated with fossil fuel consumption, while renewable energy is negatively associated with fossil fuel consumption across all four models. There is disagreement between the models regarding the short-run impacts of personal income and the top 10 percent share. The 1DFER model finds that personal income is positive and statistically significant, whereas the 2DFE models and the DCCE find no statistically significant relationship. The top 10 percent share coefficient is positive and statistically significant for both 2DFE models, but it is not statistically significant in the 1DFE or DCCE model.
The long-run effects are similar across the four models, but there are some notable differences between the estimators. The DCCE finds that personal income is not associated with fossil fuel consumption, but the 1DFE and 2DFE estimate a sizeable long-run association between income and fossil fuel consumption. The discrepancy between the 1/2DFE and DCCE arises primarily from the autoregressive coefficient tending toward unity, which creates highly misleading results, as theory predicts (Pesaran and Smith 1995) and the Monte Carlo experiments showed. Likewise, although population is statistically significant across all of the models, the 1DFE and 2DFE estimates are counterintuitive because they indicate that the short-run effect is greater than the long-run effect. For the 2DFE models, a 1 percent increase in population is associated with a .697 percent increase in fossil fuel consumption at
In the static models (Table 9), the differences between the models are similar to the dynamic case. First differences are used because the 2FE models in levels are nonstationary according to the Pesaran panel unit-root test in the presence of cross-section dependence (available on request). Personal income is positive and statistically significant for the 1FE but not for the 2FE models or the CCE. The top 10 percent share coefficient is statistically significant in both 2FE models but not the other models. The other coefficients are similar across the models. Both 2FE models suffer from strong cross-sectional dependence, and the CD statistic for 2FEPCSE/PSAR is even larger in absolute magnitude than the 1FE model—suggesting the use of PCSEs does not mitigate strong cross-sectional dependence. A summary of the results across Table 8 and Table 9 is found in Table 10.
Summary of Results: Drivers of U.S. State-Level Fossil Fuel Consumption.
Conclusion and Recommendations for Modeling Large N , Large T Data
There are three key takeaways from the Monte Carlo experiments and the example presented. First, failing to correctly model the lag structure, slope heterogeneity, and cross-sectional dependence produces biased and inconsistent results. Second, the performance of the dynamic two-way fixed effects estimator relative to the common correlated effects estimator depends on
Based on the theoretical discussion earlier in the article and the takeaways from the experiments, I present a decision-making framework in flowchart form in Figure 2. As the framework illustrates, there are theoretical and statistical decisions the researcher must make. Researchers must first decide the appropriate ARDL(

Decision-making framework for modeling large
To conclude, I make several remarks and suggestions regarding the framework:
If there is uncertainty regarding lag specification, the best way to tell whether a dynamic model is appropriate is to include lags of the dependent (and independent) variables in the model. Including them will at once model the dynamic process and correct the issue of autocorrelation (Pickup 2015). If a standard error correction is used instead, it will be misspecified if the data generating process is truly dynamic. Thus, researchers should always start with a dynamic model unless they have a strong theoretical reason to believe otherwise.
Moreover, many sociological theories expect that social processes function differently across spatial and social contexts. Thus, the homogeneity assumptions of the fixed effects estimator often are too restrictive, making the flexibility of heterogeneous panel estimators like the CCE more apt to use in many cases.
Supplemental Material
sj-pdf-4-srd-10.1177_23780231221117645 – Supplemental material for A Guide to Analyzing Large N, Large T Panel Data
Supplemental material, sj-pdf-4-srd-10.1177_23780231221117645 for A Guide to Analyzing Large N, Large T Panel Data by Ryan P. Thombs in Socius
Research Data
sj-txt-2-srd-10.1177_23780231221117645 – Research Data for A Guide to Analyzing Large N, Large T Panel Data
Research Data, sj-txt-2-srd-10.1177_23780231221117645 for A Guide to Analyzing Large N, Large T Panel Data by Ryan P. Thombs in Socius
Research Data
sj-txt-3-srd-10.1177_23780231221117645 – Research Data for A Guide to Analyzing Large N, Large T Panel Data
Research Data, sj-txt-3-srd-10.1177_23780231221117645 for A Guide to Analyzing Large N, Large T Panel Data by Ryan P. Thombs in Socius
Research Data
sj-xlsx-1-srd-10.1177_23780231221117645 – Research Data for A Guide to Analyzing Large N, Large T Panel Data
Research Data, sj-xlsx-1-srd-10.1177_23780231221117645 for A Guide to Analyzing Large N, Large T Panel Data by Ryan P. Thombs in Socius
Footnotes
Supplemental Material
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
