Abstract
Introduction
In public health research, it is acknowledged that both compositional (individual-level) and contextual (neighborhood-level) variables are important for explaining variation in health outcomes.1–7 For example, the role of contextual variables continues to be a key focus of investigators studying potential risk factors for obesity,8–10 where neighborhood variables such as fast-food density and green-space presence are thought to be important factors that contribute to obesity status. Neighborhood- or area-level variables can also have an important role in explaining variation in environmental chemical exposures. One example is found with modeling variation in polychlorinated biphenyls (PCBs) measured in carpet dust using percentage of developed land, population density, and number of industrial facilities within 2 km of residences, where total PCB levels are significantly associated with either the percentage of developed land or the population density. 11
Through the use of geocoding and geographic information systems (GIS), researchers can link an individual residential address to an external database containing numerous area-level or environmental variables, where the number of variables is denoted by
With the abundance of socioeconomic and environmental data available at multiple spatial scales, a natural question arises for researchers who wish to investigate environmental effects: at what spatial scale (geographic areal unit) should each area-level variable be modeled in order to explain a fixed health outcome or environmental exposure of interest? Area-level covariates used as contextual variables in regression models are often in practice modeled at the same spatial scale, where, generally, a smaller spatial unit (ie, census block group versus a county) is thought to better capture heterogeneity in regression relationships. Krieger et al. 4 model area-based socioeconomic measures at three spatial scales (census block group, census tract, and ZIP Code) to study mortality outcomes and cancer incidence and find that the effect estimates for the smaller spatial units (census block group and census tract) are similar, while the effect estimates for the larger spatial unit (ZIP Code) differ and are sometimes in the opposite direction. Krieger et al. 4 conclude that the level of geography is important and recommend the use of SES variables at a smaller spatial scale, namely census block group or census tract.
The selection of spatial scale for environmental variables is a problem typically encountered in modeling groundwater quality. A variable that is often incorporated into statistical models of groundwater quality is area land use, as it is known to be one of the factors that can affect water quality. Barringer et al. 12 find that the use of a circular buffer around a water table well is a simple and effective method for correlating water quality and land use. Regional and national groundwater studies have associated land use near a well with water quality using a fixed circular buffer distance, with 500 m a common choice and 1 km a less common choice.13–17 Some researchers have evaluated the univariate correlation of land use variables with groundwater quality. Ferrari and Ator 18 find correlations between agricultural land use and nitrate concentrations using circular buffers of 400 and 800 m. Kolpin 19 correlates the concentrations of nitrate, alachlor, and atrazine detected in wells with a variety of land use variables using circular buffers ranging in size from 200 m to 2 km. Johnson and Belitz 20 evaluate a range of circular buffer and wedge sizes in a univariate correlation analysis of urban land use and the occurrence of volatile organic compounds (VOCs) in groundwater using Kendall's tau (T). They find that the values of T are within 10% of one another for circles and wedges ranging in size from 500 m to 2 km, with statistically significant correlations for all sizes, and conclude that the popular choice of a 500-m circular buffer is adequate for assigning land use variables to a well.
Other researchers have evaluated the buffer distance to select each type of land use variable to be used in a regression model of groundwater quality. Rupert 21 selects the circular buffer size for land use variables to explain the detection of elevated atrazine or desethyl-atrazine (atrazine/DEA) concentrations and elevated concentrations of nitrate using univariate logistic regression. The optimal buffer size is 2 km for agricultural land use variables and 500 m for urban land use variables according to McFadden's ρ 2 , which is a transformation of the log-likelihood statistic that is designed to imitate the r 2 of linear regression for univariate regression models. 21 The buffer distances of 2 km for agricultural land use variables and 500 m for urban land use variables are then used in multiple logistic regression models of the probability of elevated atrazine/DEA detection and the probability of elevated nitrate detection.
An important question is whether all area-level variables should be modeled at the same spatial scale, as recent studies have shown that different area-level covariates are associated with health outcomes at different spatial scales.2,6,7 Root 6 finds that the relationship between area-level SES variables and orofacial cleft risk varies when using different spatial scales to define neighborhoods. The study results indicate that poverty has a stronger association with risk for cleft palate at smaller geographic scales, while unemployment has a stronger association at larger scales, thus providing evidence that neighborhood effects operate at different spatial scales. In addition, Flowerdew et al. 2 demonstrate in a British study of limiting long-term illness (LLTI) that the correlation strength between area-level variables and LLTI can vary depending on the spatial scale. Stronger correlations are present for LLTI and age group at the smaller enumeration district (ED) scale, while stronger correlations exist for LLTI and unemployment at the larger ward scale. In another study, Block et al. 8 examine the relationship between fast-food restaurant density (FFRD) and black and low-income neighborhoods while controlling for various neighborhood variables such as commercial activity and presence of highways, which, in addition to FFRD, are available at two spatial scales (0.5- and 1-mi buffer sizes). They find that, while the results for both buffer analyses are similar, the 1-mi buffer analysis leads to a statistically significant association for median household income, perhaps due to a better capturing of how far people are willing to travel to buy food. In light of these findings, it is important to consider spatial scale for each area-level covariate when studying relationships between environmental variables and a particular outcome variable.
In this paper, we present a novel approach for modeling area-based variables at different spatial scales using four model selection approaches. To demonstrate these methods, we use a nitrate dataset containing numerous geologic and land use variables at different buffer sizes to investigate potential associations with nitrate concentrations in drinking well water. Contamination of drinking water by nitrate is a growing problem in agricultural areas of the United States, as ingested nitrate can lead to the endogenous formation of N-nitroso compounds, which are potent carcinogens. Our methods are not limited to the case example we present, but can be applied to other area-based variables that are related to cancer.
Methods
Statistical Methods
We review four established methods used in model selection: forward stepwise regression, incremental forward stagewise regression, least angle regression (LARS), and the lasso. 24 Next, we introduce our modified versions of these algorithms to select the spatial scale. Lastly, we present an application of our methods to model ground-water nitrate concentrations in Iowa.
Model Selection Approaches
Forward Stepwise Regression
Forward stepwise regression is a common approach used for model selection. A description of the forward stepwise regression algorithm detailed by Berk 22 and Wheeler 23 is as follows:
Initialize all regression coefficients
Of the candidate variables, find the predictor
Let
Compute the residuals
Iterate steps (2)–(4) until there is an inadequate improvement in the performance of the model or until all predictors have been added to the model.
For step (5), we consider there to be an inadequate improvement in the model's performance if the difference in the Akaike information criterion (AIC) between the current model and the proposed model is less than ε, for some ε > 0.
Incremental Forward Stagewise Regression
Incremental forward stagewise regression is another common approach used for model selection. Hastie et al. 24 and Hastie et al. 25 describe the incremental forward stagewise regression algorithm as follows:
Initialize all regression coefficients
Find the predictor
Let
Let
Iterate steps (2)–(4) until none of the predictors are correlated with the residuals
For step (5), we consider none of the predictors to be correlated with the residuals if max|
Least Angle Regression
Following the notation of Yuan and Lin, 26 the LARS algorithm is described as follows:
Initialize all regression coefficients
Find the predictor
Let the active set
Let γ be a p-dimensional vector where all values are set equal to 0. Calculate the current least squares direction γ by updating
For every
If
Let
Let
Set
For step (5), for ease of computation, we select
Lasso
The lasso, which stands for least absolute shrinkage and selection operator,
27
is a shrinkage method that is good for dealing with high-dimensional data and correlated covariates by placing a constraint on the magnitude of the regression coefficients.23,24 Hastie et al.
24
define the lasso estimate as follows:
Following the notation of Efron et al. 28 , Shi 29 describes the lasso algorithm as follows:
Initialize all regression coefficients
Find the predictor
Add
Compute the following: ĉ =
Find
Let
Find
Let
Set
Notice that step (7) is the lasso modification to the LARS algorithm. At the final iteration, the OLS solution is reached. 28
Modifications of Model Selection Approaches to Select Spatial Scale
Spatial Scale Forward Stepwise Regression
We propose a modified forward stepwise regression algorithm that selects each area-level variable at only one spatial scale in order to build regression models to explain variation in a continuous outcome variable. We use the basic forward stepwise algorithm with adjustments to select the scale for variables available at more than one spatial level. In the algorithm, all variables are considered at all available spatial scales as potential candidates to enter a model. However, due to potentially high correlations present across different scales for a given variable, we constrain the algorithm to select each area-level variable at a single spatial scale.
Our modeling approach uses a 3-D or stacked matrix, where each stack represents a particular level of covariates, including spatial scale. As an example, we might have several individual-level covariates, an area-level covariate available at 1, 2, and 3 mi, and another area-level covariate available at 4 and 6 mi. In this case, the first stack would contain the individual-level variables; the second, third, and fourth stacks would contain the area-level variable at the 1-, 2-, and 3-mi levels; and the fifth and sixth stacks would contain the area-level variable at the 4- and 6-mi levels, respectively. In cases where values are only present for a covariate at certain levels, that covariate is assigned missing values at all other levels. The spatial scale forward stepwise regression algorithm is as follows:
Construct an
Initialize all regression coefficients
For each stack
Of the
Let
Compute the residuals
Iterate steps (3)–(6) until there is an inadequate improvement in the performance of the model or until all predictors have been added to the model.
For step (4), it is important to note that, once an overall maximum is determined, we select the corresponding variable at the best spatial scale and remove all other versions (or spatial scales) of that variable from further consideration for model selection. In this way, we constrain the algorithm to select each area-based variable at a single spatial scale. For step (7), we consider there to be an inadequate improvement in the model's performance if the difference in the AIC between the current model and the proposed model is less than ε, for some ε > 0.
Spatial Scale Incremental Forward Stagewise Regression
We spatially modify the basic forward stagewise algorithm to select each area-level variable at a single spatial scale. We use the same matrix data structure as with the spatial scale forward stepwise algorithm. The spatial scale incremental forward stagewise regression algorithm is as follows:
Construct an
Initialize all regression coefficients
For each stack
Of the
Let
Let
Iterate steps (3)–(6) until none of the predictors are correlated with the residuals
For step (4), once an overall maximum is determined, we select the corresponding variable at the best spatial scale and remove all other versions (or spatial scales) of that variable from further consideration. In this way, we constrain the algorithm to select each area-level variable at a single spatial scale. For step (7), we state that none of the predictors are correlated with the residuals if the overall maximum is less than a specified tolerance, where the tolerance is some small positive number.
Spatial Scale Least Angle Regression
Our approach for the spatial modification of the LARS algorithm involves the use of a block diagonal matrix
Adopting the notation of Yuan and Lin, 26 the spatial scale LARS algorithm is as follows:
Initialize all regression coefficients
Find the predictor
Let the active set
Update
Let γ be a
For every
If |
Let
Let
Set
For steps (4) and (7), once a predictor variable is chosen to enter the model, we select that variable at the best spatial scale and remove all other versions (or spatial scales) of that variable from further consideration from the candidate design matrix
Spatial Scale Lasso
As with the spatial modification of LARS, our modeling approach for the spatial modification of the lasso algorithm involves the use of a
Adopting the notation of Efron et al. 28 , the spatial scale lasso algorithm is as follows:
Initialize all regression coefficients
Find the predictor
Add
Update
Compute the following: ĉ =
Find
Let
Find
Let
Set
Step (8) is the lasso modification. For steps (4) and (8), once a predictor variable is chosen to enter the model, we select that variable at the best spatial scale and remove all other versions (or spatial scales) of that variable from further consideration from the candidate design matrix
Application to Groundwater Nitrate
Study Data
To model the variation in nitrate in drinking well water in Iowa, we used data for private wells sampled from 1984 to 2011 by the following programs: the Iowa Grants to Counties Water Well Program (GTC), the Iowa Private Well Tracking System, the Iowa Statewide Rural Well Water Survey, the Iowa Community Private Well Study, and the U.S. Geological Survey (USGS). We used only those wells with the most accurate locations as determined by GPS measurements, topographic quad maps, and geocoded residence addresses. Seventy-five percent of the well locations were based on geocoded residential street addresses. Nitrate data were reported either as nitrate or nitrite-plus-nitrate as NO3-, and the latter were converted to nitrate-nitrogen (hereafter referred to as “nitrate”). Values below the detection limit were imputed from a log-normal distribution of uncensored data. 30 Same-day samples at the same well location and depth were excluded if their standard deviation was 5 mg/L nitrate-N or more; otherwise the average of such samples was used. Nitrate data were natural-log-transformed prior to modeling. There were 11,931 well measurements in the analysis dataset.
We considered a set of 115 explanatory variables in the statistical analysis (Table 1). Variables were available for characteristics at the individual well location and for characteristics of the surrounding environment over different distance buffers. Variables at the individual well level include longitude, latitude, elevation, well depth, bedrock status, and bedrock depth, among others. A geographic information system was used to calculate the surrounding environmental variables. Most of the environmental variables were calculated using more than one distance buffer to assess the importance of spatial scale. An exception was for counts of animal feeding operations (AFOs) by type (Confined, Open Feedlot, Mixed), which were only calculated at 10 km. The AFO type of the closest AFO was also recorded, along with the number of animals at the nearest AFO (NearAFO_AnimalUnits). Most of the other area-based variables were calculated at distances of 500 m and 1 km. These area-based covariates include average percent sand, average percent clay, average slope length, and mean population density, among others. Only fine-grain thickness (FnGrn_Logs) was calculated at 4- and 6-mi distances. Additional details on the variables are available in Wheeler et al. 30 To account for missing data, we excluded 9.3% of the observations that were missing values for any of the covariates and used 10,824 nitrate measurements in our analysis.
Variable definitions for the variables considered in the spatial scale forward stepwise, forward stagewise, LARS, and lasso models. The horizontal dashed line separates the individual-level variables and the area-based variables available at more than one buffer distance. Any variable that falls below the dashed line has a suffix indicating the associated spatial scale.
Statistical Analysis
We modeled the natural log of nitrate concentrations in well water using our four spatial scale selection algorithms. We built spatial scale forward stepwise regression, spatial scale incremental forward stagewise regression, spatial scale LARS, and spatial scale lasso models to explain variation in log nitrate concentration while allowing any individual-level variable to enter the model and any area-based variable to enter the model at a single spatial scale, considering all available spatial scales. For spatial scale forward stepwise and spatial scale stagewise, we used a stacked data matrix where the first stack contained individual-level variables or area-level variables available at only one spatial scale, the second and third stacks contained area-level variables at the 500-m and 1-km levels, and the fourth and fifth stacks contained area-level variables at the 4- and 6-mi levels, respectively. Given that variables available at multiple spatial scales were limited to enter a model at a single spatial scale, the total number of variables possible for model inclusion was 71 instead of 115. More specifically, the total number of individual-level variables and area-level variables available at one spatial scale possible for model inclusion was 27, and the total number of area-level variables possible was 44.
For spatial scale forward stagewise, spatial scale LARS, and spatial scale lasso, we fitted OLS regression models with the selected covariates to obtain approximate
Evaluation Metrics
To evaluate the success of our spatial scale algorithms, we examined our methods using three criteria. First, for each of the four algorithms, we checked to see whether different spatial scales were selected and enumerated the number of selected variables that fell into each spatial scale category. Second, we looked at the agreement in sign and spatial scale for significant variables that were selected across various groupings of the algorithms. Third, in order to evaluate the rationale of including different variables at different spatial scales within the same model, for each algorithm we compared AIC measures across three different scenarios: 1) when limiting all selected area-based variables to be at the smallest available spatial scale, 2) when limiting all selected area-based variables to be at the largest available spatial scale, and 3) when using all selected area-based variables at the spatial scales originally selected by the model.
Results
Different variables were selected at different spatial scales using the spatial scale forward stepwise, spatial scale incremental forward stagewise, spatial scale LARS, and spatial scale lasso algorithms (Figs. 1–4). In each coefficient path plot, iterations of the respective algorithm are shown as the model-building progresses, where the coefficient estimates at each iteration change as variables enter or leave a model. Black lines represent individual-level variables, red lines indicate area-based variables at the 500-m level, green lines denote area-based variables at the 1-km level, and purple lines represent area-based variables at the 6-mi level. The forward stepwise algorithm converged after 26 iterations (Fig. 1), and the forward stagewise algorithm converged after 1,747 iterations (Fig. 2). Not surprisingly, it took a large number of iterations before the stagewise algorithm converged because of the incremental updating of the beta coefficient estimates. The LARS algorithm converged to the OLS estimates after 71 iterations (Fig. 3). The dotted vertical line in Figure 3 indicates the chosen model that had the minimum OLS-based AIC. The lasso algorithm converged to the OLS estimates after 85 iterations (Fig. 4). It took more iterations for the lasso algorithm to converge than for the LARS due to lasso's ability to add and drop variables. The dotted vertical line in Figure 4 indicates the chosen model that had the minimum OLS-based AIC.

Coefficient paths for spatial scale forward stepwise regression to explain log nitrate concentration in drinking wells in Iowa. The scale the variable entered the model is indicated by the legend.

Coefficient paths for spatial scale incremental forward stagewise regression to explain log nitrate concentration in drinking wells in Iowa. The scale the variable entered the model is indicated by the legend.

Coefficient paths for spatial scale LARS to explain log nitrate concentration in drinking wells in Iowa. The scale the variable entered the model is indicated by the legend. The dotted vertical line indicates the chosen model that had the minimum OLS-based AIC.

Coefficient paths for spatial scale lasso to explain log nitrate concentration in drinking wells in Iowa. The scale the variable entered the model is indicated by the legend. The dotted vertical line indicates the chosen model that had the minimum OLS-based AIC.
The coefficient estimates for each of the covariates selected in each of the algorithms are shown in Table 2, where the horizontal dashed line separates the individual-level variables and the area-based variables that have multiple spatial scales. Across all four algorithms, there were significant positive associations between log nitrate concentration and the following covariates: elevation, number of mixed-only AFOs within a 10-km buffer (Count_10 kmMixed), number of hog facilities within a 10-km buffer (Count_10 kmHogs), distance from well point to nearest sinkhole point (SinkholeDist_m), average transmissivity (AvgTrans), average wind erodibility index within a 500-m buffer (WEI_500 m), and estimated mean annual natural ground-water recharge within a 500-m buffer (Recharge_500 m). There were significant negative associations between log nitrate concentration and the following covariates: latitude, well depth, bedrock depth, bedrock status, average horizontal hydraulic conductivity (AvgK), average soil loss tolerance within a 1-km buffer (T_1 km), percent “not prime farmland” within a 500-m buffer (FarmClass_500 m), mean population density within a 1-km buffer from the U.S. Census 2000 (PopDen00_1 km), and fine-grain thickness at the 6-mi distance (FnGrn_Logs_6 mi).
Estimated coefficients from spatial scale (SS) forward stepwise, forward stagewise, LARS, and lasso models. The blank cells indicate variables not selected for a particular model. The horizontal dashed line separates the individual-level variables and the area-based variables considered at multiple spatial scales.
Multiple spatial scales of 500 m and 1 km were selected for variables by each of the algorithms (Table 3). All four models selected fine-grain thickness (FnGrn_Logs) to enter at the 6-mi level. For the spatial scale forward stepwise model, 26 of the 71 individual- and area-level covariates were selected. With seven variables selected at the 500-m level and seven variables selected at the 1-km level, there was an even split between the number of variables selected at the 500-m level versus the 1-km level. For the spatial scale forward stagewise model, 39 of the 71 individual- and area-level covariates were selected. Again, there was a fairly even split with 11 variables selected at the 500-m level and 12 variables selected at the 1-km level. For the spatial scale LARS model, 46 of the 71 individual- and area-level covariates were selected. More variables were chosen to enter at the 1-km level than the 500-m level, with 14 variables selected for the latter and 17 variables for the former. For the spatial scale lasso model, 42 of the 71 individual- and area-level covariates were selected. There was a fairly even split, with 14 variables selected at the 500-m level and 13 variables selected at the 1-km level.
Number of variables selected at each spatial scale for spatial scale (SS) forward stepwise, forward stagewise, LARS, and lasso models. The last row gives the total number of possible variables at each spatial scale.
Overall, there was consistency across the spatial scale algorithms in terms of the coefficient signs and spatial scale for the significant selected variables (Table 4). For various groupings of the algorithms, it is evident that, of the commonly selected covariates, the majority of them were significant. There were no instances of significant variables having different signs across algorithms, and only two instances of significant variables being selected at different spatial scales across algorithms. Average calcium carbonate (CaC03) and average risk of concrete corrosion (CorrosionCon) were selected at the 500-m level by spatial scale stepwise and at the 1-km level by spatial scale stagewise, spatial scale LARS, and spatial scale lasso.
Number of shared significant variables with the same sign and spatial scale and total number of shared variables for spatial scale (SS) forward stepwise, forward stagewise, LARS, and lasso models. The frequency of shared significant variables with the same sign and spatial scale is given along with the total number of shared variables in parentheses.
In comparing SS-Stepwise with SS-Stagewise, SS-LARS, and SS-Lasso.
For spatial scale stepwise, spatial scale stagewise, spatial scale LARS, and spatial scale lasso, we fitted OLS regression models based upon the selected covariates to obtain AIC measures for the three different scenarios mentioned previously. The table of AIC measures is shown in Table 5. For all methods, the model using the model-selected spatial scales (Model 3) resulted in the smallest AIC, indicating a better goodness of fit. Thus, we saw a significant improvement in goodness of fit with the spatial scale models in which we used the area-based variables at the spatial scales originally selected by each model. Across all scenarios, the spatial scale lasso had the best goodness of fit.
OLS-based Akaike information criterion (AIC) comparisons across spatial scale (SS) forward stepwise, forward stagewise, LARS, and lasso models.
Using the final model provided by the spatial scale lasso, 26 of the 42 selected variables were significant (Table 2). Of the significant variables, several variables had larger magnitudes and stood out as being important for explaining the variation in nitrate. There were significant positive associations between log nitrate concentration and the following covariates: distance from well point to nearest sinkhole point (SinkholeDist_m) and estimated mean annual natural ground-water recharge within a 500-m buffer (Recharge_500 m). In addition, there were significant negative associations between log nitrate concentration and the following covariates: bedrock depth, well depth, average calcium carbonate within a 1-km buffer (CaCO3_1 km), and average soil loss tolerance within a 1-km buffer (T_1 km).
Discussion and Conclusions
To consider the problem of spatial scale selection for area-based variables available at more than one spatial scale in a regression model, we modified the forward stepwise, forward stagewise, LARS, and lasso algorithms to select the best spatial scale for each area-level covariate. Our algorithms allow for any number of spatial scales of covariates to be considered and also enable the inclusion of individual-level covariates or covariates with only one possible spatial scale. We constrained the four algorithms to select each area-based variable to enter the model at a single spatial scale to avoid collinearity effects. When applying the algorithms to model groundwater nitrate exposure in Iowa, we found that not all environmental variables were selected at the same spatial scale. For all four spatial scale algorithms, the regression model that used the model-selected spatial scales had the best model fit. Furthermore, there was an overall agreement in coefficient sign and spatial scale for significant variables that were selected across the algorithms. The selection of area-level variables at different spatial units gives evidence for the environmental effects operating at different spatial scales and demonstrates the importance of considering the spatial scale when modeling environmental exposures.
Other researchers have developed approaches to address the problem of spatial scale selection in regression modeling. For example, rather than choosing the best available scale for each area-level covariate, Root et al. 7 use the variance of the outcome variable (eg, disease rates) to select a buffer distance at which to conduct the regression analysis and then use the area-level variables at the selected buffer distance. They propose the Brown—Forsythe (FBF) test of homogeneity of variance to select the optimal neighborhood or buffer size for modeling disease rates. In their approach, Root et al. 7 use circular buffers to create a collection of “neighborhoods” of different sizes around each subject and then use the statistical test to select the ideal buffer distance. This approach assumes that small neighborhoods will have high variances (reflective of an individualistic data structure) and large neighborhoods will have low variances (reflective of a global data structure). The goal is to select an “optimal” neighborhood that adequately captures the global characteristics of the neighborhood environment in which a person lives without being so large as to lose applicability to the individual. 7
Using the FBF test as a method to choose the optimal neighborhood has its merits. First, it is robust to deviations from the normal distribution in the outcome variables, which can occur when disease rates are modeled as normally distributed outcomes. 7 Second, it allows researchers to more specifically define geographic areas that may be more relevant for a particular health outcome, as opposed to using predefined geopolitical spatial scales such as census block groups or counties, which may not adequately capture the proximal environment of an individual.7,33 The approach for selecting spatial scale also has its limitations. First, it may not be suitable for researchers who wish to select neighborhoods other than those defined by using buffers. 7 Second, the buffer-based estimates of neighborhood SES variables have measurement error (in addition to the measurement error present in the census data) by assuming that people are equally distributed within a census block group, but this is generally common to buffering approaches. Third, and most importantly, area-level variables are not involved in the selection of the optimal buffer size for calculating disease rates. The buffer distance is selected based on finding a spatial scale with a moderate variance for disease rates, and then the SES variables are modeled at the selected buffer size. Thus, the test does not directly select the spatial scale for area-level covariates.
The strategies for the selection of spatial scale of environmental variables have been primarily univariate, at least in the context of the analysis of groundwater quality. Buffer shape sizes for land use variables have typically been selected independently from one another, as well as from variables measured at the well level.20,21 However, the magnitude of the effect measure and the significance of relationships between area-level variables and the outcome could change when other important variables are considered simultaneously.
Our methods provide a novel approach to the problem of spatial scale selection and have several strengths. First, rather than making an assumption about the appropriate spatial scale at which to model area-based variables, our spatial scale algorithms directly allow the data to drive the selection of spatial scale and permit different spatial scales to be present within a model. Second, our approach to spatial scale selection is multivariate and permits the simultaneous consideration of individual-level variables and area-level variables available at multiple spatial scales to be included in a model. Third, due to the potentially high correlations present across different spatial scales for a given variable, our algorithms constrain each variable to enter the model at a single spatial scale. That is, if a variable is available at two spatial scales, it can enter the model only at one of the two scales. Crowder and South 34 permit a variable to enter a regression model at both available spatial scales, and their results show that these variables have opposite signs, suggesting the possibility of collinearity effects. Fourth, to address correlations present across variables, one of our algorithms constrains the regression coefficients in the presence of correlated covariates. Fifth, our methods are scalable and can be extended to accommodate high-dimensional datasets with a large number of covariates at a variety of different spatial scales.
While our initial results when applying our algorithms are encouraging, our analysis of groundwater nitrate has limitations. First, because of limited resources we were unable to consider more buffer distances in our analysis. Second, our analysis of nitrate used fixed buffer sizes for area-level variables across the study area, but adaptive buffer sizes based on population density may be more appropriate. Third, we excluded some observations due to missing values for some of the covariates. Regarding limitations of the algorithms, in the case of the spatial scale lasso, the spatial scale of a variable is fixed once it enters the model. That is, even if a variable is dropped, we constrain that variable to reenter the model at the same spatial scale as was originally selected. This is done to ensure that the correlation between the current residuals and the candidate variables does not exceed the maximum correlation achieved between the current residuals and the variables in the active set. Another limitation is the lack of standard errors for the algorithms, which necessitated our use of OLS models to obtain the
In the case study of groundwater nitrate exposure, we used our spatial scale algorithms to select area-based variables available at multiple buffer distances in order to explain variation in groundwater nitrate, a known risk factor for cancer. Our methods can be applied to other research problems, where it is of interest to select environmental or area-based risk factors available at multiple spatial scales that are associated with a health outcome of interest such as cancer.
Author Contributions
Conceived and designed the methodology: LPG, CG, DCW. Analyzed the data: LPG. Wrote the first draft of the manuscript: LPG, DCW. Contributed to the writing of the manuscript: LPG, DCW. Agree with manuscript results and conclusions: LPG, CG, DCW. Jointly developed the structure and arguments for the paper: LPG, CG, DCW. Made critical revisions and approved final version: LPG, CG, DCW. All authors reviewed and approved of the final manuscript.
