Abstract
Introduction
In an era characterized by the proliferation of new data sources and an unprecedented data revolution, the 2030 Agenda for Sustainable Development and the overall goal of leaving no-one behind (LNOB) have generated a tremendous increase in the demand of disaggregated data and statistics. In particular, in order to operationalize the overarching requirement of data disaggregation in the development of the Global SDG Indicator framework, the United Nations Statistical Commission postulated that “
In this framework, traditional sample surveys implemented by National Statistical Offices (NSOs) can provide important information on the social, economic and environmental dimensions of target populations, representing the essential data source to produce the official estimates of about the 30% of Sustainable Development Goal (SDG) Indicators.1 However, these data sources alone are not enough to realize the ambitious goal of monitoring SDG Indicators by all relevant disaggregation dimensions and geographical areas. Indeed, despite collecting detailed information at relatively high frequency, most sample surveys are characterized by sample sizes that are either not large enough to guarantee reliable direct estimates for all sub-populations or that do not cover all possible disaggregation domains [1].
Issues of this kind can be addressed at different stages of the statistical production process. They can be tackled at the design stage, by adopting sampling strategies guaranteeing an observed set of sampling units for every disaggregation domain. Although potentially optimal, this approach normally results in an exponential increase of the sampling size and survey costs and complexity [2]. Furthermore, it is important to realize that, in practice, the anticipation of all possible future uses of survey data is virtually impossible, as “the client will always require more than is specified at the design stage” [3]. Alternatively, data disaggregation can be addressed at the data analysis stage, by adopting indirect estimation approaches borrowing strength from related disaggregation domains and/or time periods, thus resulting in an increase of the effective sample size [4]. Small area estimation (SAE) methods are among the possible indirect estimation approaches that can be adopted to deal with data disaggregation at the analysis stage. SAE techniques allow combining survey data with auxiliary information coming from additional data sources that are not affected by sampling error. Traditionally, SAE have relied on the integration of survey microdata with information from population and agricultural censuses or administrative records through explicit models linking the variable of interest to a set of auxiliary variables. However, with more and more data made available to National Statistical Systems (NSSs) from multiple innovative data sources, relying exclusively on auxiliary variables from traditional statistical sources for the production of small area estimates of SDG indicators is not considered as an efficient solution. In this respect, the 2030 Agenda explicitly stresses the need for new and enhanced data integration strategies, including the exploitation of the potential contribution to be made by geospatial information systems and other big data sources.
Within this framework, the present paper makes the case for the adoption of SAE and other indirect estimation methods to produce granular disaggregated estimates of SDG indicators, by integrating survey microdata with auxiliary information retrieved from “innovative” data sources, such as earth observation data. Indeed, relying on suitably implemented indirect estimation techniques such as SAE allows obtaining reliable disaggregated estimates of SDG indicators while managing survey costs and complexity. In particular, the integration of survey microdata with data from non-traditional sources offers the potential of producing timelier and more disaggregated statistics at higher frequencies than what allowed by traditional data sources alone. The paper is structured as follows. Distinguishing between area-level and unit-level SAE models, Section 2 presents a brief overview of these two approaches along with their relevant notation, context of usability, and potential source of auxiliary data. This section also discusses the main strengths and elements of cautions related to the use of geospatial auxiliary variables. Then, Section 3 presents some of the main fields of SAE application in official statistics, highlighting the few initiatives and examples of SAE for data disaggregation of official SDG indicators. To fill the gap of SDG relevant studies, Section 4 presents a practical case study on SDG indicator 2.3.1, measuring the average volume of production per labour unit of small-scale food producers, based on the Fay-Herriot (FH) [5] area-level model and combining the official integrated household survey of Mali with auxiliary variables retrieved from multiple geospatial information systems. The results of the case study highlight that, implementing the considered SAE approach, estimates precision is improved and predictions for out-of-sample domains can be produced. Finally, the main conclusions and ways forward are presented in Section 5.
Integrating survey data with additional data sources through small area estimation
Sample surveys, which are regarded as cost-effective means to collect detailed information at relatively high frequency over time, have a long history in the field of official statistics, and can be used to produce reliable estimates of parameters referred to total populations or to broad disaggregation domains. In this context, direct domain estimates of target parameters are statistics based solely on domain-specific sample data. Direct estimators are also known as design-based estimators, since they make use of sampling weights to produce inference on the target population [6]. One of the main requirements to achieve reliable disaggregated estimates by direct estimators is the presence of a sufficient domain sample size to yield adequate precision, or, in other words, a small estimated variance. When this circumstance is not verified, we are in the presence of so-called small-areas, i.e. disaggregation domains for which too little or no sampling observations are available [4]. It should be noted that, in practical statistical applications, it is quite rare to have an overall sampling size that is large enough to guarantee a sufficient number of observations for every possible disaggregation domain. Therefore, the use of indirect estimation techniques that “borrow strength” from auxiliary information on the population of interest [7] is often necessary. The range of possible approaches to produce indirect estimators is vast and goes from the implementation of design-based model-assisted approaches, such as the generalized regression estimator ([8, 9]) or the projection estimator ([10, 11]), to model-based approaches such as SAE ([4, 5, 12]). Contrarily to direct and model-assisted approaches, SAE model-based methods rely on explicit models and, consequently, the properties of resulting estimators are assessed under the adopted model assumptions. In particular, traditional SAE models are mixed models with area-specific random effects accounting for the variability between different areas not explained by auxiliary variables [4].
Although different in their specification, all SAE approaches share the same notation framework that is here introduced for the clarity of following sections. Let us consider a finite population
Direct HT estimators
Despite their increasing popularity, resorting to SAE should not be considered as the solution to any data disaggregation problem, and there are various considerations that NSOs should make before engaging in the production of indirect estimates. First of all, model-based approaches have stricter data requirements than direct estimation methods, with unit-level models being more data intensive than area-level ones. In this respect, the access to microdata on individual units may be limited by confidentiality concerns that need to be taken into account. Being based on models, after implementing SAE approaches, the underlying assumptions need to be carefully validated through adequate diagnostic techniques [27]. In addition, the bias of small area estimates needs to be measured to assess estimates reliability. This is generally done by means of the mean square error (MSE), which provides a combined indicator of estimates precision (variance) and accuracy (bias).
Area-level SAE models
The FH model [5], which is by far the most popular area-level SAE approach, is often used for the production of small area estimates in official statistics and research thanks to its intuitive application and interpretation. This approach combines a sampling model, assuming that the unknown parameter
where
The unknown parameters of Eq. (1) to be estimated are the fixed-effects parameters
One of the FH fundamental assumptions of known variances
Contrarily to area-level approaches, unit-level SAE models require the availability of unit-level microdata for both the variable of interest
where
The model in Eq. (2.2) contains independent and identically distributed (
Under the EBLUP approach, the SAE estimator can be formalized as a linear combination of the survey regression estimator and a regression-synthetic component:
Where
measures the amount of unexplained between-area variability to the total variability, and gives more importance to the survey regression component of the estimator with increasing domain sample size
Similarly to what seen for area-level models, various extensions of the basic unit-level approach are available in the literature. In particular, while the model in Eq. (2.2) only supports the estimation of means and totals, approaches relying on nested error linear regression models allow the estimation of non-linear indicators ([24, 25]). These extensions are particularly relevant in the context of the SDG monitoring framework, where many of the indicators are expressed as ratios and proportions. Additional extensions allow to include sampling weights in the estimation process [26], address the presence of heteroscedasticity in the error term [20], and produce estimates which are robust to influential outliers [21].
An important prerequisite for the construction of SAE models with satisfactory predictive power is the availability of good quality auxiliary variables that can properly explain the phenomenon under study. Traditionally, this additional information for SAE implementation has been extracted from population and agricultural censuses or administrative records. Census data have the advantage of providing a complete coverage of the target population and can offer valid socio-economic predictors of the variable of interest. However, the low frequency at which censuses are normally implemented limits their use for the production of disaggregated statistics on an annual basis. On the other hand, administrative records, which are often generated as side product of government operations, do not suffer from this drawback. However, this second type of data are not produced with the primary purpose of computing official statistics, and, as a consequence, their accuracy, coverage, content, and characteristics need to be carefully assessed before them being used for statistical purposes [28]. The merits and demerits of administrative data in the production of official statistics are extensively discussed in [29]. Some examples of applications of SAE based on administrative records are given in [4, 28, 30].
The huge amount of digital and geospatial information produced by a wide range of tools and technologies nowadays offers good alternative sources of auxiliary variables for SAE production. These rich large-scale datasets, often referred to as big data, generally cover a vast portion of the population within a territory, often reaching nationwide coverage. Potential sources of big data are geospatial information systems, social networks, and records generated by human transactions and interactions. These “new” or “alternative” data sources can complement traditional surveys and censuses to reduce the time and resources needed for data production, hence contributing to fill the SDG data gap. For example, latest available geospatial technologies can not only provide the auxiliary information to implement SAE or other indirect estimation approaches, but can also help improving the construction of master sampling frames and producing direct estimates of selected SDG indicators (e.g. indicator 15.1.1 on the percentage of forest area on total area, and indicator 15.4.2 measuring changes in the mountain green cover).
Examples of studies relying on the use of big data and geospatial information for the implementation of SAE techniques are presented in [31, 32, 33, 34]. In particular, in [31], the authors discuss the challenges opened by the extension of SAE covariates to include variables generated by big data sources and provides some solutions to address them. Specifically, besides requiring the availability of advanced statistical and IT know-how, the quality of data from these “new” data sources is often uncertain and rarely documented in comprehensive metadata files. In this respect, attention should be paid to the fact that basic SAE approaches are implemented under the assumption that auxiliary variables are measured without error, or, in other words, that they are available for all areas and they come from archives covering the entire population of interest. However, data coming from big data sources are often affected by measurement errors and bias. Various authors (e.g. [22, 35]) have addressed this issue by developing SAE approaches accounting for the presence of measurement errors in the covariates.
When using big data retrieved by earth observation systems, particular attention should be paid at the definition and computation of the covariates included in the model. Indeed, geospatial variables are usually available at the levels of the cells of regular grids of different resolutions, and need to be rescaled in order to be attributed either to individual sampling units (for unit-level approaches) or to the irregular polygons representing the estimation domains (for area-level approaches). Hence, when implementing area-level models such as the one summarized by expression (1), the value
Use of small area estimation approaches for data disaggregation of SDG indicators
The empirical literature on SAE is very broad, with applications in many different fields of official statistics such as income and poverty, labour, health and agriculture. However, despite the great emphasis placed on data disaggregation in the context of the SDG monitoring framework and its overarching LNOB pledge,3 the number of examples of SAE techniques applied to official SDG indicators is still limited.
Being poverty mapping among the main applications of SAE, several case studies and references are available for the disaggregation of indicators related to Goal 1 on ending poverty. In particular, SAE techniques have been implemented to produce official sub-national estimates of SDG indicators 1.1.1 and 1.2.14 in countries such as Albania, Bolivia, Bulgaria, Cambodia, Chile, Ecuador, Indonesia, Mexico, Morocco and Sri Lanka ([23, 36]). Other applications relevant to income and poverty analysis, yet without a direct link with SDG indicators, can be found, for example, in Tanzania [37] and the United States [38].
Concerning Goal 2, aiming at ending hunger, achieving food security, improving nutrition and promoting sustainable agriculture, applications of SAE relevant to food security and malnutrition were found in Nepal [39], Ethiopia [40], and the United States [41]. However, the only application of indirect estimation techniques targeting specifically an indicator under Goal 2 was developed by the Food and Agriculture Organization of the United Nations (FAO) for indicator 2.1.2 on the prevalence of moderate and severe food insecurity in the population based on the food insecurity experience scale ([1, 11]). Concerning the agricultural component of Goal 2, evidence of empirical applications of SAE targeting SDG indicators under this goal were not found. Indeed, while SAE approaches have extensively been used to produce disaggregated estimates of crop yield and production measures (see [33, 34] for some examples), the use of indirect estimation approaches to produce disaggregated measures of agricultural labour productivity (indicator 2.3.1) or agricultural sustainability (2.4.1) are not a common practice.
Finally, a limited number of applications of SAE on indicators under Goal 4 [42], 5 [43], and 8 [44] were identified.
Sampling size information for Mali’s Enquête Agricole de Conjoncture Intégreée aux Conditions de Vie de Ménages (EAC-I) 2017
Sampling size information for Mali’s Enquête Agricole de Conjoncture Intégreée aux Conditions de Vie de Ménages (EAC-I) 2017
Target 2.3 of the 2030 Agenda for Sustainable Development aims to double the agricultural productivity and incomes of small-scale food producers by the end of the monitoring period. Progress towards the achievement of this target is monitored by two official SDG indicators, namely indicator 2.3.1 – measuring the average value of agricultural production per labour unit5 – and indicator 2.3.2 – estimating the average income from agricultural production activities of small-scale food producers. Indicator 2.3.1, which is the object of the presented case study, provides a measure of average partial factor productivity of agricultural holdings in a given year, and is currently disaggregated by the sex of the holding’s head and the size of the farm (small versus non-small). In particular, small-scale food producers are identified through an official definition developed by the FAO and endorsed by the Inter-Agency and Expert Group on SDG Indicators in September 2018 [45] in order to enhance international comparability. Although disaggregation at the subnational level is not among the mandatory disaggregation dimensions for reporting indicators under target 2.3, local estimates of indicators 2.3.1 and 2.3.2 may prove to be way more relevant than national aggregates for effective monitoring and decision making at the country level.
The typical data sources used to estimate SDG indicators on small-scale food producers are agricultural surveys, or household surveys integrated with modules on households’ agricultural activities. Being based on sample data, the production of reliable estimates of these two indicators at granular subnational level is usually not possible with standard design-based approaches, and –consequently – indirect estimation approaches need to be explored. As discussed in Section 3, the literature on small area income estimation is considerably wide, even if not necessarily targeting the estimation of incomes generated through agricultural production activities. Contrarily, the body of work on SAE approaches applied to indicators of labour productivity measures is still very little. In order to fill this gap, this section explores the application of a FH area-level model to produce small area estimates of SDG indicator 2.3.1 at the second administrative level (circles) of Mali, considering the integration of household survey data with area-level auxiliary information retrieved by multiple trustworthy geospatial information systems. Specifically, the presented SAE application is based on microdata from the Enquête Agricole de Conjoncture Intégrée aux Conditions de Vie de Ménages (EAC-I) 2017. The EAC-I is a multi-thematic cross-sectional household survey, implemented under the World Bank Living Standard Measurement Study (LSMS) programme, based on a nationally representative sample of about 8,390 households and with a specific focus on agriculture. In 2017, the sample units were divided into two groups, one of 3,813 households that received the full questionnaire, and one with remaining households that received a light version of the same questionnaire. For the purpose of this application, only the group of households that completed the full questionnaire could been considered, since this included the necessary variables to identify small-scale food producers and compute the targeted indicator. Considering that indicator 2.3.1 has a disaggregation dimension already embedded in its definition, i.e. the size of the farm, the sample that could be used to produce small area estimates for small scale food producers included only 1,637 households. Table 1 provides a summary of the sample size by region and circle, and the number of out-of-sample circles. In particular, the entire region of Kidal was left outside of the sample due to security reasons. In addition, the new region of Menaka had not been officially announced yet at the time of the survey and, for this reason, was not included in the sample.
Table 1 provides information on the sampling size by region and circles.
Parameter of interest, selection of SAE approach, and considered geospatial auxiliary variables
Going back to notation introduced in Section 2, the average volume of production per labour unit to be estimated in each small area can be formalized as
Spatio-temporal resolution and sources of geospatial area-level covariates
Spatio-temporal resolution and sources of geospatial area-level covariates
The accuracy of direct estimates, measured in terms of the estimated coefficient of variation (CV), was assessed against the same accuracy measure produced for EBLUP small area estimates obtained with the FH model in expression (1). This approach was selected in place of a unit-level method in order to produce a case study on SAE based on a model of simple implementation, only requiring access to area-level direct estimates and auxiliary information. In addition, as seen in Section 2.1, the EBLUP area-level estimator obtained from the model presented in expression (1) can be formulated as a linear combination of area-level direct and synthetic estimates, giving more weight to the former with increasing sampling size. Hence, using the FH SAE model can intuitively be seen as a way of improving direct estimates through a synthetic component based on external information. On the other hand, unit-level estimators, such as the one introduced in expression 2.2, do not take into account direct estimates and – as a consequence – the sampling design.
As area-level auxiliary variables for the implementation of the small area estimation model, various geospatial covariates were considered among the vast amount of publicly available candidates according to their potential capability of being good predictors for the average labour productivity in agriculture. In particular, covariates included in the first stage of selection were providing information on the following domains:
Table 2 presents the spatial and temporal resolution of each auxiliary variable along with the related source.
Results of step-wise regression
Values of considered geospatial predictor were initially available at the level of the cells of regular grids of different resolutions (spanning from 1
Small area estimates of indicator 2.3.1 in Mali disaggregated by circle (second administrative division).
The initial set of potential predictors was then reduced adopting a stepwise regression, which was implemented using the area-level direct estimates of indicator 2.3.1 as dependent variable and the geospatial covariates as regressors.7 As result of the step-wise regression, only 8 auxiliary variables were retained (see Table 3) according to the Lindeman Merenda and Gold (LMG) factor, which represents a measure of the relative contribution of each predictor to the overall R square of the model. It is interesting to notice that most of the covariates considered as important by the selection approach provide information on either the quantity produced or the area harvested of Mali’s major crops. Other variables retained by the stepwise procedure measure the average direct normal irradiation, the average volume fraction of coarse fragments, and the average quantity of organic carbon in the soil.
As seen in Section 2, among the necessary inputs to produce EBLUP small area estimates there are the estimated variances of direct estimates
Boxplot of direct (left) and small area (right) estimates.
Accuracy of direct and small area estimates and assessment of their linear relationship.
Residuals and random effects of SAE model.
The map presented in Fig. 1 displays the obtained small area estimate for each circle of Mali (including out of sample circles, which are identified with ticker borders). Values of indicator 2.3.1 range from 906 West African CFA Franc per labour unit in the circle of Kayes to 6387 in Niono, with the highest values of agricultural labour productivity predicted in northern and central circles. The two boxplots presented in Fig. 2 provide a first evidence of the fact that the obtained small area estimates (boxplot on the right) have a much lower variability compared to direct ones (boxplot on the left).
The four graphs presented in Fig. 3 allow comparing the accuracy of direct and indirect estimates in terms of their CVs, and assessing the presence or absence of linear relationship between the two groups of statistics. In particular, the two boxplots in the top-left quadrant of Fig. 3 display the distribution of CVs of direct and model-based estimates and highlights the higher accuracy of small area estimates compared to their design-based counterpart. Indeed, small area estimates’ CVs falls below the 20% in 3/4 of the cases, while the same threshold is surpassed by more than the 50% of direct estimates. Similar evidence is provided by the plot on the top-right corner, where direct and indirect estimates are ordered by increasing values of their CV. This provides a visual indication of the fact that the CV of small area estimates falls always below the same variability measure referred to direct estimates, except in the very few cases where the domain direct estimates were already showing a high accuracy (i.e. CV below 15%).
The graph on the bottom-left corner allows assessing the linear relationship between direct and indirect estimates. Generally speaking, especially in correspondence of domains with sufficient sampling size, direct and indirect estimates are expected to be correlated, meaning the two approaches should produce similar estimation results. In the considered case, the graphs illustrate a fairly strong linear relationships between estimates produced with the two approaches, with correlation equal to 0.88.
After assessing estimates accuracy, an important component of SAE implementation is the validation of fundamental assumptions underlying the model, i.e. the normality of residuals and random effects. To that purpose, Fig. 4 presents the QQ plots of both the error term and the random effects, which does not provide any significant proof of deviation from the normality assumption. This was also confirmed by the Shapiro-Wilk test, which resulted in
Conclusions and way forward
Monitoring the implementation of the 2030 Agenda for Sustainable Development and its overarching pledge to leave no one behind calls for more disaggregated data and SDG indicators than what available in most countries. In this context, sample surveys are the preferred data source for about the 30% of indicators in the SDG monitoring framework and can offer valuable information to measure the social, economic and environmental dimensions of sustainable development. However, traditional households and agricultural surveys are usually characterized by sampling sizes that are either too small to produce precise estimates, or that do not cover all disaggregation domains of interest. Hence, indirect estimation approaches such as SAE techniques can represent a valuable tool for NSOs and international organization to produce timely and granular disaggregated estimates of SDG indicators, allowing to contain the cost and complexity otherwise generated by the increase of sampling sizes. In particular, with the proliferation of new data sources such as geospatial and big data information systems, SAE models can be implemented by combining survey data with a vast amount of auxiliary information available at no or limited cost and at high frequency. In this respect, the body of literature and the number of case studies on SAE techniques applied to SDG Indicators can still be expanded. After a brief review of the main SAE approaches available along with their principal domains of application, this paper presents a case study based on the Fay-Herriot area-level SAE model to produce subnational estimates of SDG Indicator 2.3.1 on the average volume of production per labour unit obtained by small-scale food producers. This is done by integrating survey data with area-level auxiliary information retrieved from multiple geospatial information systems. The presented case study shows how the small area estimates of indicator 2.3.1 in Mali’s circles reach greater precision compared to direct estimates. In addition, adopting the considered indirect estimation approach, estimates for out of sample areas can also be produced.
The FH area-level model was selected in place of a unit-level method in order to provide a simple example of SAE based on an SDG indicator related to the agricultural sector development, only requiring access to area-level direct estimates and auxiliary information. In addition, using an indirect estimator – such as the area-level EBLUP – expressed as a linear combination of area-level direct and synthetic estimates, the SAE approach can intuitively be interpreted as a way of improving direct estimates through a synthetic component based on external information correlated with the phenomenon of interest. Future extensions of this study will compare the results obtained with the here considered area-level model with those produced by a unit-level approach. In this circumstance, both unit-level and sub-area (e.g. the enumeration area of the cell of a regular grid) level auxiliary variables will be considered as regressors.
