Abstract
Keywords
1 Introduction
Population health and health behaviour estimates are commonly derived from survey data to monitor trends and formulate and evaluate policies. However, bias may arise if the survey samples are not representative of the target population. Non-representativeness is of some concern when measures of association such as relative risk are being estimated 1 but of greater concern for population prevalence and quantity estimates,2–4 such as for alcohol consumption. 5 A key aspect influencing the extent to which surveys are representative is the level of non-participation (unit non-response) among individuals included in the sampling frame. For instance, there is likely to be a group of harmful and dependent drinkers who may be disinclined to participate.
Survey weights derived from inverse probability weighting 6 are usually applied in an attempt to correct for such unit non-response (as well as accounting for aspects of sampling design such as the oversampling of certain household types or geographical areas). However, these weights typically rely on a limited range of socio-demographic variables 7 and are based on the assumption that non-participants have equivalent behaviours to participants in the same socio-demographic category which is unlikely to be the case.
An alternative to the application of survey weights is multiple imputation (MI), 8 which is viable if the assumption that data are missing at random (MAR: the probability of missingness is unrelated to the unobserved data conditional on the observed data) holds. Alanya et al. applied MI to make unit non-response adjustments and compared it to weighting. 9 They found MI to compare favourably, though not consistently so. In another comparison with weighting, MI showed comparable performance in terms of bias but also yielded substantially lower variance estimates. 10 However, these papers made no allowance for the data being missing not at random (MNAR: the probability of missingness is related to the unobserved data). If the data are thought to be MNAR then an alternative approach is required, typically involving sensitivity analyses, and using methodology such as pattern mixture modelling 11 among others.12,13
Application of MI is strengthened if we can infer information on the absent non-participants. In the absence of whole population registers, as existing in Nordic countries, 14 nations typically lack individual-level data amenable to forming the bases of sampling frames. Thus, in countries such as the UK, individual non-participants cannot readily be identified and their routine health data extracted.
We propose a novel methodology that aims to improve addressing non-participation bias in national health survey data in order to obtain less biased estimates of alcohol consumption.15,16 We consider both MAR and MNAR within a missing data framework, motivated by the possibility of non-participants differing in their alcohol consumption 17 from survey participants with the same socio-demographic variables and health outcome statuses. Our approach involves: (1) exploitation of record-linkage to hospital discharges and mortality; (2) survey–population comparisons which inform the creation of synthetic partial observations for non-participants; and (3) MI to generate refined estimates of weekly consumption of alcohol under assumptions of MAR (weaker than when based on survey data alone) and explorations of MNAR. 18 We illustrate the application using data from the 2003 Scottish Health Survey (SHeS) individually record-linked to administrative health information from the Scottish Morbidity Records (SMR), mortality data from the National Records of Scotland (NRS) and unlinked contemporaneous data for the entire population.
In the next section, we provide the context and motivation for the methodological approach described in section 3. In section 4, we report on the application before discussing the implications in section 5 and concluding in section 6.
2 Motivating example and data
2.1 Aim
We aim to devise and apply methodology to estimate sex-specific adult population mean alcohol consumption from national health survey data accounting for bias induced by non-participation.
2.2 SHeS
SHeS are a series of cross-sectional surveys designed to represent the Scottish population living in private households. 19 Socio-demographic data available in the surveys include sex, age group and Scottish Index of Multiple Deprivation (an area-based measure of deprivation collapsed into five equal population-weighted groups), collectively referred to here as ‘socio-demographic characteristics’. Alcohol consumption is calculated in units (equivalent to 10 ml or 8 g of pure ethanol) per week. Pre-derived survey sampling weights which sum to the achieved sample total have been created to account for the stratified, multi-stage random sample survey design and departures from population estimates by sex and age. 19 We use the 2003 survey which had an adult response level of 60%.
2.3 Linked health outcomes
Baseline data on consenting SHeS participants (91%) have been confidentially linked to routinely-collected nationwide administrative health records available until the end of 2011 providing prospective follow up of around eight years. These include prospective SMR which record hospital discharges (∼90% accurate diagnosis, 99% complete
20
) and mortality data using a probabilistic matching algorithm21–24 (Figure 1).
Available data from mid-year population estimates, Scottish Morbidity Records/National Records of Scotland, Scottish Health Survey data sources and desired data on SHeS non-respondents.
2.4 Population data
For the general population, mid-year population estimates – available by sex, age group and area deprivation – for 2003 were used as denominators. 25 Numerator counts of morbidity and mortality events in the population during the eight years of follow-up were combined with mid-year population estimates – also by socio-demographic characteristics – to create an unlinked aggregate-level data set for the population for comparison with the record-linked survey data.
Sex- and area deprivation group-specific breakdowns (%) for the general population of Scotland and participants a in the Scottish Health Survey 2003 aged 20 to 64 years consenting to linkage with inferred estimates for non-participants.
Those participants consenting to record-linkage of their data.
3 Methodology
Our approach to addressing non-participation bias in alcohol consumption estimates involved filling in the missing data in the survey in three stages marked as 1, 2 and 3 in Figure 1. The three stages are depicted in Figure 2 and described in detail in sections 3.2 to 3.4, with a worked example given in section 4.1. We compare the results of our approach with those obtained from the traditional survey-weighted results.
Summary of methodological strategy for addressing survey non-representativeness and refining alcohol consumption estimates. aSHeS: Scottish Health Survey; bSMR: Scottish Morbidity Record; cNRS: National Records of Scotland; dMAR: missing at random; eMNAR: missing not at random.
3.1 Notation
We use the following notation. Let
Let
3.2 Stage 1: Using record linked data
In stage 1, record linkage of survey data to SMR and NRS data was used to determine the values of
3.3 Stage 2: Creation of synthetic observations for non-participants
In stage 2, we made inference on the non-participants by comparing the national health survey data with corresponding population data to identify deviations from representativeness in terms of
We first assumed that
This assumption is valid if the sampling frame for the survey is representative of the general population.
It follows that the
Three modifications were needed to this general method. First, the method as proposed does not allow for uncertainty in the survey-based estimates of
Second, the calculated numbers of missing participants in each category were generally not integers. To avoid possible bias due to rounding, we applied random rounding which preserves the mean count. For example, if 2.6 missing participants were required in a particular category, then we took 3 missing participants with probability 0.6, and 2 missing participants with probability 0.4. This was performed separately in each imputed data set.
Third, estimates of
3.4 Stage 3: Imputing alcohol consumption for non-participants
Once the synthetic observations for the non-participants were created at Stage 2, the unit (person) non-response problem had been converted into an item (variable) non-response problem with the synthetic non-participant observations having data on socio-demographic characteristics and health outcomes but missing data on alcohol consumption. Imputation models for alcohol consumption could then be specified conditional on socio-demographic characteristics and health outcomes. Missing alcohol consumption observations among a small minority of participants (
The imputation approach we used begins by assuming that, given the fully observed data on health outcomes as well as socio-demographic characteristics, non-participation in the SHeS-SMR data set is MAR (note that this is already an improvement on standard methods based on unlinked data, for which MAR would not condition on health outcomes). We then accommodated the possibility of the data being MNAR by allowing the distribution of alcohol consumption to differ in a pre-specified manner between the non-participants and participants (given the fully observed characteristics including health outcomes). Sections 3.4.1 and 3.4.2 outline in turn the MI procedures based on MAR and MNAR.
For both MAR- and MNAR-based approaches, one stochastic imputation was performed for each of the
3.4.1 MI assuming MAR
Under MAR, conditional on socio-demographic characteristics and health outcomes, the distribution of alcohol consumption is independent of participation status. This is assumption
Testing of the log transformation
Our model for
3.4.2 MI assuming MNAR
We sought to change imputations of
We embed the MAR model in a wider class of models containing sensitivity parameters.32,33 The sensitivity parameters describe the difference in the joint distribution of fully observed data on participants and partially observed data on non-participants. Under MAR, the joint distribution
Pattern mixture modelling offers a means to model the joint probability distribution of
Our principal rationale for exploring MNAR concerns differential overall drinking levels, but the possibility remains that
We consider the general specification which accommodates differential modification of the imputation model by
Relative to participants with fully observed alcohol consumption, mean alcohol consumption is modified by
3.4.3 Specifying the parameters governing deviations from MAR
We considered two general approaches to specifying possible values for parameters
We drew on continuum-of-resistance theory which is predicated upon the idea of a latent propensity to not participate.2,3,36,37 Here, invited households who do not initially respond are re-approached one or more times, and the number of interviewer calls is recorded. Later responding participants can be theorised to be increasingly more like non-participants with the greater effort required to recruit them into the survey. We used the number of interviewer calls to a household as our proxy for non-participation propensity, where an individual who responded in three (the median number) or fewer attempts is considered an early-participant, and those that took four or more attempts are considered late-participants. Estimates of
We also considered the scenario in which the consumption deviation from MAR is twice the adjusted difference between early- and late-participants (MNAR2M and MNAR2UB).
A second form of sensitivity analysis considered a range of deviations from the MAR specification based on subject-matter knowledge. A survey in Scotland specifically sampled harmful and dependent drinking in-patients and out-patients attending alcohol addiction services in two Edinburgh hospitals, finding an estimated mean weekly consumption of 198 (95% CI: 185–211) units.
38
For our purposes we posit this to be a generalisable estimate of consumption among drinkers who have been hospitalised. We therefore considered the MNAR-based sensitivity analysis where the imputation model involves specifying
4 Application
4.1 Non-participant synthetic observations (Stage 2)
The SHeS had an overall survey response level of 60% and a proportion of consent to record linkage in Stage 1 of 0.91 with
As a numerical example, consider the category of
Eight-year probabilities of alcohol-related harm in the population, in the Scottish Health Survey participants a and the synthetic non-participants in 2003 by sex and area deprivation group.
Those participants consenting to record-linkage of their data.
Eight-year probabilities of all-cause mortality in the general population, in the Scottish Health Survey participants a and the synthetic non-participants in 2003 by sex and area deprivation group.
Those participants consenting to record-linkage of their data.
Weekly alcohol consumption estimates in the Scottish Health Survey 2003 participants a and the ‘full sample’ by sex under various assumptions about the missing data.
Those participants consenting to record-linkage of their data.
4.2 MAR-based MI results (Stage 3)
A total of 4903 participants (91.1%) were classed as current drinkers with the remaining 478 participants (8.9%) considered non-drinkers (ex-drinkers or lifetime abstainers). Mean weekly consumption from the survey-weighted estimates was 21.8 units for men and 10.8 units for women. Imputing usual weekly alcohol consumption in Stage 3 using each of the created bootstrap sample data sets under a MAR assumption, resulted in an estimate of 22.4 units (3% increase) among men and 10.8 units (0% change) for women (MAR results in Table 4).
4.3 MNAR-based MI results
The first scenario, in which the deviation from MAR is equal to this adjusted difference between early- and late-participants, yielded mean weekly consumption of 23.7 units among men and 11.1 units among women (Table 4, MNAR1M). For the second scenario, in which the deviation from MAR is twice the adjusted difference between early- and late-participants, the figures were 24.9 units (14% increase) and 11.5 units (6% increase), respectively. Table 4, MNAR2M). The corresponding results under the assumption that all non-participants were drinkers gives figures of 25.0 units (15% increase) for men and 11.7 units (8% increase) for women in the first scenario (Table 4, MNAR1UB) and 26.2 units (20% increase) for men and 12.0 units (11% increase) for women in the second scenario (Table 4, MNAR2UB).
Among men, adjusted mean consumption under the literature-based scenarios ranged from 24.6 units (13% increase) in the most conservative sensitivity analyses (MNAR3M) to 33.3 units (53% increase) in the most extreme (MNAR5M). Among women, this range was smaller with corresponding figures of between 11.0 units (2% increase) and 11.9 units (10% increase), respectively (Table 3, MNAR3M and MNAR5M). The corresponding results under the assumption that all non-participants were drinkers gave figures ranging from 25.9 units for men and 11.6 units for women (Table 4, MNAR3UB) to 34.7 units for men and 12.5 units for women in the second scenario (Table 4, MNAR5UB).
5 Discussion
Our approach forms an important additional analytic strategy for addressing non-participation in population-sampled studies. The key innovations of our approach are the incorporation of auxiliary topic-relevant data into unit non-response correction in addition to the conventional socio-demographic data, combined with the creation of synthetic observations for non-participants and the application of pattern-mixture modelling to explore sensitivity to plausible departures from the MAR assumption. Resultant alcohol consumption estimates were sensitive to assumptions regarding both drinking status of non-participants and consumption level differences between participants and non-participants. The refined estimates were between 3% and 59% higher among men and up to 16% higher among women relative to the regular survey-weighted estimates. Given that survey-based alcohol consumption estimates scale up to approximately half those indicated by sales data, 40 our higher estimates appear to be most appropriate.
5.1 Strengths and limitations of this study
The first strength of this work is the utilisation of linked survey records enabling the extension of comparisons of participants and the general population from basic socio-demographic variables to health outcomes.
41
We circumvented the challenges associated with gaining rich data characterising the population, and non-participants in particular, by generating synthetic observations for non-participants. The second strength is the application of the much discussed but little implemented ‘principled sensitivity analysis’
33
pattern mixture modelling to optimally
42
and transparently specify MNAR models.
43
In cases where no delta values are obviously more realistic than others, Rubin has emphasized the need for easily communicated models18,43 which are particularly valued by policymakers;44,45 we found it useful to impose assumptions in order to fix upon a plausible mechanism, considering specific conceivable scenarios in the context of the
Limitations include the possibility of distortion arising from survey participants not consenting to record linkage which could explain some of the disparities between health outcomes in the survey samples relative to the general population; however, this only affects 9% of participants and preliminary analyses suggest minimal differences between these groups (data available on request) indicating that this is unlikely to greatly distort findings. The available alcohol-related harm outcome measures were restricted to the relatively extreme occurrences, hospitalisation and death, with no data on the more frequent occurrences of commonplace harms related to alcohol abuse such as nausea, cognitive impairment and missed working days. This may explain the relatively small changes seen in the MAR estimates despite large survey-population differences in alcohol-related harm. 46 Previous work on refinement of alcohol consumption data in the presence of non-participation has been based on Swedish data 41 which has also considered the implications for impact of the use of augmented data on estimates of consumption prevalence but was based on retrospective alcohol-related hospitalisation data. This offers an alternative approach which does not rely on the attendant passage of time required for follow-up data. This methodology alone is unable to address bias arising from participants mis-reporting their alcohol consumption. It is possible to account for such self-reporting bias by way of incorporation of sales data, for instance. 16
5.2 Methodological strategy considerations
The following considers possible alternatives approaches in specific steps of the analyses:
As an alternative to our procedure of generating synthetic observations and implementing MI, we could have applied weights or taken a Bayesian-based approach
47
based on health outcome statuses as well as socio-demographic characteristics. It is not clear how to implement MNAR methods with easily communicated models in these approaches. A possible alternative to creating multiple data sets of synthetic non-participants followed by single stochastic imputation on each is a nested MI procedure where more than one final imputed data set is generated for each first-stage imputed data set. This could be computationally efficient if stage 2 was very slow and stage 3 was relatively fast, which was not the case here, and could help to partition the fraction of missing information between stages 2 and 3, but would require alternative combining rules to Rubin’s.
48
The assumptions and relative merits of our approaches to determining delta values for the pattern-mixture approach are inherently untestable, and there is an array of alternatives to the propensity- and literature-based scenarios, including:
5.3 Implications
National survey data are crucial resources for quantifying and monitoring trends in health related behaviours with information used for the development, implementation and evaluation of social and public health policy. As such, methodological improvements are of interest to a wide international audience of policy makers and researchers. The development of an effective post-hoc correction procedure for ever-worsening non-response in resource-intensive population-sampled studies offers an enhancement at no additional cost to data collection. This advanced methodology will potentially be applicable to existing and future surveys wherever there is the capacity to record-link surveys with administrative data. Presently, linkage of survey data to routine health records represents a cost-effective means of generating valuable longitudinal data but is performed in very few countries. In exploiting such linkage, our work demonstrates the extended utility of record linkage, providing further impetus for its wider uptake internationally.
Synthetic generation of survey non-participants is not necessary in countries with unique population identifiers and comprehensive linkage (such as the Nordic countries) with the ability to follow-up all individuals regardless of participation status.14,52 However, possible ethical issues related to accessing outcome data of individuals who have chosen not to participate in a survey may mean that even in such countries stage 1 of our approach might be applicable. Regardless, stages 2 and 3 of our proposed methodology would be applicable in these settings. Our approach to the sensitivity analyses was specific to the context in terms of the estimate of interest and level of participation. Different applications will require distinct approaches to be formulated. In considering the most suitable derived estimates (the higher ones, in our case), we were guided by overall estimates obtainable from external alcohol retail data; dependent on the specific context of the wider applications, reference should be made, where possible, to such relevant sources.
The presented application suggests that non-response may contribute to the general under-estimation of alcohol consumption in survey estimates. There is scope for application to other survey-derived information, which can be discrete – cigarette smoking and obesity, for instance – for which only stage 3 of our procedure would need amending. The outcomes of choice in our application were alcohol-related harms and all-cause mortality on the basis of their strong association with alcohol consumption; single or multiple outcomes can be selected and good candidate outcomes for specific applications are those which have the strongest associations with survey items of interest. Further, non-health external data sources such a taxation or education records could be used to provide auxiliary information to correct for non-participation bias in other research areas. Moreover, this paper describes tackling a single incomplete variable; however, the method can be extended to multiple incomplete variables.
5.4 Further work
The current method requires that the informative data for creating the synthetic non-participants are categorical, since we are determining the missing numbers within discrete cells. It may be possible to incorporate continuous data – such as the number of health outcomes experienced – in a further stage by inferring the distribution among the non-participants such that a value could be assigned to each synthetic non-participant as a draw from that distribution and repeated across multiple replications to allow properly for uncertainty. This would be most appropriately performed by way of MNAR imputation to incorporate information about number of alcohol-related harms from the population comparison data.
Sensitivity analyses could potentially be used to address any differential consumption-outcome associations among area deprivation categories, i.e. allowing for the possibility of interaction effects suggested by the greater levels of alcohol-related harm among the more deprived for equivalent levels of consumption 53 or differential consumption-harm relationships by alcohol product type. 54 The application we describe focussed on a quantity estimate but there is growing recognition that non-representativeness can also lead to bias in estimates of associations. 55 We plan to develop, apply and test our methodology for association estimates.
A major alternative to the pattern-mixture approach to MNAR sensitivity analysis is the selection model approach.56,57 Selection modelling expresses departures from MAR as coefficients in a logistic regression model for non-participation on alcohol consumption and other covariates: the sensitivity parameter may therefore be less intuitive than in the pattern-mixture framework,32,35 and hence less easy to relate to subject-matter knowledge. 49 Shared parameter models, 58 in which the measurement of interest and missingness processes are joint modelled, offer yet another option which can be explored.
6 Conclusions
We offer a means to extend the addressing of non-representativeness in survey data beyond the use of conventional inverse probability weights by developing a methodology which harnesses administrative and record-linked data. The key advantage of our approach is the relaxing of the assumption that socio-demographically equivalent participants and non-participants are alike in other ways: the application of the MAR method to administrative health record-linked data is an improvement on the conventional application of survey weights, and the MNAR methods utilise the best available data to make plausible assumptions about how they might differ.
Supplemental Material
Supplemental Material1 - Supplemental material for Correcting for non-participation bias in health surveys using record-linkage, synthetic observations and pattern mixture modelling
Supplemental material, Supplemental Material1 for Correcting for non-participation bias in health surveys using record-linkage, synthetic observations and pattern mixture modelling by Linsay Gray, Emma Gorman, Ian R White, S Vittal Katikireddi, Gerry McCartney, Lisa Rutherford and Alastair H Leyland in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material2 - Supplemental material for Correcting for non-participation bias in health surveys using record-linkage, synthetic observations and pattern mixture modelling
Supplemental material, Supplemental Material2 for Correcting for non-participation bias in health surveys using record-linkage, synthetic observations and pattern mixture modelling by Linsay Gray, Emma Gorman, Ian R White, S Vittal Katikireddi, Gerry McCartney, Lisa Rutherford and Alastair H Leyland in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Authors’ contribution
Ethics and dissemination
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
