Abstract
Keywords
Introduction
Construct validation of descriptive and causal interpretations derived from measurements in education, health and the social sciences is an on-going and responsive process, requiring the generation of new evidence to support emerging conclusions. ‘Validity is a property of inferences’. Not ‘… a property of designs or methods, for the same design may contribute to more or less valid inferences under different circumstances’ (p. 34). 1 Similarly, responses to a measurement instrument may vary with a change in the research context such that ‘… each interpretation of the scores needs to be validated …’ by a ‘… program of research to support the … application of the tool in relation to an increasing range of interpretations …’ (p. 2) 2 (see also Moss3,4 and references therein). Paralleling the use of the concept in relation to the importance of contextual factors in evaluating health equity interventions, 5 we think of ‘contextual validity’ in healthcare measurement as documenting, describing and understanding the extent and limits to which an instrument (questionnaire, rating scale, etc.) will yield consistent and valid interpretations in the varying contexts and purposes for which it is administered.
One important change in the measurement context encountered in the health sciences is the use of a patient self-report questionnaire for baseline (pre-test) and follow-up (post-test) measurement in the evaluation of a health promotion or health education intervention. A phenomenon known as ‘response shift’ is arguably a common occurrence. 6 Response shift entails a possible change in respondent perspective engendered by the educational or social context of the intervention and may produce various qualitatively different changes in the appraisal process during the generation of item responses.7–9 These changes in appraisal and response can threaten the validity of any inference about change in a construct that is measured, for example, by a multi-item composite scale. Furthermore, it has been argued that data derived from measures that require an increasing amount of subjective personal judgement in the generation of a response (so-called perception-based and evaluation-based measures) will be most vulnerable to response shift bias.10,11
An analogous phenomenon to response shift may also be present when comparisons are made across respondent groups. Different life experiences across age groups, males and females, cultural and educational background and so on may engender differing perspectives on the meaning of questionnaire items and consequent frameworks for responding that may, in turn, generate systematic differences in response and consequently factor structure across groups. 12 Hence, the concept of contextual validity is of critical concern in both longitudinal and cross-sectional measurement in healthcare evaluation.
From a measurement perspective, the concept of response shift is closely related to longitudinal measurement invariance, or, more specifically, factorial invariance when measurement invariance is conceptualised within a factor analytic framework. 13 When comparisons between factor or scale-score means and construct interrelationships across time, respondent groups or settings are based on composite scales, it is assumed that the measurement structure is unchanged, that is, each item continues to make an invariant contribution to the target construct.14–16 If invariance assumptions are violated, the validity of these comparisons is threatened.
Health Education Impact Questionnaire
The Health Education Impact Questionnaire (heiQ) is a perception- and evaluation-based measure that was developed 10 years ago to be a user-friendly, relevant and psychometrically sound instrument for the comprehensive evaluation of patient education programs and activities. 17 The present version (Version 3) measures eight constructs by multi-item composite scales. The English-language heiQ has been used in Australia, Canada, Great Britain, New Zealand, Singapore and the United States, translated into 20 other languages and applied across a wide range of evaluation studies from national and regional quality management systems to experimental trials. 18 The heiQ was chosen for this study as its widespread use (particularly in longitudinal studies evaluating the short- and medium-term impact of chronic disease self-management programs), and the comprehensive range of constructs measured justifies careful and on-going construct validation. The heiQ scales, number of items in each of the scales, a brief description of the construct being measured and a sample item for each scale are listed in Table 1. (The heiQ is copyright to Deakin University, Australia. Information on how to access the full questionnaire for research, course evaluation and translation into languages other than English is available on the heiQ website (http://www.deakin.edu.au/health/research/phi/heiQ.php)).
heiQ Version 3: scale names, acronyms, number of items and construct descriptions.
heiQ: Health Education Impact Questionnaire.
The heiQ was developed following a grounded approach that included the generation of a program logic model for health education interventions and concept mapping workshops to identify relevant constructs.
17
Based on the results of the workshops, 69 candidate items were written and tested on a construction sample of 591 respondents drawn from potential participants of patient education programs and persons who had recently completed a program. The 69 items were reduced to a 42-item questionnaire measuring eight constructs and again tested on a replication sample of 598 respondents drawn from a broader population of attendees at a general hospital outpatient clinic and community-based self-management programs. Both confirmatory factor analysis (CFA) and item response theory (IRT) were used for item selection and scale refinement. In the revisions leading to Version 3, the number of Likert-type response options was reduced from six to four on advice from users (they are now
The heiQ is scored as eight separate scales using simple summation and dividing the summed score by the number of items such that the total score has the same potential range as an individual item (1–4). Thus, higher scores on all scales except emotional distress (ED) are regarded as a desirable outcome of a health education program. Scores on the ED scale are typically not reversed such that lower scores are regarded as a positive outcome.
The general factor structure of the original version of the heiQ was replicated by Nolte and colleagues19,20 who investigated its factorial invariance21–23 in the context of response shift bias across a traditional pre–post design as well as across a post-test compared with a retrospective pre-test (‘then-test’) design. Nolte’s results supported the stability of the factor structure across measurement occasions and questionnaire formats (configural invariance) and the metric and scalar invariance of the heiQ when used in the traditional pre–post design. While, in this design, approximately 10% of items were found to show some form of non-invariance from pre-test to post-test, Nolte 20 concluded that ‘… group level response shifts were not strong enough in any of the datasets to threaten the validity of comparing actual pretest with posttest data …’ (p. 118). However, factorial invariance was less clearly supported when the heiQ was used in the then-test design where approximately one-third of the heiQ items exhibited some form of non-invariance.
Given the wide application of the heiQ and its role in making clinical, program and policy decisions, further validation of its measurement structure using pre-test and post-test data is warranted. Furthermore, conclusions about the differences between scale-score means in longitudinal or cross-sectional designs are only justifiable if invariance of factor loadings and, particularly, item intercepts (or thresholds) is confirmed.24–26 Using a large independent sample, this article presents analyses of the 40 heiQ items retained in Version 3 where the simplified four ordinal response options are used. We seek to add further rigour and validity to the investigation of program impact and group differences when using the heiQ by addressing configural, metric (or ‘weak’) and scalar (or ‘strong’) factorial invariance15,16,27 over time and across important population sub-groups (sex, age, education, language spoken at home and country of birth).
We thus tested the hypotheses that the originally proposed structure of the heiQ was replicated with the revised response options and reduced item number, and that the measurement properties of the scales were sufficiently invariant to justify valid comparison of factor or scale-score means and interrelationships. The initial focus was to test the hypothesis that the specified clusters of items had acceptable unidimensionality, discriminant validity and reliability. Unidimensionality is a fundamental and necessary condition for assigning meaning to constructs measured by composite scales.28–30 It is defined as the existence of a single latent trait (variable) underlying each hypothesised item cluster30,31 and thus as a properly specified independent clusters measurement model having acceptable fit to the data.32,33 Subsequently, we investigated configural, metric and scalar invariance across time and population sub-groups. Configural invariance entails the demonstration of consistent item clusters as identified by the pattern of zero (or near-zero) and non-zero factor loadings across groups or time points while, similarly, metric invariance entails equality of factor loadings and scalar invariance equality of item intercepts (or alternatively item thresholds if the data are ordered categorical and analysed using a weighted least-squares approach).23,25
Methods
Data
A dataset containing responses from all programs that utilised a data management website at both pre-test and post-test for the period July 2007–December 2012 was used. While the majority of respondents were participants in Australian chronic disease self-management programs run in hospitals, community health facilities or complementary care providers, data from a small number of similar programs in Canada were also included. After removal of records from those who made no response to more than 50% of the heiQ items at either pre-test or post-test, 3221 cases were available for analysis.
These data were gathered by a large number of individual healthcare organisations for their own monitoring and evaluation purposes using an ‘opt-in’ consent process. The de-identified data were provided to the heiQ research team specifically for on-going validation studies only. Some archived data were also gathered as part of a pilot health education quality assurance study funded by the Australian Government Department of Health and Ageing. Ethics approval for the use of these data for scale validation purposes was obtained from the University of Melbourne Human Research Ethics Committee.
Statistical approach
In re-examining the factor structure and measurement invariance of the heiQ, both unrestricted (frequently labelled ‘exploratory factor analysis’ – EFA) and restricted factor analyses (CFA) were employed in a complementary manner, taking advantage of the exploratory structural equation modelling (ESEM) routine in Mplus 34 for the unrestricted analyses.
The complementary use of EFA/ESEM and CFA can be very instructive for scale validation. 35 By specifying that each item should load on only one factor and constraining all ‘non-target’ factor loadings to exactly 0 in the form of a strictly specified ‘independent clusters’ model, CFA is frequently problematic for the analysis of multi-item multi-scale questionnaires.36–39 In particular, model fit may not reach acceptable standards and, even if it does, inter-factor correlations may be upwardly biased and lead to spurious conclusions about construct interrelationships. Additionally, particularly in large models, the incremental use of modification indices (MIs) to improve model fit can be confusing and potentially misleading. It is frequently recommended that parameters set to 0 in an initial model should be freely estimated on the basis of large MIs only if this is ‘theoretically justified’, 40 but this is a loosely interpreted caveat. A disadvantage of CFA can therefore be that 0 loadings are ‘forced’ on non-target factors even though associations may exist. Hence, while a finding of arguably acceptable fit for multi-factor CFA models may appear to support the conclusion of independent item clusters, the results may conceal salient evidence that associations that appear as high inter-factor correlations are better interpreted as cross-loadings indicating factorial complexity of items. To address this threat to model validity (and hence a clear conclusion of configural stability and invariance in the present version of the heiQ), a combination of ESEM and CFA was used with the expectation that the results would be consistent, and thus, evidence in support of the hypothesised structure would be strengthened. 35
Model estimation and fit
The mean and variance-adjusted weighted least-squares (WLSMV) estimator, suitable for the analysis of ordered categorical data, was used for all ESEM and CFA. WLSMV provides robust standard errors and a robust mean- and variance-adjusted chi-square fit statistic
40
(designated
As the sample size for this study was large, the primary focus for model acceptance was the extent of model fit (and misfit) indicated by the indices of close fit. Indicative threshold values for these were CFI ⩾ 0.95, TLI ⩾ 0.95 and RMSEA ⩽ 0.06, while a value of ⩽0.08 for the RMSEA was taken to indicate a ‘reasonable’ fit.42–44
Factor rotation in ESEM
A wide range of rotation options for ESEM are available in Mplus. A rotation approach that is designed to provide an approximation to Thurstone’s 45 original conception of ‘simple structure’ that allows for possible multi-factorial items is oblique Geomin.35,46 The default epsilon value of 0.01 for four or more factors was used. 46 While an independent clusters solution was hypothesised, oblique Geomin was chosen to provide evidence for the factorial complexity of the items should such complexity be indicated.
Configural, metric and scalar invariance
Factorial invariance of the heiQ was investigated following recent advocacy for a refocus of the usual statistical approach and development of a revised methodology. 47 The investigation of metric and scalar invariance is predicated on the demonstration of satisfactory configural invariance. Given the potential hazards in the use of CFA alone to establish configural invariance discussed above, both ESEM and CFA were used for this stage of the investigation.
To test the hypothesis that an eight-factor model was a satisfactory fit to the correlations between the 40 items of the heiQ, irrespective of the specific configuration of the factors. 48 ESEM analyses were conducted, one for the pre-test data and one for the post-test. From six to eight factors were extracted in each analysis. The ESEM analyses were followed by validation of the specific multi-factor configuration by fitting and examining the results of CFA and ESEM (Geomin rotation) models to the pre-test and post-test data separately.
Following the demonstration that the hypothesised eight-factor model was a satisfactory fit to the data, metric and scalar invariance were investigated using full eight-factor CFA and ESEM models and scale-by-scale analyses. Typically, metric and scalar invariance are investigated by fixing factor loadings and item intercepts (or thresholds) to equality across groups or time in a hierarchical manner.14,21 But, it has been argued by Raykov et al. 47 that the structural equation models used in this approach have significant limitations, and, in general, do not provide a complete and unconditional statistical assessment of either metric or scalar invariance. As the scalar invariance model is typically not nested within that for metric invariance, a statistical test of whether the additional constraints result in a meaningful reduction in fit is not possible. Furthermore, as the metric invariance model normally requires that the loading of one factor indicator (e.g. questionnaire item) be fixed to 1.0 in each group, a complete test of the equality of factor loadings is not possible 47 (pp. 955–956). Hence, it is recommended that at present, metric and scalar invariance be investigated using only an unconditional model with a complete set of constraints for both metric and scalar invariance but minimum constraints necessary for model identification. Multi-factor CFA and ESEM models were fitted across sex, age, education level, country of birth and home-language groups separately using the CONFIGURAL and SCALAR ‘convenience features’ available in Mplus 7.1 (see Version 7.1 Mplus Language Addendum available at http://www.statmodel.com/) to achieve the minimal constraints necessary for identification required by the Raykov and Marcoulides approach. (The Mplus ‘convenience features’ resulted, for the CONFIGURAL model, in the factors in both groups being identified by setting one loading to 1.0, while all other loadings and the factor variances were freely estimated. Also, the scale factors were set to 1.0, while all item thresholds were estimated. For the SCALAR model, factors were similarly identified by setting one factor loading in each scale to 1.0, while the other loadings were constrained to be equal across groups and factor variances were free. Factor means, however, were fixed to zero in the reference group only and were freely estimated in the comparison group. All item thresholds were constrained to be equal across groups, while the scale factors were fixed to 1.0. The Delta parameterisation was used for both analyses.) The chi-square difference test appropriate for WLSMV estimation (provided by the DIFFTEST) (see p. 5 of Version 7.1 Mplus Language Addendum (available at http://www.statmodel.com/).) was used to assess the change in model fit between the configural and scalar models only. For across-occasion measurement invariance, an analogous model was tested in which pre-test factor means were fixed to 0 and factor variances to 1.0 but were free to vary at post-test. Additionally, the longitudinal character of the model was taken into account by freely estimating correlations between parallel item residuals across time.
Reliability
Reliability was assessed using the Mplus coding for composite scale reliability developed by Raykov.33,49 Composite scale reliability is defined as the ratio of true variance to total variance in a homogeneous cluster of test items and is obtained as a robust maximum likelihood estimate of this ratio. While Cronbach alpha can be seriously biased if the test items are not at least tau-equivalent (i.e. have, in practice, equal factor loadings) and in the presence of correlated residuals, the maximum likelihood estimator of composite scale reliability is, instead, consistent and unbiased. 33 Cronbach alpha is, however, also presented for possible comparison with the results of similar scale validation studies.
Discriminant validity
Discriminant validity of the heiQ constructs was studied by inspecting the size of the inter-factor correlations in both CFA and ESEM results 50 and by comparing the inter-factor shared variance estimates with the average variance extracted (AVE) by each factor involved.51,52
Results
Replicating the structure and reliability of heiQ Version 3
Model fit statistics for the ESEM analyses to establish the number of factors are shown in the upper part of Table 2. According to the close-fit criteria, all but the six-factor model at pre-test satisfied all thresholds for a good fit. In corresponding ‘scree’ plots of the eigenvalues of the two correlation matrices, there were six eigenvalues >1.0 while the plot for the post-test data showed a clear discontinuity between the eigenvalues of the eighth and ninth factors. It was concluded that eight factors, as hypothesised, would be satisfactory for subsequent investigation of the factorial structure of the data, minimising potential problems with underfactoring. 53
Fit statistics for exploratory factor analyses (ESEM) and CFA of pre-test and post-test heiQ data separately.
ESEM: exploratory structural equation modelling; CFA: confirmatory factor analysis; heiQ: Health Education Impact Questionnaire; d.f.: degrees of freedom; CFI: comparative fit index; TLI: Tucker-Lewis index; RMSEA: root mean square error of approximation; CI: confidence interval.
Validation of the specific configuration of these eight factors was then conducted by fitting eight-factor CFA models to the pre-test and post-test data separately and tabulating and examining the standardised factor loadings from the eight-factor ESEM analysis.
Fit statistics for the two eight-factor CFA independent cluster models are shown in the lower part of Table 2. While model fit did not reach the ‘satisfactory fit’ thresholds established above, they suggest a closer fit than is frequently found with similar self-report psychological data. 38 As can be seen in the upper-right triangle of Table 3, however, inter-factor correlations are high in a number of instances (particularly between self-monitoring and insight (SMI) and skill and technique acquisition (STA) at both pre-test and post-test). As these high inter-factor correlations may be the result of the ‘forced’ zero cross-loadings in the CFA, ESEM analyses were examined to further investigate the possibility that some items may be factorially complex, thus questioning the homogeneity of the scales.
Factor correlations in the CFA (upper right) and ESEM (lower left) analyses.
Estimated correlations between the SMI and STA factors in the contrasting CFA and ESEM analyses are given in bold. The ESEM analysis yields considerably lower estimates, arguably a result of allowing non-target loadings to be estimated rather than fixed precisely to 0.
CFA: confirmatory factor analysis; ESEM: exploratory structural equation modelling; HDA: health-directed activities; PAEL: positive and active engagement in life; ED: emotional distress; SMI: self-monitoring and insight; CAA: constructive attitudes and approaches; STA: skill and technique acquisition; SIS: social integration and support; HSN: health service navigation.
As anticipated, model fit was considerably improved when cross-loadings were not fixed precisely to 0 in the eight-factor ESEM analyses (see the appropriate rows in the upper part of Table 2). Given the very good fit of the ESEM models, the potential for model improvement by including correlated item residuals was not explored. 50
Table 4 shows the factor pattern for the Geomin obliquely rotated solutions for pre-test and post-test data separately with the order of the factors in the raw output rearranged to correspond to the hypothesised heiQ factors. It can be seen that the unrestricted ESEM analyses resulted in factor patterns that, while showing some evidence of factorial complexity, corresponded well with the hypothesised structure based on the original scales. 17 First, there was clear evidence of at least moderate loadings of all hypothesised factors on their target items and no evidence of any substantial factorial complexity in the constituent items for six of the a priori heiQ factors, namely, health-directed activities (HDA), positive and active engagement in life (PAEL), emotional distress (ED), constructive attitudes and approaches (CAA), skill and technique acquisition (STA) and health service navigation (HSN) (i.e. all hypothesised factor loadings were ⩾0.4 while there were no secondary loadings on the constituent items ⩾0.3). Furthermore, for one additional scale (social integration and support (SIS)), factorial complexity was found for only one item (Item 45) at both pre-test and post-test and all hypothesised loadings were ⩾0.4. The factor pattern for the SMI items was, however, somewhat more complex. For this scale, all but one item appeared factorially complex with one or two loadings ⩾0.3 from non-hypothesised factors. In all but one instance, these non-target loadings were higher than the respective target loading; four of the factorially complex items appeared in post-test heiQ data, while two factorially complex items were found in pre-test heiQ data, with Item 21 being complex in both pre- and post-test data.
Standardised factor loadings for two eight-factor ESEM analyses of the pre-test and post-test heiQ data – Geomin rotation (
All loadings ⩾0.3 shown with those not hypothesised in italics; hypothesised loadings <0.3 also shown (underlined). Items and factor loadings are arranged according to the hypothesised structure.
ESEM: exploratory structural equation modelling; heiQ: Health Education Impact Questionnaire; HDA: health-directed activities; PAEL: positive and active engagement in life; ED: emotional distress; SMI: self-monitoring and insight; CAA: constructive attitudes and approaches; STA: skill and technique acquisition; SIS: social integration and support; HSN: health service navigation.
Correlations between the latent variables were typically considerably smaller in the ESEM analyses than those in the CFA (Table 3). As Marsh et al. 54 point out, the ‘… inappropriate imposition of zero factor loadings …’ (p. 472) on non-target items in CFA ‘… usually leads to distorted factors with positively biased factor correlations …’. The median absolute (with ED reflected) inter-factor correlation in the ESEM analyses for the pre-test data was 0.38 (range: 0.06–0.60) and for the post-test 0.41 (range: 0.09–0.65) compared with parallel results for the CFAs of 0.61 (0.29–0.83) and 0.67 (0.30–0.88). The maximum inter-factor correlation of 0.65 in the ESEM results is well below the threshold of 0.80–0.85 that is frequently recommended as indicating poor discriminant validity 55 (p. 131). The inter-factor correlations between the factor pair that was identified as potentially confounded in the ESEM (SMI with STA) were 0.24 at pre-test and 0.39 at post-test in the ESEM analysis compared with 0.83 at pre-test and 0.88 at post-test in the CFA (Table 3, values in bold type). Additionally, the correlations between SMI and ED (−0.34 and −0.37 in the CFA) were very small but marginally significant in the ESEM analysis (at pre-test: −0.06 (95% confidence interval (CI) = −0.10 to −0.01) and at post-test: −0.09 (95% CI = −0.15 to −0.05)).
Discriminant validity was also investigated by calculating the AVE by each of the heiQ factors in both the CFA and ESEM analyses at pre-test and post-test and comparing these values to the appropriate estimates of the shared variance between each pair of constructs. The presence of sufficient discriminant validity between the constructs is demonstrated when the shared inter-factor variance is less than the AVE of each of the factors involved.51,52 By this criterion, in the CFAs, there was evidence of insufficient discriminant validity between SMI and PAEL, CAA, STA and HSN and also PAEL and CAA at both pre-test and post-test. In the ESEM analysis, there was evidence of insufficient discriminant validity between SMI and HSN at pre-test and between SMI and HDA, SMI and STA and SMI and HSN at post-test (full results are available in Supplementary Table 1). Taking the results together and considering the likely over-estimation of inter-factor correlations in the CFAs, the results suggest that the discriminant validity of the SMI construct from HSN and STA in particular may not be fully established.
Composite scale reliability with 95% confidence intervals (italicised) based on robust standard errors and, for comparison with other studies, Cronbach α (in parenthesis) for the scales in Version 3 estimated from the pre-test data are as follows – HDA: 0.83/
Configural invariance
The complementary CFA and ESEM analyses described above replicated the eight-factor structure of the heiQ and the homogeneity of seven of the scales, thus clearly establishing the basis for a detailed investigation of the across-time and across-group invariance of the scales. The factorial identity and homogeneity of the SMI scale, however, was not so clearly established, but it was retained for the invariance analyses to seek further information on its psychometric performance.
To establish configural invariance across time, a 16-factor CFA was conducted with no cross-loadings and with correlated residuals allowed only between identical items at pre-test and post-test. To identify the model, factor variances were set to 1.0 at pre-test and post-test, while factor loadings and item intercepts were freely estimated as were all inter-factor correlations. Fit statistics for this model were as follows:
Similar CFAs (but without allowing estimation of any residual correlations) were conducted across groups formed by sex, age (split at the median, 63 years), education (year 10 or less, above year 10), country of birth (Australia vs. overseas) and language spoken at home (English vs other) for the pre-test and post-test data separately. For these analyses, models were identified using the CONFIGURAL specification in Mplus 7.1 described in section ‘Methods’. The results are shown in Table 5.
CFAs of two-group eight-factor models of the heiQ for pre-test and post-test separately with factor loadings and item thresholds freely estimated testing for configural invariance.
CFA: confirmatory factor analysis; heiQ: Health Education Impact Questionnaire; d.f.: degree of freedom; CFI: comparative fit index; TLI: Tucker-Lewis index; RMSEA: root mean square error of approximation.
All models met the threshold for a good fit indexed by the RMSEA while values for the CFI and TLI either satisfied the threshold for good fit or were very close to it. Given the requirement that the models contained no cross-loadings or correlations between item residuals, these results suggest a quite satisfactory fit of the eight-factor configural model across the selected groups at both pre-test and post-test. It should be noted, however, that the numbers of cases in the country-of-birth and home-language analyses are unbalanced. Chen 57 has pointed out that ‘unequal sample sizes … might affect changes in goodness of fit indices …’ (p. 469). In a Monte Carlo study, Chen 57 showed that in across-group tests of invariance of factor loadings, item intercepts and residual variances, estimated changes in the CFI and RMSEA (among other fit indices) were reduced when sample sizes were unequal and therefore ‘… invariance tests are more likely to fail to detect invariance’ (p. 499). (Parallel eight-factor ESEM configural invariance models were also estimated. With one minor exception, the close-fit indices for these models were within the thresholds established for this study. The least well-fitting model was for country of birth at pre-test where the TLI was marginally below the threshold of 0.95. Close-fit indices for this model were RMSEA = 0.054 (90% CI 0.053–0.055), CFI = 0.951, TLI = 0.946.)
Longitudinal and across-group metric and scalar invariance of the heiQ
To investigate across-time metric and scalar invariance, a 16-factor CFA model was fitted to the pre-test and post-test data combined. In this model, factor loadings and item thresholds were constrained to be equal, while only the residuals of item pairs across pre-test and post-test were allowed to be correlated. All inter-factor correlations were estimated. For identification, factor means were fixed to 0 at pre-test and factor variances to 1.0. Means and variances were freely estimated at post-test. Fit statistics for the 16-factor longitudinal CFA model were as follows:
Inter-item correlations, factor loadings and fit statistics for CFAs of eight separate longitudinal (pre-test and post-test) models for the heiQ scales testing for metric and scalar invariance.
CFA: confirmatory factor analysis; heiQ: Health Education Impact Questionnaire; d.f.: degree of freedom; CFI: comparative fit index; TLI: Tucker-Lewis index; RMSEA: root mean square error of approximation; HDA: health-directed activities; PAEL: positive and active engagement in life; ED: emotional distress; SMI: self-monitoring and insight; CAA: constructive attitudes and approaches; STA: skill and technique acquisition; SIS: social integration and support; HSN: health service navigation.
A final set of CFAs addressed the question of metric and scalar invariance across five salient demographic groups: sex, age, educational level, country of birth and language spoken at home. These models utilised the SCALAR model command in Mplus 7.1. Model fit statistics are shown in Table 7. The results for the Mplus chi-square difference tests for comparison of the chi-square estimates derived from the configural compared with the scalar models are also shown. Fit was clearly satisfactory (RMSEA < 0.06; CFI and TLI > 0.95) for all models with the exception of those for age at pre-test where both the CFI and TLI values were below the recommended thresholds and (very marginally) for sex and education at pre-test. The chi-square difference tests results were variable, with some not significant (NS) but the majority
CFAs of two-group eight-factor models of the heiQ with factor loadings and item thresholds fixed to be equal across demographic sub-groups testing for metric and scalar invariance.
CFA: confirmatory factor analysis; heiQ: Health Education Impact Questionnaire; CFI: comparative fit index; TLI: Tucker-Lewis index; RMSEA: root mean square error of approximation; d.f.: degree of freedom.
Discussion and conclusion
While patient self-report questionnaires are often used to investigate change in healthcare interventions, their contextual validity, including cross-sectional and longitudinal measurement invariance, is infrequently investigated. Additionally, many such questionnaires comprise items and scales that entail a high level of personal subjective judgement from respondents. The absence of a clear demonstration of measurement invariance when evaluating change and across-group differences threatens the validity of interpretations and conclusions derived from the use of these scales. In this article, using recently developed factor analytic approaches, we demonstrated measurement invariance of the heiQ. This is an important finding as the heiQ has become widely used to make program and policy decisions – decisions that affect patient care, program implementation and program funding.
Among the principal reasons for the extensive application of the heiQ is that it yields timely and understandable information about the impact of self-management interventions across a variety of chronic conditions. 18 Given this widespread use in different contexts, it is incumbent on the scale developers to provide a framework within which the validity of inferences drawn from the instrument can be supported. While most patient self-report questionnaire development studies provide initial evidence of reliability, factor structure and (possibly) concurrent or predictive validity, on-going research is required to provide rigorous support for the increasing range of inferences drawn from these instruments. 2 When used to assess change across time as well as outcomes across a diverse range of patient groups, the rigorous investigation of their contextual validity is particularly necessary.
Following recent arguments,35,36,38,54 ESEM was used in this article in combination with CFA to substantiate the hypothesised eight-factor structure of the heiQ. Additionally, the CFA method of investigating configural, metric and scalar invariance applied was recently reviewed and recommended. 47 While there is an extensive literature on invariance testing extending over the past three decades and a consensus that factor analysis provides an appropriate and powerful approach, there remains considerable controversy about the specific CFA (or, indeed, ESEM) models that are most appropriate. The advocated model uses minimal restrictions for identification but full equality constraints for both metric and scalar invariance. 47 If this model yields a satisfactory fit to the data, the way is clear to make valid inferences about possible differences between factor- or scale-score means across groups or time and about possible interrelationships between the invariant construct measures.
The eight-factor structure and configural invariance of the 40-item version of the heiQ were clearly replicated with items consistently aligning well with their hypothesised target construct over the period of a self-management intervention and across salient demographic groups. Furthermore, metric and scalar invariance across time and over demographic groups was well established with the caveats (a) that invariance across age groups may warrant further investigation and (b) that the analyses across country of birth and home language may have reduced sensitivity to detect invariance due to the unbalanced numbers in the compared groups. This finding of metric and scalar invariance is particularly important given that the heiQ items are largely ‘perception-based’ or ‘evaluation-based’ where the amount of personal judgement involved in generating a response is large and, particularly for ‘evaluation-based’ items, the subjectivity of the criteria used to make these judgements is such that comparisons across time and persons may be particularly problematic.
The finding of satisfactory factor structure and psychometric properties for the heiQ in this study also supports the decision to use a simplified four-option response set in later versions of the questionnaire. Both pre-test and post-test factor structures and the reliability of all scales have now been replicated in the analysis of the website data (four response options) and the Nolte 20 study (six response options).
The reliability of the 6-item SMI scale has been found to be consistently lower than that of the other scales17,20 while, in this study, its discriminant validity was less clearly supported. The factorial complexity of the items in this scale as seen in the ESEM analyses may be contributing to the lower reliability and lack of discriminant validity; however, as the scale measures a construct that is central to a conception of self-management, we believe it should continue to be used with caution while the construct is investigated further. It is interesting to note that the items of the SMI scale that show factorial complexity appear, in most part, to be related to the STA scale. There is quite possibly a strong, perhaps iterative, causal relationship between the constructs measured by these two scales, with the results of self-monitoring and consequent awareness of progression of a chronic condition leading to the person actively seeking new strategies and skills to improve their condition (note that the most strong multi-factorial SMI item is Item 21 – When I have health problems, I have a clear understanding of what I need to do to control them – an item that connotes a clear action orientation to addressing the health problem). This possible causal relationship may lead to a confounding of some items of the SMI scale with the STA and other constructs (HDA, HSN) with consequent cross-loadings and lowered discriminant validity, particularly for those respondents who score high on it.
The poorer model fit and low factor loadings of the SMI scale may also suggest that there are two underlying constructs that are being brought together in the scale: (a) self-monitoring and (b) consequent insight and understanding of, for example, triggers of flare-up of the chronic condition. Further research might explore these issues through in-depth qualitative interviews with individuals scoring at different levels on these two scales and the development and psychometric testing of additional items that could identify the separate constructs. However, despite the factorial complexity of some of its constituent items, the SMI scale shows a satisfactory level of across-time measurement invariance; hence, summed scores on the scale are comparable from pre-test to post-test in the study of self-management education interventions.
Despite the caveats associated with the SMI scale, this study supports the high level of interest in the use of the English-language version of the heiQ, particularly as a pre-test/post-test measure in experimental studies, other pre-test/post-test evaluation designs and system-level monitoring and evaluation. Positive psychometric evaluations of French, German and Japanese translations of the heiQ have been reported59–61 and independent studies of translations into Danish, Dutch, Canadian French, Italian and Norwegian are underway, providing support for its use across a wide range of languages, cultures and healthcare systems, and opportunities to establish extensive cross-cultural measurement invariance and contextual validity.
