Abstract
Introduction
Since the days of ethnographic pioneers such as the Anthropologist Franz Boas and members of the Chicago School of Urban Sociology, a vast literature has developed on the procedures and underlying philosophies of qualitative research. Focusing on the natural behavior of people and their perceptions of the social world (Denzin and Lincoln 2005; Yin, 1994), one relatively recent development is the increasing use of qualitative methods and information for evaluation purposes in social policy and health care. The health sector in particular has seen a surge in approaches and writings on
However, the increased importance given to qualitative information in the
In this article, I will explore aspects of validity of qualitative research with the explicit objective of connecting them with aspects of evaluation. Given the nature of the evaluator–stakeholder relationship in evaluations (see Rossi, Lipsey, & Freeman, 2004), and the methodological properties of qualitative research in particular, qualitative information in evaluation can have three different purposes. First, it can contribute to or focus on the
I will argue that the different purposes of qualitative evaluation in social policy and health care can be linked with different scientific paradigms and perspectives and aligned with relevant validity procedures. Such a conceptualization transcends unproductive paradigmatic divisions and provides a framework for researchers and reviewers of qualitative evaluations. The framework presented
Qualitative Research and the Question of Validity
Since roughly the 1970s increasing criticism of the reliability and objectivity of qualitative research has resulted in a growing interest in establishing more rigorous criteria and methodological standards. This attention has somewhat shifted from standards for the implementation of the study by the researcher to verification strategies for evaluating the credibility of qualitative findings by external reviewers (Morse, Barrett, Mayan, Olson, & Spiers, 2002). Validity is a key concept in this discussion. In the positivistic, rational tradition of science methodology, “validity” can be defined as the degree to which the indicators or variables of a research concept are made measurable, accurately represent that concept. Does, for example, a response scale that measures interactions with members of other ethnic groups indeed refer to intercultural tolerance? Obviously, this rational definition of validity does not work well in qualitative naturalistic research—which does not focus on variables on interval or ratio level. As a result, in the qualitative methodological literature, “validity” has been labeled with alternative terms such as authenticity, adequacy, plausibility, and neutrality (see, e.g., Lincoln & Guba, 1985; Maxwell, 1996; Merriam, 1998). Nevertheless, within the academic community, the idea seems to be dominant that qualitative researchers must demonstrate in one way or another that their research results are valid. Several authors have therefore sought to develop specific research procedures and criteria aimed at increasing the validity of qualitative outcomes.
Probably, the most influential is the work of Guba and Lincoln (see Guba & Lincoln, 1981; Lincoln & Guba, 1985). Guba and Lincoln were one of the first to develop specific criteria for qualitative research. They started from the premise that although all research must possess high truth value, the properties of knowledge within the “rational” (or quantitative) paradigm is different from the properties of knowledge within the “naturalistic” (or qualitative) paradigm (as cited in Morse, Barrett, Mayan, Olson & Spiers, 2002). According to Guba and Lincoln, each paradigm requires specific criteria to determine the veracity of the research. Within the rational paradigm, criteria can be formulated in terms of internal validity, external validity, reliability, and objectivity. Within the naturalistic paradigm, one is better to speak of criteria such as “credibility,” “fittingness,” and “confirmability.” Later Lincoln and Guba (1985) redefined these concepts to credibility, “transferability,” and “dependability.” Guba and Lincoln subsequently formulated several procedures aimed to increase the credibility of qualitative research.
Popular procedures originally conceptualized by Guba and Lincoln are
Many qualitative researchers still regard these criteria as methodological standards. In the wake of Guba and Lincoln, many authors supplemented or perfected their criteria, or suggested alternative terminology for similar procedures. Around the turn of the last century, Morse et al. (2002, p. 15) concluded that this had resulted in a “plethora” of terms and criteria that often brought more confusion than clarity in establishing the validity of qualitative research. Today, still, methodological textbooks on this point show a lot of overlap and most criteria are directly obtained from the themes first conceptualized by Guba and Lincoln.
Critique on Validity Standards in Qualitative Research
Despite efforts to advance the debate on validity, some authors reject the desirability of predetermined criteria for qualitative research altogether. Sandelowski and Barroso (2002), for example, distance themselves from the search for general criteria for qualitative research because in their view the epistemological range of qualitative methods is too broad to be represented by a uniform set of criteria. Instead, they argue for a more rhetorical approach in which the quality of each project must be determined separately for every study. Sandelowski and Barroso (2002) write: “The only site for evaluating research studies—whether they are qualitative or quantitative—is the report itself” (p. 8). In the same vein, Rolfe (2006) points out that qualitative research cannot fall back on a single scientific paradigm. Any attempt to reach consensus on qualitative criteria, according to Rolfe, therefore has little chance. There simply is no common understanding of the field of qualitative theory or methodology which can collectively be described as “qualitative research” (unlike quantitative research, perhaps, that despite the diversity in applications is based on similar mathematical laws). Rolfe argues his case by showing contradictions and paradoxes of common validity checks. Member checking and peer debriefing, for instance, are problematic because if it is assumed that there is no universal truth but only different and additionally constructed truths to which every individual provides his or her own meaning (in effect the premise of much qualitative research), then we cannot expect that the respondents or external evaluators of qualitative studies will come to corresponding categories and conclusions (cf. Sandelowski, 1993, p. 3).
Hammersley (2007) is also critical of the attempt to formulate uniform criteria of qualitative research. He points out that there are several qualitative approaches that explicitly reject the idea that the production of knowledge should be the only immediate goal of research, and instead insist on political “action.” Proponents of this approach believe that qualitative research is a part of the education and social advancement of people and that this function is rendered useless when education is separated from research (see, e.g., J. Elliott, 1988). Related approaches call for a political function of qualitative research by requiring that they should be focused on bringing change of one kind or another: for example, by challenging capitalism, racism, homophobia, or social disadvantage. In addition to traditional epistemological considerations, Hammersley emphasizes that it is important to point out that these approaches produce alternative considerations in assessing the quality of research. Such alternative criteria should be much more formulated in terms of education, politics, ethics, aesthetics, or even economics (e.g., does the study offer value for money?).
Like Rolfe and Sandelowski, Hammersley ultimately rejects the idea that a final set of universal criteria can be formulated. The obstacles to this not only originate out of political “action” objectives but also out of differences in value assumptions. He illustrates this with the example of the growing research on the impact of gender differences in educational achievement of children (see Hammersley, 2007, pp. 294–295). To accept this as a relevant research topic, argues Hammersley, it is vital that one believes in the equality of the sexes (which may not be shared by certain religious groups or sociobiologists). One also has to share the assumption that certain disparities in the classroom affect educational performance, defined in terms of exam success. However, there are people who see gender differences as a predominantly social construct, and there are those who deny that school exams provide a sound indication of educational performance. What Hammersley shows with this example is that research in the social domain is framed by a series of value assumptions which can produce serious differences. The fewer underlying assumptions of a particular research field are shared, the more difficult it is to defend the relevance of the research and the more difficult it is to reach consensus on the validity criteria of that research. Hammersley (2007) nevertheless believes that certain criteria, in the form of “guidelines,” can play a role for a more rigorous assessment of qualitative research, though he does not clarify what these guidelines should be. My conclusion is that guidelines for qualitative research are desirable [.]. However, the barriers to our being able to produce any set of common guidelines, even among qualitative researchers, are formidable. At the same time, we should not simply accept at face value methodological pluralism, reinforcing it by treating each qualitative approach as having its own unique set of quality criteria. Dialogue on this issue across different approaches, and indeed across the qualitative—quantitative divide is essential for the future of social and educational research. (p. 301)
A Model for Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and Perspectives
One’s stance on the question of validity in qualitative research, then, primarily depends on which scientific paradigm is supported, leading some authors to reject the desirability of predetermined criteria for qualitative research altogether. Yet one could equally argue that
Validity Procedures Within Qualitative Lens and Paradigm Assumptions.
Based on the three paradigm assumptions, Creswell and Miller identify nine different types of validity procedures (see Table 1). Besides the paradigm assumptions, the procedures are arranged to different perspectives—Creswell and Miller call these “lenses”—by which the validity of qualitative research can be assessed (see vertical axis of the table). These lenses constitute the researchers’ own perspective, that of the participants in the research or that of external reviewers or readers.
Member checking, audit trail, prolonged engagement, peer debriefing, and disconfirming evidence (negative case selection) are criteria discussed earlier from the work of Guba and Lincoln. Triangulation is a validity procedure where researchers base their categories and/or conclusions on different sources of information (see Denzin, 1978). The researcher might look, for example, whether conclusions derived from interviews are consistent with findings from document analysis and observations. The more the categories and conclusions are confirmed by different data sources, the more valid the results. Reflexivity of the researcher refers to the extent to which researchers make their personal values and beliefs explicit in the research report, in such a way that is clear to what extent they might have influenced the results. This can be done in the form of a methodological paragraph or comments throughout the report. Thick description involves the detailed description of the setting, the participants, and the themes of the study. The purpose of thick description is that it creates “probability,” that is, a statement of affairs that takes readers as much as possible into the studied world and its main characters. Detail is the key word here. Researchers should describe, for instance, interactions with informants, personal experiences, or provide a detailed description of the emotions of the respondents. Collaboration is a criterion that is particularly associated with the critical paradigm, meaning that participants should be involved in the study as coresearchers, or in less formal relationships.
Creswell and Millers’ work advances the debate on validity in qualitative research in several ways. It elegantly unites different worldviews or paradigms within qualitative research with key perspectives by which the validity of qualitative research can be assessed: that of the researcher, the respondent, and the external reader. It further explicates the criteria that are essential for each respective paradigm and/or perspective.
The framework of Creswell and Miller provides a basis for a new model for validity in qualitative evaluation. As argued in the introduction of this article, qualitative evaluation can have three different purposes. It can contribute to or focus on the
Validity Procedures of Qualitative Evaluation Aligned to Purposes, Paradigms, and Perspectives.
Naturally, as is the case with Cresswell and Miller’s original model, the assessment procedures are partly interchangeable. Member checking and peer debriefing, for example, can be applied in all three paradigms. In this sense, one must keep in mind that the framework is an ideal type. But the model nonetheless poses priority in which procedures are especially important for what paradigm and evaluation purpose. Each procedure in effect serves as a counterweight for inherent methodological weaknesses of the respective evaluation purposes.
In case of a qualitative evaluation that primarily focuses on the instrumental effectiveness of a particular policy or program (does it work? what are its working components?), the criteria triangulation, member checking, and conducting an audit trail are essential. These criteria are most appropriate to avoid or detect spurious (causal) inferences and possible biases, which in itself are significant potential distortions when assessing the instrumental effectiveness of a program or policy. Triangulation, in particular, reduces chance associations and biases due to specific methods used, allowing for greater confidence in interpretations (Fielding & Fielding, 1986; Maxwell, 1992). This is crucial when evaluating the effectiveness of any method or policy. Oliver, Aicken, and Arai (2013) used this procedure to help policy makers make better decisions on childhood obesity. By triangulating user involvement data with a mapping study of interventions aimed at reducing child obesity, the investigators concluded that enhancing mental well-being should be a policy objective, and greater involvement of peers and parents in the delivery of obesity interventions would be beneficial.
If the goal is to uncover the meaning of the intervention for clients and target groups, then the research should acknowledge disconfirming evidence (or negative case selection), there must be prolonged engagement in the field (not a snapshot study) and external readers should be able to identify the experiences of respondents adequately through thick description. These criteria counterbalance a too-one-sided report of the experiences of particular individuals (disconfirming evidence) or circumstances (prolonged engagement) and allow for a thorough understanding of the experiences of respondents (thick description). Washington, Demiris, Oliver, Wittenberg-Lyles, and Crumb (2012) used the procedure of prolonged engagement to conduct an analysis of informal hospice caregivers who had participated in a structured problem-solving intervention (using open-ended exit interviews). During their prolonged participation in the program, they reported how caregivers actively reflected on caregiving, structured problem-solving efforts, partnered with interventionists, resolved problems, and gained confidence and control. The study thereby provided depth to the understanding of problem-solving interventions for informal hospice caregivers which can be used to enhance existing support services.
If the evaluation has an emancipatory intent (empowerment), then reflexivity of the researcher in the study becomes particularly important. It should become clear how personal beliefs or dispositions might have influenced the investigation as most empowerment-based evaluations (e.g., participatory action research) require a strong involvement of the researcher with his or her research subjects and the theme under study (with the possible risk of “going native”). Elliott, Fischer, and Rennie (1999, p. 221) argue for “owning one’s perspective,” whereby authors specify their theoretical orientations and personal anticipations, both as known in advance and as they become apparent during the research (see also Choudhuri, Glauser, & Peregoy, 2004; Morrow, 2005). As a hypothetical example of poor practice, Elliot et al. present a case of authors who report an investigation of the process of recovering from childhood sexual abuse, but give no indication of who they are and what they brought to the research. The reader is thereby forced to read between the lines in order to detect the authors’ presuppositions. To illustrate a good practice, Elliott et al. argue that the authors should have described their theoretical, methodological, or personal orientations as relevant to the research (e.g., feminist, symbolic interactionist, and heterosexual); their personal experiences or training relevant to the subject matter (e.g., therapist who works with sexual abuse survivors), and their initial (or emerging) beliefs about the phenomenon they are studying (e.g., that recovery from abuse requires forgiveness). From the perspective of the participants, finally, empowerment evaluations must also employ collaboration, which means that participants should be involved in the evaluation as coresearchers, or in less formal relationships.
Let me further illustrate the model with the hypothetical example I presented in the introduction (support program for pregnant teenagers). Suppose a qualitative case study is performed which aims to investigate the working components of the program. In the case study, interviews, observations, and documentation analysis are conducted. Given its main purpose—evaluating the effectiveness of the program itself—it is essential that from the evaluator’s perspective, triangulation is performed (do findings from interviews with teenagers, observations of the execution of the program by practitioners and document analysis overlap?) and that from the participant perspective there is member checking (do participant teenagers endorse certain conclusions/interpretations made by the evaluators?), and an audit trail is conducted so that external reviewers can verify if presented findings can be supported by the data and (causal) inferences about the workings of the program are grounded (e.g., are intended effects—such as engaging the teenagers to remain in school—indeed achieved? On what data are these conclusions based? On what grounds are arguments made?). The same steps can be followed with the other evaluation purposes (meaning and empowerment) “checking” procedures from the columns down and linking them with the perspectives from the rows.
Note that in the new evaluation model, Creswell and Millers’ original criteria are completed with other relevant procedures. Triangulation can be enhanced by
Conclusions and Discussion
It is important to note that the framework presented in this article
However, we must keep in mind that the actual application of validity procedures of qualitative inquiry takes time and energy. Whether it concerns member checks, keeping an audit trail, or thick description of the data, respecting validity criteria for qualitative research is easier said than done (causing some researchers to present a “procedural charade” in their reports, see Whittemore, Chase, & Mandle, 2001). In the realm of policy and program evaluation, in particular, it can be difficult to maintain certain standards. A PhD student working with a time frame of several years will generally have the patience and opportunity to apply validity procedures adequately. But for an evaluator or policy researcher who has to make an assessment of the impact of a social measure in, say, 2 months because the political situation calls for it, the situation is different. For him or her, the temptation will be greater to cut corners in the analysis. It is therefore important that funders of qualitative evaluations create the time and space for evaluators to implement validity criteria in earnest.
Finally, apart from the methodological and practical considerations, it would be fruitful to take a step back and study the social, cultural, and institutional aspects of some of these issues (see also Strassheim & Kettunen, 2014). It would be interesting, for instance, to discern why certain preferences for particular paradigms and purposes for evaluation seem to correspond with different time periods and sectors. In health research, the personal experience and realities clients and target groups provide to a particular program (constructivist paradigm) certainly has become more important over the last two decades or so, and this development has served as a supplement (or perhaps counterweight) to the dominance of postpositivist investigations focused on the instrumental effectiveness of programs. In community and social work, it seems the reverse is at work. Historically, highly influenced by postmodern and constructivist schools of thought, programs in community and social work are now increasingly fitted into experimental or “quantized” research models reminiscent of the old
Sociological and sociohistorical research not only can shed light on how and why such sectorial paradigm shifts occur, it could also investigate how these shifts influence ideas on what counts as “evidence” of particular social programs or policies, and if or how this in turn influences ideas on the role of qualitative information in evaluation.
