Abstract
Keywords
Definition of Terms.
In experimental message testing research, precise messages operationalize intended treatment conditions. A poorly constructed message can result in Type I errors—rejecting a true null hypothesis—if a message tests unintentional or tangential concepts treatments and Type II errors—accepting a false null hypothesis—if the constructed message dilutes the treatment condition. Thus, the instrument fidelity imparted by precise message construction is the foundation for sound investigation. Common strategies used in message construction include needs assessment focus groups, participant phrasing discussions, and interviews with target audiences (Willoughby & Brickman, 2020). However, the instrument fidelity of messages is suspect because no well-established, transparent procedures for scientific message construction exist despite the fact that scholars in these fields expend tremendous effort in describing and analyzing participant responses to messages. While there is an abundance of important theoretical discussions of mixed methods standards of validity and data integration (Dellinger & Leech, 2007; Fàbregues & Molina-Azorín, 2017; Fetters et al., 2013), we address a methodological gap in these works regarding such standards in experimental research. Indeed, the current approaches to constructing messages occur in a black box, reliant largely on expert opinion, with the often unstated assumption that message construction accurately represents treatment conditions. With no established procedures to maximize instrument fidelity, many existing studies—including some of our own—engender real threats to validity and reliability.
The vagueness of message construction typically resides in how researchers incorporated data from the formative research phase into the messages that were subsequently tested. The following three studies are examples of how messages are constructed in message testing research that lack instrument fidelity. In the first example (Poehlman et al., 2019), messages were developed and tested that would “resonate with and inspire priority groups to act” on Zika virus prevention in Puerto Rico (p. 900). The research team first conducted qualitative, formative research with women in the Women, Infants, and Children program. They then conducted “environmental scans” to quickly collect information from a variety of publicly available sources. Finally, the team brainstormed concepts for their messaging campaign. Yet,
In an effort to address validity, some empirical studies rely on a textual basis for constructing messages. For example, one study adapted publicly disseminated political emails (McLaughlin et al., 2019). Another excerpted from actual news (McLaughlin, 2020), while another used segments from a television program (Semmler & Loof, 2019). Shanahan et al. (2014) used ideas from public comments, and others found passages from interviews (Leshner et al., 2018). Finally, several studies noted that questions asked during the focus group or other formative research phases were informed by theory (Jordan et al., 2012; Lapka et al., 2008). Despite these efforts, none of the studies specifically or explicitly described
While these studies have advanced our knowledge of the effects of messages (both science-based and narrative-based messages), our procedure takes a first step toward precision in message construction through the deployment of a systematic mixed method to develop message treatments (Figure 1). Specifically, our mixed methods procedure integrates data collection and data analysis (Fetters & Molina-Azorin, 2017) with an exploratory sequential mixed methods design (Fetters & Freshwater, 2015).
The aim of this article is to present our unique contribution to mixed methods research that addresses the methodological gap in message testing research: the need for a formalized procedure guiding the
The remainder of this article is organized as follows. The section
Shining a Light into the Black Box: Why Employ Mixed Methods to Improve Precision in Message Construction?
We asked the question:
Exemplar: Constructing Narrative Messages to Communicate Riverine Flood Hazard Information
In this article, we use narrative messages in a riverine flood hazard context as an exemplar to illustrate how mixed methods can achieve integration in a transparent, systematic manner. The empirical goal of our exemplar was to measure the power of narrative risk communication to influence the audience’s affective response to a message (i.e., the valence and intensity of emotions) and their intended risk mitigation behaviors Raile et al. (2022) and Shanahan et al. (2019a). Specifically, we sought to learn how the narrative mechanism of “character selection” works to persuade in narrative science messages about flooding and whether narrative messages that highlight “hero” characters generate different affective responses and decisions than narrative messages that highlight “victim” characters. Therefore, building strong message treatments with different character sets was of paramount importance to our exemplar.
The Study Area of Our Exemplar
The Yellowstone River basin in Montana, USA, is keenly susceptible to flooding hazards. With a classic mountain-snowmelt hydrologic regime, the Yellowstone experiences frequent flooding events. These conditions are made especially acute when combined with higher frequencies of “rain on snow” events. Local hazard preparedness is vital to avoid the hazard-to-disaster trajectory. This iconic river originates in Yellowstone National Park and flows 1,100 KM northeast to its confluence with the Missouri River in western North Dakota. The Yellowstone is the longest unimpounded river in the conterminous 48 states and flows through several communities in Montana (Figure 2). Land on the Yellowstone River is held by private landowners and federal and state management agencies. The river is integral to many different communities for agricultural, residential, industrial, and recreational purposes. Without appropriate hazard preparedness, the increased volatility introduced by climate change will significantly increase the vulnerability of individuals and communities (Whitlock et al., 2017).
Why Narrative Messages?
At the heart of conventional risk communication is the assumption that scientific information on the probability and consequences of natural hazards will lead people to engage in risk-reduction behaviors (Ludy & Kondolf, 2012); however, many studies reveal that scientific information in isolation rarely affects hazard preparedness (Wachinger et al., 2013). New information about flood hazards is unlikely to prevent hazard-to-disaster trajectories because an alarming gap persists between scientific predictions of hazards and the general population’s perceptions of risks associated with those very same hazards (Barnes, 2002). In turn, preparedness decisions are often based on subjective factors derived from life experiences and cultural values rather than up-to-date science information (Bubeck et al., 2012). In our broader project, we propose that one way to improve risk communication is to use narrative structure to relay the story of the scientific information.
Why
Testing narrative-based risk communication is not new but has been consistently imprecise because black-box message construction impairs instrument fidelity. Inferring causal mechanism(s) from black-box messages is dubious; in particular, internal validity is compromised if messages lack ecological and construct validity, regardless of how well researchers measure and analyze the responses to those messages. While previous risk communication studies have examined the differences between the impact of technical information and narratively presented hazard information (Barbour et al., 2016; Occa & Suggs, 2016), the narrative treatments lacked validity, as they were ad hoc constructions made up by researchers. Our approach was to reduce threats to validity by capturing narrative elements directly from residents in the study area to catalog in vivo local language used to describe flood hazards, a step referred to as participant enrichment (Collins et al., 2006).
Why NLP?
Natural language processing refers to a cadre of machine-learning tools that can be applied to infer, describe, and quantify the meaning and nuance in human language transcripts. NLP techniques enable efficient and unbiased processing of large bodies of texts—such as coded interview transcripts or other qualitative source texts—that would be unwieldy or impossible for a human to process manually. These techniques bring out qualities of narratives that are impossible to discern with the naked eye (Flanders, 2005). We assert that NLP strengthens the operationalization of the qualitative and theoretical foundations of our research procedure by enabling the identification and relative importance of the words that most precisely capture the treatment conditions from the source texts.
Application of computational science techniques in the subfields of digital humanities and computational social science are limited (Grubert & Siders, 2016). Yet, computational techniques have tremendous potential to help social scientists make valid causal inferences and develop theory from the assessment of large and unwieldy datasets (Grimmer, 2015). Nascent application of machine-learning tools such as text mining, sentiment analysis, word frequency analysis, topic modeling, and text clustering (Grubert & Siders, 2016) are promising because they offer ways to preserve “the superior abilities to interpret text holistically provided by humans but [incorporate] the formal rigor, reliability, and reproducibility of computer-assisted methods” (Nelson, 2020, p. 8).
Risk messages need to be precise and ecologically valid to the extent possible. Consequently, utilizing the language of the target population, as identified in source texts (e.g., interviews), in message construction is critical but also presents a substantive research challenge. At its heart, the operational challenge is to objectively identify, classify, and rank the importance of descriptive terms most strongly associated with each treatment condition from numerous source texts while also accounting for the variability in lengths of source texts. We confront this research challenge via judicious integration of NLP into qualitative research.
Detailed Procedure
Purpose of Each Step in the Procedure.
Qualitative Phase I
The procedure begins with Qualitative Phase I, comprising Step 1 Qualitative source text and Step 2 Human coding. Briefly, in Step 1, we compiled our source text by conducting and subsequently transcribing semi-structured interviews with 45 individuals in three flood-prone communities in our study area (Figure 2). In Step 2, we used human coding to bin local language from the semi-structured interviews into character language categories (hero vs. victim) based on a NPF codebook (Shanahan et al., 2018a). Integration occurred as the human-coded hero and victim texts became the foundation of the subsequent quantitative phase. Below, we detail each Step outlined in Figure 1 and Table 2.
Step 1 Qualitative Source Text
We conducted semi-structured interviews to provide the vernacular needed to build narratives in the target audience’s own language; this step was conducted to improve the operationalization of NPF theory and to reduce threats to ecological and construct validity (Table 2). Thus, the raw material for narrative construction came from semi-structured interviews conducted with 45 individuals in three communities along the Yellowstone River in Montana. These three communities–Livingston, Miles City, and Glendive–were chosen because they border the Yellowstone River and had all experienced significant riverine floods recently despite manmade levees intended to protect infrastructure. Regardless of these commonalities, these communities have different relationships with the river, including varying recreational and economic opportunities (Shanahan et al., 2018c; Bergmann et al., 2020). The purposive sampling procedure aimed to achieve a sample with individuals from a range of affected sectors in the communities. The resulting sample included interested citizens, business owners, and residents from along the river. The interviews were distributed across the three communities (
The first section of the interview protocol (Shanahan et al., 2019b) focused on problems, benefits, and risks associated with flooding on the Yellowstone River, as well as sources of information for learning about such flooding. To develop our message treatments, we needed locally derived language describing victim and hero characters. Thus, the second section asked about
The Human Ecology Learning and Problem Solving (HELPS) Lab at Montana State University transcribed nearly all the audio files from the interviews, with researchers completing the remaining few. In total, the 45 interviews resulted in 42 transcripts. Two individuals were interviewed simultaneously. Another two individuals refused audio recording per the informed consent procedures; field notes for these interviews were taken but not used subsequently. We aimed to allow interviews to unfold at a relatively leisurely pace so that interviewees would feel comfortable and would use their own descriptive language. The resulting transcripts ranging in length from about 3,500 words to over 32,000 words, with a median of 9,016 words.
Step 2 Human Coding
We used human coding to assign local language from the 42 semi-structured interviews (Step 1 Qualitative source text) into appropriate narrative elements (e.g., characters) and nodes within those elements (e.g., hero or victim language). More plainly, we manually tagged victim and hero language in all interview transcripts based on NPF theory. This step aimed to fortify the integration of theory into final message construction while simultaneously bolstering the ecological and internal validity of final message treatments (Table 2).
Human coding for characters was an iterative process that began in a deductive manner. Previous NPF codebooks (Shanahan et al., 2013) provided the foundation for the coding. Existing NPF research also provided definitions for the character nodes. According to the NPF, heroes are fixers of problems, whereas victims are entities being harmed (Shanahan et al., 2018b). Four researchers began by independently coding the same transcript in NVivo11 software (QSR International Pty Ltd., 2015). The main nodes, established deductively from the NPF, were the hero and victim character categories. The specific identities of these characters (i.e., the sub-nodes) emerged inductively from the data (e.g., government floodplain administrator under the hero node or individual homeowner under the victim node). The researchers then convened to compare specific coding actions and categories. Based on this comparison, they revised and consolidated the codebook. Three of these researchers then independently coded a second transcript in full. They met again to refine the node structure and coding scheme. These iterative comparisons were important for ensuring reliability in coding. The researchers then distributed and coded the remaining 40 transcripts based on the refined coding scheme, coding at the sentence level for hero and victim language. A fourth coder subsequently coded a random selection of 20% of each interview to check for inter-coder reliability. Averaged across all interviews, Cohen’s kappa (Cohen, 1960) for hero coding was 0.883 and for victim coding was 0.880, which indicates substantial agreement (Landis & Koch, 1977).
Quantitative Phase I
Quantitative Phase I comprises Step 3 Natural Language Processing and Step 4 Word classification. In this phase, we employed NLP techniques and word classification to distinguish words from interview transcripts that were most strongly associated with each of our treatment conditions. Integration occurred as the individual “hero words” and “victim words,” identified via NLP and word classification, provided our research team with the key terms to use in the narratives to precisely operationalize hero versus victim message treatment conditions.
Step 3 Natural Language Processing
Across all interviews (Step 1 Qualitative source text), the human-coded text associated with characters hero and victim characters (identified in Step 2 Human coding) was combined into bodies (i.e., corpora) of character-related text: one corpus for hero language and one for victim. In turn, these corpora were subjected to NLP to identify and rank word choices for each character type. The rationale for integrating computational techniques is twofold. First, we wanted to reduce threats to the internal and construct validity of our final messages (Table 2) by efficiently and objectively discerning the words that most precisely characterized victim or hero treatments to the target audience. Second, the corpora of character-related text were large and unwieldy. Specifically, the hero corpus contained about 35,400 words, while the victim corpus contained about 58,300 words. In what follows, each step used in our computational NLP approach is described.
Assessment of the coded text using NLP techniques required carrying out certain preprocessing procedures. Natural language (i.e., human-generated language) presents a combinatorial problem for computers, which can “view” each unique letter, word, sentence, and paragraph as a feature for consideration. This high dimensionality can dramatically slow down automated content analysis algorithms. Thus, the goals of preprocessing are to reduce the number of features in a narrative without losing relevant information and to reach a vectorized representation for computational text analysis models. All preprocessing steps used the RStudio integrated development environment (RStudio Team, 2019) and the R programming language (R Core Team, 2019), relying heavily on the tm (text mining) package (Meyer et al., 2008).
First, we reorganized the 42 coded, semi-structured interview transcripts (Step 1 Qualitative source text and Step 2 Human coding) into sets of documents by label (i.e., hero and victim) so that each document contained all the coded language from a label found in an interview. For example, document 1 in the hero corpus contained all the hero-coded language elements from interview 1, whereas document 1 in the victim corpus contained all the victim-coded language elements from the same interview. In total, we extracted 472 instances of hero language elements and 748 instances of victim language elements. These language elements ranged from one sentence to a paragraph in length. The aggregation of each set of documents made up the hero and victim corpora, respectively.
The next set of four preprocessing steps included commonly used approaches in automated content analysis: conversion to lowercase, character scrubbing, stop-word removal, and tokenization. All of these methods reduce the number of features for consideration with minimal semantic loss. Lowercase conversion quickly reduces the number of features considered, as words like “he” and “He” would otherwise be interpreted as unique terms. Alphanumeric character (i.e., letter) scrubbing removes unhelpful symbols, such as punctuation, URL markers, and numbers. Similarly, stop-word removal eliminates many high frequency terms used in natural language such as “a,” “the,” and “that.” The tm R-package default list of 174 English stop-words and an additional custom list, created by our researchers, were used for selecting words tagged for removal from the documents. The custom list was tailored to reflect interview transcripts and the flood risk domain. This list included terms like “uh,” “uhm,” “hmm,” which are important social cues in vocal speech but not relevant to the formation of narratives.
The final preprocessing step, tokenization, breaks the documents into feature vectors. These vectors are sequences of integers that store the counts for each unique term in every document of the corpus; each integer in a vector represents a count for a term from a document. In order to tokenize, a term length must be determined. For this project, the documents were broken into unigram terms (i.e., one word per term). Bigram (i.e., two words per term) and N-gram (i.e., n words per term) models were explored, but they did not yield useful information. The feature vectors are combined into a term-document matrix, where rows represent the unique terms found in the corpus, columns represent the documents of the corpus, and cells store the term count. This creates a large and sparse matrix from which we can perform automated content analysis.
The performance of algorithms that use a vector approach for storing the word frequencies is linear (i.e.,
Step 4 Word Classification
We classified the words in the hero and victim corpora (Step 3 Natural Language Processing) using automated content analysis. The purpose of this step was to classify the “hero words” and “victim words” to operationalize NPF theory most precisely in the final messages while also reducing threats to ecological, construct, and internal validity (Table 2). We experimented with four different content analysis techniques and found that term frequency calculations proved to be the most informative text analysis techniques for the creation of narratives (see King (2019) for full description of the other three methods). Term frequency measurements on transcripts from the target audience provided the exact vocabulary used to communicate messages about the flood domain. Using the term-document matrices, the term counts for each corpus were calculated by summing across each row. Given that the corpora were of different sizes, term counts were normalized by dividing by the total number of words in each corpus, calculating a relative frequency,
Qualitative Phase II
Qualitative Phase II comprises only one step, Step 5 Algorithmic message construction. Here, we employed a human-generated algorithm—rooted in narrative theory—to construct the narratives using key words discovered through the NLP analysis. Integration occurred again between this phase and the subsequent one, as the algorithmic message construction enabled us to evaluate the instrument fidelity of each segment of each message treatment in Quantitative Phase II.
Step 5 Algorithmic Message Construction
With the hero and victim vocabularies in hand (Step 4 Word classification), we proceeded with algorithmic message construction. Algorithmic message construction reduced threats to reliability and ecological, construct, and internal validity (Table 2). As discussed earlier, the primary goal in the exemplar was to investigate the influence of the narrative mechanism of victim and hero characters on affective responses (Shanahan et al., 2019) and intended risk mitigation behaviors (Raile et al., 2022). As such, we constructed narrative messages with language corresponding to three distinct character mechanisms—victim, hero, and victim-turns-hero. Victim language emphasizes negative outcomes for the audience members and their communities. Hero language emphasizes the entities responsible for fixing flood-related problems, including the audience members. Victim-to-hero language creates an arc in which the negative outcomes can be overcome by the audience members and their communities.
The secondary goal in the exemplar was to determine whether science information presented in the language of probability or certainty had greater persuasive power. Probability language is the
We constructed each narrative with a common structure; however, we strategically varied the content of each of the four segments (or pieces) that compose a message to enable testing of different treatment combinations (Figure 3; Shanahan et al., 2019). All narrative messages opened with an identical definition of a riverine flood. The second segment in each narrative framed the problem of flooding with either a victim, hero, or victim-turns-hero frame. The third segment described science information about flooding using either probability or certainty language. The fourth and final segment described how the characters in the story took action to prepare for a flood hazard with a character mechanism of victim, hero, or victim-turns-hero. Thus, narrative messages for the victim treatment included victim language in both the second (problem framing) and fourth (characters in action) segments of the messages; likewise, narrative messages for the hero treatment included hero language in both of these segments. In contrast, the narrative messages for the victim-turns-hero treatment include a combination of victim and hero language. The full narrative messages with segments identified are presented in S1 Text of Shanahan et al. (2019).
To improve internal validity, construct validity, and reliability in message construction, we constructed a “word use signature” histogram for each narrative message by plotting the frequency of 
Quantitative Phase II
The final phase of our research is Quantitative Phase II. This phase comprises one step, Step 6 Validity & reliability testing, wherein we evaluated the instrument fidelity of the message treatment conditions by conducting validity and reliability testing. To do so, we returned to the three flood-prone communities (Figure 2) and asked 90 participants to evaluate each narrative message using dial response testing.
Step 6 Validity and Reliability Testing
Step 6 reduced threats to reliability and ecological, construct, and internal validity (Table 2). The exemplar’s full experimental protocol and results and our interpretation of the validity and reliability testing are published in Shanahan et al. (2019). Briefly, the three communities that were the sites of the semi-structured interviews also became the sites for field testing of the eight risk communication messages. The goal, again, of the exemplar was to test the language with audiences from the same places as the individuals who generated the vocabularies via the semi-structured interviews (Shanahan et al., 2019). The testing technology required the construction of videos with audio for all messages. The videos were recorded using Microsoft PowerPoint with white words on dark blue backgrounds and audio overlays. Each slide contained a single sentence from the message to prevent audience members from reading ahead. The narrator attempted to remain as calm and impassive as possible when reading the messages to focus audience members on the content alone.
To obtain a sample of participants to test these eight messages, the researchers ordered a random sample of 500 addresses from Survey Sampling International for each of the three study communities. Postcards went out to these addresses inviting one adult from the household to participate and offering a $50 incentive in return. The research sessions took place in the respective communities on prearranged dates in October and November of 2017. Potential participants could sign up via the website of the HELPS Lab. A second postcard went out to non-respondents 2 weeks later and invited individuals to spread news about the sessions. We also advertised via local newspapers and social media accounts linked to city governments. Our research team conducted four sessions in each community. The final sample included 90 research participants: 36 from Livingston, 22 from Miles City, and 32 from Glendive. We held multiple sessions in each community, with the number of participants ranging from 4 to 11 in each session. The final sample was nearly evenly split in terms of women and men but did skew somewhat older than the general populations of adults in these communities.
The test sessions, which lasted approximately 1 hour each, featured dial response technology and a follow-up focus group and demographic survey. The dial response was used to measure affective response, a dimension of narrative transportation that measures audience engagement (Green & Brock, 2000). The dial response technology, the Perception AnalyzerTM from Dialsmith, permits instantaneous and continuous measurement of audience response to either live or recorded messages. Participants hold dials with preloaded data ranges as specified in the software. For this study, response options ranged from 0–100. The middle (vertical) position of the dial indicated 50 and was the neutral score. Participants were instructed to respond throughout the message with regard to how positive or negative the message was making them feel (i.e., their affective response to the message). The facilitator asked participants to start at the neutral position of 50 and indicated that 0 was the most negative score and 100 was the most positive score. Each session included a brief practice with using the dial response technology. The researchers randomized the order of the eight risk communication messages across sessions to eliminate message order effects. The software recorded each participant dial once per second.
The results from these sessions were used to test hypotheses about affective responses to character language in narrative science messages and to the type of science language (probability vs. certainty) that described flood hazard risk as part of the persuasion process (Shanahan et al., 2019). From this testing, we learned that participant responses differed among message treatments. Altering the narrative mechanism of character selection in messages consistently resulted in differences in participant responses; participants had slightly negative responses to victim treatments but positive responses to hero and victim-turns-hero treatments. These results largely corresponded with our predictions, thereby suggesting that we had minimized threats to construct validity. In simple terms, we had measured the concepts (hero and victim characters) that we had intended to measure based on theory. Such construct validity would be crucial to internal validity (i.e., establishment of cause and effect) in our later experiment. We found no differences between the probability and certainty versions of the science statements, which both produced negative affective responses across treatments. However, we did find remarkably consistent aggregate responses to the flood definition and science information segments, which provided evidence of reliability in the measurement. In sum, we concluded that our process minimized threats to validity and reliability. Had this testing revealed problems, we would have returned to message construction to evaluate which step might have been problematic.
Having determined that the narrative messages satisfy construct validity and precisely operationalize the treatment conditions, we used them in a mail survey of residents who live along the Yellowstone River, to test whether different narrative science messages have differential effects on affective response and intended risk preparation behavior (Raile et al., 2022). Figure 5 presents a research process display of our sequential mixed methods procedure, providing details of the optimization of instrument fidelity and details of what Onwuegbuzie et al. (2010, p. 58) refer to as crossover analyses, “which involves using one or more analysis types associated with one tradition (e.g., quantitative analyses) to analyze data associated with a different tradition (e.g., qualitative data).”
Discussion
Contribution to the Field of Mixed Methods Research
Our procedure makes a unique contribution to the field of mixed methods research in two ways as we address Onwuegbuzie et al.’s call (2010) for “more publications…that outline explicitly ways of optimizing the development of instruments by mixing qualitative and quantitative techniques” (pp. 57–58). First, we address the black box of experimental message treatment construction with what Onwuegbuzie et al. (2010) refer to as crossover analysis. We do so by blending a constructivist stance (i.e., perspective of residents through interviews, human coding, and message construction) with a positivist stance (i.e., NLP, word classification, and validity and reliability testing). Additionally, we employ a compatible theoretical foundation in the NPF that brings an objective epistemological approach (i.e., objective measures of universal narrative structure such as characters) to bear on a subjective ontology (i.e., social construction of reality through narratives). Second, we offer guidance on a powerful and relatively new tool in textual analysis, that of NLP, for use in mixed methods research. This use of NLP in crossover analyses between inductive and deductive logics optimizes instrument fidelity by linking theory (i.e., NPF) with qualitative data (i.e., interviews, coding, and message construction) and quantitative data (i.e., words identified via NLP). In turn, we sought to validate our message construction through further quantitative measure, that of affective response to different message treatments.
The novelty of this study is harnessing the power of integration through crossover analysis to develop a mixed methods approach to improve instrument fidelity in message testing research by addressing the critical need of developing a procedure for precisely constructing message treatments via broadening the use of NLP in the social sciences. Integration is a challenge in mixed methods research (Bryman, 2007; Fetters et al., 2013; Uprichard & Dawney, 2016) but is important to surmount because integration “produces a sum greater than the individual parts” (Fetters & Freshwater, 2015, p. 208). In particular, our work highlights how integration in the Research design dimension improves the Research integrity dimension, that is, precision in message construction (Table 2; see also section Detailed Procedure) (Fetters & Molina-Azorin, 2017). Our procedure for constructing precise messages is a research outcome that resulted from integration in the following dimensions: Rationale; Study purpose, aims, and research questions; Researcher; Team; Data collection; Data analysis; and Interpretation (Fetters & Molina-Azorin, 2017).
The integration in the Rationale and Study purpose, aims, and research questions dimensions emerged from a clear need to open the black box of developing message treatments to overcome the numerous potential threats to validity and reliability that arise from depending on expert opinion to construct treatments. In the Researcher dimension, our research team (i.e., the coauthors on this article) were drawn together to address the challenge of constructing precise message treatments because of experiences that lead each to highly value employing mixed methods procedures; without question, integration in this dimension was the bedrock upon which the integration in all other dimensions was built. For instance, in the Team dimension, each researcher was a domain expert in fields ranging from political science, human geography, economics, hydrology, to computer science. This diversity in team expertise brought incredible creativity and energy but also many challenges. As others have noted (Poth, 2019), our team quickly learned that integration is hard work. Each team member was stretched to learn the key concepts, theory, history, and vernacular of the other disciplines as related to the common research goal and to communicate the nuance and importance in their own discipline using a common language. As a result, frequent meetings were required wherein patience, humility, humor, and excellent team leadership were critically important to advancing the research. Perhaps not surprisingly, some of the most productive and lively conversations arose as the team carefully considered the strengths of existing qualitative and quantitative approaches to select the best mixed method approaches to integrating NLP with NPF in the Data collection, Data analysis, and Interpretation dimensions.
To our knowledge, our team is the first to utilize NLP to enhance message treatment construction. Our efforts were not without challenges. We faced similar challenges in the Data collection, Data analysis, and Interpretation dimensions as those presented by a multitude of other scholars (Guetterman et al., 2018; Nelson, 2020; O’Halloran et al., 2018; Rohrer et al., 2017). For instance, as we strove to incorporate the most useful NLP techniques into our procedure, we explored several “dead ends” that we originally thought would be quite useful. In addition to the term frequency approach to word classification that we describe in the detailed procedure above, we also attempted three other approaches to identifying words associated with victim and hero characters. These approaches were topic modeling, sentiment analysis, and a formal classification algorithm (King, 2019). Topic modeling (Blei, 2012) refers to the application of quantitative techniques used to find common or unifying themes in a set of documents within a corpus. These techniques can be used to either confirm the existence of known topics within a corpus or to find latent topics not readily apparent–even to a trained domain expert. Sentiment analysis techniques (Cambria et al., 2013; Mäntylä et al., 2018) aim to measure emotions embedded in narratives. Two approaches to measuring sentiment are commonly used. The first approach is a nominal technique that classifies words into bins, where each bin can represent a sentiment (e.g., happy, sad, angry, etc.). The second technique uses ratio-scale measurements to calculate a polarity score for each word. Many techniques exist to assign and adjust polarity of words depending on context. Finally, classification algorithms (Han et al., 2009) are machine-learning based approaches that aim to reduce manual coding of information. By training a model with known data, classification algorithms can then be exercised with new, previously unseen data with the expectation that the model will yield the correct classification.
Briefly, the NLP methods of topic modeling and sentiment analyses generally confirmed researcher interpretation of the linkages amongst words in the corpora but did not provide new or unappreciated information to the research team. The formal text classification algorithm rendered only minimally useful information because the quantifiable aspects of victim and hero language—term frequencies—were higher in the victim documents simply because the victim documents were generally longer than the hero documents. Consequently, the classifier produced skewed results: precision and recall were moderate for the hero corpus (50–70%) but low for the victim corpus (<30%; King, 2019). Despite their limitations, each of these methods helped the research team better understand the corpora. However, only through a combination of intramethod analytics and core integration analytics was our team able to fully appreciate the strengths of the term frequency approach we ultimately employed. In the end, we agree that NLP is most useful when augmented by qualitative analysis (Guetterman et al., 2018) and that NLP offers improved integration of qualitative data into an exploratory sequential mixed methods research design (O’Halloran et al., 2018).
Considerations, Limitations, and Future Directions
The procedure presented here moves the theoretical discussions of mixed methods standards of validity and integration (Dellinger & Leech, 2007; Fàbregues & Molina-Azorín, 2017; Fetters et al., 2013) into practice. However, a reader might ask whether our approach was worth the considerable effort. Much of our effort was the result of exploring and comparing specific methods, which might not be necessary in subsequent studies. Ultimately, our approach boils down to semi-structured interviewing, human coding, the production of relative word frequencies and their systemic application in message construction, and then some form of testing the validity and reliability properties of the resulting messages. The multiple stages and mixed methods necessitate a team approach, but no single piece is exceedingly difficult on its own. At this point, the validity and reliability in other approaches remain unknown, so comparing the ratio of labor to precision is impossible. However, moving forward, researchers can be more intentional in evaluating this ratio. Thus, the primary limitation of our research is that we cannot explicitly state if our procedure is “worth it” for other researchers or “how much better” our procedure is over black-box message construction.
Our future research directions seek to transport our mixed methods process to other domains such as viral spillover (e.g., coronavirus, Ebola) and cyber security. Indeed, the accuracy of risk communication studies in these domains has the potential to save lives and increase security at multiple levels—personal, municipal, state, national. Applications across different field domains will also test the transportability of our mixed methods approach.
Conclusions
Our procedure improves instrument fidelity in message testing research via a novel integration of qualitative and quantitative methods to address a critical research need: bolstering theoretical grounding, validity, and reliability as forms of message precision. We found this procedure to be effective for our purposes and suspect it will prove useful beyond our research domains of narrative communication and hazard preparedness. Our research team looks forward to its use and improvement in future studies.
