Abstract
Keywords
Introduction
The “visual turn” in qualitative methods has seen increasing use and acceptance of visual artifacts, most notably photographs, in social science research (Bell, 2010; Pain, 2012). Two photographic methods are routinely distinguished (Glaw et al., 2017). In autophotography, firstly, participants take photographs (or at least stage them) and researchers analyze these artifacts
It was the perceived “absolute and unqualified objectivity” (Strand, 1917, p. 524) of the camera that appealed to early adopters of visual methods (Pink, 2021). Since the 1960s, however, researchers working within relativist and constructivist paradigms have recognized the camera’s “twin capacities, to subjectivize reality and to objectify it” (Sontag, 2002, p. 178). In this case, a photograph is not considered an objective record of reality, but rather an artifact whose meaning is co-produced by photographer and viewer. The photographer selects their subject, composition, frame, lighting, time exposure, etc. (Berger, 2013; Chaplin, 1994), and in doing so lends their “point of view” (Thompson, 2003, p. 7). In the act of interpretation, moreover, each viewer brings their own “point of view” to an image insofar as the visual itself (i.e.,
Mindful of the challenges of interpreting visual artifacts, researchers working within a relativist or constructivist paradigm have privileged photo elicitation over autophotographic approaches, focusing their analysis on participants’ own interpretations of images (Brown & Collins, 2021; Chapman et al., 2017; Gleeson, 2011; Murray & Nash, 2017). Even with increasing use of visual methods, then, the visual has often been subordinated to the verbal (Chaplin, 1994). This not only risks overlooking a valuable source of knowledge (Chapman et al., 2017), but also ignores similar difficulties with interpreting verbal texts, which can also be understood to be “inherently polysemic” (Knowles & Sweetman, 2004, p. 13). Those scholars who have treated participant-generated photographs
This article builds on attempts to develop an approach to analyzing multimodal data without privileging one semiotic mode over another (Brown & Collins, 2021; Chapman et al., 2017; Gleeson, 2011). Answering calls for greater reflection on and transparency in analysis of visuo-verbal data (e.g., Catalani & Minkler, 2010), we begin by presenting a methodological framework that incorporates systematic investigation of text-image relations into the analytic process by drawing on the foundational work of Roland Barthes (Barthes, 1977a, 1977b) in semiotics. We then explore how our methodologically innovative approach can be implemented with reference to eight visuo-verbal illness narratives generated as part of a study into experiences of living with an eating disorder during the COVID-19 pandemic. Finally, in our discussion we highlight some of the strengths and limitations of our framework in light of its application and with respect to established approaches to analyzing visuo-verbal data.
Analyzing Visuo-Verbal Data
Attempts to analyze visual data as part of a multimodal dataset have typically begun with the observation that, for all their differences, analyzing visual material has much in common with analyzing verbal material (Ritchie et al., 2014). For Gleeson (2011), for example, interpretation of pictures is akin to that of words, that is, “basically the same process of bringing one set of texts to bear on another in order to make meaning” (p. 314); for Chapman et al. (2017), the processes of coding images and texts are, essentially, “the same” (p. 814); and for Sellers (Braun & Clarke, 2021), a photograph can be treated in the same way as a transcript. Use of the same analytic method (e.g., content, thematic and narrative analysis) across semiotic modes has been advocated (Banks, 2018; Glaw et al., 2017; Rapport et al., 2007). Alternatively, “synergistic” methods may be found through experimentation (Drew & Guillemin, 2014, p. 63).
Some scholars have adopted what we might call a “holistic” approach to visuo-verbal data, whereby the visual and the verbal are merged into a single multimodal unit from the outset (Burles & Thomas, 2014; Lian & Rapport, 2016; Rose, 2016; Wilde et al., 2020). Others insist on analysis of visual and verbal data independently prior to analysis of the multimodal unit. Where codes are generated separately for images and words, these may be combined as part of a gradual shift from descriptive towards interpretive analysis (e.g., moving towards themes in thematic analysis). As Murray & Nash (2017) and Glaw et al. (2017) note, the rationale and procedure for combining codes are not always discussed thoroughly. For some, however, it is an additive process (a combined code needs only to have been ascribed to data in one mode, e.g., Brown & Collins, 2021); for others, it is a corroborative process (a combined code must have been ascribed to data in multiple modes, e.g., Chapman et al., 2017).
Studying the relationship between visual and verbal data is advocated (Rose, 2016; Pink, 2021), but seldom practiced. Among the rare examples of a relational stage built into the analytic process is Oliffe et al.’s (2008) formal “layered” analysis (drawing on Dowdall & Golden, 1989). They examine congruity between photographs and written texts as part of their “Review” phase. Similarly, in her “multiple text analysis,” Keats (2009) calls for intratextual (within case) consideration of the connections, parallels and differences between data in different modes. To the best of our knowledge, however, consideration has not yet been given to the systematic study of semantic relations between visual and verbal data prior to their analysis in combination.
Text-Image Relations in the Work of Roland Barthes
With the aim of developing a more rigorous approach to the study of participant-generated multimodal data, we turn to the foundational work in semiotics (the study of signs and sign systems) of Roland Barthes. Barthes (Barthes, 1977a, 1977b) remains the starting-point for scholarly discussion of text-image relations (Bateman, 2014). Although more complex classifications have since been developed by scholars of systemic functional linguistics to address logical gaps in Barthes’s schema (Kong, 2006; Martinec & Salway, 2005; Van Leeuwen, 2005), the original provides the appropriate balance of rigor and flexibility for our present purposes.
Barthes’s simple taxonomy of text-image relations recognizes three logical possibilities (see Figure 1): Barthes’s Classification of Text-Image Relations.
What Barthes (Barthes, 1977a, p. 25) refers to as the most “traditional” relation is illustration. Here, an image supports a text, clarifying or “realizing” the written word, as is the case, for example, in Figure 3. Having dominated for centuries, illustration, Barthes (Barthes, 1977b) argues, has been overtaken by anchorage. In anchorage, a text supports an image by “anchoring” or elucidating its meaning, as in the case of photograph captions in newspapers (Barthes’s key example). Barthes (Barthes, 1977b, p. 40) finds anchorage “repressive” insofar as it reduces the polysemy of images, directing the viewer to one particular interpretation among many. For an example taken from the data discussed later in this article, see Figure 4.
The final relation Barthes identifies is relay, in which the visual and verbal are accorded equal status. In cases of relay, words and images function as “fragments of a more general syntagm and the unity of the message is realized at a higher level” (Barthes, 1977b, p. 41). The whole, in other words, exceeds the sum of its parts. Barthes suggests (Barthes, 1977b) that cases of relay are relatively rare in static media, although Bateman notes (Bateman, 2014) that relay is recognized as being much more prevalent today. Comic strips are Barthes’s prime examples. See Figure 5 for an example drawn from our data.
A Framework for Analyzing Visual-Verbal Data: Text-Image Relations Analysis
Drawing on the insights from Barthesian semiotics outlined above and on previous attempts to systematize data analysis across multiple modes (notably Brown & Collins, 2021), we propose a framework for analyzing visuo-verbal data that incorporates systematic investigation of text-image relations. We stress that, as an analytical framework, our Text-Image Relations Analysis is flexible enough to be employed with a variety of analytical approaches (e.g., thematic analysis, grounded theory, IPA) and theoretical and philosophical perspectives. What we offer is simply a practical guide to acknowledging the relationship between data in visual and verbal modes and recognizing its potential for meaning-making.
According to our conception of Text-Image Relations Analysis, images and written texts are initially coded separately. The semantic relationship between image and text is then explored, for example, through the lens of Barthes’s taxonomy (Barthes, 1977a, 1977b). Only then does integrated coding at the level of the multimodal unit take place, taking into account the text-image relations that have been identified. Special attention needs to be paid to cases of apparent incongruity between text and image (i.e., relay). Here, it may not be possible to interpret the multimodal whole in the absence of further data, and codes generated separately for the image and text should not therefore be combined. It is nonetheless important to note the form of text-image relation in case a pattern emerges.
As shown in Figure 2, analysis is an iterative process, in which images are compared and contrasted to images, texts to texts and visuo-verbal pairs to visuo-verbal pairs. The process of comparing and contrasting data elements takes place at two different levels: Text-Image Relations Analysis.
Applying the Text-Image Relations Analysis Framework
In this section we show how our framework of Text-Image Relations Analysis can be implemented, using data drawn from the
During our analysis we observed that text-image relations varied according to the setting represented by participants. The nine visuo-verbal pairings we present here, then, are organized around three different settings: the hospital ward, the home, and natural (or semi-natural) environments.
Hospital Wards
Three photos (see Figures 3, 4, and 5) submitted by different participants represent experiences of inpatient treatment during the pandemic. Despite a commonality of subject—the participant’s hospital bed (Radley & Taylor, 2003)—each entertains a different relationship with the text that accompanies it. These relationships correspond to the three different text-image relations outlined in Barthes’s taxonomy:



The first photo-text pairing (see Figure 3) provides an example of illustration: the text explains that the photo captures a moment that has recurred, depicting one of “a fair few admissions.”
In Figure 4 the text embeds the photo within a broader narrative. This participant uses the caption-text to explain the reasons for her hospitalization (an “accident due to my low blood pressure”), her thoughts at the time (“all I could think about was … how many calories I was ‘saving’”), and what happened next (“It was this … which led to my decision to move home”). The dominant text-image relation here is anchorage: the text serves to situate the specific moment captured by the photo in time and to pinpoint what is meaningful about this image for the participant (what makes it the eponymous “turning point”).
Compared to the other two figures, Figure 5 presents a more complex case. The image (partially pixelated, with the participant’s permission) depicts the participant sitting up in bed, smiling directly at the camera and making a “thumbs up” sign. Once again, the caption-text serves to anchor the image within a narrative, establishing that this photo was taken while the participant was hospitalized with COVID-19. She points to her eating disorder history as a factor that contributed to the severity of her COVID-19 infection and holds COVID-19 responsible for a subsequent anorexia relapse. There is an incongruity here, however, between image and text. The positivity connoted visually by the participant’s smile and thumbs up contrasts with the seriousness of the verbal narrative in which the image is embedded, one that tells of COVID-19 complications and eating disorder relapse. This is a “demand” image that requires something from the viewer (Kress & Van Leeuwen, 2021), in this case challenging us as readers and viewers to make sense of its relationship to the verbal text. In light of the caption, we returned to the image and asked ourselves whether it depicted the participant’s determination to preserve a positive outlook in the face of illness, or perhaps critiqued a culture of “positive thinking” (Ehrenreich, 2009), which, in the participant’s case, may have done little to help. Image and text combine here, in a relationship of relay, to form a multimodal unit that is suggestive of meaning beyond that connoted by its constituent parts. There may, however, be several contenders for what that “more general syntagm” (Barthes, 1977b, p. 41) might be.
An additive approach to coding visuo-verbal data may prove adequate for cases of illustration and anchorage. To the images in Figures 3 and 4, for example, we initially ascribed a code of “Hospital ward;” with regards to their accompanying texts we opted for “Hospitalization.” When moving to analysis of the multimodal unit, reconciliation of these separate codes to form combined codes did not pose significant problems. Analysis of the third photo-text pairing (Figure 5), however, required more careful consideration. We initially ascribed codes of “Hospital ward,” “Smiling,” and “Making thumbs up” to the image, and “Hospitalization” and “Worsening ED symptoms” to the text. But to simply add these together, or select those common to text and image, would have been to overlook supplementary meaning generated by the particular ways in which semiotic resources have been combined by the participant. We initially kept the codes that we had generated for the text and photo separate, but we also noted the particular text-image relation (“image more positive than text”) in case a pattern emerged as we compared and contrasted this visuo-verbal pairing to others in the dataset. In this case a pattern did emerge, and the relevant multimodal sets were given the in-vivo code “Behind the smile I was really struggling,” a phrase that the participant who produced Figure 5 used elsewhere in her submitted work.
Illustrating Home
Participants living alone during the pandemic repeatedly linked spending more time at home to exacerbated eating disorder symptoms, including increased body checking (looking in the mirror, pinching parts of the body), calorie counting, bingeing, and self-induced vomiting. Their photos tend to illustrate activities described in the accompanying texts. Some capture activities as they were taking place, for example, mirror selfies depicting body checking. More commonly, photos depict objects that are “indexical” of those activities (Ledin & Machin, 2018), for example, depicting a bin overflowing with food wrappers to represent bingeing, or a toilet to represent purging.
Figure 6 provides an example of an image depicting objects from which we infer an activity. The photo is a straightforward depiction of kitchen scales, packets of oats and a sugar substitute, and a scrap of paper with calculations on it. A straightforward text accompanies it: the participant explains that more time at home has led to more calorie-counting. In terms of text-image relations, this is a clear case of illustration: the photograph captures a single instance of what the text describes as habitual practice.

That participants repeatedly had recourse to illustration (depicting one instance among many described in words) rather than anchorage (with its turning points) was significant. This pattern resonated with descriptions of homelife during the pandemic as monotonous, for example, “Being in lockdown meant that there was very little change from day to day.”
The text-image pair in Figure 7 is among the few representations of homelife submitted by participants that go beyond illustration. In her writing, the participant describes taking up sewing “as a distraction technique” to keep her “hands and brain busy and the guilty thoughts at bay.” At first glance, the photo may appear to be a straightforward illustration, a depiction of the sort of embroidery mentioned in the text. On closer inspection, however, details emerge. Both embroideries are food-related: the carton of fries on the right obviously so, the larger hoop on the left taking the form of a pictorial diary with several food-related items depicted (a hamburger, cake, sweets, a frying pan, etc.). The image in Figure 7, then, directs the viewer to thoughts about food and eating, even as the text reports that the very reason the participant embroiders is to distract herself from such thoughts.

The incongruity between text and image here points to a relationship of relay. There are several ways we might make sense of the multimodal unit here. Perhaps the participant’s attempts to distract herself from obsessive food-related thoughts with needlework were not always successful, and the photo allowed her to express this difficult subject more easily than in words. Perhaps, on the contrary, she found it perfectly easy to disassociate the act of embroidering from the subject of the embroidery itself. Perhaps food was less prominent as a theme in her other needlework. Why she chose to pair text and image in this way might form the basis of a (sensitively handled) discussion in a follow-up interview (Chapman et al., 2017). In the absence of further data, however, it is difficult to privilege one interpretation of the multimodal whole over another. When performing integrated coding, then, we opted to keep the codes we generated for the image (“Depiction of food”) and for the text (“Needlework as distraction technique”) separate. But we also coded the text-image relation (“Relay”), in case similar incongruities recurred elsewhere in our data.
Figure 8 provides a further example of a representation of homelife where the relationship between text and image is one of relay. The photo depicts a window with succulents and a white orchid resting on the sill, with the viewer’s gaze directed through the windowpane towards apartment blocks on the opposite side of the street. We noted the similarity of this photograph to some of the hundreds of thousands of others circulating on social media at the time, when people under “lockdown” restrictions around the globe were encouraged to share the view from their window in a bid “to connect and escape” (e.g., the

This multimodal unit left us wondering whether the participant, in combining text and image in this way, intended to contrast her experience of the pandemic with that of others (as, for example, the participant in Figure 7 contrasts her reason for taking up a new hobby with that of others), or whether this was simply a window through which she looked, uninterested, as she planned her weight-loss regime. Again, in the absence of further data, we kept codes generated for the text and photo separate, but signaled the text-image relation at the multimodal level in case a pattern emerged (e.g., recurring use of photos to convey contrasting experience of others).
Getting Out
Compared to the inside spaces they depicted and described, participants for the most part represented outside spaces much more positively. In their writing they point to natural and semi-natural environments such as woodland, fields and parks as “therapeutic landscapes” (Gesler, 1992). They report being able to escape negative thinking patterns in these places, leaving them feeling “grounded,” “connected,” and more “mindful.” These were relaxing environments in which they felt able to talk to others about their illness and to challenge destructive eating habits. The beauty of nature, moreover, inspired hope and increased motivation for recovery. In the words of one participant: “Nature is a really special form of therapy.”
While this corpus did contain photographs that illustrated routine encounters with nature, and texts that anchored images of natural landscapes within a narrative, a majority of the images depicting outside spaces (12, split across four participants) were related to caption-texts through metaphor, wherein the characteristics or qualities of one object are applied to another (Lakoff & Johnson, 1980; Switzer, 2019). Barthes did not explicitly consider how metaphor might fit into his schema of text-image relations (Bateman, 2014). Insofar as a text that explicitly confers a metaphorical interpretation on an image (or part thereof) reduces its polysemic potential, it can be considered an example of anchorage. In many cases, however, a text will develop a metaphor from an image without explicitly pointing to the image as its source, or without elucidating that metaphor in full. Here we are dealing with a relation of relay: the reader/viewer is left to move back and forth between semiotic modes, looking for ways in which aspects represented in one can be transferred to the other.
In Figure 9 the participant does much of the interpretive work for us. The text indicates two stages of figuration. First, cycling has come to symbolize “freedom” for this participant. Second, now that gyms are closed, he has come to realize that his pre-pandemic exercise regime was “compulsive and rigid.” When cycling, by contrast, he does not feel “locked down” at all, even while living under restrictive social distancing measures. In addition to the

Figure 10 presents a more complex case of metaphor: compared to the previous example, the reader/viewer is left to do more of the interpretive work themselves. The first photo in this participant’s series depicts a woman (either the participant or a friend) standing on a wooded hillside, her back turned as she looks towards a misty horizon and the valley below. The photo’s title (

The text here does not “anchor” a metaphorical interpretation to the image. Rather, the loose, unwritten bond between text and caption prompts the viewer to search for a figurative meaning in the photo based on their knowledge of images’ “canons of use” (Ledin & Machin, 2018, p. 47). Not all viewers will see the same thing. Working in a Norwegian context, we, for example, found the protagonist gazing towards a mysterious horizon to be reminiscent of Romantic and Neo-Romantic landscape painting (Theodor Kittelsen’s
In Figure 10, the caption-text may make no reference to getting outside, but we are at least supplied with a title that anchors the natural landscape depicted in the photo at a literal level. Figure 11, however, leaves us with many more unanswered questions. The photo is a close-up taken near ground level of a colorful little door that has been placed among nettles at the foot of a tree-trunk. The caption-text gives no indication of the circumstances in which this photo was taken: no mention is made of a walk in the woods, let alone whether the door formed part of an art installation (e.g., Dinky Doors, 2021) or was placed there by the participant herself. The text simply metaphorizes one element of the image (

Insofar as the three caption-texts presented in this section all derive a metaphor from their corresponding photos, the text-image relation at work is that of relay. The point here is not that all readers/viewers will develop a metaphor in the same way; others will interpret the text-image pairings differently, including, perhaps, the participants themselves. What matters is that a relation of relay sparks metaphorical thinking in the reader/viewer.
When coding at the multimodal level, then—conscious of the fragility and provisionality of our interpretations—we refrained from merging codes generated separately for texts and images. What we did do was ascribe an integrated code of “relay” to each pair. This allowed us to notice a pattern in participants’ usage of text-image relations: the natural and semi-natural spaces they photographed provided a source of metaphor through which they understood and expressed pathways to recovery. Participants pointed to nature as beneficial for their recovery. But by taking text-image relations into account in our analysis of the visuo-verbal illness narratives, we were able to uncover one of the reasons
Discussion
According to the old adage, “a picture is worth a thousand words.” The reasons researchers like to work with participant-generated photographs or other visual artifacts alongside their words are well-rehearsed. Images have the potential to convey people’s experiences and life-worlds with great precision and emotional power and, moreover, where words prove to be inadequate (Berger et al., 1972). They capture “small, unintended details” that nonetheless reveal something significant about a participant’s lifeworld (Shannon et al., 2021, p. 118). In addition, they offer insights into the physical settings of people’s lives, serving to “re-anchor” study participants in physicality (Rugg, 1997, p. 2).
Triangulating visual and verbal data may also give the researcher greater confidence in their findings: “as soon as photographs are used with words, they produce together an effect of certainty … Together the two then become very powerful” (Berger, 2013, p. 66). Words can authorize a single interpretation of an otherwise ambiguous photo, while a photo—“irrefutable as evidence” (Berger, 2013, p. 66)—lends authenticity to the words that accompany it, especially if we believe “the camera cannot lie” (Collier & Collier, 1967/1986, p. 8).
Combining more than one mode, however, can also create uncertainties for the researcher. A picture may be worth a thousand words, but those thousand words are likely to be different for different viewers. Even where the meaning-making potential of photographs is recognized and harnessed, then, the temptation is often to locate meaning ultimately in verbal interpretations rather than in images themselves (as in photo elicitation approaches).
Barthes’s (Barthes, 1977a, 1977b) classification of text-image relations helps us to understand why texts and images can be situated so differently by different researchers. Berger’s (Berger, 2013, p. 66) “effect of certainty” is fostered by texts and images tied by relations of illustration and anchorage. Analyzing visuo-verbal data, in these cases, may simply be a case of coding at the multimodal level immediately, or else combining codes developed separately for data in each mode. Where relay is the dominant relation, however—and there is reason to believe we should expect relay to be the dominant relation more often than not (Bateman, 2014)—that certainty is challenged. Relay trades on “the secret antipathy between the two modes of expression” (Hunter, 1987, p. 25). Rather than minimizing ambiguity, it opens it up. The reader/viewer is invited to make sense of the relation between image and text, and to interpret the supplementary meaning generated at the level of the multimodal unit.
Thus far, there has been little reflection in scholarship on what to do when we as researchers receive that invitation. When confronted with incongruities between visual and verbal data, Chapman et al. (2017) recommend pursuing further dialogue with participants. Likewise, Rapport et al. (2007) report that they are minded to take a photo elicitation approach in a follow-up study to make sense of one dataset “undercutting” the other. In cases of relay, the temptation is to reassert the primacy of verbal data over the visual. Others hold fast to the principle of according equal status to visual and verbal modes: for Oliffe et al. (2008) it is the researcher’s responsibility to use context “to explain, rather than expose, what appeared to be incongruous details” (p. 534), while Brown and Collins (2021) advise researchers to adopt an open, critical-reflective stance to make sense of “differences, discrepancies, and contradictions” between forms of communication (p. 1287). As we noted in our discussion of the participant’s attempts to distract herself from food through crafts in Figure 7, however, it may be inappropriate (and at the very least feel profoundly uncomfortable) to assert an interpretation that conflicts with a participant’s own words, privileging the voice of the researcher over that of the participant. It may, in other words, be difficult to distinguish between “explanation” and “exposure.”
The key strength of the analytical framework presented here is that it constitutes a “middle way” between recourse to photo elicitation and imposition of researcher interpretation. Text-Image Relations Analysis provides researchers with a tool for analyzing data across modes of expression without privileging the verbal over the visual and without privileging the researcher’s voice over the participant’s. Meaning-making takes place
Confronted with cases of relay, the option a researcher will choose will depend on their ontological and epistemological assumptions about different modes of expression and the ways in which they combine to produce meaning. When working with multiple semiotic modes, it is important to resist the “effect of certainty” data triangulation can bring. We need, rather, to reflect deeply and sustainedly on how we are situating individual modes and their combinations in relation to truth and knowledge. This includes examining the limitations of the speech and written texts that continue to dominate as sources of data in research today (Reavey & Johnson, 2017). Combining visual with verbal data holds considerable, underutilized potential for gaining greater understanding of people’s life-worlds than words or images alone can deliver. To harness that potential, however, requires that we accord equal status to words and images, while also acknowledging and carefully delineating the meaning-making roles of participants and researchers in relation both to each semiotic mode and to their combinations (Drew & Guillemin, 2014).
Conclusion
Participant-generated photographs and other visual forms of expression are increasingly used by qualitative researchers alongside verbal research data. Images can complement words, for example by re-anchoring participants’ words in the physical world or by capturing “unintended details.” In addition, the configuration of words and images into a multimodal unit can yield insights into people’s experiences and life-worlds that transcend those afforded by data in individual semiotic modes alone. Analysis across modes poses significant methodological challenges, not least because multimodal data can introduce another layer of ambiguity into a dataset. Photo elicitation methods have therefore tended to predominate, and the potential of visuo-verbal data remains underutilized.
As a step towards harnessing that potential, we propose a framework of Text-Image Relations Analysis for interpreting visuo-verbal research data, drawing on Roland Barthes’s tripartite classification of text-image relations into “illustration,” “anchorage,” and “relay.” Application of our framework to eight multimodal illness narratives generated as part of the
