Abstract
Keywords
Introduction
Media communication, defined here as communication intended for a wider audience and not delivered face-to-face, plays a key role in shaping knowledge and attitudes across societal strata, including in the context of health discourses (Seale, 2003b; Strömbäck, 2008). Studying how health-related topics are communicated across various media can therefore provide insights for designing and delivering health interventions, and can inform efforts to identify, understand, and address misinformation (Chen & Wang, 2021; Seale, 2003a; Swire-Thompson & Lazer, 2020). In this context, the expansion of online media spaces, whether building on formats that are traditional (e.g., online newspapers, online radio/TV station broadcasts) or novel (e.g., social networks such as Facebook, X (formerly Twitter), or TikTok), has fundamentally reshaped chances and challenges of health communication, and sparked a range of new opportunities for media research (Kapoor et al., 2018; Moorhead et al., 2013; Quan-Haase & Sloan, 2022; Swire-Thompson & Lazer, 2020). While the richness and breadth of online media platforms provide easier accessibility to larger, and often more in-depth (Quan-Haase & Sloan, 2022), datasets, this “data gold rush” (Felt, 2016) also presents challenges for systematic and manageable media analysis.
A primary challenge related to online media analysis entails how and from where to systematically extract meaningful data. Large-scale initiatives have provided starting points or archives to simplify complex data extraction and investigation processes, including tools sponsored and developed by the WHO (‘EARS’) (Purnat et al., 2021; WHO, 2021), or by non-profit and for-profit research groups (Meltwater, 2024; NORC, 2025; The GDELT Project, 2022). At the same time, the accessibility of such solutions is limited, with some being discontinued (such as EARS), likely due to technical and financing challenges, while other solutions’ for-profit nature inhibits access in resource-limited contexts. Alternatively, researchers could extract data via media platforms’ Application Programming Interfaces (APIs) and open source web-scraping tools (e.g., Scrapy (Scrapy Developers, 2025) or BeautifulSoup (Richardson, 2015)). However, these approaches commonly require coding expertise and familiarity with the respective platform and software documentation. Accessible step-by-step guidance remains limited.
Beyond extraction challenges, a second hurdle inherent to online media analysis involves synthesizing and distilling large amounts of data in a way that facilitates in-depth analysis (Chani et al., 2023; Strauss et al., 2024). This gap is particularly pronounced in the context of powerful-yet-work-intensive qualitative efforts to analyze health-related media data. While the broader field of media studies routinely employs a range of qualitative analytic techniques, scholarship in health research is yet to systematically embrace these approaches (Foley et al., 2019; Hallin & Briggs, 2014). Traditional – but often descriptive – content analysis remains heavily employed, often at the expense of more nuanced qualitative examination (Fu et al., 2023).
To address this gap, authors have proposed ways to utilize existing best-practice approaches to investigate health media data. A leading approach is framing analysis, which aims to understand how topics are promoted to the public via the inclusion and exclusion of specific information and interpretations (Entman, 1993; Foley et al., 2019). A more recent approach, referred to as the visual-verbal analysis method, provides guidance for engaging with media data across various modalities (e.g., including written texts, audio and video files, or images) (Fazeli et al., 2023). These and other analytic approaches provide valuable insights in terms of how to move forward once a research team
This article introduces the five-step FOCUS approach for the extraction and qualitative analysis of large amounts of multimedia data. We developed this approach amid our work on health-related messaging in Filipino TV broadcasts (Wachinger et al., 2023, 2025) where we encountered gaps in the available methodological guidance: To facilitate accessible and adaptable data extraction, we outline a systematic, free-of-charge, and software-based technique for culling YouTube-based data, and provide step-by-step guidance for applying this tool to other research projects and media platforms. To address challenges associated with the analysis of large-but-inherently-qualitative datasets, we then outline distinct steps for condensing and investigating extracted data by combining the strengths of two distinct analytical pathways: content analysis and framing analysis.
Find, Order, Code, Understand, Spotlight – The Five-step FOCUS Approach to Online Media Analysis
The FOCUS approach consists of five distinct steps: First, we
Step 1: Find the Data
Given the amount of multimedia data available online, researchers have to identify media platforms or databases with the highest likelihood to yield data of interest for their study – both in terms of facilitating the search and extraction of data, and in terms of representing media formats that are relevant for the population of interest. For example, many studies focus on X (formerly Twitter) or on English language news magazines – not necessarily or exclusively because these platforms are most commonly used by members of diverse populations, but because they are comparatively easy to access and analyze. However, as media communication can vary considerably across platforms and formats, we encourage undertaking formative research to identify platforms that represent the content consumed by the population of interest – even if this might require compromises with regards to the ease of data extraction. In this context, third-party online platforms have become a promising starting point to search for content that historically has not been archived in an online, systematically accessible way, which is often the case for local language media in low- and middle-income countries (LMICs).
In our work on media communication in the Philippines, a review of the literature and available data highlighted that local language TV stations were among the most relevant sources for health information. However, accessing content from these stations proved challenging, as their websites and archives lacked systematic broadcast search functionalities. We found, however, that major TV stations routinely uploaded content to their official YouTube channels, making YouTube a valuable platform for retrieving past broadcasts.
YouTube allows content creators, including TV stations, to upload videos to their own ‘channels’, but the platform as of this writing does not allow structured searches based on specific pre-identified criteria. To address this challenge, we developed a free-of-cost, software-based approach leveraging the YouTube API, which allows for automated, systematic searches of YouTube content and is freely accessible via a standard Google account (See Supplemental File 1 for guidance on modifying and using this approach).
Search Criteria to be Specified in the API Query
Piloting of the Four Different Search Strings
aNumber of results extracted and screened <50 as this search string yielded no results on two channels and only 8 on a third, and only the top 10 results per channel were extracted.
As highlighted in Table 2, the short search string (No. 1) yielded the highest number of relevant results. The highly specific search string (No. 4), which most closely aligned with the biomedical librarian’s recommendations, had a high hit rate when identifying relevant videos but yielded a low absolute number of results. For our final search approach, we therefore decided to use both search strings (No. 1 and 4) and to remove potential duplicates in the following step.
Finally, we ran API-based searches via a Python script; see Supplemental File 1 for a step-by-step guide and example for executing this script, directly outputting search results (including information such as video URL, view count, comment count, etc.) into a .csv file. While this step-by-step guidance applies to extracting media data from YouTube, other platforms (including social media platforms such as X, TikTok, or Facebook) offer similar options with varying limitations, clearance requirements, costs, and ease-of-use (Meta for Developers, 2024; TikTok for Developers, 2025; X Developer Platform, 2024).
Step 2: Order Extracted Data
To order extracted data, and to condense them into the final dataset, we recommend approaches established for systematic literature reviews, including the removal of duplicates before identifying relevant data based on pre-specified inclusion criteria.
In our case, the final search script yielded 340 results, and we identified 52 duplicates (15% of all search results). Considering the high number of individual searches and potential overlaps (e.g., videos identified by both search strings or classified as both most relevant and most watched), this comparatively low percentage of duplicates emphasizes the importance of combining several search approaches. We then screened the 288 remaining videos following the same criteria employed during search string piloting: We first screened video titles for mentioning vaccines or vaccine-preventable diseases. In cases where titles were ambiguous or did not mention vaccines, we watched the entire video, only excluding it if the content did not match our criteria. Although we are not aware of clear guidance on how the YouTube algorithms define ‘relevance’, results identified via the ‘by relevance’ search-strings nevertheless tended to be more fitting to our inclusion criteria: 69 of 108 videos (64%) included in our final sample were exclusively identified by ‘by relevance’ search strings, while 14 videos (13%) were exclusively extracted via the ‘by view count’ search-strings; 25 videos (23%) were identified by both search strings.
A majority of extracted videos did not mention vaccines or vaccine-related topics, and we were not always able to understand the characteristics that led to their identification in the initial searches. Some extracted videos might have been identified via the relation between vaccination and ‘shots’, which might have resulted in the identification of videos on gun violence. Considering these challenges, we encourage investing time and effort in ordering the data; in our experience, not only did this result in a much higher quality of the overall dataset, but the familiarization achieved via data ordering also directly facilitated later analytical steps.
Step 3: Code the Full Dataset
Building on the ordered dataset, we suggest performing an initial content analysis, starting with developing a codebook based on both deductive and inductive approaches. Deductively, we recommend relying on the existing literature and previous experiences in the specific setting or research field. Inductively, researchers can draw on the extensive familiarization with the data achieved in the previous ‘order' step to identify emerging dynamics and patterns. Individual codes and their levels should then be refined in an iterative process.
In our study, we initially selected several codes based on our previous experiences in the setting and on approaches established in the literature (Ache & Wallace, 2008; Briones et al., 2012; Covolo et al., 2017; Ekram et al., 2019) (e.g., codes related to the presented vaccine attitude, or discussions of vaccine effectiveness and safety). These codes were then supplemented with inductively identified codes, for example regarding types of broadcasts or invoked authorities. Levels of each code were adapted to reflect our specific dataset (for example via differentiating explicit and implied notions of vaccine safety or efficacy).
Given that the organized dataset at this point might remain too comprehensive for full transcription and translation, and to allow for contextualized analysis (including of media excerpts in the local language), developing and piloting the codebook can require more than one coder familiar with local language(s) and context. Additionally, in line with the idea of the FOCUS approach to allow for a meaningful trade-off between researcher resources and a comprehensive analysis of a systematically extracted dataset, we recommend designing a codebook that enables direct coding while reading, watching, or listening to the extracted media (several times if necessary for more complex or ambiguous datapoints).
In our case, three Filipino team members (MDCR, GJ, MLU) piloted the codebook on five purposively selected videos using Microsoft Excel. As part of this process, we aimed to ensure a shared understanding of the coding tree, to clearly define the limited number of included codes and sub-codes, and to provide nominal numerical values for each code (see Supplemental File 2 for the final codebook). The resulting codebook facilitated coding directly while engaging with the data, without the need for transcription and translation of the full dataset.
To identify overarching themes and dynamics present in this coded dataset, we propose drawing on the tenets of content analysis (Krippendorff, 2018; White & Marsh, 2006). Exploring frequencies, thematic patterns, and contextual use of the individual code values, both for the full dataset and for potential sub-groups (e.g., individual media channels, timeframes, topics, etc.), allows for developing a broad understanding of overarching patterns and their occurance across the full dataset.
In our study, we assessed the presented vaccine-related information and how the content of this information changed over time. For each timeframe, we descriptively assessed the vaccines discussed and vaccine attitudes conveyed in broadcasts, as well as the prominence of specific vaccine characteristics such as benefits, safety, and risks. This process allowed us to understand timeframe-specific and overarching patterns related to the presence and absence of specific information, as well as shifting dynamics of vaccine reporting over time. While this process alone did not engage with the full qualitative depth of the data, it provided valuable insights in terms of how to select a subset of information-rich broadcasts for deeper examination.
Step 4: Understand Qualitative Nuances in Media Communication and Message Framing
To supplement this broad overview of the entire dataset, we propose an in-depth qualitative framing analysis of selected datapoints. Framing analysis originated in anthropology and psychology, but has since been embraced by various disciplines (Foley et al., 2019). With its overaching aim of understanding how media content relays information to the public (Entman, 1993), framing analysis investigates how a particular piece of information is presented (i.e., the frame of the information) (Foley et al., 2019).
The exact criteria for purposively selecting datapoints to be included in this step may vary depending on the research question at hand – we recommend not only focusing on highly sensational media communications, but to deliberately include media content representing routine or archetypical communication. Similarly, the exact number of cases to be selected can depend on various factors including the resources available, the size of the overall dataset, and the complexity of the topic of interest.
In our case, we purposively selected broadcasts that were especially rich and had sparked high viewer engagement (based on the extracted broadcast metadata) across multiple timeframes. We also sought to cover several broadcasting types (e.g., entertainment and news shows), and to capture an array of vaccines discussed. We identified 16 videos (approx. 12% of the full dataset) which allowed us to maintain a manageable sample size while also ensuring that we covered general groups, timeframes, and discourses identified in the previous steps.
Once researchers have identified their data subset of interest, in-depth qualitative analysis can begin. In our case, we first transcribed spoken text and non-verbal cues in each video. Additionally, we took screenshots of key moments in each broadcast and extracted YouTube user comments to understand viewer engagement. In our framing analysis, we drew on approaches established in the literature (Entman, 1993; Foley et al., 2019; Reese et al., 2001). Following recommendations by Foley and colleagues (Foley et al., 2019), we did not limit ourselves to assessing overarching frames of entire broadcasts, but instead allowed for different frames and frame components to emerge within one broadcast by coding data fragments (Foley et al., 2019) using NVivo 14. We mapped coded data and broadcast screenshots on a Miro board (https://www.miro.com). This mapping helped us identify communication nuances across stakeholders or messengers, as well as interrelations between frames employed within and across data fragments and timespans. Supplemental File 3 includes insights into components of the Miro-based mapping.
Step 5: Spotlight Insights Across Phases and Data Modalities
Appropriate formats for presenting the derived insights and recommendations will depend on objectives and intended audiences of a respective research project. However, across dissemination channels, we recommend that researchers leverage the multimodal nature of modern media communication: The visualization of nonverbal data, such as via screengrabs of broadcasts, can supplement the traditionally heavy focus on text within media analysis. However, the inclusion of non-textual data might spark challenges, such as many academic publishers’ figure limitations and limited familiarity with video or audio-based data, or copyright concerns when moving beyond the extraction of written or verbal quotes.
To navigate such challenges in our own work, we contacted broadcasters for their permissions to include screengrabs in scientific publications. Despite several attempts, we were unable to get any response (from the first broadcasting conglomerate) or were asked to pay prohibitively high fees (from the second conglomerate). We therefore explored creating representations of the respective scene using generative artificial intelligence (using Stable Diffusion), referencing the hyperlink for and exact second in the respective broadcast which was meant to be represented in the figure. While the final published article does not include these generated representations due to figure-count limitations and reviewer concerns, we encourage exploring alternative ways for presenting key data across different modalities to facilitate the inclusion of qualitative nuance beyond textual excerpts. Figure 1 summarizes all five FOCUS steps, the decisions we made over the course of its application, and key takeaways based on our experiences. Actions, Example Decisions, and Key Takeaways when Employing the FOCUS Approach. Note: API: Application Programming Interface
Discussion
Data accessibility issues have often challenged the analysis of mass media content in global health, especially for non-English language media from LMICs. The advent of online platforms has drastically facilitated data access. However, the resulting data goldrush sparked new challenges for systematically extracting datasets that are meaningful for the research objective but also manageable in size for in-depth (qualitative) investigation. To respond to this challenge, we developed the five-step FOCUS approach for online media analysis in global health research. FOCUS offers a holistic approach that complements but also extends existing frameworks that predominantly explore data analysis rather than intersections across extraction and analytic steps (Fazeli et al., 2023; Foley et al., 2019).
With the proposed combination of analytic approaches to iteratively condense large datasets, FOCUS echoes aspects of Andreotta and colleagues’ analytic framework (Andreotta et al., 2019). Drawing on a mixed methods approach, the authors propose data science techniques (such as matrix factorization and topic alignment) to compress full datasets along a specific dimension of relevance (Andreotta et al., 2019). In comparison, FOCUS encourages a more qualitative and active engagement while condensing the data. We see this hands-on emphasis as potentially reducing the degree to which researchers have to rely on an algorithmic selection of datapoints, and as facilitating a familiarity with qualitative nuances in the full dataset that may inform subsequent in-depth analysis. However, the extraction of massive datasets might challenge this hands-on engagement; we therefore encourage researchers to adapt FOCUS to their respective needs, for example by integrating quantitative approaches or adding iterations within the ‘order’ and ‘code’ steps.
We developed FOCUS based on our experiences analyzing vaccination-related TV broadcasts in the Philippines (Wachinger et al., 2025), and our emphasis on global health research is rooted in two core observations: First, existing extraction approaches often narrowly focus on major media platforms or require considerable resources. These characteristics limit existing approaches’ applicability to and accessibility for many global health research projects with limited funds or investigating local, non-English media conversations. Second, health scholarship has repeatedly called for stronger engagement with media-analytic approaches established in other disciplines (Foley et al., 2019; Hallin & Briggs, 2014). FOCUS addresses these challenges by outlining an adaptable, free-of-cost approach for extracting and qualitatively analyzing media data, thereby facilitating rigorous and context-sensitive global health research. However, we believe that FOCUS holds value beyond health contexts, and we invite researchers working across domains (including disciplines, topics, regional contexts, media types, or online platforms) to adapt and extend the approach according to their interests.
Our aim in developing FOCUS was to offer accessible and practical guidance for qualitative online media analysis. As a result, we see potential for FOCUS to inform and be incorporated into ‘small q’ studies that remain rooted in more positivist-empiricist traditions in terms of both methodology and epistemology (Braun & Clarke, 2024; Kidder & Fine, 1987). Yet, we also see FOCUS as more than a stepwise toolkit: At its core, FOCUS encourages the integration of constructivist and interpretivist perspectives, supporting interrogations on how media narratives are socially constructed, context-dependent, and expressions of discursive power relations. Studies which are inherently guided by qualitative paradigms and constructivist epistemologies, or ‘Big Q’ (Braun & Clarke, 2024; Kidder & Fine, 1987), remain underrepresented in the field of health media analysis as compared to more positivist, big-data analyses. FOCUS thereby assumes a hybrid role: While it offers systematic steps for data extraction and a structured entry into content analysis, it also emphasizes the value of exploring qualitative nuance via framing analyses and the attention to salient cases. We encourage researchers applying FOCUS to foreground meaning-making processes in (online) media contexts, and to share their experiences regarding FOCUS’s potential of moving beyond a ‘small q’ analytic toolkit.
The extraction and analysis of online media, including as proposed by the FOCUS approach, sparks ethical challenges. Available guidelines and regulatory frameworks do not consistently provide clarity on ethical ways to utilize existing online media data, particularly in the case of social media platforms where individuals might publicly share private information that cannot be sufficiently anonymized, and where obtaining informed consent may be challenging (Fuchs, 2018; Karafillakis et al., 2021; Nicholas et al., 2020). In the case of FOCUS, we did not primarily develop this approach as a social media listening tool, but rather to engage with mass media content available in online spaces (and inherently meant to be publicly available, commonly with contributors’ consent for publication in line with their professional roles as public communicators). However, we see the potential of FOCUS to be applied to extract individual-level data of identifiable private individuals (e.g., when extracting comments of YouTube users, some of whom may have used their real names – in the case of our work, we therefore used comments to contextualize videos within the broader public discourse at the time, but did not include verbatim quotes in the published article). In such instances, we urge researchers to consult the applicable legal and ethical guidelines where available, but also to err on the side of caution: With the permanence and searchability of online content, the protection of individual private contributors in our view has to take precedent over the need to extract and present individual-level information, regardless of the data’s public availability.
Moving forward, we encourage researchers considering the FOCUS approach to flexibly adapt the steps to their own needs, and to share their own research and experiences when applying this approach – including novel ways and platforms for extracting data, and use-cases within the broader field of health and media research. Furthermore, as multimodal media data becomes increasingly accessible, we echo calls for greater creativity in how (health) researchers engage with media contents (Foley et al., 2019). We encourage researchers to explore how novel analytic perspectives, such as the visual-verbal analysis method (Fazeli et al., 2023), might provide new angles for ‘understanding’ and ‘spotlighting’ multimedia data within the broader procedural frame of the FOCUS approach. Moving beyond FOCUS’s individual steps, we hope to embolden researchers to leverage the opportunities of online platforms for bridging existing gaps in the analysis of health-related mass media reports.
Supplemental Material
Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach
Supplemental Material for Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach by Jonas Wachinger, Mark Donald C. Reñosa, Jerric Rhazel Guevarra, Felix Casel, Georgina Janowski, Ma Leslie Ulmido, Shannon A. McMahon in International Journal of Qualitative Methods
Supplemental Material
Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach
Supplemental Material for Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach by Jonas Wachinger, Mark Donald C. Reñosa, Jerric Rhazel Guevarra, Felix Casel, Georgina Janowski, Ma Leslie Ulmido, Shannon A. McMahon in International Journal of Qualitative Methods
Supplemental Material
Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach
Supplemental Material for Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach by Jonas Wachinger, Mark Donald C. Reñosa, Jerric Rhazel Guevarra, Felix Casel, Georgina Janowski, Ma Leslie Ulmido, Shannon A. McMahon in International Journal of Qualitative Methods
Footnotes
Acknowledgments
Ethical Considerations
Author Contributions
Funding
Declaration of Conflicting Interests
Data Availability Statement
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
