Sage Journals: Discover world-class research

Abstract

Scientific guidance for systematically searching, extracting, and qualitatively analyzing large-yet-meaningful amounts of media data remains limited, particularly in the context of health research where analyzing media communication can facilitate our understanding of public health discourses. To address this gap, and building on our experience in qualitatively analyzing global health media data, we developed a 5-step FOCUS approach, which entails: (1) Finding the data (identifying suitable platforms and devising ways to systematically search and extract online data); (2) Ordering the data (arranging a meaningful-yet-manageable sample for further analysis in light of massive amounts of data available online); (3) Coding the data (based on the tenets of qualitative content analysis applied to the full dataset, serving as an analytic starting point); (4) Understanding qualitative nuances in a purposively selected subset of the data (using open coding of overarching themes based on the tenets of in-depth framing analysis); and (5) Spotlighting the findings (drawing on insights across analytic steps and by presenting data across modalities such as images, audio cues, or textual excerpts). In this article, we present example decisions and key takeaways across each step based on our research analyzing vaccine-related television broadcasts in the Philippines. Additionaly, we provide detailed guidance on how to systematically and free-of-charge extract audiovisual media data using Python and YouTube API queries, and how to modify these scripts in a manner that aligns with a given research focus and data source. The FOCUS approach summarizes actionable recommendations for researchers and (health) practitioners seeking to leverage insights from online media spaces.

Keywords

media analysis framing analysis content analysis health discourse qualitative methods online media python YouTube

Introduction

Media communication, defined here as communication intended for a wider audience and not delivered face-to-face, plays a key role in shaping knowledge and attitudes across societal strata, including in the context of health discourses (Seale, 2003b; Strömbäck, 2008). Studying how health-related topics are communicated across various media can therefore provide insights for designing and delivering health interventions, and can inform efforts to identify, understand, and address misinformation (Chen & Wang, 2021; Seale, 2003a; Swire-Thompson & Lazer, 2020). In this context, the expansion of online media spaces, whether building on formats that are traditional (e.g., online newspapers, online radio/TV station broadcasts) or novel (e.g., social networks such as Facebook, X (formerly Twitter), or TikTok), has fundamentally reshaped chances and challenges of health communication, and sparked a range of new opportunities for media research (Kapoor et al., 2018; Moorhead et al., 2013; Quan-Haase & Sloan, 2022; Swire-Thompson & Lazer, 2020). While the richness and breadth of online media platforms provide easier accessibility to larger, and often more in-depth (Quan-Haase & Sloan, 2022), datasets, this “data gold rush” (Felt, 2016) also presents challenges for systematic and manageable media analysis.

A primary challenge related to online media analysis entails how and from where to systematically extract meaningful data. Large-scale initiatives have provided starting points or archives to simplify complex data extraction and investigation processes, including tools sponsored and developed by the WHO (‘EARS’) (Purnat et al., 2021; WHO, 2021), or by non-profit and for-profit research groups (Meltwater, 2024; NORC, 2025; The GDELT Project, 2022). At the same time, the accessibility of such solutions is limited, with some being discontinued (such as EARS), likely due to technical and financing challenges, while other solutions’ for-profit nature inhibits access in resource-limited contexts. Alternatively, researchers could extract data via media platforms’ Application Programming Interfaces (APIs) and open source web-scraping tools (e.g., Scrapy (Scrapy Developers, 2025) or BeautifulSoup (Richardson, 2015)). However, these approaches commonly require coding expertise and familiarity with the respective platform and software documentation. Accessible step-by-step guidance remains limited.

Beyond extraction challenges, a second hurdle inherent to online media analysis involves synthesizing and distilling large amounts of data in a way that facilitates in-depth analysis (Chani et al., 2023; Strauss et al., 2024). This gap is particularly pronounced in the context of powerful-yet-work-intensive qualitative efforts to analyze health-related media data. While the broader field of media studies routinely employs a range of qualitative analytic techniques, scholarship in health research is yet to systematically embrace these approaches (Foley et al., 2019; Hallin & Briggs, 2014). Traditional – but often descriptive – content analysis remains heavily employed, often at the expense of more nuanced qualitative examination (Fu et al., 2023).

To address this gap, authors have proposed ways to utilize existing best-practice approaches to investigate health media data. A leading approach is framing analysis, which aims to understand how topics are promoted to the public via the inclusion and exclusion of specific information and interpretations (Entman, 1993; Foley et al., 2019). A more recent approach, referred to as the visual-verbal analysis method, provides guidance for engaging with media data across various modalities (e.g., including written texts, audio and video files, or images) (Fazeli et al., 2023). These and other analytic approaches provide valuable insights in terms of how to move forward once a research team has a body of health-related media data ready for analysis. A missing link, however, remains in terms of a unifying or holistic approach that can assist researchers throughout the entire process of media analytic work, including (a) extracting relevant health-related media data, (b) navigating trade-offs between scale and depth, and (c) combining existing analytic techniques to interrogate the final dataset.

This article introduces the five-step FOCUS approach for the extraction and qualitative analysis of large amounts of multimedia data. We developed this approach amid our work on health-related messaging in Filipino TV broadcasts (Wachinger et al., 2023, 2025) where we encountered gaps in the available methodological guidance: To facilitate accessible and adaptable data extraction, we outline a systematic, free-of-charge, and software-based technique for culling YouTube-based data, and provide step-by-step guidance for applying this tool to other research projects and media platforms. To address challenges associated with the analysis of large-but-inherently-qualitative datasets, we then outline distinct steps for condensing and investigating extracted data by combining the strengths of two distinct analytical pathways: content analysis and framing analysis.

Find, Order, Code, Understand, Spotlight – The Five-step FOCUS Approach to Online Media Analysis

The FOCUS approach consists of five distinct steps: First, we find the data for our analysis. This step requires careful consideration of where the data of interest are available, and how to systematically search and extract these data. A second step consists of ordering the data. Online media analysis can yield large datasets; this step ensures manageable dataset sizes relevant to the research question at hand. Third, we inductively and deductively code our data based on the tenets of content analysis, allowing us to gain insights into overarching themes across the full dataset. Building on these insights, we then aim to understand qualitative nuances in the data via an in-depth framing analysis of selected content-rich datapoints. Finally, we spotlight our findings by reporting insights across phases and data modalities. In the following, we present each step in detail.

Step 1: Find the Data

Given the amount of multimedia data available online, researchers have to identify media platforms or databases with the highest likelihood to yield data of interest for their study – both in terms of facilitating the search and extraction of data, and in terms of representing media formats that are relevant for the population of interest. For example, many studies focus on X (formerly Twitter) or on English language news magazines – not necessarily or exclusively because these platforms are most commonly used by members of diverse populations, but because they are comparatively easy to access and analyze. However, as media communication can vary considerably across platforms and formats, we encourage undertaking formative research to identify platforms that represent the content consumed by the population of interest – even if this might require compromises with regards to the ease of data extraction. In this context, third-party online platforms have become a promising starting point to search for content that historically has not been archived in an online, systematically accessible way, which is often the case for local language media in low- and middle-income countries (LMICs).

In our work on media communication in the Philippines, a review of the literature and available data highlighted that local language TV stations were among the most relevant sources for health information. However, accessing content from these stations proved challenging, as their websites and archives lacked systematic broadcast search functionalities. We found, however, that major TV stations routinely uploaded content to their official YouTube channels, making YouTube a valuable platform for retrieving past broadcasts.

YouTube allows content creators, including TV stations, to upload videos to their own ‘channels’, but the platform as of this writing does not allow structured searches based on specific pre-identified criteria. To address this challenge, we developed a free-of-cost, software-based approach leveraging the YouTube API, which allows for automated, systematic searches of YouTube content and is freely accessible via a standard Google account (See Supplemental File 1 for guidance on modifying and using this approach).

For our work, we first identified the specific channels of interest. We selected the top channels of the country’s two major TV conglomerates, covering news and entertainment channels, excluding channels with a focus on sports or individual regions of the country. We then identified additional search characteristics, making decisions regarding the timeframe of interest, prioritization order (view count or relevance), and the number of results to be extracted per channel. Table 1 details the criteria that we specified in the API-based search approach, for more details see the official API documentation (Google Developers, 2025).

Table 1.

Search Criteria to be Specified in the API Query

Criterion	Comments and insights
Application Programming Interface (API) key	Required for running search strings. Can be obtained free of cost via the Google Developers Console; for more information, including the impact of different API keys, see Supplemental File 1.
Unique API key associated with an individual Google account
Channel ID	Inputting no value results in a search of the entire YouTube platform. This is of interest when studying content across the social media platform itself, but knowledge of specific channels of interest in the respective population might allow for a more targeted search.
Unique YouTube channel ID, identifying the channel(s) to be searched	Channel IDs can be obtained from the command console on the YouTube website (see Supplemental File 1).
Upload date	Allows for specifying the exact timeframes of interest. Format to be used is YYYY-MM-DDT00:00:00Z (RFC 339 format, with Z denoting UTC timezone. For more information on formatting date-time values, see The Internet Society (2002)).
- ‘Before’: Specifies latest possible date for video upload
- ‘After’: Specifies earliest possible date for video upload
Order	Sorting by relevance to the best of our knowledge follows the standard YouTube Algorithm (see also Supplemental File 1 regarding the impact of using API keys from different Google accounts). Sorting by view count allows for identifying the most ‘viral’ videos, but link to the search string is not always clear.
- ‘View count’: Sorts results by most viewed video first

- ‘Relevance’: Sorts results by most relevant video first	Based on our experience and recommendations in the existing literature (Knösel et al., 2011), we recommend running searches both by view count and by relevance, and to remove duplicates in the subsequent analytic step.

Maximum results	API queries are limited to 50 results per search; giving integers above 50 still returns only 50 results.
Any integer between 1 and 50
Search string	Does not allow for truncation (automatically including various endings to a specified word root in the search).
A (piloted) search string to be applied	Formally allows for employing Boolean operators ‘AND’ and ‘OR’, but based on piloting experiences (see Table 2), ‘OR’ appears to result in more specific instead of broader search strings. We therefore advise caution when relying on official API documentation and instead recommend detailed search string piloting.
A (piloted) search string to be applied	While we developed search strings top-down together with a biomedical librarian, other research questions might align better with more bottom-up approaches (e.g., identifying search terms via Google Trends (Yiannakoulias et al., 2019)).

Supported by a biomedical librarian, we drew on existing scholarship and a combination of Filipino and English terms to develop four search strings. We piloted these search strings individually on all five identified YouTube channels (sorted by relevance, no specified timeframe), recording the total number of results and extracting the top ten results per channel. We then assessed the relevance of each string’s results by screening them on title and content levels.

Table 2.

Piloting of the Four Different Search Strings

Total number of results	Results extracted and screened for relevance	Results found to be relevant	Percentage of relevant results
Search string 1:Vaccine OR bakuna
466,298	50	32	64%
Search string 2:Vaccination OR vaccine OR vaccines OR immunization OR immunisation OR bakuna OR pagbabakuna OR turok
727,398	50	22	44%
Search string 3:Vaccination OR vaccine OR vaccines OR immunization OR immunisation OR bakuna OR pagbabakuna OR turok OR Polio OR Measles OR Tetanus OR Mumps OR Rubella OR Tuberculosis OR Hepatitis B OR Diphtheria OR Pertussis OR Meningitis OR Pneumonia OR Influenza B
439,135	50	14	28%
Search string 4:Vaccination OR vaccine OR vaccines OR immunization OR immunisation OR bakuna OR pagbabakuna OR turok OR Polio vaccine OR “bakuna sa polio” OR Measles vaccine OR “bakuna sa tigdas” OR Tetanus vaccine OR bakuna sa tetano OR Mumps vaccine OR Rubella vaccine OR MMR vaccine OR Tuberculosis vaccine OR BCG vaccine OR “bakuna sa tuberculosis” OR Hepatitis B vaccine OR Diphtheria vaccine OR Pertussis vaccine OR Meningitis vaccine OR Pneumonia vaccine OR Influenza B vaccine
73	28^a	21	75%

^aNumber of results extracted and screened <50 as this search string yielded no results on two channels and only 8 on a third, and only the top 10 results per channel were extracted.

As highlighted in Table 2, the short search string (No. 1) yielded the highest number of relevant results. The highly specific search string (No. 4), which most closely aligned with the biomedical librarian’s recommendations, had a high hit rate when identifying relevant videos but yielded a low absolute number of results. For our final search approach, we therefore decided to use both search strings (No. 1 and 4) and to remove potential duplicates in the following step.

Finally, we ran API-based searches via a Python script; see Supplemental File 1 for a step-by-step guide and example for executing this script, directly outputting search results (including information such as video URL, view count, comment count, etc.) into a .csv file. While this step-by-step guidance applies to extracting media data from YouTube, other platforms (including social media platforms such as X, TikTok, or Facebook) offer similar options with varying limitations, clearance requirements, costs, and ease-of-use (Meta for Developers, 2024; TikTok for Developers, 2025; X Developer Platform, 2024).

Step 2: Order Extracted Data

To order extracted data, and to condense them into the final dataset, we recommend approaches established for systematic literature reviews, including the removal of duplicates before identifying relevant data based on pre-specified inclusion criteria.

In our case, the final search script yielded 340 results, and we identified 52 duplicates (15% of all search results). Considering the high number of individual searches and potential overlaps (e.g., videos identified by both search strings or classified as both most relevant and most watched), this comparatively low percentage of duplicates emphasizes the importance of combining several search approaches. We then screened the 288 remaining videos following the same criteria employed during search string piloting: We first screened video titles for mentioning vaccines or vaccine-preventable diseases. In cases where titles were ambiguous or did not mention vaccines, we watched the entire video, only excluding it if the content did not match our criteria. Although we are not aware of clear guidance on how the YouTube algorithms define ‘relevance’, results identified via the ‘by relevance’ search-strings nevertheless tended to be more fitting to our inclusion criteria: 69 of 108 videos (64%) included in our final sample were exclusively identified by ‘by relevance’ search strings, while 14 videos (13%) were exclusively extracted via the ‘by view count’ search-strings; 25 videos (23%) were identified by both search strings.

A majority of extracted videos did not mention vaccines or vaccine-related topics, and we were not always able to understand the characteristics that led to their identification in the initial searches. Some extracted videos might have been identified via the relation between vaccination and ‘shots’, which might have resulted in the identification of videos on gun violence. Considering these challenges, we encourage investing time and effort in ordering the data; in our experience, not only did this result in a much higher quality of the overall dataset, but the familiarization achieved via data ordering also directly facilitated later analytical steps.

Step 3: Code the Full Dataset

Building on the ordered dataset, we suggest performing an initial content analysis, starting with developing a codebook based on both deductive and inductive approaches. Deductively, we recommend relying on the existing literature and previous experiences in the specific setting or research field. Inductively, researchers can draw on the extensive familiarization with the data achieved in the previous ‘order' step to identify emerging dynamics and patterns. Individual codes and their levels should then be refined in an iterative process.

In our study, we initially selected several codes based on our previous experiences in the setting and on approaches established in the literature (Ache & Wallace, 2008; Briones et al., 2012; Covolo et al., 2017; Ekram et al., 2019) (e.g., codes related to the presented vaccine attitude, or discussions of vaccine effectiveness and safety). These codes were then supplemented with inductively identified codes, for example regarding types of broadcasts or invoked authorities. Levels of each code were adapted to reflect our specific dataset (for example via differentiating explicit and implied notions of vaccine safety or efficacy).

Given that the organized dataset at this point might remain too comprehensive for full transcription and translation, and to allow for contextualized analysis (including of media excerpts in the local language), developing and piloting the codebook can require more than one coder familiar with local language(s) and context. Additionally, in line with the idea of the FOCUS approach to allow for a meaningful trade-off between researcher resources and a comprehensive analysis of a systematically extracted dataset, we recommend designing a codebook that enables direct coding while reading, watching, or listening to the extracted media (several times if necessary for more complex or ambiguous datapoints).

In our case, three Filipino team members (MDCR, GJ, MLU) piloted the codebook on five purposively selected videos using Microsoft Excel. As part of this process, we aimed to ensure a shared understanding of the coding tree, to clearly define the limited number of included codes and sub-codes, and to provide nominal numerical values for each code (see Supplemental File 2 for the final codebook). The resulting codebook facilitated coding directly while engaging with the data, without the need for transcription and translation of the full dataset.

To identify overarching themes and dynamics present in this coded dataset, we propose drawing on the tenets of content analysis (Krippendorff, 2018; White & Marsh, 2006). Exploring frequencies, thematic patterns, and contextual use of the individual code values, both for the full dataset and for potential sub-groups (e.g., individual media channels, timeframes, topics, etc.), allows for developing a broad understanding of overarching patterns and their occurance across the full dataset.

In our study, we assessed the presented vaccine-related information and how the content of this information changed over time. For each timeframe, we descriptively assessed the vaccines discussed and vaccine attitudes conveyed in broadcasts, as well as the prominence of specific vaccine characteristics such as benefits, safety, and risks. This process allowed us to understand timeframe-specific and overarching patterns related to the presence and absence of specific information, as well as shifting dynamics of vaccine reporting over time. While this process alone did not engage with the full qualitative depth of the data, it provided valuable insights in terms of how to select a subset of information-rich broadcasts for deeper examination.

Step 4: Understand Qualitative Nuances in Media Communication and Message Framing

To supplement this broad overview of the entire dataset, we propose an in-depth qualitative framing analysis of selected datapoints. Framing analysis originated in anthropology and psychology, but has since been embraced by various disciplines (Foley et al., 2019). With its overaching aim of understanding how media content relays information to the public (Entman, 1993), framing analysis investigates how a particular piece of information is presented (i.e., the frame of the information) (Foley et al., 2019).

The exact criteria for purposively selecting datapoints to be included in this step may vary depending on the research question at hand – we recommend not only focusing on highly sensational media communications, but to deliberately include media content representing routine or archetypical communication. Similarly, the exact number of cases to be selected can depend on various factors including the resources available, the size of the overall dataset, and the complexity of the topic of interest.

In our case, we purposively selected broadcasts that were especially rich and had sparked high viewer engagement (based on the extracted broadcast metadata) across multiple timeframes. We also sought to cover several broadcasting types (e.g., entertainment and news shows), and to capture an array of vaccines discussed. We identified 16 videos (approx. 12% of the full dataset) which allowed us to maintain a manageable sample size while also ensuring that we covered general groups, timeframes, and discourses identified in the previous steps.

Once researchers have identified their data subset of interest, in-depth qualitative analysis can begin. In our case, we first transcribed spoken text and non-verbal cues in each video. Additionally, we took screenshots of key moments in each broadcast and extracted YouTube user comments to understand viewer engagement. In our framing analysis, we drew on approaches established in the literature (Entman, 1993; Foley et al., 2019; Reese et al., 2001). Following recommendations by Foley and colleagues (Foley et al., 2019), we did not limit ourselves to assessing overarching frames of entire broadcasts, but instead allowed for different frames and frame components to emerge within one broadcast by coding data fragments (Foley et al., 2019) using NVivo 14. We mapped coded data and broadcast screenshots on a Miro board (https://www.miro.com). This mapping helped us identify communication nuances across stakeholders or messengers, as well as interrelations between frames employed within and across data fragments and timespans. Supplemental File 3 includes insights into components of the Miro-based mapping.

Step 5: Spotlight Insights Across Phases and Data Modalities

Appropriate formats for presenting the derived insights and recommendations will depend on objectives and intended audiences of a respective research project. However, across dissemination channels, we recommend that researchers leverage the multimodal nature of modern media communication: The visualization of nonverbal data, such as via screengrabs of broadcasts, can supplement the traditionally heavy focus on text within media analysis. However, the inclusion of non-textual data might spark challenges, such as many academic publishers’ figure limitations and limited familiarity with video or audio-based data, or copyright concerns when moving beyond the extraction of written or verbal quotes.

To navigate such challenges in our own work, we contacted broadcasters for their permissions to include screengrabs in scientific publications. Despite several attempts, we were unable to get any response (from the first broadcasting conglomerate) or were asked to pay prohibitively high fees (from the second conglomerate). We therefore explored creating representations of the respective scene using generative artificial intelligence (using Stable Diffusion), referencing the hyperlink for and exact second in the respective broadcast which was meant to be represented in the figure. While the final published article does not include these generated representations due to figure-count limitations and reviewer concerns, we encourage exploring alternative ways for presenting key data across different modalities to facilitate the inclusion of qualitative nuance beyond textual excerpts. Figure 1 summarizes all five FOCUS steps, the decisions we made over the course of its application, and key takeaways based on our experiences.

Figure 1.

Actions, Example Decisions, and Key Takeaways when Employing the FOCUS Approach. Note: API: Application Programming Interface

Discussion

Data accessibility issues have often challenged the analysis of mass media content in global health, especially for non-English language media from LMICs. The advent of online platforms has drastically facilitated data access. However, the resulting data goldrush sparked new challenges for systematically extracting datasets that are meaningful for the research objective but also manageable in size for in-depth (qualitative) investigation. To respond to this challenge, we developed the five-step FOCUS approach for online media analysis in global health research. FOCUS offers a holistic approach that complements but also extends existing frameworks that predominantly explore data analysis rather than intersections across extraction and analytic steps (Fazeli et al., 2023; Foley et al., 2019).

With the proposed combination of analytic approaches to iteratively condense large datasets, FOCUS echoes aspects of Andreotta and colleagues’ analytic framework (Andreotta et al., 2019). Drawing on a mixed methods approach, the authors propose data science techniques (such as matrix factorization and topic alignment) to compress full datasets along a specific dimension of relevance (Andreotta et al., 2019). In comparison, FOCUS encourages a more qualitative and active engagement while condensing the data. We see this hands-on emphasis as potentially reducing the degree to which researchers have to rely on an algorithmic selection of datapoints, and as facilitating a familiarity with qualitative nuances in the full dataset that may inform subsequent in-depth analysis. However, the extraction of massive datasets might challenge this hands-on engagement; we therefore encourage researchers to adapt FOCUS to their respective needs, for example by integrating quantitative approaches or adding iterations within the ‘order’ and ‘code’ steps.

We developed FOCUS based on our experiences analyzing vaccination-related TV broadcasts in the Philippines (Wachinger et al., 2025), and our emphasis on global health research is rooted in two core observations: First, existing extraction approaches often narrowly focus on major media platforms or require considerable resources. These characteristics limit existing approaches’ applicability to and accessibility for many global health research projects with limited funds or investigating local, non-English media conversations. Second, health scholarship has repeatedly called for stronger engagement with media-analytic approaches established in other disciplines (Foley et al., 2019; Hallin & Briggs, 2014). FOCUS addresses these challenges by outlining an adaptable, free-of-cost approach for extracting and qualitatively analyzing media data, thereby facilitating rigorous and context-sensitive global health research. However, we believe that FOCUS holds value beyond health contexts, and we invite researchers working across domains (including disciplines, topics, regional contexts, media types, or online platforms) to adapt and extend the approach according to their interests.

Our aim in developing FOCUS was to offer accessible and practical guidance for qualitative online media analysis. As a result, we see potential for FOCUS to inform and be incorporated into ‘small q’ studies that remain rooted in more positivist-empiricist traditions in terms of both methodology and epistemology (Braun & Clarke, 2024; Kidder & Fine, 1987). Yet, we also see FOCUS as more than a stepwise toolkit: At its core, FOCUS encourages the integration of constructivist and interpretivist perspectives, supporting interrogations on how media narratives are socially constructed, context-dependent, and expressions of discursive power relations. Studies which are inherently guided by qualitative paradigms and constructivist epistemologies, or ‘Big Q’ (Braun & Clarke, 2024; Kidder & Fine, 1987), remain underrepresented in the field of health media analysis as compared to more positivist, big-data analyses. FOCUS thereby assumes a hybrid role: While it offers systematic steps for data extraction and a structured entry into content analysis, it also emphasizes the value of exploring qualitative nuance via framing analyses and the attention to salient cases. We encourage researchers applying FOCUS to foreground meaning-making processes in (online) media contexts, and to share their experiences regarding FOCUS’s potential of moving beyond a ‘small q’ analytic toolkit.

The extraction and analysis of online media, including as proposed by the FOCUS approach, sparks ethical challenges. Available guidelines and regulatory frameworks do not consistently provide clarity on ethical ways to utilize existing online media data, particularly in the case of social media platforms where individuals might publicly share private information that cannot be sufficiently anonymized, and where obtaining informed consent may be challenging (Fuchs, 2018; Karafillakis et al., 2021; Nicholas et al., 2020). In the case of FOCUS, we did not primarily develop this approach as a social media listening tool, but rather to engage with mass media content available in online spaces (and inherently meant to be publicly available, commonly with contributors’ consent for publication in line with their professional roles as public communicators). However, we see the potential of FOCUS to be applied to extract individual-level data of identifiable private individuals (e.g., when extracting comments of YouTube users, some of whom may have used their real names – in the case of our work, we therefore used comments to contextualize videos within the broader public discourse at the time, but did not include verbatim quotes in the published article). In such instances, we urge researchers to consult the applicable legal and ethical guidelines where available, but also to err on the side of caution: With the permanence and searchability of online content, the protection of individual private contributors in our view has to take precedent over the need to extract and present individual-level information, regardless of the data’s public availability.

Moving forward, we encourage researchers considering the FOCUS approach to flexibly adapt the steps to their own needs, and to share their own research and experiences when applying this approach – including novel ways and platforms for extracting data, and use-cases within the broader field of health and media research. Furthermore, as multimodal media data becomes increasingly accessible, we echo calls for greater creativity in how (health) researchers engage with media contents (Foley et al., 2019). We encourage researchers to explore how novel analytic perspectives, such as the visual-verbal analysis method (Fazeli et al., 2023), might provide new angles for ‘understanding’ and ‘spotlighting’ multimedia data within the broader procedural frame of the FOCUS approach. Moving beyond FOCUS’s individual steps, we hope to embolden researchers to leverage the opportunities of online platforms for bridging existing gaps in the analysis of health-related mass media reports.

Supplemental Material

Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach

Supplemental Material for Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach by Jonas Wachinger, Mark Donald C. Reñosa, Jerric Rhazel Guevarra, Felix Casel, Georgina Janowski, Ma Leslie Ulmido, Shannon A. McMahon in International Journal of Qualitative Methods

Supplemental Material

Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach

Supplemental Material

Supplemental Material - Systematic Extraction and Qualitative Framing Analysis of Health-Related Online Media Content: Introducing the FOCUS Approach

Footnotes

Acknowledgments

The authors would like to thank Jeniffer Landicho for her support and advice for search string development,and Vivienne Endoma for her support in executing the searches.

ORCID iDs

Jonas Wachinger

Shannon A. McMahon

Ethical Considerations

This article does not present personal data of identifiable individuals. All data used for the analysis underpinning this article are publicly available;no primary data were collected. The overarching research project therefore did not require ethical approvals.

Author Contributions

Conceptualization: JW,MDCR,SAM;Methodology: JW,MDCR,JRG,FC,SAM;Software: JW,JRG,FC;Formal analysis: JW,MDCR,GJ,MLU;Writing – Original Draft: JW;Writing – Review & Editing: JW,MDCR,JRG,FC,GJ,MLU,SAM. All authors have read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: JW received funding in form of a doctoral scholarship by the Konrad Adenauer Foundation. Heidelberg University provided financial support to cover open access publication fees. The funder played no role in study design,data collection,analysis and interpretation of data,or the writing of this manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data Availability Statement

All data used for the analysis underpinning this article are publicly available. Step-by-step guidance to extract and analyze this and similar data is provided in the article and the supplemental files. *

Supplemental Material

Supplemental material for this article is available online.

References

Ache

K. A.

Wallace

L. S.

(2008). Human papillomavirus vaccination coverage on YouTube. American Journal of Preventive Medicine, 35(4), 389–392. https://doi.org/10.1016/j.amepre.2008.06.029

Andreotta

Nugroho

Hurlstone

M. J.

Boschetti

Farrell

Walker

Paris

(2019). Analyzing social media data: A mixed-methods framework combining computational and qualitative text analysis. Behavior Research Methods, 51(4), 1766–1781. https://doi.org/10.3758/s13428-019-01202-8

Braun

Clarke

(2024). How do you solve a problem like COREQ? A critique of Tong et al.’s (2007) Consolidated Criteria for Reporting Qualitative Research. Methods in Psychology, 11, Article 100155. https://doi.org/10.1016/j.metip.2024.100155

Briones

Nan

Madden

Waks

(2012). When vaccines go viral: An analysis of HPV vaccine coverage on YouTube. Health Communication, 27(5), 478–485. https://doi.org/10.1080/10410236.2011.610258

Chani

Olugbara

O. O.

Mutanga

(2023). The problem of data extraction in social media: A theoretical framework. Journal of Information Systems and Informatics, 5(4), 1363–1384. https://doi.org/10.51519/journalisi.v5i4.585

Chen

Wang

(2021). Social media use for health purposes: Systematic review. Journal of Medical Internet Research, 23(5), Article e17917. https://doi.org/10.2196/17917

Covolo

Ceretti

Passeri

Boletti

Gelatti

(2017). What arguments on vaccinations run through YouTube videos in Italy? A content analysis. Human Vaccines & Immunotherapeutics, 13(7), 1693–1699. https://doi.org/10.1080/21645515.2017.1306159

Ekram

Debiec

K. E.

Pumper

M. A.

Moreno

M. A.

(2019). Content and commentary: HPV vaccine and YouTube. Journal of Pediatric and Adolescent Gynecology, 32(2), 153–157. https://doi.org/10.1016/j.jpag.2018.11.001

Entman

R. M.

(1993). Framing: Toward clarification of a Fractured paradigm. Journal of Communication, 43(4), 51–58. https://doi.org/10.1111/j.1460-2466.1993.tb01304.x

10.

Fazeli

Sabetti

Ferrari

(2023). Performing qualitative content analysis of video data in social sciences and medicine: The visual-verbal video analysis method. International Journal of Qualitative Methods, 22, Article 16094069231185452. https://doi.org/10.1177/16094069231185452

11.

Felt

(2016). Social media and the social sciences: How researchers employ Big Data analytics. Big Data & Society, 3(1), Article 205395171664582. https://doi.org/10.1177/2053951716645828

12.

Foley

Ward

McNaughton

(2019). Innovating qualitative framing analysis for purposes of media analysis within public health inquiry. Qualitative Health Research, 29(12), 1810–1822. https://doi.org/10.1177/1049732319826559

13.

Zhou

Lai

Deng

Zhang

Guo

(2023). Methods for analyzing the contents of social media for health care: Scoping review. Journal of Medical Internet Research, 25, Article e43349. https://doi.org/10.2196/43349

14.

Fuchs

(2018). Dear Mr. Neo-Nazi, can you please give me your informed consent so that I can quote your fascist tweet?’: Questions of social media research ethics in online ideology critique. In The Routledge companion to media and activism (pp. 385–394). Routledge.

15.

Google Developers . (2025). YouTube: Data API: Search: List. https://developers.google.com/youtube/v3/docs/search/list

16.

Hallin

D. C.

Briggs

C. L.

(2014). Transcending the medical/media opposition in research on news coverage of health and medicine. Media, Culture & Society, 37(1), 85–100. https://doi.org/10.1177/0163443714549090

17.

Kapoor

K. K.

Tamilmani

Rana

N. P.

Patil

Dwivedi

Y. K.

Nerur

(2018). Advances in social media research: Past, present and future. Information Systems Frontiers, 20(3), 531–558. https://doi.org/10.1007/s10796-017-9810-y

18.

Karafillakis

Martin

Simas

Olsson

Takacs

Dada

Larson

H. J.

(2021). Methods for social media monitoring related to vaccination: Systematic scoping review. JMIR Public Health and Surveillance, 7(2), Article e17149. https://doi.org/10.2196/17149

19.

Kidder

L. H.

Fine

(1987). Qualitative and quantitative methods: When stories converge. New Directions for Program Evaluation, 1987(35), 57–75. https://doi.org/10.1002/ev.1459

20.

Knösel

Jung

Bleckmann

(2011). YouTube, dentistry, and dental education. Journal of Dental Education, 75(12), 1558–1568. https://doi.org/10.1002/j.0022-0337.2011.75.12.tb05215.x

21.

Krippendorff

(2018). Content analysis: An introduction to its methodology. Sage publications.

22.

Meltwater . (2024). Product feature - media monitoring. https://www.meltwater.com/en/products/media-monitoring

23.

Meta for Developers . (2024). Meta content library and API. https://developers.facebook.com/docs/content-library-and-api#researcher-platform-highlights

24.

Moorhead

S. A.

Hazlett

D. E.

Harrison

Carroll

J. K.

Irwin

Hoving

(2013). A new dimension of health care: Systematic review of the uses, benefits, and limitations of social media for health communication. Journal of Medical Internet Research, 15(4), Article e85. https://doi.org/10.2196/jmir.1933

25.

Nicholas

Onie

Larsen

M. E.

(2020). Ethics and privacy in social media research for mental health. Current Psychiatry Reports, 22(12), 84. https://doi.org/10.1007/s11920-020-01205-9

26.

NORC . (2025). Social data collaboratory. https://www.norc.org/about/departments/social-data-collaboratory.html

27.

Purnat

T. D.

Wilson

Nguyen

Briand

(2021). EARS - a WHO platform for AI-supported real-time online social listening of COVID-19 conversations. Studies in Health Technology and Informatics, 281(Public Health and Informatics), 1009–1010. https://doi.org/10.3233/SHTI210330

28.

Quan-Haase

Sloan

(2022). The SAGE handbook of social media research methods (2nd ed.). Sage Publications.

29.

Reese

S. D.

Gandy

O. H.

Grant

A. E.

(Eds.), (2001). Framing public life: Perspectives on media and our understanding of the social world. Lawrence Erlbaum Associates.

30.

Richardson

(2015). Beautiful soup documentation. https://beautiful-soup-4.readthedocs.io/en/latest/#

31.

Scrapy Developers . (2025). Scrapy 2.13 documentation. https://docs.scrapy.org/en/latest/

32.

Seale

(2003a). Health and media: An overview. Sociology of Health & Illness, 25(6), 513–531. https://doi.org/10.1111/1467-9566.t01-1-00356

33.

Seale

(2003b). Media and health. Sage.

34.

Strauss

Harr

M. D.

Pieper

T. M.

(2024). Analyzing digital communication: A comprehensive literature review. Management Review Quarterly. https://doi.org/10.1007/s11301-024-00455-8

35.

Strömbäck

(2008). Four phases of mediatization: An analysis of the mediatization of politics. The International Journal of Press/Politics, 13(3), 228–246. https://doi.org/10.1177/1940161208319097

36.

Swire-Thompson

Lazer

(2020). Public health and online misinformation: Challenges and recommendations. Annual Review of Public Health, 41(1), 433–451. https://doi.org/10.1146/annurev-publhealth-040119-094127

37.

The GDELT Project . (2022). The GDELT story. https://www.gdeltproject.org/about.html

38.

The Internet Society . (2002). Date and time on the internet: Timestamps. https://www.rfc-editor.org/rfc/rfc3339

39.

TikTok for Developers . (2025). About research tools. https://developers.tiktok.com/doc/about-research-api/

40.

Wachinger

Reñosa

M. D. C.

Janowski

Ulmido

M. L.

Guevarra

J. R.

Endoma

Landicho

McMahon

(2023). Framing of vaccine messages on key information platforms over time: The case of broadcast media in the Philippines. VARN2023, Bangkok, Thailand. https://www.vaccineacceptance.org/app/uploads/2023/10/SLM-11-poster_FramingOfVaccineMessagesOnKeyInformationPlatformsOverTime-Jonas-Wachinger.pdf.

41.

Wachinger

Renosa

M. D. C.

Janowski

Ulmido

M. L.

Guevarra

J. R.

Landicho

McMahon

S. A.

(2025). Medical knowledge, political tension, and social relevance: A content and framing analysis of vaccine-related TV broadcasts in the Philippines. BMJ Public Health, 3(2), Article e002133. https://doi.org/10.1136/bmjph-2024-002133

42.

White

M. D.

Marsh

E. E.

(2006). Content analysis: A flexible methodology. Library Trends, 55(1), 22–45. https://doi.org/10.1353/lib.2006.0053

43.

WHO . (2021). WHO launches pilot of AI-powered public-access social listening tool. https://www.who.int/news-room/feature-stories/detail/who-launches-pilot-of-ai-powered-public-access-social-listening-tool

44.

X Developer Platform . (2024). Getting started with the X API. https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api

45.

Yiannakoulias

Slavik

C. E.

Chase

(2019). Expressions of pro- and anti-vaccine sentiment on YouTube. Vaccine, 37(15), 2057–2064. https://doi.org/10.1016/j.vaccine.2019.03.001

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.17 MB

4.74 MB

0.00 MB

0.04 MB