Abstract
Keywords
Introduction
As of June 20, 2021, the severe acute respiratory syndrome, coronavirus 2 (SARS-CoV- 2), had been confirmed responsible for approximately 4 million deaths worldwide- the most devastating conditions in history. Although anti-viral preventive measures such as wearing masks, avoiding crowded areas, and improving sanitation reduce the possibility of being infected by or spreading COVID-19, widely dispersing SARS- CoV-2 vaccinations is crucial to reducing viral transmission since various vaccines have shown an effectiveness of over 95% in obstructing SARS-CoV-2 symptoms (Bendau et al., 2021). However, while extraordinary progress in the development of vaccines has been made, the task of comprehensive vaccination administration remains unaccomplished.
In 2019, The World Health Organization recognized vaccine hesitancy as one of the ten most significant threats to global health. The prevalence of such skepticism of vaccines, namely the adverse side effects and the rapid development speed, increased dramatically during the COVID-19 pandemic. Analyzing COVID-19 vaccine-related discourses on social media has revealed a significant amount of vaccine uncertainty alongside an exponentially decreasing trust trend in general public. Various studies also indicate that anti-vaccine content generates higher user engagement rates than most pro-vaccine posts (Puri et al., 2020). Therefore, evaluating and understanding public trust and confidence in COVID-19 vaccines is essential to developing effective communication strategies to maximize uptake. This may also assist in addressing the concerns of vaccine doubters and ensuring public safety during the global epidemics.
The propagation and evolution of information during the pandemic can be attributed pivotal to social media (Chopra et al., 2021), particularly Twitter, which has been progressively sanctioned for its global information dissipation and circulation efficacy. Twitter’s micro-blogging capacities and the existence of a platform of approximately 200 million regular users allow for a deeper understanding of public sentiments about the COVID-19 vaccine. Moreover, Twitter reported its highest user growth rate during the initial quarantine stage: a 24% increase in daily active users. Additionally, the most widely employed hashtag in 2020 was COVID-19 (Kreps et al., 2022).
Considering the complex interplay between social media use, COVID-19, and mental health, integrating social media platforms as part of alternative mental health therapies during lockdowns could offer novel pathways for support, provided their use is carefully curated to foster positive interactions and mitigate the spread of misinformation (Radanliev & De Roure, 2021).
Investigations of Twitter posts can provide insight into real-time shifts and trends in public sentiment throughout the COVID-19 pandemic, a valuable information source in public health and COVID-19-related research. Noteworthy analysis methods often used to study the presence of vaccine skeptics include social network analysis, topic identification, and sentiment classification. Previous studies (Donovan, 2020; Puri et al., 2020) also suggest a link between misinformation on Twitter and vaccine hesitancy and its subsequent effects on public health. Furthermore, these studies confirmed various elements influencing vaccine acquisition, including inimical occurrences, socioeconomic inequities, and quantitative apportionment. Therefore, given the significant role of Twitter in shaping vaccine uptake, monitoring tweets to understand public sentiment toward COVID-19 vaccines is essential (Saleh, McDonald et al., 2021).
Furthermore, numerous tweet analyses (Charquero-Ballester et al., 2021; Huangfu et al., 2022) have shown the potential of Twitter discussions to predict an accurate estimate of vaccinated individuals. Research conducted on public perceptions of the H1N1 influenza vaccine in 2009 found that projected vaccination rates were able to be analyzed through Twitter data (Saleh, McDonald et al., 2021). Similarly, a previous study discovered that Twitter exposure may explain variations in obtaining the human papillomavirus (HPV) vaccine that are not explained by socio-economic factors like education, insurance, or income.
In this study, we aim to apply various content analysis techniques to tweets related to COVID-19 vaccines to gain insight into changes in public opinion on vaccinations over time. We hope that conducting a content analysis on tweets from the early stages of the pandemic would classify critical topics of discussion throughout the vaccine development phase that could, thus, guide healthcare authorities, public health officials, and decision-makers in enabling awareness and intellectual intervention strategies for the uptake of COVID-19 vaccines. To guide this investigation, we present the following research questions:
RQ1: What are the predominant sentiments expressed in the discourse about COVID-19 vaccines on Twitter?
RQ2: How do these sentiments evolve over time, particularly in response to key events in the vaccine rollout and pandemic milestones?
RQ3: What themes emerge from the Twitter discourse regarding COVID-19 vaccines, and how do these themes correlate with public sentiment and vaccine uptake?
RQ4: How does misinformation manifest within the discourse, and what impact does it have on public sentiment toward COVID-19 vaccines?
Based on a preliminary review of the literature and the observed trends in social media discourse, we propose the following hypotheses:
• H1: Negative sentiment toward COVID-19 vaccines on Twitter is strongly influenced by misinformation and specific events, such as reports of side effects or changes in vaccine recommendations.
• H2: Positive sentiment is closely associated with periods following the release of efficacy data or endorsements from reputable health organizations.
• H3: The prevalence of specific themes, such as trust in science, concerns about side effects, and conspiracy theories, correlates with shifts in public sentiment and vaccine uptake rates.
• H4: Engagement rates for tweets containing misinformation about COVID-19 vaccines are higher than those for tweets promoting vaccine acceptance, contributing to an overall climate of vaccine hesitancy.
These research questions and hypotheses are designed to structure the study’s exploration of Twitter discourse related to COVID-19 vaccines, providing a foundation for understanding the complex interplay between social media, public sentiment, and public health outcomes.
The significant contributions of this work are summarized as follows:
The study’s contribution is a multimethod exploratory analysis providing insight into public sentiment and emotions regarding COVID-19 vaccinations through a time series analysis and comparison of vaccine brands using existing NLP techniques.
The study proposes a new fusion model for sentiment analysis of tweets, which combines the predictions of two traditional supervised learning models (TextBlob and VADER) and four deep learning models (Flair, Transformers, and two additional pre-trained models).
The work’s novelty lies in using a deep learning model, “Robert-base-emotion,” trained on a multilabel emotion dataset for improved emotion recognition performance.
We employed a novel approach utilizing zero-shot classification and the “Bart- large-mnli” transformer to dynamically categorize topics and classify pro- and anti-vaccine tweets, reducing the reliance on labeled data and providing a time- based analysis of evolving user opinions.
The rest of this paper is organized as follows: A review of the contemporary literature relevant to this study is presented in Section 2. The framework is presented in Section 3. The results of the experimental study are reported and discussed in Section 4. Finally, Section 5 presents our conclusions.
Literature Review
Increasing underlying emotions such as anxiety, uncertainty, and fear during the COVID-19 pandemic heightened the public’s negative responses to the crisis and more specifically, vaccination. Research findings indicate that psychological states are interrelated with media consumption and sources, as well as individual and contextual variations, but in a diverse and complex manner (Charquero-Ballester et al., 2021; Chu et al., 2022). The emotionally charged controversiality of vaccines has been present far earlier than the COVID- 19 pandemic and thrives through the assistance of anti-vaccination groups. These associations beguile and maneuver emotions to encourage conspiracy theories and their spread. In fact, according to existing studies (K. Ali et al., 2022; Y. Wang et al., 2019; Yan et al., 2023), anti-vaccine accounts on Twitter were found to express anger at a significantly higher rate than their pro- vaccine counterparts. Theories designed to lower trust in the government and experts were also found to be frequently associated with these groups (K. Ali et al., 2022). These sentiments uphold and further vaccine skepticism and adversely affect the task of comprehensive vaccination administration (Yan et al., 2023). As the campaigns of such groups have continued to flourish during the COVID-19 pandemic, halting primary misinformation sources is essential to promoting vaccine assurance.
Vaccine Hesitancy
Distrustful and misinformed attitudes on COVID-19 vaccines only bolster vaccine hesitancy and skepticism, adversely affecting comprehensive inoculation against the pandemic. As such content flourishes under increased social media usage, analysis of public sentiment is rapidly growing as it has the potential to improve vaccine distribution and uptake. More attention was paid to vaccination rejection and hesitation than to vaccine interest (Brannen et al., 2023).
Several studies (Krittanawong et al., 2020; Lenti et al., 2022; Mir et al., 2022; Warner et al., 2022) have found that vaccine misinformation is pervasive on Twitter, with many tweets advocating anti-vaccination discourse, mentioning side effects, and lacking reliable sources. A study in 2020 (Warner et al., 2022) found that personal threats, civil liberties, and conspiracy theories were the most often discussed subjects. Moreover, misinformation tweets were more likely to embrace an anti-vaccine viewpoint, name a specific vaccine, lack citations, and be against vaccination legislation. A severe limitation of the study is the lack of individual characteristics of Twitter users who engaged with such tweets, which are often associated with vaccine hesitancy and acceptance. The conclusions drawn by Krittanawong et al. (2020) are speculative because the dissemination of false information and unreliable data seriously limits the usage of Twitter, particularly among users and in non-academic settings.
Raising public awareness and comprehension is crucial to lessening the anti-vaccine movement’s harmful effects. Research on the effectiveness of interventions to address vaccine hesitancy, however, has focused chiefly on high-income countries. As such, the insufficient information available from low-income nations has commenced an urgent need for quick research to assist in the global distribution of vaccines. Q. Wang et al. (2021) study provides a detailed analysis of the factors influencing COVID-19 vaccine acceptance, including predictors like including gender, educational status, influenza vaccination history, and credibility in the government. The Ullah et al. (2021) study also has shortcomings, such as a lack of information on the populations polled being representative, possible biases in the survey’s methodologies and questions, and a short window for the data cutoff. In addition, a new single-center study from Wuhan, China has shed more light on the fact that enough evidence has been found to justify that those who have allergies, the flu, or asthma are more likely to contract COVID-19 or die from infection-related causes.
Vaccine hesitancy has attracted considerable attention from public health experts, policymakers, and social media platforms in recent years. However, few studies have considered the impact of celebrity endorsement on communication engagement and dissemination. Lenti et al. (2022) provide a detailed account of how users in no-vax communities are more likely to share low-credible domains and be exposed to misleading information. Abbas et al. (2022) 1919 conducted a cross-sectional survey with 100 participants and discovered that most people do not believe the vaccine is safe for pregnant and breastfeeding mothers. CoVaxxy (DeVerna et al., 2021) is a collection of English- language Twitter posts about COVID-19 vaccines and public health outcomes. The author’s work has significantly contributed to the field by developing an infrastructure hosted by XSEDE Jetstream virtual machines that can collect and process large quantities of Twitter data.
Scannell et al. (2021) examined the effects of exposure to COVID-19 news and information on mental health during a public health emergency. The authors found a weak but statistically significant positive correlation between overall media exposure, psychological discomfort, sadness, anxiety, stress, and preventive actions. They also found that different types of misinformation have different emotional valences on Twitter, with “conspiracy” and “viral feature and number” myths having a larger negative emotional valence than other myths. Limitations of their methodology include reliance on manual coding to detect misinformation, inability to measure readers’ immediate emotional response, and vaccine hesitancy being a significant barrier to vaccine discourse on Twitter.
Mir et al. (2022) also examined the impact of myths and conspiracy theories on vaccines and COVID-19. They found that verified user tweets have a higher impact than unverified user tweets, while tweets that express positive sentiments have the highest impact. However, the authors failed to conclude whether the observed effect is clinically significant. Additionally, the study is based on data annotated by just one annotator, which may undermine the reliability of their conclusions. Finally, the authors’ lack of full disclosure of their data and reporting of mean reliability scores for their technique may make it more difficult to replicate their findings.
According to a study conducted by A. Hussain et al. (2021), vaccine hesitancy and objections were more prevalent than vaccine interest in both the US and UK. Similarly, in a study conducted by Kwok et al. (2021), it was found that fear was the most prevalent negative emotion expressed in tweets. This study also identified that news organizations were the most active in producing positive content about COVID-19 vaccines. Vaccine hesitancy via the spread of misinformation has also been extensively researched by Thelwall et al. (2021), who found that false information dispersed across Twitter increases vaccine hesitancy.
Furthermore, Griffith et al. (2021) researched vaccine sentiments through the Theoretical Domains Framework and identified 15.45% to be vaccine hesitant tweets. Approximately 80% of these posts conveyed concern about COVID-19 vaccine safety due to misinformation and mistrust in the vaccine based mainly on politics and anti- vaccine social media posts. Eibensteiner et al. (2021) produced a similar study grounded in researching public opinion on COVID-19 vaccine safety via Twitter polls. Data withdrawn from the polls revealed a 30% increase in positive opinions of the COVID- 19 vaccine after users were introduced to a patient safety platform.
Sentiment and Emotion Analysis
This section delves into the intricate web of public perceptions and reactions to COVID-19 vaccines as captured through Twitter. The analysis draws from a series of pivotal studies, showcasing the dynamic landscape of sentiment and opinion ranging from support and optimism to hesitancy and opposition. The use of Twitter-based sentiment analysis emerges as a powerful tool in discerning these public sentiments, aiding in efforts to enhance vaccine distribution and uptake.
Studies, including those by F. Alderazi et al. (2021), F. M. Alderazi et al. (2022), and Khattak et al. (2020), explored the utility of Twitter in mapping the contours of public opinion, revealing nuanced apprehensions toward the COVID-19 vaccine. Similarly, investigations by Samuel et al. (2020) and Rahman et al. (2021) illuminate the broader societal pulse—fear of the virus and concerns over the reopening of the United States, for instance. H. Lyu et al. (2022) analysis, with data from 20,000 Twitter users, paints a quantitative picture: a majority in support of vaccination but marked by regional hesitancies and political apprehensions. A. Hussain et al. (2021) work extends this analysis to the UK and US, uncovering sentiment trends across hundreds of thousands of social media posts and highlighting the impact of research developments and vaccine criticism on public perception.
Further enriching the discourse, Eibensteiner et al. (2021) employ Twitter polls to gage safety perceptions of COVID-19 vaccines, revealing a blend of assurance, uncertainty, and skepticism among the global public. The Household Pulse Survey and studies by G. G. M. N. Ali et al. (2021) and Pristiyono et al. (2021) contribute to this nuanced understanding by examining sentiment changes over time and across geographies, employing innovative frameworks and algorithms to distil the essence of public opinion. Saleh, Lehmann, & Medford (2021) large-scale analysis of vaccine-related tweets and Yousefinaghani et al. (2021) exploration of bot-generated content versus genuine engagement offer insights into the mechanics of sentiment dissemination on Twitter.
In conclusion, studies by Jang et al. (2022) and Monselise et al. (2021) highlight the evolving nature of sentiment over time and the predominant emotions driving the public conversation around COVID-19 vaccinations. Through a meticulous examination of tweets, keywords, and sentiment analysis techniques, these studies collectively underscore the profound influence of social media discourse on vaccine acceptance and hesitancy, illuminating the path toward addressing public concerns and enhancing vaccine uptake.
Topic Modeling
The focus shifts to dissecting the discourse surrounding COVID-19 vaccines through the lens of topic modeling, a nuanced method that, alongside sentiment analysis, unravels the key subjects of discussion across social media landscapes. This segment highlights pivotal research endeavors that leverage the Latent Dirichlet Allocation (LDA) topic model to parse through vast volumes of tweets, uncovering the dominant themes and sentiments entwined with COVID-19 vaccination discussions.
The analysis spearheaded by Kwok et al. (2021) delves into an extensive corpus of tweets emanating from Australian Twitter users, dissecting the public’s emotional and thematic engagement with COVID-19 vaccines from January to October 2020. This exploration reveals a dichotomy where a majority showcases support for vaccination efforts, juxtaposed against a significant fraction voicing skepticism or opposition, underscoring the multifaceted perspectives within the public discourse.
Huangfu et al. (2022) utilized LDA for sentiment-based topic modeling, revealing nuanced public perceptions and attitudes toward vaccines, though their focus on textual Twitter data suggests the potential for broader applicability across social media platforms and geographies. Similarly, J. C. Lyu et al. (2021) observed an increasing trend of positive sentiment regarding COVID-19 vaccines, suggesting a growing acceptance, albeit limited by the reliance on older analytical techniques.
Melton et al. (2021) analyzed sentiments across Reddit communities, finding a predominantly positive outlook that has remained consistent, indicating a broader trend of vaccine acceptance in online communities, despite challenges in detecting sarcasm. The study by Yousef et al. (2022) highlighted a prevalent negative sentiment during the Australian vaccine rollout, emphasizing the localized nature of vaccine sentiments and the importance of considering regional and demographic nuances.
Yousefinaghani et al. (2021) research underscores the utility of Twitter for public health agencies in understanding vaccination sentiments, pointing out limitations in language representation and data coverage that could impact the findings’ generalizability. Shim et al. (2021) work reflects public anticipation, disappointment, and fear, showing that online discourse does not fully represent the entire population’s views, often based on indirect experiences with a small tweet sample.
Further, Hu et al. (2021) aim to modulate sentiment toward vaccines in the U.S. highlights the necessity for complementary data to grasp the full spectrum of public attitudes using LDA. Xue et al. (2020) and Abd-Alrazaq et al. (2020). emphasized the challenges posed by the limited search terms, demographic representation, and the constraints of social media platforms on data collection and analysis.
In summary, these studies collectively shed light on the complexity of public sentiment toward COVID-19 vaccines, demonstrating the vital role of social media as both a mirror and a mold for public opinion. Through diverse methodological approaches, from LDA topic modeling to sentiment analysis, researchers have navigated the intricate web of global discourse, uncovering both challenges and opportunities for enhancing vaccine uptake and combating misinformation.
Materials and Methods
We proposed a tweet-based analysis framework for COVID-19 vaccine sentiment exploration as shown in Figure 1 (abstract view).

Abstract view of the proposed COVID-19 sentiment analysis framework.
The proposed framework as shown in Figure 2 conducts seven different analyses ranging from exploratory data analysis (EDA) to analyses of pro-and anti-vaccination. The framework’s first stage is acquiring information from social media networks (SNS), Twitter in particular. The accumulated data consists of tweets mentioning COVID-19 vaccines and keywords such as Pfizer, Moderna, and AstraZeneca. The collected data is preprocessed (at step 2) to eliminate irrelevant data, such as URLs, mentions, and special characters. In addition, stop words are removed, and stemming is performed to convert words to their root form. EDA is done using several approaches, such as hashtag analysis, word cloud analysis, sentiment analysis, emotion analysis, topic modeling, and pro- and anti-vaccination analysis. Hashtags can provide insight into the vaccine-related topics that people are discussing. The framework’s entire source code is accessible via the URL (https://github.com/jamilbadama/Tweets-Analysis-Framework-for-Covid-19-and-Covid-19-Vaccines) on the GitHub repository. We describe these components in the following sub-sections.

Detailed view of the proposed COVID-19 sentiment analysis framework.
Data Collection and Resource
We used a tweet-based dataset from Kaggle called “COVID-19 All Vaccines Tweets” (COVID-19 all Vaccines Tweets, n.d.) . The tweets were collected using the Python package Tweepy and the Twitter API, and the search terms used were specific to each vaccine, such as Pfizer/BioNTech, Sinopharm, Sinovac, Moderna, Oxford/AstraZeneca, Covaxin, and Sputnik V. The Kaggle dataset was regularly updated with new vaccination-related tweets associated to the pharmaceutical companies. As the dataset was unstructured and raw, data cleaning and pre-processing was required before being able to extract useful insights or use it in a machine learning model. We selected tweets from January 2021 to August 2021 from the dataset, a period that corresponds to the peak time of vaccine development and would. This time frame was chosen as it allowed us to capture the public’s reactions and opinions while the vaccines were still in the development phase, and before they were widely distributed. The tweets were analyzed using text mining and sentiment analysis techniques to gain insights into the public’s perceptions and attitudes toward COVID-19 vaccines.
Data Preprocessing
Text preprocessing is crucial as it prepares raw data for text mining, thus easing the information extraction process. This step eliminates insignificant textual noise in under- standing the sentiment of tweets, such as punctuation, special characters, numbers, and words with minimal contextual weight. After the dataset was complete, we removed duplicate tweets and analyzed unique tweets with the following sub-preprocessing steps to further clean said data:
• English language detection: We are primarily interested in English tweets, so non-English words are removed.
• Case conversion: All uppercase letters are converted to lowercase.
• Removing Punctuations, Links, Numbers, and Special Characters: Punctuation, numerals, and special characters are eliminated from the text.
• Tokenization: Tokens are individual terms or words, and tokenization divides a text string into tokens.
• Stopword Removal: The stopword removal process utilized a customized list based on the standard NLTK library, enriched with domain-specific terms and adjusted to retain contextually significant words, thereby refining the dataset for focused sentiment and thematic analysis.
• Stemming: Used PorterStemer to reduce words to their root form.
After collecting the data, we removed all tweets written in languages other than English and ones that contained URLs or hashtags. Then, we anonymized the tweet and user ID using data masking and replaced all usernames mentioned with the code “user mention.” We used the user geo-location and the user-reported profile location to determine the country in which each tweet was posted. We also created a database of hashtags.
Exploratory Data Analysis (EDA)
For the EDA, we employed several approaches, such as hashtag analysis, word cloud analysis, sentiment analysis, emotion analysis, topic modeling, and pro- and anti-vaccination analysis. Hashtags can provide insight into the vaccine-related topics that people are discussing. The Hashtag analysis entails extracting hashtags from the preprocessed data and analyzing them to determine the most frequently used hashtags in tweets about COVID-19 vaccines. Word cloud analysis visually represents the most frequently used terms in tweets about COVID- 19 vaccines. The magnitude of each word in the cloud is proportional to how often it appears in the tweets. This analysis provides insight into the most prevalent themes and topics discussed in vaccines. Sentiment analysis requires determining whether tweets are positive, negative, or neutral. Ensemble sentiment analysis uses multiple algorithms to enhance the veracity of sentiment analysis results. Analysis of emotions involves identifying the sentiments conveyed in tweets about COVID-19 vaccines. This analysis provides insights into how individuals feel about vaccines, which can be used to better comprehend their attitudes and behaviors regarding vaccination. Topic modeling identifies the most prominent topics discussed in tweets about COVID-19 vaccines, and the analysis of pro- and anti-vaccination tweets requires identifying which tweets are pro- and anti-vaccination.
Hashtag Analysis
In social media, hashtags have increased substantially during the last several years. Numerous businesses develop custom hashtags for specific campaigns, making organizing all relevant postings and discussions simpler. They are especially essential on Instagram and Twitter because they provide a simple method to group related items together.
The hashtag analysis was meticulously conducted to gage the distribution and prominence of hashtags in COVID-19 vaccine-related tweets. Initially, the data were curated by substituting absent hashtag entries with a default “None” label and eradicating any “\\N” entries to ensure dataset integrity. Subsequently, the dataset was augmented with a “hashtags_count” column, derived from enumerating hashtags within each tweet, facilitating the analysis of hashtag usage patterns. Visualization of this distribution, employing a logarithmic scale for enhanced clarity, provided insights into the prevalence of hashtags per tweet. Further, the dataset underwent segmentation to isolate individual hashtags, enabling the compilation of a unique hashtag set. This comprehensive approach not only illuminated the diversity of the conversation but also enriched the understanding of the thematic landscape surrounding COVID-19 vaccine discussions on Twitter.
Word Cloud Analysis
We used the Python wordcloud package (Mueller, 2018) to create a word cloud from the dataset. Word Cloud is a common data visualization technique used to display textual information while providing a fast and dirty overview of the contents of a text corpus. The words are sized based on their term frequency in a corpus and sorted randomly. Word Cloud may also summarize the words associated with a hashtag or term on Twitter.
Moreover, stop-words that contribute little to the content of a phrase are regularly used words in the English language, such as “the,” “a,” “an,” and “in.” These words are added to our stop list in tasks involving natural language processing and filtered out. If needed, we may re-add particular stopwords into our word cloud. We anticipate that terms such as “t,” “co,” “HTTPS,” “amp,” and “U” will often appear in tweets.
Ensemble Sentiment Analysis
One of the most well-known Natural Language Processing (NLP) tasks is sentiment analysis. The field of NLP has significantly advanced during the last 5 years, and open-source applications like TextBlob (Loria, 2015), Flair(Akbik et al., 2019), and Transformers (Wolf et al., 2020) provide sentiment analysis and other NLP capabilities that are ready for use. To classify each tweet and assign a polarity of positive, negative, or neutral and evaluate the overall performance of the sentiment of tweets, a new fusion model is pro-posed in this study. To strengthen the capacity and robustness of the proposed AI model, we developed ensemble models based on the majority voting strategy (J. Hussain et al., 2018; Khan et al., 2019). The overall architecture of the proposed model is shown in Figure 3.

Detailed view of the ensemble sentiment analysis framework.
Voting is a form of ensemble approach that combines the predictions of many models and selects the one with the most votes. This method is generally used when there are many models with distinct configurations or multiple experts with divergent viewpoints. In either scenario, the voting ensemble approach may provide a more accurate prediction while avoiding overfitting by combining information from numerous sources (Sunitha et al., 2022). This strategy eliminates bias and errors in deep learning models often caused by missing information in recovered deep features.
The first layer of the proposed ensemble approach comprises two traditional supervised learning models and four deep learning models, which allows for categorizing various test samples. Furthermore, according to the notion of three-way choices, data samples falling inside the low-confidence decision zone of one model may fall within the high-confidence decision region of another. Consequently, using several classifiers may boost the overall confidence in the classification system and enhance the precision of the findings.
For sentiment analysis in the traditional machine learning model, we employed TextBlob and the Valence Aware Dictionary and Sentiment Reasoner (VADER) on tweets.
TextBlob is a library in Python that provides a simple API for natural languages processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. TextBlob uses a combination of machine learning and rule-based approaches to determine the sentiment polarity of a given text. In the context of sentiment analysis, Text-Blob assigns a subjectivity score to each tweet on a scale from 0 (objective) to 1 (subjective). Tweets with a score of 0 are considered objective and convey information, while tweets with a score of 1 are considered subjective and express an opinion or belief. In addition to classifying tweets as objective or subjective, TextBlob also assigns a sentiment polarity score to the text. The polarity score is between -1 and 1, where -1 represents highly negative sentiment, 0 represents neutral sentiment, and 1 represents highly positive sentiment. For example, a tweet with the text “I love this product!” would likely be classified as subjective with a high positive polarity, while a tweet with the text “This product is terrible” would be classified as subjective with a high negative polarity.
VADER is a lexicon and rule-based sentiment analysis tool developed specifically for social media messages. It assigns a sentiment score to each word or phrase in the text based on its pre-defined valence (positive or negative) and the intensity of that valence. VADER then combines the scores of all the text’s words and phrases to determine the message’s overall sentiment. Tweets are classified as positive if they have a VADER score of 0.25 or higher, negative if they have a score of 0.25 or less, and neutral if they have a score in between. This common threshold is used to classify sentiment in text, but other thresholds may also be used depending on the specific application.
Flair, for instance, is a modern natural language processing (NLP) framework built on top of PyTorch that offers several pre-trained models for various NLP tasks, including sentiment analysis. Flair’s pre-trained sentiment model allows users to quickly and easily perform sentiment analysis on text without developing a custom algorithm. The model returns the predicted label (positive, negative, or neutral) along with a confidence score, which ranges from 0 to 1, with 1 representing extreme confidence and 0 representing extreme uncertainty. To use the model, the input text must be first tokenized using the Sentence() tokenizer function and then the predict function to predict the sentiment. This framework is known for its simplicity and flexibility, making it a popular choice for NLP tasks.
The Transformers library is another popular toolkit developed by the Hugging Face team that provides APIs and tools for downloading and training state-of-the-art pre- trained models in various modalities, including natural language processing, computer vision, audio, and multimodal. These models have already been trained on large datasets and can be fine-tuned for specific tasks, allowing users to save computing expenses, reduce their carbon footprint, and save time and resources compared to training a model from scratch. The Transformers library offers several pre-trained models for various tasks, such as Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction, and Question Answering. These tasks can be performed using the library’s pipelines, which contain the bulk of the complex logic and provide a simple API for users.
Our study uses three distinct pre-trained models alongside a transformer pipeline for sentiment analysis tasks. These models include: “jus-tinqbui/bertweet-covid- vaccine-tweets-finetuned (justinqbui/bertweet-covid-vaccine-tweets-finetuned · Hugging Face, n.d.)” and “cardiff-nlp/twitter-roberta-base-sentiment-latest” (cardiffnlp/twitter-roberta-base-sentiment-latest · Hugging Face, n.d.). The “justinqbui/bertweet-covid-vaccine-tweets-finetuned” model is a pre-trained version of “vinai/bertweet-covid19-base-uncased (vinai/bertweet-covid19-base-uncased · Hugging Face, n.d.)” on masked language modeling using a Kaggle dataset that includes tweets up to the beginning of December 2021. While bertweet was only trained on 23 million tweets up to September 2020, this model was further pre-trained with 300 thousand tweets, including the hashtag CovidVaccine. The “cardiffnlp/twitter-roberta-base-sentiment-latest” is a roBERTa- based model trained on approximately 124 million tweets between January 2018 and December 2021 and improved for sentiment analysis using the TweetEval benchmark (GitHub - cardiffnlp/tweeteval: Repository for TweetEval, n.d.). This model is suitable for English language text. After obtaining prediction results from the classifiers, we used the ensemble approach, as mentioned earlier, to assign each tweet a final sentiment label (positive, negative, or neutral).
Emotions Analysis
For emotion recognition, we used a deep learning model developed for multilabel emotion classification, “Robert-base-emotion” (bhadresh-savani/roberta-base-emotion · Hugging Face, n.d.). This model was trained on an emotion dataset that includes various emotions (anger, fear, joy, love, sadness, and surprise). It focuses on learning emotion-specific associations and incorporating their correlations into the training target. As “Robert-base-emotion” has shown good performance in the multilabel emotion classification task, we decided to use it for emotion prediction.
Topic Modeling
We used the BERTopic model (Grootendorst, 2022), a transformer and c-TF-IDF-based technique, for extracting topics related to COVID-19 vaccinations. BERTopic generates interpretable dense clusters that retain important words and supports various types of topic modeling, including guided, semi-supervised, and dynamic. In addition, we utilized topic themes (J. C. Lyu et al., 2021) from Topic 1 to Topic 5 to classify each tweet using a zero- shot classifier and to analyze the topics’ performance over time. These themes include opinions and emotions about vaccines as a global issue, vaccine administration, and vaccine development and authorization progress. We tested the accuracy of the zero- shot classifier on a set of sample tweets and found that it accurately predicted the theme label, giving us confidence in using it to predict the theme of each tweet in our dataset.
Pro- and Anti-Vaccination Analysis
We employed zero-shot learning to classify the tweet as either anti-vaccine or pro-vaccine. This paradigm can identify unseen classes by establishing a link between visible and unseen classes based on previous knowledge of unseen classes. Moreover, text classification is a task in natural language processing in which the model must predict the classes of text content. Traditionally, a massive quantity of labeled data is required to train the model, as unlabeled data cannot be used to make predictions. Natural language processing has reached its limit by adding zero- shot learning and text categorization. For instance, over sixty transformers function on zero-shot categorization in the hugging face transformers.
We used the “Bart-large-mnli” (facebook/bart-large-mnli · Hugging Face, n.d.) transformer in the zero-short classification pipeline, which Facebook researchers built to upgrade the Bart-large model trained on the MNLI dataset. The approach works by allowing the sequence to be categorized as the NLI premise and deriving a hypothesis from each possible label. For instance, if we wanted to determine whether a sequence belonged to the class “politics,” we might formulate the hypothesis, “this text is about politics.” The probability for entailment and contradiction are then translated into label probabilities. We also performed a time-based analysis to show how user opinions have changed over time based on the predicted label.
Results and Discussion
Analysis of Hashtag Usage in COVID-19 Tweets
The findings indicated that 44.32% of all evaluated tweets had the hashtag #COVID-19. Figure 4 demonstrates that, of all hashtags used, #COVID-19 was the most popular, with a use rate of 44.32%. The second most popular hashtag was #Vaccine, with a 6.82% use rate. This indicates that 6.82% of the evaluated tweets featured the hashtag #Vaccine.

Usage of hashtags in tweets related to COVID-19. The data shows the percentage of tweets that contained each hashtag, with #COVID-19 being the most frequently used at 44.32% and other hashtags being used less frequently.
The study reveals that a significant fraction of the tweets included these hashtags. Additionally, the hashtags #Moderna and #Covaxin had notable use percentages of 20.21% and 18.65%, respectively. Other hashtags appeared in the studied tweets, although less often than those listed above.
This information may assist in comprehending the public’s interest in debating various COVID-19-related subjects. It might also assist in identifying the most popular vaccinations and medications mentioned by vaccine vendors.
Understanding the Most Frequent Terms Used in Tweets: WordCloud Analysis
Initially, we endeavored to understand commonly used words by constructing wordclouds, or graphical representations in which common words are shown in big font sizes and less frequent ones in smaller font sizes. Figure 5 depicts a representation of all the terms in our data using a wordcloud from all normalized tweets, with prominent words revealed to be convaxin, moderna, slotsconvaxin, age, and others.

Wordcloud of all normalized tweets in our dataset.
Insights into Twitter Sentiments: A Deep Dive into Public Opinion
To determine the sentiment of the tweets, we utilized multiple sentiment classifiers, including TextBlob, Vader, Flair, Transformer (twitter-roberta- base-sentiment-latest), Transformer (bertweet-covid-vaccine-tweets-finetuned), and an Ensemble method. These classifiers were used to classify the tweets into three categories: positive, negative, and neutral, as illustrated in Figure 6. Our results revealed that sentiment distribution varied among the six classifiers. The ensemble method, for instance, was demonstrated to be more accurate and reliable than any single classifier. As such, we used the sentiment results generated by the ensemble method to illustrate the percentage of tweets in each class.

Sentiment analysis results of six classifiers - ensemble.
As shown in Figure 7, approximately 43% of the tweets were classified as positive, 36% were neutral, and 21% were negative. On the basis of the analyzed tweets, it is possible that a substantial proportion of the public has a favorable opinion of the COVID-19 vaccine, while some tweets express neutral or negative sentiment. Negative sentiment may be associated with concerns regarding vaccine research, safety concerns, and responses to governments, political leaders, and manufacturers, whereas positive sentiment may be associated with scientific progress, medical guidance, and optimism. However, the representativeness of tweets to public opinion as a whole and the dependability of sentiment analysis results should be considered.

Sentiment polarity distribution.
The word cloud in Figure 8 illustrates the most used words in tweets that were classified as having negative sentiment. The word cloud is a visual representation of the frequency at which certain words appear in the dataset, where the size of the word represents its frequency. The word cloud for tweets with positive sentiment is composed more broadly with words expressing gratitude such as “thank you,” “thankful,” and “grateful.” This indicates that people are positive and appreciative of the vaccine in a general sense. This is a common trend seen in word cloud representation where negative tweets tend to be more specific and pinpoint certain actors, entities, or aspects of the subject, whereas positive tweets tend to be more general.

Wordclouds of COVID-19 sentiments are classified into negative, positive, and neutral categories.
Figure 9 depicts the number of tweets of positive, neutral, and negative sentiments month-wise. Throughout the duration of the months studied in this paper, vaccine positivity dominated Twitter. In March, mid-April, and July, there were also several notable increases in positive sentiments. Figure 10 illustrates the sentiment of tweets regarding various vaccine suppliers over time, including the date, number of tweets, and sentiment score. Our analysis revealed that Pfizer initially received a more favorable response than other vaccine suppliers. However, in April and July of 2021, the Moderna vaccine also received positive reactions from Twitter users. When analyzing tweets over time, it is possible to observe patterns in how people respond to different vaccine suppliers. It seems that people initially responded more positively to the Pfizer vaccine than other suppliers. However, later, in April and July of 2021, the Moderna vaccine received more positive reactions from Twitter users.

Timeline displaying varying sentiments on COVID-19 vaccines.

Timeline displaying varying sentiments on COVID-19 vaccine suppliers. The sub-graphs labeled (a) through (f) demonstrate the sentiment timeline for several well-known suppliers of vaccines.
It is worth noting that social media data such as tweets are not always indicative of the general public opinion and can be limited by who is using Twitter and how they discuss the topic. In addition, the interpretation of such data is subject to the researchers’ bias. It is essential to consider other factors that may have contributed to the change in sentiment toward Moderna. For example, there may have been more information about the vaccine’s efficacy and safety that was reported in the news during these months, which could have influenced people’s perceptions of the vaccine. Also, the Moderna vaccine may have been approved in more countries worldwide or became more available, so more people were discussing it. It is always good to supplement this analysis with other data sources and types of data (like surveys, public opinion polls, and expert opinion) to obtain a more comprehensive understanding.
Figure 11 depicts the overall negative sentiment z-score of vaccination suppliers, with AstraZeneca receiving the most negative sentiment, which may have been due to the thrombotic thrombocytopenia findings. The negative publicity, however, may also be a result of AstraZeneca’s supply problems in the European Union.

Vaccine supplier sentiments reflected by Z-scored.
Emotions Analysis Over Time Analysis
Figure 12 depicts emotions expressed in regard to COVID-19 vaccination during the duration of 2021. At the beginning of 2021, the most exhibited emotion classifications were superficial expressions, with joy (59%) being the exception, followed by anger (33%), fear (5%), sadness (3%), surprise (1%), and love (0%). As time passed, the number of tweets expressing emotions increased, but the distribution of joy and anger were constantly fluctuating. After March 2021, anger was consistently the second most prominent emotion in tweets about COVID-19 vaccines. Additionally, it was noticed that anger declined as the emotion of happiness increased. Joyful sentiments peaked in July of 2021 and can be linked to the date most vaccine suppliers successfully delivered effective vaccines.

Timeline displaying several classes of emotions expressed in COVID-19 vaccine-related tweets.
Topic Modeling Results
In this study, the BERTopic modeling method was employed to comprehensively analyze the dataset. The technique yielded a total of eight distinct topics, each of which was represented by five representative words. To further illustrate the results of this analysis, the diagram presented in Figure 13 was created to depict the topic word relative score and demonstrates the minimal number of words necessary to effectively convey the meaning of a topic. The score serves as an indicator of the most important word or phrase within a given topic. This information is significant as it suggests that adding additional words to a topic may not significantly contribute to the overall understanding of the topic. Furthermore, the top terms presented in each topic can be used to assign a more descriptive and efficient name to each topic. This can be beneficial in simplifying the comprehension and interpretation of the data.

Topic word relative scores.
The theme of tweets related to COVID-19 vaccinations and vaccine suppliers change over time as new information is made available and people’s perceptions of the vaccines evolve. Figure 14 illustrates the fluctuations in themes of all tweets in our dataset from January 2021 to August 2021. Between January and February, the most frequently released tweets are represented by T1 (vaccinations, biopharmaceutical companies, and others). This may be due to increased public interest in the early stages of the vaccine rollout and the approval and distribution of vaccines from companies such as Pfizer and Moderna. As the vaccine rollout continued and more individuals received their vaccinations, there was an increase in tweets referring to T3 (received, second doses).

Frequency of topics between January 2021 and August 2021.
This is evident by the peak in April 2021, as breakthroughs in vaccine research and an increased number of individuals receive their second doses.
Furthermore, there was a noticeable increase in tweets related to T5 (bbmp, slotscovaxin) in July 2021. This may be correlated to an increase in COVID-19 vaccinations in India during this time, as the “bbmp” and “slotscovaxin” likely refers to the state-run vaccination program. This could suggest that people in India were more active in discussing and seeking information about the vaccine rollout in their country during this period of time. Furthermore, we employed the zero-shot classifier and to analyze the topics’ performance over time as shown in Figure 15. We also utilized the following list of themes: (i) opinions and emotions around vaccines and vaccination (ii) knowledge about vaccines (iii) vaccines as a global issue (iv) vaccine administration (v) progress on vaccine development and authorization 41. Through our analysis, we have discovered that the most frequently discussed topic was knowledge about vaccines.

Theme analysis overtime.
The second most frequently discussed topic was opinions and emotions around vaccines and vaccination, which includes both positive and negative perspectives on the use of vaccines, as well as the various emotions that individuals may experience when deciding whether to vaccinate themselves or their children.
The third most frequently discussed topic was the progress made in developing and authorizing vaccines, including the ongoing research and development of new vaccines and the regulatory processes in place to ensure the safety and efficacy of these vaccines. This topic also includes discussions about the approval and distribution of vaccines in different countries and the challenges faced in getting vaccines to those who need them most.
The fourth most frequently discussed topic was the global aspect of vaccines, including the impact of vaccination on population health, the role of governments and international organizations in promoting vaccination, and the challenges faced by low- and middle-income countries in achieving high vaccination coverage, which highlights the importance of vaccines not just as a personal health decision but as a global public health concern.
Finally, the least frequently discussed topic was vaccine administration, including the logistics of administering vaccines, the role of healthcare professionals in administering vaccines, and the importance of ensuring that vaccines are administered safely and effectively. This topic is essential as it highlights the practical aspects of vaccination and the importance of ensuring that vaccines are delivered to those who need them promptly and effectively.
Overall, our analysis has shown that the topic of vaccines and vaccination is multi- faceted and encompasses various perspectives and issues. From opinions and emotions to global concerns and practical considerations, vaccines continue to generate ongoing discussion and debate.
It is important to note that these tweet patterns are inconclusive and may not represent public sentiment. However, by monitoring these themes over time, we can gain insights into how people’s perceptions and discussions about vaccines have evolved. It also helps to identify areas of concern that should be further investigated and addressed.
Pro-and Anti-Vaccination Tweets Analysis Over Time
An analysis was conducted to examine the public sentiment and discussions pertaining to vaccines on social media. Specifically, tweets that were classified as either pro- vaccine or anti-vaccine were analyzed, as illustrated in Figure 16. The objective of this study was to understand the trend in public opinion regarding vaccines on social media platforms.

Analysis of tweets regarding vaccination, including both those that are anti-vaccine and those that are pro-vaccine.
The results of the analysis revealed that the trend of anti-vaccine tweets reached its peak in March 2021 and again at the end of July 2021. This suggests that there was a notable increase in the number of tweets expressing anti-vaccine sentiments during these time periods. It is important to note, however, that it is challenging to determine the origin of these tweets whether they are generated by individuals or organized groups.
Contrarily, the trend of pro-vaccine tweets began to rise after February 2021. This indicates that a growing number of Twitter users began to express support for vaccines and the vaccine effort. This rising trend in pro-vaccine tweets suggests that public opinion regarding vaccines became more favorable as the year progressed. This could be attributed to the increasing awareness of the efficacy and safety of vaccines, as well as the ongoing efforts to combat the COVID-19 pandemic.
Conclusions
Our primary contribution is a multimethod exploratory analysis that gives insight into public sentiment and emotions during the COVID-19 epidemic regarding vaccinations. In addition, our time series analysis has allowed for comparison research to public concerns about various vaccine brands. Our investigation revealed positivity and hope were the predominant emotions throughout the epidemic. Furthermore, as the government adopted control measures and vaccination regulations in response to the epidemic and epidemiological shift, the themes discussed in our dataset also shifted. Overall, our research demonstrates that evaluating social media data might help gain a deeper understanding of public sentiments and concerns about COVID-19 vaccination at the national level, facilitating the formulation of reasonable but efficient regulations.
Limitations and Future Works
Our work also faced several limitations because we employed existing NLP techniques to examine various qualities, such as emotions, sentiment, and topics in this study. However, there is no assurance that these current approaches accurately anticipate the actual attribute. In addition, emotion and sentiment are subjective activities, making them difficult to model, which may influence our perception and findings. In addition, because our data was acquired from Twitter using specified keywords, additional online conversations and perspectives were most likely overlooked. In conclusion, the analysis of tweets provides valuable insights into the public sentiment and discussions surrounding vaccines on social media. The results of this study indicate that there are varying opinions on vaccines, but as the year progressed, more individuals began to express their support for vaccines. This information could be crucial for public health officials and policymakers as they work to promote vaccination and combat the COVID-19 pandemic. Moreover, we utilized a zero-shot classification method to determine pro- and anti-vaccination sentiment, but we intend to improve its accuracy by incorporating human input that is transparent, objective, and dependable. In future research, an expansion of data collection to include social media networks such as Facebook, Reddit, Instagram, and others, coupled with the exploration of large language models (LLMs) and the development of effective prompts, holds the potential to enrich further the analysis of similar conversations surrounding COVID-19 vaccinations.
