Abstract
1. Introduction
The social network Twitter was created in 2006 and is nowadays widely used, with around 500 million tweets per day, covering all kinds of content. As such, Twitter becomes a popular social network when one wants to analyse the expression of sentiment in textual data [1–3]. Due to the tweet’s maximum number of 280 characters, authors need to express opinions straight to the point. When tweets originate in specific interest areas, it is usual to resort to the use of technical jargon and polysemic terms, with the meaning understood by the tweet’s particular context domain. Thus, currently, sentiment analysis of domain-specific tweets is considered a challenging task [4].
Sentiment analysis, a natural language processing (NLP)–related task, employs several methods and tools to automatically detect relevant information from a text to determine the prevalent sentiment or opinion it expresses [5]. The most common approach for sentiment analysis or polarity detection consists of classifying the sentiment towards something as positive, neutral or negative [6]. To perform sentiment analysis, and since this is a highly domain-dependent task, domain-specific knowledge is critical for obtaining a good classification performance [7]. Most of the existent research focusing on textual sentiment analysis adopts tools built for generic domains [8,9] which results in lower performance, given that sentiment analysis is a domain-dependent task, and domain-specific words are not taken into account and may express different sentiments for different domains. In fact, only very few approaches are specifically designed for financial and stock-market domains [10–12]. The stand-alone usage of a domain-specific dictionary presents its own shortcomings, given that it is created based on a specific and relatively small data set. To overcome these shortcomings, a few works attempt to combine general and domain-specific dictionaries to provide a better sentiment classifier [13–15]. Despite the wide usage of general domain, domain-specific or hybrid lexicons, we did not identify any other research that creates a specific and up-to-date dictionary to evaluate the sentiment of tweets related to stock markets.
This study investigates: (a) if sentiment analysis for stock-market tweets can be improved using a specifically automatically constructed stock-market lexicon from Twitter and (b) how is sentiment analysis of stock-market tweets’ performance affected by the use of hybrid lexicons compared with general approaches. Moreover, we present an enhancement for the actual reference financial dictionary, the Loughran–McDonald dictionary (LMcD) [10]. This updated lexicon, LMcD20, contains additional words not previously included in the lexicon and is automatically drawn from a corpus of recent stock market–related tweets. We also extend the existing body of knowledge by proposing SA-MAIS: a hybrid approach for sentiment analysis of a domain-specific text that integrates general dictionaries with an up-to-date domain-specific sentiment lexicon. This methodology positively compares with existing tools. Our experiments show that this innovative approach achieves better results when compared with pre-trained models or stand-alone usage of generalist or specialised dictionaries. SA-MAIS is available in GitHub, 1 and the data set used to validate SA-MAIS and enhance LMcD is published in IEEE DataPort. 2
The article is organised as follows: after the introduction, a brief review of the most relevant literature is presented in section 2. Section 3 presents the objectives and the research questions. Section 4 presents the data set and SA-MAIS system’s architecture. Section 5 details the domain-specific dictionary enhancement. Section 6 describes SA-MAIS implementation. Section 7 highlights theoretical and practical implications as well as the limitations of this work. Section 8 discusses possible directions for further research and presents the main conclusions of this work.
2. Literature review
Sentiment analysis, also named opinion analysis [16], uses NLP and text analysis to explore sentiment’s valence in textual data (e.g. documents, tweets and direct messages). In general, sentiment analysis intends to evaluate the objectivity or subjectivity of a text and classifies it as positive or negative, and thus, this type of classification is considered a binary problem [17,18]. Saif et al. [19] created SentiCircle, a semantic sentiment representation of words. It captures the contextual data, based on the occurrence of tweets and updates the sentiment based on the contextual semantics.
Nofsinger [20] concluded that, due to the nature of stocks, the stock market is directly impacted by social mood, and this behaviour helps to predict ‘financial and economic activity’. Sul et al. [21] collected tweets where stock symbols (like AAPL for Apple Inc.) of S&P 500 companies were referred to and classified the sentiment in each tweet as positive or negative. The authors were able to show that their sentiment analysis was related to the firm’s stock returns. They also demonstrated that users with many followers directly impact the same day’s returns, while users with fewer followers impact future returns (10 days returns). In a nutshell, most of the previous works support the claim that public mood and sentiment expressed in word-of-mouth (WOM) impact stock-market prices. Bollen et al. [22] created a system that correlated Dow Jones Industrial Average with Twitter feeds. Chandra Pandey et al. [23] proposed a new clustering method to evaluate the sentiment of tweets. The proposed method outperforms five of the most well-known algorithms. In more recent work, Song et al. [4] describe a technique that combines supervised and unsupervised learning and uses a new text representation model named Word2PLTS for short-text sentiment analysis. This model is based on probabilistic linguistic term sets that fully describe the possibilities for the sentiment polarity of the word.
Hu et al. [24] proposed an approach to summarise the task of manually verifying customer reviews by reviewing the features related to the products in the data set. The authors created a domain-specific lexicon for customer reviews as an attempt to increase the performance of their lexicon. Loughran et al. [10] identified that previous approaches classified financial texts incorrectly. Thus, the authors created a domain-specific dictionary to classify the sentiment of financial texts more accurately. However, Li et al. [25] implemented a ‘generic stock price prediction framework’ using Harvard IV-4 dictionary and the Loughran–McDonald financial sentiment dictionary [10] for sentiment analysis. The proposed generic framework was tested using 5 years of historical data on prices and news on the Hong Kong Stock Exchange. The authors concluded that models focusing only on positive and negative sentiment classes do not provide good predictions. Junqué et al. [26] used articles from all major Flemish newspapers between 2007 and March 2012. The authors concluded that sentiment analysis using Pattern for Python [27], which is a python package with multiple functionalities including natural language processing, underperformed when compared with Bag-Of-Words or market technical indicators. More recently, Oliveira et al. [28] noted that there is a lack of financial lexicons adjusted to micro-blogging stock markets. The authors proposed a new automatic procedure to create a lexicon based on the StockTwits 3 data set.
Li et al. [29] combined technical indicators with news articles as an attempt to predict Hong Kong stock prices. Four dictionaries were used to perform sentiment analysis on the news articles, namely, Harvard IV-4 Dictionary, Loughran–McDonald financial sentiment dictionary [10], SentiWordNet 3.0 [30] and SenticNet 5 [31]. The authors concluded that Loughran–McDonald financial sentiment dictionary outperformed the remaining dictionaries.
In relation to the specificity of a tweet’s textual message, several authors use Twitter hashtags (i.e.
Devlin et al. [40] proposed in 2018 a new pre-trained model called BERT which uses bidirectional encoding representations from transformers. One of the main advantages of this model comes from its flexibility, given that we can fine-tune a BERT model for a specific task. Thus, to create new models derived from BERT, one only output layer needs to be added. Quoc Nguyen et al. [41] proposed a BERT tweet for sentiment analysis (BERTweet), a model based on BERT trained with more than 40,000 tweets. FinBERT was proposed by Araci [42] based on BERT, with the model made to specialise in financial news. FinBERT was trained with news from Reuters and financial phrase bank data sets. Li et al. [43] used FinBERT to analyse news headlines and compared it with other long short-term memory (LSTM) models. The authors concluded that FinBERT outperforms other models. roBERTa Twitter sentiment analyser (Twitter-roBERTa) is a model based on BERT, trained over more than 58 million tweets and fine-tuned with TweetEval benchmark proposed by Barbieri et al. [44].
3. Objectives
Our goal is to provide a proof of concept through a case study: the evaluation of the sentiment of tweets directly related to stock markets.
Following the existing literature, most of the tools that try to predict the market’s behaviour use either generalist approaches or domain-specific dictionaries to quantify the sentiment of tweets or other important sources of data (i.e. news). In the particular case of stock market–related tweets, to the best of our knowledge, there are no domain-specific dictionaries for sentiment analysis. The following section presents a new simple method for upgrading an existing financial dictionary. Regarding generalist tools, we have chosen to use VADER, TextBlob and Stanza [8] and the state-of-the-art of specifically trained language models BERTweet [41], Twitter-roBERTa and FinBERT [42]. These models have been part of the most recent studies in sentiment analysis, not only in financial domains [39] but also in generic domains [45].
As such, the present work addresses the following research questions:
To answer these questions, we show that a methodology that seeks to unveil the polarity of a specific domain’s short-textual messages must incorporate up-to-date domain knowledge.
4. Methodology
During the analysis of our data set, it became evident that some words like ‘breakthrough’ and ‘barred’ were not adequately classified by generalist sentiment analysers (GSAs). In the past few years, either domain-specific dictionaries or lexicons have been explored for sentiment analysis. Different applications found that if the data set is specific enough on a particular topic, such as finance, a domain-specific dictionary may improve the results. We address the problem by proposing a new approach, SA-MAIS, a sentiment analyser that differs from previous tools because it combines a generalist tool and a domain-specific dictionary. SA-MAIS system’s architecture is depicted in Figure 1. The methods and definitions created for the system’s implementation are detailed later in section 6.

SA-MAIS system architecture.
The proposed methodology follows the commonly established framework for mining sentiment in tweets: data collection, pre-processing (removal of numbers, emails, hashtags and hyperlinks), performing sentiment classification and validating the model results.
To validate SA-MAIS as a short-text sentiment analyser, we explore and compare six well-known generalist sentiment classifiers: TextBlob, Stanza, VADER, BERTweet, Twitter-roBERTa and FinBERT. The first one, TextBlob,
4
is a pre-trained Python library for NLP that returns two values for sentiment analysis: the text’s polarity and its subjectivity. The latter is a measure for the level of lack of objectivity and thus abstract and subject to individual perception and opinion, whereas polarity expresses the tweet’s overall sentiment and is evaluated in the range
BERTweet is a model created by Quoc Nguyen et al. [41] that was trained with SemEval 2017 corpus (around 40,000 tweets). The model classifies the sentiment in three different classes: POS, NEG and NEU, respectively, positive, negative and neutral. Twitter-roBERTa was trained on approximately 58 million tweets and fine-tuned for sentiment analysis with the TweetEval benchmark [44]. The model produces three labels to classify the sentiment, 0 that represents negative sentiment, 1 that represents neutral sentiment and 2 that represents positive sentiment. FinBERT was proposed by Araci [42], and it is specialised in financial news. This model was fine-tuned using the Financial Phrasebank by Malo et al. [47]. The model produces three labels as output: positive, negative and neutral.
At this point, it is important to notice that, to the extent of our knowledge, (1) there are no manually annotated stock-market tweet data set for the three possible sentiment classes: positive, neutral and negative. Moreover, (2) there are no benchmark data sets with a large enough amount of tweets related to S&P500 and its firms.
To this end, we have collected and filtered about 928,000 stock-market tweets, between 9 April 2020 and 16 July 2020, concerning the top 25 companies with higher volume in S&P500 index stock symbol (cash tag), $SPX and #stock. The time window was used to reduce possible time/seasonal patterns (i.e. the uptrend of the sentiment) that could impact the experiments and, consequently, the results. Table 1 shows the distribution of the companies and tweets per economic sector.
Number of tweets and companies per sector in the data set.
Creating an annotated data set for a domain-specific task is time-consuming and is subject to a high degree of subjectivity by the annotators. For example, Li et al. [48] used two annotated data sets to propose a new sentiment analysis of user reviews using deep learning models. Both data sets had the validation set, respectively, with 2210 and 802 reviews. Mowlaei et al. [49] proposed a new aspect-based sentiment analyser. To validate this model, the authors used a data set containing 367 positive reviews and 267 negative reviews, making a total of 634 reviews. We did not identify any manually annotated tweets data set specialised in the stock-market domain. Therefore, we have manually annotated a random sample of 2100 tweets. Table 2 shows the distribution of the companies and tweets per economic sector for the manually annotated tweets. The tweets were manually classified using the three available sentiment values: positive, neutral and negative. The annotation was performed using two independent annotators, which are experts in stock markets, as an attempt to reduce subjectivity while labelling the data set. Cohen’s kappa [50] statistic was used to quantify the inter-annotator agreement. This measure is in the range of
Manual annotated data set distribution.
To keep SA-MAIS as much up-to-date as possible based on the word frequency, we created another data set with almost 1 million tweets, providing large-scale quality data for the analysis. Both data sets have been made public available [51].
5. Lexicon improvements
As previously pointed out, Loughran and McDonald created sentiment lists based on the most probably interpretation of a word in a business context, resulting in two dictionary lists that contain 354 positive and 2329 negative words [10]. The dictionary lists from now on will be named the LMcD, which is nowadays a financial and accounting dictionary of reference [52]. However, it was constructed based on financial accounting texts to enhance sentiment analysis in this specific domain.
This section describes the changes introduced into the LMcD lexicon in order to improve its representative power for stock markets’ analysis (section 5.1) and highlights the results of the newly enhanced dictionary (section 5.2).
5.1. LMcD20: domain-specific dictionary enhancement
A deeper exploration of the second stock-tweet data set (not annotated data set) highlighted that some specific and frequent stock market–related words (like ‘breakthrough’ and ‘barred’) were not included in the generalised sentiment analysis tools, leading to the need to use domain-specific dictionaries. However, a more in-depth exploration of the LMcD also revealed that some of the words used currently in financial tweets were still not present in the LMcD. We have also noted that words used to express opinions on Twitter are subject to subjective interpretation and vary over time. Therefore, a Twitter domain-specific dictionary cannot be static but must be adjusted over time based on real and up-to-date content.
In order to solve this problem, we have used stock-tweet data containing more recent data as a means to improve our lexicon. A sample of positive and negative words not included in LMcD was selected between the most frequent 500 words extracted from the large-scale data set. From these, 23 words expressed a positive sentiment (such as ‘buy’ or ‘bull’), and 13 words expressed a negative sentiment (like ‘short’ or ‘bear’). The authors manually selected words expressing sentiment, resulting in 36 finance-related new words (Table 3) added to the LMcD dictionary, thus creating LMcD20.
LMcD20 – positive and negative words added to Loughran–McDonald dictionary.
5.2. Comparing different lexicons
In terms of model evaluation, that is, validation of the results of positive, negative or neutral classification of words, and due to the slight imbalance between the negative class and the remaining ones, the metric selected to compare the experiments is the weighted average recall
As an initial experiment, a comparison between domain-specific lexicons was performed. In terms of domain-specific dictionaries, Hu et al. [24] (Sentilex) is one of the most well-known domain-specific lexicons and was created for analysis of customer reviews. Oliveira et al. [11] (stock-market sentiment lexicon (SMSL)) and Loughran and McDonald [10] (LMcD) are the examples of financial dictionaries created to be used in specific economic contexts. Mohammad and Turney [53] (NRC) is one of the most well-known domain-specific lexicons incorporating the sentiment polarity and the emotions in the same lexicon for a crowd-sourcing scenario. Loughran and McDonald created the LMcD dictionary in 2011 to evaluate financial reports from companies. LMcD contains two subsets of words: the negative words’ subset, which includes 2329 words, and the positive words’ subset, which contains 354 words [10]. Many words related to financial markets can be found in the dictionary, but unlike tweets, financial reports are very well structured and carefully written.
The SMSL dictionary was created in 2016 with the primary objective of evaluating StockTwits sentiment [11]. This dictionary contained 20,551 uni-grams and bi-grams and was generated automatically based on a StockTwits sample using a statistical approach. Nonetheless, the validation of SMSL was performed using a selection of 5000 StockTwits classified by the authors of the posts but excluded the neutral sentiment. Each
Figure 2 shows the results of sentiment analysis using each domain-specific dictionary. It is possible to observe that SMSL shows the worst WAR (40.0%). One of the reasons for SMSL results may be that it was automatically generated using StockTwits. This platform is different from Twitter and the language employed is much more targeted, very specific and directed mainly to financial markets readers. A second reason is that SMSL uses uni-grams and bi-grams generated automatically with the StockTwits data set. Notice that the two best domain-specific lexicons were combined, namely, Sentilex and LMcD. Based on Figure 2, Sentilex combined with LMcD outperformed the LMcD by 1.4 p.p. Comparing the LMcD against the remaining domain-specific lexicons (NRC and Sentilex), LMcD outperformed both by 8 p.p. and 0.5 p.p., respectively.

Comparing lexicons – weighted average recall (WAR).
As an attempt to improve the overall WAR, LMcD20 was created as previously described in section 5.1. Noteworthy, the stand-alone use of LMcD20 achieved the best performance (65.9%) compared with the remaining dictionaries. In particular, LMcD20 shows an improvement of 11 p.p. over LMcD. There was a decrease in the WAR when combining Sentilex and LMcD20. The main reason for this decrease is that LMcD20 is domain-specific for stock-market tweets and Sentilex is domain-specific for customer reviews.
Albeit the results of LMcD20 can be considered satisfactory, a careful analysis of the results reveals that a uni-gram words dictionary, such as LMcD20, still has limitations in tweets sentiment classification. Therefore, we felt the need to combine a GSA with this domain-specific dictionary, something that was achieved by implementing SA-MAIS as described in section 6.
6. SA-MAIS: a hybrid method for the analysis of stock-market tweets
This section details the implementation of SA-MAIS. Section 6.1 outlines how the GSAs and domain-specific dictionaries are combined, and section 6.2 highlights the results achieved with this implementation.
6.1. Classifying tweets
To achieve the best performance, SA-MAIS combines a generalist sentiment classifier, such as TextBlob library, Stanza toolkit, VADER, BERTweet, Twitter-roBERTa, FinBERT and a domain-specific lexicon (like SMSL, NRC, Sentilex, LMcD or the enhanced LMcD20 dictionary). The technique relies upon using both tools for textual analysis and integrating the resultant classification using a convex combination.
6.1.1. GSA component
As previously mentioned, TextBlob returns the polarity expressing the tweet’s overall sentiment: a value in the interval
This means that, independently of the tool being used, the sentiment polarity given by the GSA component of SA-MAIS is a value
6.1.2. Domain-specific dictionary component
The domain-specific component of SA-MAIS classifies each tweet by comparing its content with a domain-specific dictionary at the word level. Let
In case none of the tweet words matches the domain-specific dictionary, the output of the domain-specific dictionary is zero.
6.1.3. SA-MAIS sentiment value integrator
The output of SA-MAIS is a linear convex combination of the two previous components, given by equation (4), where
Finally, equation (5) defines the categorical output of SA-MAIS. Each tweet is classified as negative, neutral or positive, according to the value of
6.2. Results
The first experiment is meant to establish the baseline and consists of the evaluation of the performance of TextBlob, Stanza, VADER, BERTweet, Twitter-roBERTa and FinBERT as stand-alone analysers on the annotated data set. This means that the experiment is performed by setting the parameter
As shown in Figure 3, VADER clearly outperformed the remaining tools, achieving approximately a WAR of 64.0% versus 52.0% from BERTweet, with the latter being the model that achieves the highest WAR for the remaining models. Thus, from the compared tools, VADER stands out as the most suitable classifier for stock-market tweet sentiment analysis.

Comparing the performance of the generalist sentiment analysis tools on our data set.
Comparing the classification confusion matrices of TextBlob and VADER (Tables 4 and 5), it is possible to see that TextBlob has a higher failure rate in the positive sentiment class than when predicting a neutral or a negative sentiment class. On the contrary, although VADER shows a more homogeneous confusion matrix, it is possible to see that it fails mainly in classifying negative sentiment tweets. Regarding BERTweet’s confusion matrix (Table 6), the model fails mostly in the positive and negative sentiment classes, classifying many of these tweets as neutral. Therefore, it seems that the model is either missing context or words with domain-specific polarity to achieve a proper classification for the financial stock-market tweets.
TextBlob – confusion matrix.
VADER – confusion matrix.
BERTweet – confusion matrix.
With the previous results in mind, SA-MAIS integrated approach was evaluated using VADER as the general component provider combined with LMcD20, the lexicon that achieved the best results in the experiments in section 5.2.
As it can be observed in Figure 4, the WAR of sentiment analysis classification has increased from 65.9% with LMcD20 and 64.0% with VADER up to 71.8% with VADER + LMcD20. Thus, there is an overall increase in 6 p.p. compared with the stand-alone LMcD20 and 8 p.p. with the stand-alone VADER. Notice that the best result was achieved at

Comparing the performance of LMcD20 and VADER for different values of
The overall metric values for SA-MAIS using VADER plus LMcD20 with
Metrics for each sentiment class using VADER + LMcD20 with
To better illustrate what can be a positive, neutral and negative tweet, Table 8 shows an example of a correct prediction of SA-MAIS for each of these sentiment classes.
Examples of correct SA-MAIS prediction.
7. Contributions and practical implications
This work proposes a hybrid parametric approach for the analysis of sentiment polarity of short-text messages, combining a general sentiment analyser with an up-to-date domain-specific lexicon termed SA-MAIS.
The study offers novel insights into domain-specific sentiment analysis of short-text social media. It describes a new methodological approach for timely analysis and shows that (a) a simple yet effective way of incorporating up-to-date vocabulary from domain-specific short text provides added value for classification tasks and that (b) the usage of this enhanced lexicon improves existing general sentiment analysis, providing a more accurate tool for analysis of textual sentiment in a specific technical domain language. In particular, we describe a proof-of-concept of this methodology by applying it to stock-market tweets analysis and prediction. Despite the focus of this work on stock markets, the general approach may be applied in different domains, whether financial or not, by substituting the specific dictionary and using tweets from the chosen domain area for enhancing the new dictionary.
Second, it contributes to machine learning and text mining research by providing a novel annotated stock market–related corpus to benchmark new approaches and techniques. Third, by comparing the performance of several existing generalist tools, it shows that the latter, on their own, are mostly inadequate for accurate and precise classification of sentiment for stock market–related tweets.
8. Conclusion
A new sentiment analyser method, SA-MAIS, using a framework based on the controlled integration of a GSA and a domain-specific dictionary has been presented. This system combines the well-known GSA VADER with a domain-specific lexicon, LMcD20, updated with the more recent lexical trends. An enhanced version of the LMcD financial lexicon, named LMcD20, that incorporates newer and up-to-date specific finance-related words automatically retrieved from stock-market tweets was also presented.
The SA-MAIS hybrid combination of generalist and domain-specific analyses was comprehensively tested using six popular GSAs: TextBlob, Stanza [8], VADER [9] and the specifically trained state-of-the-art models BERTweet [41], Twitter-roBERTa and FinBERT [42] together with the four existing specialised financial dictionaries: LMcD financial sentiment dictionary [10], the SMSL [11], Sentilex [24] and NRC [53]. As a proof of concept, after running several experiments, it was possible to conclude that the novel-enhanced dictionary LMcD20 shows an increase in WAR results of about six percentage points for the Twitter stock market–related corpus. Furthermore, the SA-MAIS implementation using the integration of VADER with LMcD20 improves the former results over all the possible classification classes – positive, negative and neutral. These results indicate that SA-MAIS can be used as a tool in more elaborate systems for market evolution prediction as it outperforms the state-of-art in terms of NLP models using deep learning.
Finally, all the experiments were conducted using a novel annotated corpus publicly available at https://github.com/taborda11/SAMAIS. In terms of further directions for research, this study inevitably presents some limitations. Ideally, the 2100 annotated documents data set should be extended, since further manual annotation of tweets would not only allow for the enrichment of the annotated corpus but would be an important asset for future enhancement of LMcD20, eventually leading to an increase in SA-MAIS sentiment classification results. Second, more complex or purposeful dictionaries, possibly representing relations between words and additional linguistic information could be pivotal for improving the results presented here. Third, performing online tests using SA-MAIS could improve this tool’s performance and expand its scope.
