Abstract
Introduction
Fake news is a problem. It is a Big Data problem. We are trying to solve it with small amounts of data.
Those are, in a nutshell, the three main points of our paper. We will not retread the familiar territory covered by many recent papers, reports and news media articles regarding how fast fake news spreads; why the presence of fake news is a problem because so many people get their news through online sources; and how the inability to trust news is a problem for democracy. We will only provide brief literature reviews of how each of those aspects has been addressed in recent literature. Our focus is on providing technological solutions to a problem that has been not necessarily created, but certainly exacerbated, by technology. We provide a comprehensive account of fake news detection as a text classification problem, to be solved using natural language processing (NLP) tools, and show that, in our experiments with two general classes of algorithms, fake news articles are detectable, especially given enough training data. And this need for data leads to our call to arms to the research community, to news media and social media companies: We want your fake news data.
In this introduction, we first define and delimit the problem and its historical roots. Then in the section on Approaches to the fake news problem, we discuss general approaches, from multiple points of view (educating the public, stopping the spread, human and automatic identification). The approach we take concentrates on automatic identification by using the
Let us start, then, with defining the ‘fake news problem’. In the most recent incarnation of this problem, and especially during the 2016 US presidential election, the problem refers to the creation and spread of news articles that favoured or attacked one of the two main candidates, Hillary Clinton and Donald Trump. In a more general sense, the issue is one of disinformation (false information that is purposely spread to deceive people) and misinformation (false or misleading information; Lazer et al., 2018), but it also includes the bias that is inherent in news produced by humans with human biases. Lazer et al. (1094) define this most recent phenomenon as ‘fabricated information that mimics news media content in form but not in organizational process or intent.’ More comprehensive definitions and classifications of different types of falsehoods and disinformation can be found in Jack (2017), Shu et al. (2017), or Wardle (2016),.
Some researchers and media analysts object to the use of the term
Historically, misinformation has been seen as the normal state of affairs, and news sources were routinely considered untrustworthy. Virginia Woolf, in
What makes present-day fake news most alarming is the speed with which it spreads. It is worth noting that concern about the role of technologies in the spread of false, inaccurate or misleading information has quickly followed the invention of such technologies. Darnton (2017) discusses how, in 18th-century London, invented stories or gossip made it into newspapers which had just began to circulate among a broad public. A
It is, however, undeniable that social media have enabled the speediest form of spread in this long history. Vosoughi et al. (2018) studied the spread of stories through Twitter, and found that false stories diffused significantly farther, faster, deeper and more broadly than true stories, and, within the false stories, political stories had the fastest and broadest spread. Most of the spread was viral, i.e., it was distributed not centrally, but through peer-to-peer diffusion. There are potential psychological explanations for this, as fake news articles tend to be more novel, or more shocking, and we as humans are attracted to such stimuli, perhaps the result of negativity bias, our tendency to pay more attention to negative events (Rozin and Royzman, 2001). That is why sensationalist stories sell newspapers (Glogger et al., 2016; Sachsman, 2017).
The speed and extent of the spread is probably directly tied to political and financial incentives, both online and offline. The political incentives are clear: Voters (and even non-voters) wish for their preferred candidate to win an election, be successful in political initiatives, or pursue and defend their political agenda. More complex is the role of ideologues, conspiracy theorists and hate groups (Marwick and Lewis, 2017). The financial incentives have complicated matters online, as many fake news producers do so simply for financial gain. The widely discussed case of the ‘Macedonian teenagers’ illustrates this. Teenagers in a small Macedonian town created website content in many areas, including health, sports, and politics. They found that US politics provided the most ad revenue and, within US politics, pro-Trump stories were the most profitable, so they diligently set out to post such stories. They plagiarized the content of those stories from American fake news sites (Silverman and Alexander, 2016; Subramanian, 2017).
The effects of fake news in specific situations are being documented. For instance, Allcott and Gentzkow (2017) analysed browsing data, archives of fact-checking websites and online surveys to conclude that social media was an important source of news for many Americans during the 2016 US presidential election. They attempted to quantify the amount of exposure to and engagement with fake news stories, and concluded that those stories were widely shared, and tended to feed confirmation bias: People were more likely to believe stories that favoured their preferred candidate. This is certainly a problem in the specific situation of the election, because the results of a particular election may have been affected by stories that were proven to be false. It is, however, a much more general problem as it promotes the impression that we cannot believe anything we find online, in printed media, or in radio and television broadcasts. That is a problem for democracy, because it erodes trust in public institutions.
Now that we have briefly discussed the problem, its spread and effects, we will move, in the next section, to what approaches have been taken so far in tackling it.
Approaches to the fake news problem
The root causes, the spread and the consequences of fake news are all complex issues. One can take multiple approaches and, indeed, individuals, researchers and organizations have undertaken efforts to address the issue. Lazer et al. (2018) propose interventions along two lines: empowering individuals to evaluate potential fake news they encounter; and structural changes to stop or minimize exposure to such ‘news.’ We would like to break those down a bit further, into: (1) educating the public; (2) analysing and curtailing the spread; (3) performing manual checking; or (4) performing automatic fact-checking and classification. We agree with Lazer et al. (2018) that this is a problem that requires an interdisciplinary approach. In this section, we provide brief descriptions of the possible interventions, but, in the rest of the paper, will focus on the last one: performing automatic text classification to determine whether a news article seems to contain fake or false information.
Educating the public
Education efforts can be enhanced, starting at the school level, with media literacy, and a general education towards empowering a responsible citizenship, raised in civil and democratic values, who is also able to understand the competing pressures of capitalist societies, including the influence of lobby groups, political parties, and the simple financial gain of creating online content that generates advertising revenue for the creator (and of course for the hosting site). We should pause and think for a moment what the internet would have been like had it not taken the route of using advertising as a form of revenue.
More focused forms of education concentrate on news and sources of news specifically, such as an infographic prepared by the International Federation of Library Associations and Institutions, which encourages readers to examine the source, read beyond a headline, or ensure that the content is not meant to be humorous or satirical. 3 Another excellent initiative is the course designed by Harvard Kennedy School’s Shorenstein Centre, aimed at both journalists and the general public, and providing tools to verify information. 4
Calling for better education, whatever form it takes, is, nonetheless, an easy way out, and one that places undue burden on the individual to acquire such education.
It is not our place to advise governments on how to create and administer education policy. We would like, however, to offer a caveat, that evidence supporting the belief that a higher level of education inures news consumers to outrageous claims is not conclusive. For instance, Allcott and Gentzkow (2017) found that level of education was not statistically significantly associated with how likely readers were to believe an ideologically aligned story (but people with higher education tended to have more accurate beliefs about news). Furthermore, Greenhill and Oppenheim (2017) found that education, income, age and gender, what they describe as commonly cited factors in receptivity to rumours, did not seem to have a correlation with how likely people were to believe a rumour.
Analysing and curtailing the spread
Fake news spread fast. It spreads faster and penetrates social networks to a larger extent than credible news (Mustafaraj and Metaxas, 2017; Vosoughi et al., 2018). This may be due to its novelty, its capacity to generate outrage (which generates attention), or its role in confirming the preexisting biases of the reader. The novelty and outrage may explain why Facebook’s effort of flagging debunked fake stories backfired (Constine, 2018). Users actually shared flagged stories more.
Part of the problem resides in echo chambers or filter bubbles, which means that some people will be exposed to only one point of view, and will find it easier to believe stories that reaffirm that point of view (Bechmann and Nielbo, 2018; Del Vicario et al., 2016; Greenhill and Oppenheim, 2017). This is why, in their agenda for research and action, Lazer et al. (2017) encourage communication online across partisan or ideological lines.
We also know that people tend to remember facts and events that have been repeatedly mentioned, even when the repeated mention is in the context of a retraction or myth debunking (Ecker et al., 2017; Swire et al., 2017). It makes sense, then, to stop false information on its tracks, before it reaches too many people and becomes entrenched in their minds. Research in this area includes linguistic signals of a rumour (Zhao et al., 2015) and models of spread, which help in determining how to contain it, and how many fact-checkers are needed to contain a hoax (Tambuscio et al., 2015). Hoaxy, an open platform to study misinformation and fact-checking on Twitter, is useful in modelling how to disrupt the spread of a rumour (Shao et al., 2018).
Despite popular belief that bots play a crucial role in spreading misinformation, Vosoughi et al. (2018) found that rumours spread with the same speed, depth and reach, whether they originated or were retweeted by either bots or humans. Therefore, while identifying bots may be useful, humans are still a major source of misinformation spread.
Manual checking
Manual checking of false statements, rumours and fake news articles online plays a vital role in containing the spread. Two broad classes of efforts can be identified: using fact-checking websites, and performing manual checking on specific social media sites.
Fact-checking websites (e.g., Snopes, Politifact, Emergent) provide verification of claims that they find, or that users submit. They have the advantage of using qualified journalists and other professionals, who are able to research and verify claims and news stories. They do have some downsides, however. The first one is, as with education, the process makes the responsibility rest with the individual. Lazer et al. (2018) also point out that people may not be likely to fact-check a story that aligns with their pre-existing beliefs. Fact checking could even be counterproductive, as fact-checking a story or a rumour leads to familiarity with it, and familiarity breeds not contempt but acceptance (Berinsky, 2017; Ecker et al., 2017; Pennycook et al., 2018; Swire et al., 2017). Lewandowsky et al. (2012) recommend that, if a myth or rumour is to be debunked, that it not be repeated. The correct facts should rather be reported, without mentioning the false information.
Large technology companies and social media sites have responded to social pressure and the common belief that they played a role in abetting, or at least not curtailing, the spread of fake news by announcing that they will hire (more) content moderators. Human monitoring is desirable, because it ensures that claims are accurately verified. It has many potential pitfalls, however, ranging from the possibility that moderators’ bias will be propagated to the mental toll placed on individuals performing the checking (Chen, 2017). Facebook partnered with fact-checking organizations to reduce and contain the impact of fake news. A recent report on the partnership (Ananny, 2018) documents mixed success, stemming from a lack of common goals. Partners also worried that the effort was not transparent, to them or to the wider public. Pavleska et al. (2018) have also documented the problems with fact-checking organizations, including: lack of coordination among each other; excessive reliance on human expertise without, in some cases, a plan for long-term sustainability; or an absence of measures of impact.
The Credibility Coalition
5
is developing a framework for
Automatic checking
There are clear benefits to performing verification automatically: It can be done at scale and it saves moderators from having to sort through at best unpleasant content. This form of automatic checking is about the content and claims in the story itself, not about metadata such as source or rate of spread.
Computational fact-checking attempts to find unverified claims in a story or rumour, and check them against reliable sources. Ciampaglia et al. (2015) find factual information by transforming Wikipedia into a network of knowledge graphs. Unverified statements can be checked against this network. A statement known to be true in Wikipedia will be present as an edge of the knowledge graph, or will have its subject and object linked via a short path in the graph. Presumably, untrue statements should not be found as connected in the graph.
Jaradat et al. (2018) have created ClaimRank, a computational system that detects claims that may need verifying (available for both Arabic and English). The claims can then be sent to fact-checking websites (which typically employ humans to do the verification), or to automatic systems. One such system by Mohtarami et al. (2018) finds documents that may be relevant to a given claim, and snippets of evidence. While the system may not be used in a completely automatic way, it can assist human verification experts.
Another form of automatic checking involves assessing the language of the story itself, i.e., finding cues in the language of the story that point to exaggerated claims, overly emotional language or a style that is uncommon in mainstream news sources. This is, in essence, a text classification problem, one commonly addressed by computational linguists using NLP tools. Potthast et al. (2018) describe this type of classification as
We argue that computational linguists are uniquely positioned to determine whether there is a ‘language of fake news.’ We discuss the potential of text classification in the next section.
Text classification for fake news
An intuitive framing of the fake news problem in NLP would be to ask how we can classify news text into fake and legitimate instances. This applies especially to the case of full text – as opposed to tweets or headlines distributed on social media – because text classification relies mainly on the linguistic characteristics of longer text. Deception detection in text has a broad literature in NLP, and fake news articles can be considered a category of deceptive text (Chen et al., 2015; Feng et al., 2012; Pérez-Rosas and Mihalcea, 2015). Methods used for text classification vary from classic machine learning algorithms using a set of pre-defined linguistic features to modern neural network models which mainly rely on pre-trained word vectors and embedded representations resulting from processing large amounts of textual data. In this section, we briefly introduce text classification methods used in the domain of deception detection and, in particular, in fake news detection.
Feature-based approaches
In NLP, the feature-based approach, which involves the extraction and analysis of linguistic cues for identification of specific target phenomena (e.g., fake product reviews from real ones) has been a very powerful model with relatively interpretable results. Features such as n-grams, subjectivity and polarity markers, lexical semantic classes, syntactic or discourse-level features have been explored in previous work on deception detection in general and on news classification in particular (Afroz et al., 2012; Conroy et al., 2015; Horne and Adali, 2017; Pérez-Rosas and Mihalcea, 2015; Rashkin et al., 2017; Rubin et al., 2015; Ruchansky et al., 2017; Volkova et al., 2017). These features can be used with a variety of traditional supervised algorithms. Feature-based modelling usually involves feature engineering and a feature selection phase. Based on comparative experiments in different machine learning applications, it has also been shown that the performance of these classic models plateaus at some point as the training data size increases (Ng, 2011). Thus, in problems where Big Data is available, deep neural network models are being preferred, as they usually achieve impressively better results (for a recent overview of the NLP trends see Young et al., 2018).
Deep learning models
Deep learning has taken over most NLP tasks but usually in domains where large-scale training data is available. In text classification, recurrent neural networks (RNNs), convolutional neural networks (CNNs) and Attention models have been competing with feature-based models (Conneau et al., 2017; Le and Mikolov, 2014; Medvedeva et al., 2017; Yang et al., 2016; Zhang et al., 2015). RNNs are capable of encoding sequential information and are most suitable for modelling short text semantics. CNNs are composed of convolution and pooling layers, which provide an abstraction of the input. These models are employed in specific NLP tasks where the presence or absence of features is a more distinguishing factor than their location or order. For example, presence of specific words and phrases in a product review is usually indicative of it being a positive or negative review. Therefore, CNNs are well suited for the purpose of longer text classification. Neural network models have also been applied in previous work within the domain of misinformation and fake news (Rashkin et al., 2017; Wang, 2017; Yang et al., 2017).
All leading machine learning techniques for text classification, including feature-based and neural network models, are heavily data-driven. Therefore, training data is the first requirement to build these models. Quality training data for misinformation detection should consist of a balanced, sufficiently diverse and carefully labelled set of legitimate and fake news articles. While building such a dataset may sound trivial, the following section explains the challenges in gathering a dataset of this kind by referring to the datasets we have found through a review of previous work. Initial experiments suggest that existing data is still insufficient for building a robust misinformation detection system.
Data: Where and how to find fake news
The first question we need to answer in addressing fake news detection through text classification is what we consider as a representative instance of fake news. In other domains related to deceptive text, such as fake product review detection, objective criteria can be designed when labelling the fake instances: a review written by someone who has not bought or used the product, or someone recruited by the product seller for the specific duty of writing a review would be considered fake. Fake news can also be defined as news articles written by amateurs (rather than journalists) recruited with the express purpose of generating content in favour or against an entity or policy, to promote a specific idea, or for financial gain such as attracting clicks for ads. Professional journalists can also fabricate stories, for various reasons. One recent case is Claas Relotius, journalist for
The data collection strategy for building a fake news detection system depends on the definition one adopts for the task. In the majority of previous work, instances of fake news were collected from a list of suspicious websites. A relatively large collection of this type is a dataset of roughly 20,000 news articles collected by Rashkin et al. (2017). This data contains texts harvested from eight news publishers categorized into four classes:
In order to build a text classification system to detect false from true content based on linguistic cues, we need news articles assessed individually and labelled with respect to their level of veracity. This type of data collection is labour-intensive, as it involves fact-checking for each and every news article. A variety of fact-checking websites perform this analysis on real news. Therefore, one way to collect data on rumours and false news is to take advantage of these websites and to try to automatically scrape information such as the true vs. false headlines and hopefully their sources. Previous attempts to collect large data in this manner did not focus on the text of the news articles where the rumour was originally distributed; they rather cared more about the headlines and the annotations of the fact-checking websites. 7
Currently available datasets with texts individually labelled for veracity.
The Liar dataset (Wang, 2017) is the first large dataset collected through reliable annotation, but it contains only short statements, not full news article texts. Another recently published large dataset is FEVER (Thorne et al., 2018), which contains both claims and texts from Wikipedia pages that support or refute them, together with veracity labels for the claims. This dataset, however, has been built to serve the slightly different purpose of stance detection (Hanselowski et al., 2018; Mohtarami et al., 2018); the false claims have been artificially generated; and the documents are not news articles, but Wikipedia pages (as true text instances) and their modified version obtained from crowd-sourcing (as fake or false instances). We provide a summary of these datasets in Table 1.
MisInfoText: A repository of assessed news texts
In order to address the lack of data with reliable labels, we have built a repository of news article texts that have been labelled by fact-checking websites. The entire dataset, plus links to datasets listed in Table 1, is available from our lab GitHub space
9
and from our demo page.
10
This repository contains three data categories:
Links to all publicly available datasets of news that contain (1) the text of news articles, and (2) veracity labels assigned to them, are collected and maintained in our repository. This will facilitate both theoretical and application-based studies on fake news and automatic misinformation detection. In addition to datasets originally published in previous work, we perform scraping on top of datasets that contain veracity labelled claims and URLs of their sources, but not necessarily the text of news articles. For example, we have found two datasets of links with veracity labels on the Buzzfeed News repository. These links become useful in finding news article instances that have already been assessed for their factual content. We make available both the original data containing links and veracity labels, as well as augmented data that we scrape from the associated news web pages including body text, title, author and date of publication. The Buzzfeed dataset will be introduced in the rest of the current section. Finally, we maintain and use a list of potential fact-checking websites to harvest larger amounts of data and provide a scrape-and-clean service on top of them. In collecting data directly from fact-checking websites such as Snopes, we apply a mixture of automatic and manual procedures. We have so far scraped the entire archive of the Snopes, Politifact and Emergent websites, and then automatically followed the links mentioned by each fact-checking article on these websites to the sources of discussed rumours. Figure 1 shows a screenshot of the web-scraping service that we have built and made available online for public use. Manual checking is necessary to verify that the text is valid and it in fact supports the discussed claim. We have done the manual assessment on a subset of the automatically collected Snopes articles. This manual effort and the resulting dataset are introduced in the following. Screenshot of our web service to scrape data from fact-checking websites and links to the original news articles, available at https://fakenews.ngrok.io.

Buzzfeed dataset
The first source of information that we used to harvest full news articles with veracity labels is from the Buzzfeed media company. Buzzfeed has published a collection of links to Facebook posts, originally compiled for a study around the 2016 US election (Silverman et al., 2016). Each URL in this dataset was given to human experts so they could rate the amount of false information contained in the linked article. The links were collected from nine Facebook pages (three right-wing, three left-wing and three mainstream publishers).
11
We had to follow the Facebook URLs and then the links to the original news articles to obtain the news texts. We scraped the full text of each news article from its original source. The resulting dataset includes a total of 1380 news articles on a focused topic (US election and candidates). Veracity labels come in a four-way classification scheme including 1090
Snopes dataset
The second source of information that we used to harvest full news articles with veracity labels is Snopes, a well-known rumour debunking website run by a team of expert editors. In addition to finding rumours and mentioning distributing websites, Snopes provides elaborate explanations of the rumour and its effects. We scraped the entire archive of fact-checking pages. On each page, Snopes discusses a claim, cites the sources (news articles, forums or social networks where the claim was distributed) and provides a veracity label for the claim. We automatically extracted all links mentioned on a Snopes page, followed the link to each original news article, and extracted the text. The resulting datafile includes roughly 4000 rows, each containing a claim discussed by Snopes annotators, the veracity label assigned to it, and the text of a news article related to the claim. The main challenge in using this data for training/testing a fake news detector is that some of the links on a Snopes page that we collect automatically do not actually point to the discussed news article, i.e., the source of the claim. Many links are to pages that provide contextual information for the fact-checking of the claim. Therefore, not all the texts in our automatically extracted dataset are reliable or simply the ‘supporting’ source of the claim. To come up with a reliable set of veracity-labelled news articles, we randomly selected 312 items and assessed them manually. Two annotators performed independent assessments on the 312 items. A third annotator went through the entire list of items for a final check and to resolve disagreements. Snopes has a fine-grained veracity labelling system. We selected
Topics covered by fact-checkers
In this section, we perform a topic modelling experiment to explore the data we have collected from fact-checking websites and to get a sense of what type of news articles are covered in the available datasets. The issue of topics is important because training datasets that are skewed in terms of topic will result in classifiers that are unable to generalize to different topic distributions. More generally, research to date has not explored what topics are more or less likely to be featured in fake news stories, although it seems that news about politics, the environment and health are prevalent. Vargo et al. (2018) investigate the media landscape, and the interaction between mainstream media, fake news publishers and fact checkers, testing the hypothesis that fake news media and fact checkers have the power to set the agenda of news media, by the types of stories that they respectively cover or fact-check. While their study found that fake news does not set the agenda for mainstream media, it is intricately connected to partisan news, taking cues from partisan sites with regard to what types of topics and stories are covered in fake news. Even more worrisome is the connection between fake news and emerging media, perhaps, Vargo et al. hypothesize, because emerging media is, like fake news sites, predominantly online. These are all issues that need further exploring, and will affect how fake news datasets ought to be built in terms of topic distributions.
Since the Emergent dataset is the largest and most similar dataset to ours (because it was also collected from a fact-checking website), we also include this data in our experiment. The objectives of this experiment are two-fold:
Discover what topics in news are covered by fact-checking websites, and how the distribution varies between true vs. false news stories. Find the gaps and sources of imbalance in currently available data to provide useful directions for future data collection efforts.
In order to address these points, we need a sufficiently large reference corpus of news text – as training data to the topic model – that is representative of news stories regardless of their content being misinformation or not. For this purpose, we employ a collection of 16,000 texts from the training portion of Rashkin et al.’s (2017) data, which we briefly introduced in the previous section. Projecting our labelled news articles into the topic space constructed based on this diverse data will then reveal the topic distribution in fake news as well as preferences of the fact-checking websites in picking and debunking rumours.
To build the topic model, we preprocessed documents in Rashkin’s training set (by tokenizing, normalizing and removing punctuations and stopwords) and fed the resulting word-document vectors into an Latent Dirichlet Allocation model in the Gensim python library. We tuned the number of topics so each topic represents a clear category of news that is not too fine or coarse-grained for visual investigation. The final number of topics that gave clearest results was 10. Figure 2 (bottom section) shows the word clouds we obtained from the 10 most important words in each topic, with their weight represented by the font size. In a similar fashion, we preprocessed documents from our two datasets extracted from Buzzfeed and Snopes, as well as the Emergent dataset. We then projected each subset of these datasets (split based on veracity labels) into the pre-trained topic space. Doing so provides us with some interesting observation regarding the distribution of important news topics in the labelled collections (see the top section of Figure 2).
Topic distribution across news text corpora obtained from fact-checking websites: Buzzfeed, Snopes and Emergent.
The Buzzfeed dataset (1380 articles), which is mostly focused on news related to the US election in 2016, comes out as the least diverse dataset. This was to be expected, as this dataset covers the topics of election, personal stories (of the presidential candidates) and other political topics such as stories related to police and the legislation system. The Snopes dataset (145 articles) is relatively more diverse: In addition to political topics, it includes some news on sports, environment and health. Notice that the Buzzfeed top fake news collection (33 articles) has a more similar distribution to that of the Snopes collection, and this is because Buzzfeed in fact collected that dataset by looking at Snopes and Politifact websites. Finally, the Emergent dataset (1612 articles) stands out as the most varied collection. This dataset is also relatively larger, which might indirectly contribute to topic diversity. While the three datasets put together cover a variety of news stories, it does seem that stories on certain topics such as the market (economy) and technology are less represented in this collection.
By looking closely at each row of the heat map, we also find that some topics are more frequent in false than true news. For example, in the Snopes dataset, the topic of police is found more in false news articles. In the Emergent dataset, the technology and environment topics are more frequent in false news, whereas the opposite pattern is observed for the politics topic. These differences can be indicative of an inherent difference between misinformation and real news, or they might just mean that the studied fact-checking websites are biased towards certain types of stories. Personal stories, in particular, appear frequently across all datasets and all veracity labels. This type of pattern is particularly interesting, as it can indeed be a consistent feature of the rumour type of news, but not necessarily a sign of misinformation.
We used the datasets introduced here for text classification experiments which we do not include in this paper, but which are interesting to briefly mention. Using unbalanced data in terms of topics leads to high accuracy classification, even using very simple features (such as tokens or word n-grams) when train and test data are sampled from a similar distribution of news topics. However, reporting such high accuracies is misleading because what we are looking for is in fact a fake news detection system that can generalize to new topics, i.e., a classifier that detects high-level features that can be considered as signs of deception, regardless of a news article’s specific topic. Small data collections would not offer cross-topic generalization because what the models learn in this situation is the vocabulary differences between fake and real news and the vocabulary depends strongly on the topics. For example, if we train a classifier on the data depicted in Figure 2, any test instance that comes from the technology topic would likely be classified as ‘false’ (because we have few training instances of this topic in the collection and most of them are from the false class in the Emergent data). Therefore, it is important to collect both fake and real news instances on a variety of topics to make sure that what our systems learn about deception can be generalized to unseen instances of news across topics. 13
A call to arms
Our efforts at collecting data to build a robust fake news classifier have taught us a valuable lesson: Reliably labelled fake news articles are actually hard to come by. Although many fake news publishers exist, we have no assurance that every story on those sites constitutes misinformation. Thus, we need instances of individually labelled stories, labelled by humans with some expertise on the topic of the stories, or at least with some general training in journalism.
We have modestly contributed to this effort, with two datasets from Buzzfeed and Snopes, amounting to a total of 1558 individually labelled articles with veracity scores in a five-way spectrum. While this dataset has allowed us to investigate certain aspects of fake news, such as the types of topics covered, it is certainly not sufficient for modern text classification methods, especially for deep learning models. We need Big Data to solve this problem.
Our call to arms encourages researchers in this field to share datasets, and to work towards a standard for labelling and organizing the data. This is not about who gets a paper published first; it is about addressing an important problem, and finding solutions by working together. Lazer et al. (2017) call for developing datasets that are useful for studying the spread of misinformation, and suggest pressuring social media companies to share important data. We join this chorus, and would like to have access to datasets to study not only spread, but also the fake news articles themselves.
Conclusions
We have discussed the different approaches to the problem of fake news and misinformation, some of them relating to how to educate the public or to how to stop the spread of such pernicious news. We focus on tackling the problem as a text classification problem, i.e., attempting to automatically detect whether a particular news article is fake or not. By ‘fake’ we mean an article that contains unverified or untrue claims, or attempts to disseminate information that is not accurate.
In order to perform automatic classification of news texts, modern NLP and machine learning methods require large amounts of training data. As computational linguistics researchers, we feel, however, that we cannot decide by ourselves which articles are instances of fake or real news. This is why we propose relying on datasets containing articles that have been individually labelled for veracity by experts. We have found, unfortunately, that there are very few such datasets, because individual labelling is a time-consuming task. Nevertheless, one source of such labels are fact-checking websites, which perform this task for the public good. We have scraped, cleaned up and organized individual articles harvested from these sites, together with their labels (true, false, or similar labels). We introduce this dataset, MisInfoText, as a resource for text classification efforts. We also carried out analyses based on topics, and discovered that the datasets are unbalanced with respect to topics, an issue that needs to be addressed for text classification.
More work in this regard is certainly needed, and we encourage the community to organize and contribute their own datasets, so that we can address this problem in a collaborative fashion.
Our future work involves using this dataset and any other that we can find to build robust classifiers. We are experimenting with both ‘classic’ feature-based approaches and deep learning methods.
