Abstract
Keywords
Based on my admittedly fortunate experience collaborating with computer scientists on both research and teaching, I can report that the era of the “two cultures” (Snow, 1959) is over. 1 Instead of epistemological chasms, I have found modest differences in orientation, of which I shall mention three, reflecting computer scientists’ and social scientists’ respective intellectual traditions. These differences require social scientists to do some extra work to adapt the powerful tools that computer scientists provide to social-science problems, but offer in return insights that can improve the way we think about explanation more broadly. Because my own “Big Data” comprises texts, I shall limit my observations to computational text analysis (Blei et al., 2003).
First difference: Supervised vs. unsupervised machine learning
Topic modeling and many other text-analysis tools have their roots in machine learning. Typically, in machine learning problems, one has a class of cases of known type and a class for which the type is unknown (Almeydin, 2014). One divides the former into a “training set” and a “testing set”; develops a model based on the former that is predictively effective for the latter; and applies that model to classify the cases for which the type is unknown. An excellent example of this approach is Jockers and Mimno (2013), who used supervised topic models (“supervised” refers to models based on cases of known type) to identify the gender of anonymous or pseudonymous authors of 19th-century novels. Supervised models have had a wide range of practical applications (think, e.g. the Netflix challenge, where contestants used supervised-learning models to improve Netflix’s ability to recommend to users films that they would enjoy).
Arguably, however, the major strides in computational text analysis in recent years—and the ones of greatest use to social scientists—have entailed the development of
In other words, the shift from supervised to unsupervised models, especially in some areas of greatest interest to social scientists, requires many of us to move outside our comfort zone in accepting interpretive uncertainty and to develop robust ways to interpret and validate the results of our models. 2 There have been promising developments in this respect (e.g. Grimmer and Stewart, 2013) and room for more progress. The key may be starting with interpretive methods borrowed from the humanities, but then disciplining the results through statistical validation. For example, a model interpretation, combined with information external to the corpus, may lead one to hypothesize that the prevalence of some topics should be associated with particular classes of authors, or to anticipate temporal patterns in topic prevalence (DiMaggio et al., 2013).
Second difference: Machine-learning vs. statistical explanation
Whereas social scientists customarily obsess over causality and rely on formal tests of statistical significance, computer scientists using supervised models focus on results. The first topic-model presentation I attended used the method to identify public records particularly likely to require redaction, out of a set of records too immense for humans to screen by hand. The only measure that mattered was whether the models improved prediction (which they did).
Even where held-out samples are not available, however, computer scientists seem to devote more attention to designing models and less on statistical validation (at least in the social-scientific sense) of model solutions. I suspect that this (admittedly stylized) difference is reinforced by variation in disciplinary skill sets: Computer scientists can write new algorithms faster than most social scientists can learn them. In the tradition of supervised machine learning, this makes sense because if you create a better model, you will be rewarded instantly with better results. This approach carried over (at least at first) into unsupervised models, as computer scientists wrote new algorithms at an alarming rate. From a computer-science standpoint, this made sense: algorithms tend to solve big problems, rather than nibble at the edges of small ones.
Social scientists, by contrast, tend to learn a method and adapt it to their needs, in large part because their learning costs are much higher. 3 Social scientists therefore tend to be more interested in model-testing and curation than computer scientists (though there are exceptions on both sides (e.g. Boyd-Graber et al., 2014)), attempting to get the most out of the programs they have mastered. From this standpoint, social scientists have good reason to invest in corpus curation, because it may be their most efficient way to boost the quality of their results.
There is another difference, which is likewise consistent with the supervised-learning roots of text modeling, but has more to do with underlying approaches to explanation. Many computer scientists (with notable exceptions (e.g. Pearl, 2009)) seem less concerned with causality and with model confirmation than are many social scientists. It is not that they care less about getting models right; rather they understand “getting it right” in a different (and I am beginning to suspect more useful) way than do most social scientists, focusing on model plausibility, utility, and descriptive, as opposed to causal, validation. Social scientists accustomed to obsessing over model specification and statistical significance may find this emphasis frustrating, at least at first. (As a novice topic-model user, I literally could not believe that there was no target function I could use as a simple goodness-of-fit criterion to choose among model solutions.) Ultimately, however, the computer-science perspective is liberating, as it forces us to recognize real interpretive uncertainty and seek out appropriate and substantively relevant forms of validation fitted to specific research goals.
Third difference: Computer scientists trust humans more than social scientists do
From Alan Turing (1950) onward, much work in computer science, especially in Artificial Intelligence, has sought to create algorithms that can replicate the results of human problem solving. The difficulty of this task has generated great respect among many computer scientists for the human brain (which, as computers go, is an impressive piece of hardware). This tradition has influenced Natural Language Processing and, especially, Sentiment Analysis, where human reasoning is routinely described as a “gold standard” against which algorithmic output should be judged. Tasks like the Netflix Challenge induce computer scientists to write programs that try to simulate human evaluation processes; and much research on sentiment analysis employs as training sets human reviews from websites like Yelp or Rotten Tomatoes that combine text and summary evaluations (for a thoughtful review see Liu, 2010).
By contrast, social scientists, at least those who have paid attention to work in cognitive psychology, are deeply suspicious of human judgment. I, for one, harbored the hope (of which my computer science colleagues have disabused me) that computational analysis could free us from dependence on pesky humans, whose judgments are clouded by hard-wired errors of reasoning (Kahneman, 2003); schematic (Nisbett and Wilson, 1977) and ideological priors (Graham et al., 2009); and vulnerability to emotional environments (Shiller, 2015), priming (Gilbert, 1991), stress (Hammond, 2000), poverty (Mullainathan and Shafir, 2013), pride (Tourangeau and Yan, 2007), and prejudice (Hardin and Banaji, 2013).
As anyone who has ever hand-coded text or supervised others in doing so is aware, measures of inter-rater reliability are correlated directly with the
What do social scientists need?
Three priorities follow from these observations, two for topic models and one for sentiment analysis:
Engagement with computational text analysis entails more than adapting new methods to social-science research questions. It also requires social scientists to relax some of our own disciplinary biases, such as our preoccupation with causality, our assumption that there is always a best-fitting solution, and our tendency to bring habits of thought based on causal modeling of population samples to interpretive modeling of complete populations. If we do, the encounter with machine learning may pay off not just by providing tools for text analysis, but also by improving the way we use more conventional methods. To be sure, there are risks in going against the grain. But this is a great time for social scientists to get involved in computational text analysis, because the field is relatively young, the challenges are intellectually captivating, and there is still time to influence the shape that these methods will take as they enter the social sciences.
