Abstract
Keywords
Introduction
In recent years, an increasing number of sociologists have embraced machine learning algorithms to infer latent patterns in text data (e.g., DiMaggio, Nag, and Blei, 2013; Mohr et al., 2013; Rule, Cointet, and Bearman, 2015; Bail, 2016; Nelson, 2020, 2021a,b; Kozlowski, Taddy, and Evans, 2019; Goldenstein and Poschmann, 2019; Bail, Brown, and Mann, 2017; Karell and Freedman, 2019; Wu, Wang, and Evans, 2019; Bohr, 2020; Taylor and Stoltz, 2020; Stoltz and Taylor, 2021; Arseniev-Koehler et al., 2022; Bonikowski, Luo, and Stuhler, 2022; Boutyline, Arseniev-Koehler, and Cornell, 2023; Best and Arseniev-Koehler, 2023). One suite of algorithms, unsupervised topic models (Blei, Ng, and Jordan, 2003; Griffiths and Steyvers, 2004; Blei, 2012), infers linguistic themes based on word co-occurrences. Topic models have been found to resonate well with sociological ideas about how people create meaning and make sense of the social world by linking themes to other concepts and ideas (DiMaggio, Nag, and Blei, 2013; Mohr et al., 2013; Törnberg and Törnberg, 2016; Fligstein, Stuart Brundage, and Schultz, 2017; Nelson, 2020). This article addresses a central limitation of topic models: while they are suited to inductive research that identifies emergent themes from document collections, they fare poorly at identifying, in transparent and replicable ways, specific concepts predefined by the researcher. Topic models, and unsupervised methods more generally, rely on post hoc analysis to make sense of the output in light of sociological theory, opening up an old rift between inductive and deductive research within the discipline. As computational text analysis has matured as a methodology in the sociological toolkit, calls have been made for an important next step: to move beyond the implementation of standard models and to strive to apply specialized models that are more transparent, replicable, theory-driven, and interpretable, and thus more attuned to the central demands of social science research (DiMaggio, 2015; Nelson, 2019; Mohr et al., 2020; Pääkkönen and Ylikoski, 2021; Nelson, 2021b; Grimmer, Roberts, and Stewart, 2022; Bonikowski and Nelson, 2022).
We contribute further to this debate and argue for the use of semi-supervised text analysis. We focus on the
The seeding crystallizes topics around predefined words that describe themes of interest. We use the term “topics” to refer to model output, and we use “themes,” “issues,” and “frames” when referring to theoretical concepts. Seed words require researchers to be explicit about how a concept is operationalized, and seeding is one way to constrain the model to search for specific themes of interest. Seeding can also increase the robustness of computational text analysis to language change, an endemic challenge when analyzing text archives of historical timescales (Bearman, 2015; Rule, Cointet, and Bearman, 2015; Voyer et al., 2022; Bonikowski, Luo, and Stuhler, 2022). By identifying associations between a focal topic and other topics with which it frequently co-occurs, the model can detect widely shared interpretations (or frames) associated with the theme in question. These model features provide an attractive complement to the mixed-methods approaches (e.g., DiMaggio, 2015; Karell and Freedman, 2019; Nelson, 2019, 2020) that are currently being discussed as a way of bringing computational text analysis into sociological research.
One strength of the topic model approach is to allow for words’ mixed memberships in topics. Our use of the seeded topic model, however, aims at measuring clearly defined and interpretable topics, which we will achieve by using seed words that we believe to have a single, very clear meaning. Seeding will work less well if one starts from polysemic words, i.e., words with multiple meanings, or if one tries to seed a polysemic topic altogether. While the words associated with the seeded words within a given topic are also allowed to emerge from the data, forced monosemy is a limitation of our approach that will hinder its applicability to certain use cases.
Seeded topic models have been around for a decade and have more recently become available in general-purpose programming languages such as R (Watanabe, Xuan-Hieu, and Watanabe, 2022) and Python (Anoop and Asharaf, 2017). However, strong computational requirements and limitations in the scalability of off-the-shelf implementations (Lu et al., 2011; Jagarlamudi, Daumé, and Udupa, 2012; Fan, Doshi-Velez, and Miratrix, 2019; Eshima, Imai, and Sasaki, 2024; Watanabe and Baturo, 2024) have hampered their application in sociology. We discuss a scalable implementation for big text data (Magnusson et al., 2018) that removes previous bottlenecks and that we hope will make the algorithm attractive to a broader sociological audience. We illustrate the method using an important case study that measures the ways the media have framed immigration in a Swedish newspaper corpus spanning 75 years. The corpus, one of the most extensive ever analyzed in the social sciences, contains 30 million text blocks from more than 100,000 editions of the country’s four national newspapers from the period 1945–2019.
Our study connects to a long tradition of sociological research studying newspaper discourses (e.g., Gamson and Modigliani, 1989; Marx Ferree, 2003; Koopmans and Olzak, 2004; Fiss and Hirsch, 2005; Janssen, Kuipers, and Verboord, 2008; Bail, 2012; Shor et al., 2015). Previous immigration-related research has relied on corpora comprising between a few thousand and 130,000 articles, which have typically been assembled using keyword searches, and which have spanned time frames of between 1 and 14 years (Helbling, 2014; Lawlor and Tolley, 2017; Greussing and Boomgaarden, 2017; Heidenreich et al., 2019; Czymara and van Klingeren, 2022). The largest studies to date have included 850,000 articles in six European languages (Eberl and Galyga, 2021) and 850,000 immigration-related headlines from UK newspapers (Bleich and van der Veen, 2021). Compared to past snap-shot corpora, our data are vast and—in combination with a scalable algorithm—permit a fine-grained mapping of the newspaper discourse on immigration over 75 years.
Using the corpus described above, we map how shared interpretations of immigration have evolved over time. We operationalize interpretative media frames as associations between a focal topic and other topics, estimating the co-occurrence patterns of predefined themes (combining “immigration” with, e.g., “the economy,” “culture,” or “security”). Issues that frequently co-occur with the focal topic represent prominent logics for the topic’s interpretation. Through the ways journalists curate and present the news flow, the media frames that we measure in this study establish a shared context of meaning-making (Scheufele, 1999; Fiss and Hirsch, 2005; Chong and Druckman, 2017; Lizardo, 2021), placing events, people, and ideas into a wider context of interpretability (Strauss and Quinn, 1997; DiMaggio, 1997; Cerulo, Leschziner, and Shepherd, 2021; Arseniev-Koehler and Foster, 2022).
Since we estimate changes in cultural associations and delineate periods during which associations measurably differed, our computational approach adds scale to the qualitative analysis of “turning points” in collective meaning-making (Sewell, 1996; Abbott, 1997, 2001; Wagner-Pacifici, 2017). It further lends a broader empirical foundation to the casing of timelines than the narrative accounts usually heralded in the historical social sciences (Ermakoff, 2019; Griffin, 1992; Bearman, Faris, and Moody, 1999).
In the following, we provide a brief primer on frames of interpretation and turning points in media discourse, and we introduce the Swedish case study in relation to earlier large-scale studies of newspaper content. We then turn to the method itself and describe its implementation as a means of estimating predefined topics and their relations to one another over time. We present results for the Swedish newspaper corpus that highlight the interpretability of model outputs. In the concluding section, we discuss our insights into the Swedish media coverage of immigration over the past 75 years, and we ponder the degree to which text measures, drawn for example from the mainstream media as in our case, provide social sensors that can help us learn about trends in contemporary societies.
Frames and Turning Points
Frames concern how information is conveyed in communication, and how specific interpretations are promoted by relating one concept to other concepts, thereby linking new information to existing ideas and previous experiences (Gamson and Modigliani, 1989; Entman, 1993; Scheufele, 2000; Rawlings and Childress, 2021). As such, frames are “interpretive packages” (Gamson and Modigliani, 1989) that evoke particular perspectives and problem definitions through which objects in the social world can be seen and understood (Weaver, 2007; Gamson, 1992; Benford and Snow, 2000). Immigration, for example, might be interpreted, among others, through a security frame or an economic frame. Individuals may have opposing opinions on immigration (e.g., “immigrants provide necessary labor” and “immigrants take our jobs”), but they can still agree to interpret immigration through a similar lens (e.g., the economy). Taken together, frames provide the cognitive contexts that speak to and activate the learned categories of individuals’ cognition (Lizardo, 2017; Wood et al., 2018; Hunzaker and Valentino, 2019; Cerulo, Leschziner, and Shepherd, 2021), and they organize cognition at a higher order of abstraction than do opinions, attitudes, or values (DiMaggio, 1997; Goldberg, 2011; Mohr et al., 2020).
In our application, we focus on how immigration has been framed in national news media, exploring the interpretations of immigration formulated by journalists and editors. In line with the idea that an interpretative frame can be viewed as an associative pattern, we operationalize media frames as associations between a focal theme and other topics. Media frames that frequently co-occur with the focal issue represent prominent logics for the issue’s interpretation. For example, one frame may connect immigration with issues of religion in order to highlight the cultural differences between natives and migrants, while another may connect immigration with party politics in order to promote a politicized perspective on immigration. The composition of salient frames at a certain time point aggregates into what we refer to as the shared interpretation of immigration communicated by the media.
In contests over sovereignty in interpretation (Swidler, 1986; Gamson, 1992; Benford and Snow, 2000), entrepreneurs of meaning—such as governments, political parties, advocacy groups, and media outlets themselves—are keen to obtain ownership of salient issues and to influence their shared interpretations (Andrews and Caren, 2010; Quinsaat, 2014; Tsur, Calacci, and Lazer, 2015; Farrell, 2016; Bail, Brown, and Mann, 2017). But how do publicly available interpretations change? Influential social science theorizing refers to “turning points” that constitute breaks with routine practices of meaning-making (Sewell, 1996; Abbott, 1997; Wagner-Pacifici, 2017). Turning points take shape in “unsettled times” (Swidler, 1986) or “periods of rupture” (Wagner-Pacifici, 2017) in which sequences of events occur that imply thresholds and shifts that are recognizable to contemporaries. In retrospect, we give names to these ruptures because they bring with them a series of occurrences that challenge established interpretations and “durably transforms previous structures and practices” (Sewell, 1996).
We use the concept of turning points that are grounded in, and operative on, publicly available interpretations to partition Sweden’s immigration discourse into recognizable eras. We estimate annual salience shifts in the composition of dominant frames over time to identify breakpoints in the media’s framing of immigration and to parse discursive periods during which meaning-making measurably differed.
The Swedish Newspaper Corpus in Context
The Swedish Newspaper Corpus 1945–2019, digitized by the National Library of Sweden (Börjeson et al., 2023), contains 75 years of journalistic content from the country’s four largest newspapers
The news articles we study represent a broad mixture of different formats and political orientations (see Table 1). Newspapers divide their content into multiple stand-alone sections, e.g., op-eds, domestic politics, world news, culture, sports, and TV listings. We restrict our analysis to the front sections of each newspaper. We believe these sections contribute most to meaning-making in newspapers. Using the front sections leaves us with 29.3 million documents and 1.6 billion words after removing rare words and documents shorter than 15 words. The corpus consists of text blocks, i.e., units of cohesive text identified in the segmentation procedure during digitization. The segmentation relies on a rule-based approach curated by the Swedish National Library (using the software Zissor with ABBYY as the optical character recognition engine); there are different segmentation rules for each newspaper that are updated when newspaper layouts change (Dannélls, Johansson, and Björk, 2019). We use each text block as a document. Previous research (Hurtado Bodell, Magnusson, and Mützel, 2022) has shown that an article is commonly captured by multiple text blocks and, importantly, that only 16% of text blocks contain content from more than one article. See Supplemental Material Section S1 in the Appendix for more details on corpus creation.
Corpus Description, 1945–2019.
By comparison with earlier computational studies of archival text that have described national conversations based on sets of
While they have been innovative and carefully implemented, previous topic-model studies have relied exclusively on an inductive operationalization of meaningful frames that were detected as topics in articles identified as having a focus on immigration based on a keyword search. The inferred topics, and the sociological concepts they may represent, have been interpreted post hoc, after seeing the model outputs. In this article, we argue that this practice invites researchers to adapt the boundaries of theoretical constructs on the basis of model outputs rather than on what is suggested by theory. Because topics inferred by unsupervised topic models differ each time a model is estimated, this could create a situation in which the conceptualization of a theoretical construct changes with each model run. In our use case, a topic model may capture different aspects of the “immigration discourse” with each re-run. The use of seed words to anchor an immigration topic stabilizes inferences across model estimations. As we explain in the next section, the seeded topic model improves both replicability and interpretability and combines improvements in transparency with a more theoretically informed approach to detecting topics and topical associations.
Methods
For many in the social sciences, computational text analysis comes in two variants: supervised or unsupervised. Supervised methods rest on the researcher’s access to labels for meaning structures in text data, such as categories and a coding scheme, and then extrapolate these labels to unseen texts (Nelson et al., 2021; Chen et al., 2018; Lichtenstein and Rucks-Ahidiana, 2021; Do, Ollion, and Shen, 2022). By contrast, unsupervised methods infer information about language patterns, such as co-occurrences of words in documents, without drawing on predefined categories or coding schemes. A growing number of studies are using unsupervised methods to describe the cultural meanings of sociological concepts—such as class (Kozlowski, Taddy, and Evans, 2019), gender (Garg et al., 2018), race (Nelson, 2021b), stigma (Best and Arseniev-Koehler, 2023), and art (DiMaggio, Nag, and Blei, 2013). Unsupervised methods rely on algorithms that either trace the meaning of individual words—for word embedding models in recent sociological research see Kozlowski, Taddy, and Evans (2019); Nelson et al. (2021); Bonikowski, Luo, and Stuhler (2022); Voyer et al. (2022); Best and Arseniev-Koehler (2023)—or on algorithms that identify thematic structures in ensembles of text—for topic models see, e.g., DiMaggio, Nag, and Blei (2013); Karell and Freedman (2019); Bohr (2020); Greve et al. (2022).
Topic models or, more specifically, models based on Latent Dirichlet Allocation (LDA, Blei, Ng, and Jordan, 2003) represent an important class of unsupervised methods that inductively detect themes by learning the topics that are present in a document and the words that best describe them. LDA represents a generative probabilistic process that treats each document as a bag of words from which each word (token) is randomly drawn from a mixture of topics present in the document. The model then assigns each word in a document to a topic, allowing the same word to belong to various topics to a differing degree. Each topic, in turn, is a low entropy distribution over words that tend to co-occur. This graded membership property aligns closely with our analytical aim of determining which co-occurring topics are most relevant for describing the shared interpretation (or framing) of an issue.
As was mentioned above, unsupervised methods quantify what would otherwise be inaccessible, making the interpretive process that is always an important part of text analysis more transparent and systematic. However, unsupervised methods require post hoc operations to connect the model output to meaningful sociological concepts. Word embedding models, such as the one used by Kozlowski, Taddy, and Evans (2019), rely on vector algebra and focus on a set of manually selected keywords in order to identify interpretable dimensions of a concept. In applications that use LDA models, the standard practice employed to achieve interpretability involves qualitatively inspecting each inferred topic and making iterative decisions as to which topics are meaningful and relevant for inclusion in the final analysis (e.g., Törnberg and Törnberg, 2016; Karell and Freedman, 2019; Nelson, 2020; Czymara and van Klingeren, 2022). As a consequence, “sociologists using text as data must make a dizzying number of decisions about what information to extract and how to answer their research question” (Nelson, 2019: 139). While they are important as a result of their exploratory potential and for their links to existing qualitative methodologies, iterative mixed-method approaches such as “computational grounded theory” (Baumer et al., 2017; Nelson, 2020), or “computational hermeneutics” (Mohr, Wagner-Pacifici, and Breiger, 2015) remain reliant on making sense of the output after a model is learned (Goldenstein and Poschmann, 2019; Nelson, 2019; Pääkkönen and Ylikoski, 2021). Because the inductive finding of relevant sociological concepts places researchers at risk of also finding seemingly meaningful interpretations where none actually exist, calls have been made for the development and use of intrinsically interpretable models (Hurtado Bodell, Arvidsson, and Magnusson, 2019; Rudin, 2019; Madsen, Reddy, and Chandar, 2021).
Seeded Topic Model
We suggest an extension to the original topic model, the
Allowing researchers to seed topics on the basis of existing domain knowledge constitutes an important step toward a more deductive, insight-oriented approach to modeling that is both less reliant on post hoc interpretations of model outputs (as are required in the unsupervised approach) and not restricted to a priori manually annotated categories or manually selected keywords (as are required in the supervised approach). Instead, the seed words help form topics around predefined concepts, names, or ideas, while at the same time utilizing the functionality of LDA to find new associations in text data based on word co-occurrences.
It is important to note that there is a crucial difference between the seed word strategy used here and the use of keyword searches to identify meaningful topics and identify documents that “belong” to or are most salient in relation to specific topics. Keyword search involves a deterministic procedure that requires detailed knowledge of the configuration of topics before models are run. Previous research shows that even domain experts perform poorly in identifying the keywords that are most relevant for capturing specific concepts (King, Lam, and Roberts, 2017). This results in biased text measures and differences in substantive conclusions. In contrast, seed words are only the starting point from which a model proceeds to learn which words go together. The unsupervised part of the algorithm will expand upon the original list of seed words in crystallizing topics of interest. We discuss the model and its implementation in detail in Supplemental Material Sections S2 and S4.
Previous contributions that have introduced seeded topic models using informative priors on preselected seed words (Lu et al., 2011; Jagarlamudi, Daumé, and Udupa, 2012; Fan, Doshi-Velez, and Miratrix, 2019; Eshima, Imai, and Sasaki, 2024; Watanabe and Baturo, 2024) relied on the standard collapsed Gibbs sampler as described in Griffiths and Steyvers (2004), limiting their applicability to large-scale data. By increasing scalability, and by using the model as a method for measuring sociological concepts, our implementation extends in important ways to the existing methodological literature. Seeded topic models that are implemented via highly scalable parallelizable sampling (Magnusson et al., 2018) permit the extraction of predefined topics and their associations with other themes from massive text data. Even though we have used this highly specialized algorithm, the model estimation process based on our vast corpus took 4.5 days using a machine with 360 GB RAM and 32 cores. 2 Without the specialized algorithm, our analysis would not have been possible. See Authors’ Note for information about the code and data that reproduces our analysis.
Seeding the Immigration Topic
Seeded topic models rely on Bayesian informative priors to decide which topics the algorithm should identify. In practice, informative priors are placed on the topic-word distribution such that a word used to guide the model has a zero probability of belonging to any other topic than the one for which it is a seed word. The seed words one uses to guide the model should be highly unlikely to occur in contexts outside the topic of interest—in our case, immigration. We use five types of words that are highly unlikely to be used in texts that do not relate to immigration: (i) names of immigration laws, (ii) titles of ministers responsible for immigration, (iii) names of agencies responsible for immigration, (iv) terms referring to related policy areas (e.g., integration policy), and (v) terms referring to different types of immigration (e.g., labor migration). Moving beyond the predefined seed words, the model learns other meaningful words that define the topic of interest. Among these, we find words that relate, for example, to race and ethnicity, such as names and slurs associated with minorities in Sweden (see Supplemental Material Section S7 for details). Our choice of seed words allows us to capture different dimensions of the immigration issue including, for example, discourses on different types of migrants such as refugees, asylum seekers, and labor migrants.
Seeding also allows the model to be infused with a priori knowledge of language change. Conceptually, actors, meanings, and contexts change over time, which implies that no single measure of discourse may be appropriate over long timescales. Lexical shifts and the changing meanings of social categorizations are critical challenges to the computational analysis of historical text (Bail, 2014; Rule, Cointet, and Bearman, 2015; Bonikowski, Luo, and Stuhler, 2022; Voyer et al., 2022). The word “immigrant,” for example, had rarely been used prior to the 1970s (“foreigner”” was the term of the day), and concepts such as “family reunification” and “unaccompanied minor” first appeared in the 1970s and 1990s, respectively. We implement the semi-supervised seeded topic model using domain knowledge to guide the model estimation over language changes that introduce new words to discuss the same topic. Topic seeding is best equipped to handle this type of language change that, in a standard modeling approach, would lead to the splitting of a theme into various topics. A previous name of the current Migration Agency (
We measure the salience of the immigration topic (Figure 1B) by calculating the proportion of words in all documents that are estimated to belong to the seeded immigration topic each week.
Co-occurring Topics as Interpretative Frames
The seeding strategy also permits us to define a set of additional topics that meaningfully co-occur with immigration and that we wish to flesh out from the media discourse as potential interpretations of immigration. We operationalize prominent media frames via the focal topic’s associations with other frequently co-occurring topics, and we interpret these relationships as culturally shared associations between concepts. This implies that we abstract away from word-level analyses, such as keyword in context, and instead, focus on how topics (rather than words) co-occur. In our analysis, it is not crucial whether the word “immigrant” is discussed alongside words such as “workplace” or “murder”; what matters instead is the association of the immigration topic with the economy topic and the crime topic, respectively.
We have predefined co-occurring topics on the basis of existing research on the common themes found in European news reporting on immigration (Korkut et al., 2013; Greussing and Boomgaarden, 2017; Eberl et al., 2018; Heidenreich et al., 2019) and research documenting Sweden’s immigration history (Geddes and Scholten, 2016; Byström and Frohnert, 2017; Krzyżanowski, 2018; Andersson et al., 2010). Based on this research, we expect five dominant frames—“culture,” “economy,” “human rights,” “politics,” and “security”—to co-occur with discussions of immigration. We capture each frame that represents a known interpretation of immigration by seeding several topics (Table 2). We seed multiple topics to capture each frame such that an interpretative frame can be viewed as a “supratopic” covering different dimensions of a related issue. For example, “crime,” which constitutes part of the security frame, is a highly diverse issue that includes a focus on offenses such as burglary, narcotics, murder, and sexual assault, to name only a few. To capture the many different crime-related aspects, we seed four different topics using the same set of seed words (see Supplemental Material Sections S2 and S3 for details). By seeding different topics with the same words we allow the model to crystallize around particular dimensions of a broader theme of interest in separate topics without explicitly having to choose these dimensions a priori. For example, while we know that “crime” is a multi-dimensional theme in our corpus (e.g., news covering different types of crimes at different phases in an investigation will be defined by different vocabularies), we let the model inductively find which type and aspect of crime should form a particular topic. One seeded topic then becomes a drug topic, for example, one becomes a homicide topic, and so on, and these are then combined into the larger topic of crime. This procedure allows the model to identify more specialized topics which, depending on the research question, can then be combined into a well-defined larger topic. We set the number of topics to 1,000, allowing for a combination of seeded and unseeded topics in the model.
Seeded Topics Reflecting Frames of Immigration.
Unlike previous research, we quantify interpretative frames using co-occurrence frequencies for different topics that are inferred from the same topic model that simultaneously measures the focal topic of interest. We measure the importance of each frame (Figure 2) in terms of the proportion of words that belong to the respective seeded topics in immigration-rich documents printed in the newspapers (see Supplemental Material Section S5).
Document Inclusion, Sensitivity, and Validation
The analysis includes all documents that we classified as “immigration-rich” if at least 2.5% of its tokens were estimated to belong to the immigration topic (i.e.,
We report on model diagnostics and sensitivity analyses in Supplemental Material Section S6, including (i) a test for model convergence as well as model re-runs (ii) using alternative numbers of topics (950, 1500), (iii) using each newspaper corpus separately, (iv) using alternative thresholds for document inclusion (1%, 4%, and 5%), and (v) using random subsets of 90%, 80%, and 70% of the original set of seed words.
In Supplemental Material Section S7, we report on validation strategies for topic definition that evaluate the degree to which a seeded topic captures the concept of interest. Those strategies include (i) a comparison of documents classified as being about immigration with a manual annotation of a sample of documents, (ii) an inspection of the tokens that the algorithm learned to belong to the topic, and (iii) an analysis of influential immigration-related events based on high temporal resolution data. The latter analysis tests whether the model picks up on immediate changes in newspapers’ framing following such events. We focus on events for which clear theoretical expectations exist about their likely impact on the salience of a particular seeded frame. An Islamist terrorist attack, for example, may serve to re-frame Islam as a violent ideology, leading to revisions of the current security-related interpretations of immigration (Greenberg, Pyszczynski, and Solomon, 1986; Legewie, 2013; Schmidt-Catran and Czymara, 2020). In this case, we would expect the relative salience of the security-related frame to increase in the weeks following the attack—indicating valid topic seeding.
Parsing Discursive Eras
We use a Bayesian Gaussian change-point model (Barry and Hartigan, 1993; Erdman and Emerson, 2007) to detect shifts over time in the salience of single frames as well as in the relative composition of salient frames. We interpret salience shifts as breakpoints in the media’s framing of immigration. The model assumes that a time series of frame salience can be partitioned into an unknown number of periods, with each period having a constant mean reflecting a “new probability regime” (Abbott, 2001). We estimate two kinds of specifications of the change-point model: (i) A univariate specification that tests for breakpoints in the salience of each of the five seeded frames separately, and (ii) a combined multivariate specification that tests for breakpoints in the relative composition of all five seeded frames. We are particularly interested in the multivariate model results. The composition of salient frames at a certain point in time aggregates into what we refer to as the shared interpretation of immigration communicated by the media. A shared interpretation describes a set of frames that are available to the public at a given point in time to make sense of an issue. The estimates of the change-point model provide an empirical foundation for the parsing of discursive periods (Rule, Cointet, and Bearman, 2015) in which meaning-making measurably differed.
The model, regardless of its specification as univariate or multivariate, estimates the posterior probability that each year constitutes a change point, delimiting sharp differences in the means of the respective time series in adjacent periods. That is, the model estimates the likelihood of a significant shift has occurred in the way the newspapers frame immigration in each one of the 75 years included in the data. We use a standard implementation of the model (Erdman and Emerson, 2007), and we set the model’s hyperparameter
Results
Figure 1B traces the relative salience of the seeded immigration topic in Sweden’s newspaper corpus from 1945 to 2019. The blue line represents the annual average salience of immigration and shows how important this issue was in the media. Prior to the first major peak in the number of immigrants in 1970, the level of media attention focused on immigration was low. On average, 0.05% of tokens in the newspapers referred to it. By contrast, from 2015 to 2019, the salience of immigration as a news issue reached 0.37%, a 7.4-fold increase vis-à-vis the first period. 3 Both the actual number of immigrants arriving in Sweden (Figure 1A) and the importance of the immigration topic in newspaper coverage (Figure 1B) reached unprecedented heights in 2015. The year of the European “refugee crisis” represents a clear disruption in terms of the salience of immigration. Salience also spiked during 1969–1970, which were years of high labor migration, and during the armed conflicts in Iraq (1990–1991, 2003–2011) and Bosnia (1992–1995), which resulted in many refugees arriving in Sweden. The linear correlation between the annual number of newly arrived immigrants and the salience of the immigration topic is 0.82 for the entire period examined; this correlation increases to 0.93 from 2010 to 2019. These results show that the attention of the media shifts to immigration in periods of peak influx, particularly if immigrant numbers increase rapidly.

(A) Annual number of immigrants (in thousands) arriving in Sweden. (B) Annual average salience of the immigration topic in Sweden’s four major newspapers (blue line). Data points represent the percentage of all words in a given week’s news articles that are estimated to belong to the immigration topic.

(A) The evolution of media frames of immigration. The Y-axis represents the salience proportion of the five seeded topics that frequently and meaningfully co-occur with the “immigration” topic. The salience proportions of these five frames sum to 1 in each year, and trajectories represent 5-year moving averages. The dashed vertical lines indicate the beginning and ending of inferred eras. (B) The likely turning points in the framing of immigration. Colored trajectories represent the univariate posterior distribution of potential change points per media frame. The black trajectory represents the multivariate posterior distribution of potential change points in the composition of frames, which constitutes our measure of the shared interpretation of immigration. The background colors highlight the seven periods implied by the model.
Relative topic salience provides an important measure of
Immediately following the war, the media discourse portrayed immigration mainly from a humanitarian perspective (Figure 2A). As this association became less prominent, we find likely univariate change points in the humanitarian interpretation and, to a lesser degree, in the cultural interpretation of immigration during the late 1940s and early 1950s (Figure 2B).
We estimate the first turning point, with a 96% posterior probability in the multivariate model, as occurring in 1955. This year was characterized by a surge of labor migration to Sweden. At the end of the second period, in the mid-1960s, the association between immigration and the economy had caught up with the humanitarian perspective. Both inferred periods 1 and 2 of post-war immigration align with historical accounts that partition Sweden’s immigration history on the basis of immigration flows and policy changes (Geddes and Scholten, 2016; Byström and Frohnert, 2017; Krzyżanowski, 2018; Kupskỳ, 2017; Andersson et al., 2010; Svanberg and Tydén, 1998).
Our model identifies a period of rupture in the mid-1960s—which coincides with the first discussions of multiculturalism (1964) and investigations into the costs of immigration for the expanding welfare state (1965). In the immediate aftermath of these discussions and investigations, the dominant interpretation of immigration became economic, and a cultural framing gained importance. These ruptures, with multivariate change-point probabilities of 95% in 1964 and 70% in 1966, mark the beginning of a long era of relative stability in the associative patterns. Rapid economic growth and the political hegemony of the Social Democratic party resulted in the roll-out of the welfare state, which was extended in 1968 to cover migrant workers, and a newly established migration board was tasked with overseeing their employability. Again, the inferred period is largely in alignment with the narrative presented by historical social science (Byström and Frohnert, 2017; Krzyżanowski, 2018).
We infer turning points in 1974 (70%) and 1986 (77%). Labor migration declined during the economic crises of the 1970s and was increasingly replaced by immigration involving non-European refugees. The univariate breakpoint for culture in 1984 coincides with the arrival of increasing numbers of non-Western refugees, discussions of legislation against ethnic discrimination, and increased efforts focused on integration, including family reunification (Byström and Frohnert, 2017; Andersson et al., 2010).
1986 marks the year in which the Swedish Prime Minister, Olof Palme, was murdered. Spearheaded by Palme’s governments (1969–1976, 1982–1986), immigration law had embraced multicultural ideals, affirming diversity and the protection of immigrants’ cultural identities. Despite the turning point identified in 1986, the media framing of immigration remained remarkably stable across periods 4 and 5, and we interpret the interval 1974–1999 as representing Sweden’s famed era of tolerance (Schierup and Ålund, 2011; Rydgren and van der Meiden, 2019), during which an inert mix of economic, humanitarian, and security-related frames shaped the interpretation of migration for almost a generation. This interpretation weathered economic downturns, peaks in immigration, and Sweden’s accession to the EU in 1995, and remained dominant until the end of the 1990s—which is much longer than the historical narrative suggests (Dahlström, 2004; Byström and Frohnert, 2017; Svanberg and Tydén, 1998). At the same time, the turning points we identify in this era are disproportionately driven by an increase in a new, politically polarized understanding of immigration. Notably, this upward trend in the politicization of immigration precedes the electoral success of populist far-right parties and the decline in the Social Democratic consensus that have characterized Swedish policy debates in recent decades (Dahlström, 2004; Byström and Frohnert, 2017).
Our analysis identifies the year 2000 as a consequential turning point (84%) driven by politicization. This was a year of revisions to immigration law, when the EU started to harmonize its immigration policies in the lead-up to the Schengen agreement (2001), and led to an increase in the number of migrant workers arriving in Sweden from the eastern countries of the EU. We find that a further convergence of media frames and, ultimately, their gradual replacement by politics as the dominant lens through which immigration is viewed, coincided with the populist right Sweden Democrats’ entry into parliament in 2010. The Sweden Democrats have since become the country’s second-largest party in national elections. Several years are associated with non-zero change-point probabilities for specific frames, but none of these are particularly pronounced and we do not find them to be sufficiently consequential to register in the model as having altered the interpretation of immigration. Throughout this period, and despite the September 11 attacks and the subsequent US-led “war on terror,” the association between Swedish immigration and security issues remained flat.
The final turning point that we estimate to lie above the 50%-threshold (51%) occurred in 2013. This disruption, which is less clear than those described above, marks the beginning of the most recent discursive era. This period included generous revisions of asylum law. At the same time, the consensual migration politics of past decades, which some have argued cemented an “opinion corridor” of views perceived as socially acceptable (Ekengren Oscarsson, 2013), were increasingly being criticized in society at large. This period reflects a further politicization of the immigration discourse, a surge in a security-related interpretation, and probably also the end of Sweden’s “exceptionalism” (Schierup and Ålund, 2011; Rydgren and van der Meiden, 2019) as regards the country’s tolerant approach to immigration. Our results indicate that this reinterpretation of immigration started well before the 2014 general election (in which the Sweden Democrats doubled their number of seats in parliament) and, most importantly, before the 2015 “refugee crisis.” Neither of these years was sufficiently consequential to register in our change-point model. Strikingly, we instead see that the 2015 “refugee crisis,” which many observers have classified as a watershed in European immigration history, was of little consequence for the ways in which the Swedish media have portrayed immigration.
In Supplemental Material Section S6, we report these results separately per newspaper. We find that the framing of immigration over time varies little between newspapers of different political orientations or between highbrow broadsheets (
Discussion
We have argued that the seeded (or constrained) topic model constitutes a promising semi-supervised method—combining both inductive and deductive reasoning—that provides a more replicable and transparent means of measuring meaning in digital text. Semi-supervised methods can improve transparency and replicability by decreasing the number of idiosyncratic decisions made during model implementation. Importantly, the seeded topic model permits a theoretical grounding of the topic definition procedure, because seed words require researchers to be explicit about how concepts are operationalized, and these constraints ensure that the model will identify the same concepts in each model run. This approach represents an advance in relation to concerns about whether computationally identified patterns can provide replicable and interpretable empirical evidence that is relevant to social science research. The seeding procedure allows researchers to tame the unsupervised nature of the topic model by guiding the model in its detection of topics, but without predetermining the full vocabulary associated with the topics identified. We have demonstrated the applicability of one specific algorithm to the task of identifying predefined, sociologically relevant concepts in texts and inferring the associations that exist between these concepts.
Model performance should be validated to ensure that the seeded topics represent the concepts of interest, and model validation still requires subjective interpretations of topic quality. To be sure, choosing seed words may be an iterative process, based on interpretations of model outputs and allowing previously unknown patterns to arise from the data. Such iterative processes are essential in most research that employs computational text analysis (Grimmer, Roberts, and Stewart, 2022), and as Mohr and colleagues have noted, “there can be no measurement of culture without interpretation” (Mohr et al., 2020: 4). Against this backdrop, we have taken important steps toward a more principled interpretation of topic models. First, identifying both a focal concept and its neighboring topics in a single estimation—instead of first identifying the relevant documents that contain the focal concept and then searching for other concepts within these documents—ensures that the analysis is less reliant on early operationalization decisions. One-step procedures are particularly important for producing reliable measures of meaning-making over long timescales, where they may be affected by language change.
Second, seeding facilitates diagnostics of model performance, something that is typically difficult in purely unsupervised settings (Chang et al., 2009; Ying, Montgomery, and Stewart, 2022). The semi-supervised nature of the model allows us to restrict validation efforts to the seeded topics. This is particularly important because there are currently no standards regarding how topic models should best be evaluated when used in sociological research. In the Appendix (Supplemental Material Section S7), we suggest various measures that will assist in inspecting the quality of seeded topics, and we found a high level of correspondence when we compared a manually coded sample of documents with documents inferred by the model to belong to a seeded topic. Additionally, we have checked the sensitivity of our results regarding the number of topics, seed word selection, and different thresholds for document inclusion (Supplemental Material Section S6).
In a supplementary analysis also reported in Supplemental Material Section S7, we provide suggestive evidence that unforeseen and widely recognized events have the capacity to measurably shift the salience of certain media frames. These results illustrate another validation strategy that tests whether the model picks up on shifts in the salience of the frame most closely related to the event in question. The results lend support to the validity of our semi-supervised inference of interpretative frames, and they provide pointers to the immediate response of newspapers to disruptive events. The event-focused analysis of high temporal resolution data also illustrates how—under certain assumptions—latent features of text data can be used as the outcome variable when estimating causal effects (Egami et al., 2022; Gencoglu and Gruber, 2020).
Of course, seeded topic models also have their own limitations. Current applications of the original topic model focus on discovering previously unknown patterns in text data (Grimmer, Roberts, and Stewart, 2022). The seeding of topics places bounds on an open discovery process. One solution (which we followed in our case study) involves allowing for a combination of seeded and unseeded topics in the model such that unexpected signals in the data can still be detected and explored. The applicability of the seeded topic model depends on how well researchers can operationalize a theoretical concept via one or more topics. A seeded topic model can easily identify some concepts, depending on the availability of unique words associated with the theme of interest. Other concepts are nearly impossible to pin down, however. For example, the model will struggle to capture a topic that is mostly defined by polysemic words, i.e., words with different possible meanings. To tackle issues with polysemy, researchers can seed multiple topics with the same words—as we did, for example, for the multifaceted crime topic—and thereby rely on the model to inductively capture their different meanings. While this may solve issues related to polysemy, it also decreases the replicability of the model. Therefore, finding non-polysemic words to crystallize interpretable topics of interest poses an important scope condition and, in some potential use cases, a roadblock to making full use of the seeded topic model. At the same time, however, vague and multifaceted themes that are difficult to identify using a seeded topic model may also present challenges to supervised methods that require human annotation.
Large language models (LLMs), which increasingly find their way into social science publications, also blur the line between supervised and unsupervised learning. LLMs have shown great capacity in a vast array of classification tasks (Do, Ollion, and Shen, 2022; Widmann and Wich, 2023; Bonikowski, Luo, and Stuhler, 2022; Chae and Davidson, 2023; Gilardi, Alizadeh, and Kubli, 2023; Törnberg, 2023), although current models’ performance is still under debate (e.g., Ollion et al., 2024; Bail, 2024), especially in classification tasks that require cross-document reasoning as in topic modeling and when texts pertain to a particular place and time as in historical corpora (Ziems et al., 2024). The development of LLMs proceeds at an extremely fast pace. Decreasing costs will open them up for analyses of very large corpora, and ideas of identifying, in principled ways, concepts predefined by the researcher will hopefully guide some of the modeling advances. If researchers find ways to gain more control over labeling, replicability, and transparency (Grossmann et al., 2023), this transformative brand of text modeling will be in a good position to develop important alternatives to the seeded topic model.
We have applied the seeded topic model to a vast newspaper archive to learn how the issue of immigration has been framed in Swedish newspapers from 1945 to 2019. The storytelling of journalists—their use of interpretative frames to make news events understandable to their audiences—makes newspaper archives a treasure trove for the study of meaning-making over historical timescales. We have operationalized frames as themes that frequently co-occur with the issue of interest, and we have interpreted these relationships as culturally relevant associations between concepts. Hence, we have also studied newspaper coverage as a social sensor of discursive processes (Fiss and Hirsch, 2005; Gamson and Modigliani, 1989) in which broader interpretations of societal developments and events are generated, negotiated, and revised (Swidler, 1986; Bourdieu, 1991; Strauss and Quinn, 1997). Viewing text as a social sensor involves the use of large repositories of digital text to uncover latent observations about the social world and trends in contemporary societies in particular.
Some have argued that media content reflects elite discourses and that a media sensor can capture “common cultural patterns, but it cannot observe what is never articulated” (Bonikowski, 2016). We recognize that media-generated perceptions of current events do not equate to the perceptions of the whole population, especially not with regard to polarized “hot” topics and in the age of social media. We have not measured meaning at the individual level, and we have not delineated different “thought communities,” although they no doubt exist, particularly in a politicized domain such as immigration. One example would be that different segments of society may have different groups in mind when they think about immigrants (Blinder, 2015; Eberl et al., 2018). Still, our case study has demonstrated that vast corpora of the type and scale studied here are likely to contain important evidence of the dominant interpretative frames—in the sense of “common cultural patterns”—that have been used to make sense of societal issues at a certain point in time. We believe that using such sensors may have general implications for sociological research in light of the increasing availability of “found” online data (e.g., Keuschnigg, Lovsjö, and Hedström, 2018; Salganik, 2018; Jarvis, Keuschnigg, and Hedström, 2021).
We have highlighted the induction of different eras of meaning-making as a potential means of analyzing the output of seeded topic models, offering a refined empirical foundation for the parsing of “discursive periods” during which specific interpretations of an issue are widely shared. Historians often define “eras” of social change on the basis of policy shifts (Ermakoff, 2019), and—for immigration history—many have viewed key revisions of immigration law as turning points demarcating different eras (Andersson et al., 2010; Geddes and Scholten, 2016). However, historical narratives that partition the flow of events into coherent, meaningful sequences (Stone, 1979; Sewell, 1996) have been criticized for their lack of explanatory depth and, in particular, for involving a risk that spurious events will be identified as marking the beginning and end of posited periods (Popper, 1957; Griffin, 1992). Our study exemplifies that digital archives offer new opportunities for the identification of turning points and for delineating discursive periods on the basis of the ideas expressed by contemporaries (Bearman, 2015; Rule, Cointet, and Bearman, 2015; Garg et al., 2018).
Our measures of media framing are in close alignment with the type of immigration experienced in post-war Sweden until the mid-1970s. The inferred discursive periods match those implied by historical accounts that have partitioned Sweden’s immigration history on the basis of policy changes (Andersson et al., 2010; Geddes and Scholten, 2016; Kupskỳ, 2017). We found that the texts from the late 1970s and early 1980s best describe the country’s signature era of multiculturalism and tolerance toward immigration. Different frames achieved similar salience, indicating a new pluralism in how immigration has been discussed. Weathering economic downturns and peaks in immigration, this era lasted until the end of the 1990s—and thus much longer than historical accounts have suggested (Dahlström, 2004; Svanberg and Tydén, 1998). At the same time, we found that the media began framing immigration as a political issue as early as the mid-1970s—long before anti-immigration platforms started attracting larger audiences and the erosion of the parliamentary consensus on immigration in the mid to late 1980s (Byström and Frohnert, 2017). As the political framing of immigration gained momentum, we were once again able to see a more unidimensional discussion of migration—now as a strongly politicized issue.
We have also found that seemingly obvious turning points—such as the economic downturns of the 1970s and 1990s, and the “refugee crisis” of 2015—had few consequences for the frames used by the news media to portray immigration in Sweden. However, the public might frame things differently from the mainstream media, and future research is therefore needed to examine how broader segments of society, e.g., the online public, react to highly publicized events.
To conclude, seeded topic modeling provides a means whereby researchers can rely on sociological knowledge when implementing and validating replicable models that make inferences beyond the words on the page. Semi-supervised approaches of this kind could become an important next step toward further improving the work of social scientists in their computational analysis of social data.
Supplemental Material
sj-pdf-1-smr-10.1177_00491241241268453 - Supplemental material for Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers, 1945–2019
Supplemental material, sj-pdf-1-smr-10.1177_00491241241268453 for Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers, 1945–2019 by Miriam Hurtado Bodell, Måns Magnusson and Marc Keuschnigg in Sociological Methods & Research
Footnotes
Acknowledgments
Author’s Note
Declaration of Conflicting Interests
Funding
Data Availability Statement
Supplemental Material
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
