Sage Journals: Discover world-class research

Abstract

Sociologists are discussing the need for more formal ways to extract meaning from digital text archives. We focus attention on the seeded topic model, a semi-supervised extension to the standard topic model that allows sociological knowledge to be infused into the computational learning of meaning structures. Seed words help crystallize topics around known concepts, while utilizing topic models’ functionality to identify associations in text based on word co-occurrences. The method estimates a concept’s shared interpretation (or framing) via its associations with other frequently co-occurring topics. In a case study, we extract longitudinal measures of media frames regarding immigration from a vast corpus of millions of Swedish newspaper articles from the period 1945–2019. We infer turning points that partition the immigration discourse into meaningful eras and locate Sweden’s era of multicultural ideals that coined its tolerant reputation.

Keywords

media discourse framing immigration computational text analysis seeded topic model natural language processing‌

Introduction

In recent years, an increasing number of sociologists have embraced machine learning algorithms to infer latent patterns in text data (e.g., DiMaggio, Nag, and Blei, 2013; Mohr et al., 2013; Rule, Cointet, and Bearman, 2015; Bail, 2016; Nelson, 2020, 2021a,b; Kozlowski, Taddy, and Evans, 2019; Goldenstein and Poschmann, 2019; Bail, Brown, and Mann, 2017; Karell and Freedman, 2019; Wu, Wang, and Evans, 2019; Bohr, 2020; Taylor and Stoltz, 2020; Stoltz and Taylor, 2021; Arseniev-Koehler et al., 2022; Bonikowski, Luo, and Stuhler, 2022; Boutyline, Arseniev-Koehler, and Cornell, 2023; Best and Arseniev-Koehler, 2023). One suite of algorithms, unsupervised topic models (Blei, Ng, and Jordan, 2003; Griffiths and Steyvers, 2004; Blei, 2012), infers linguistic themes based on word co-occurrences. Topic models have been found to resonate well with sociological ideas about how people create meaning and make sense of the social world by linking themes to other concepts and ideas (DiMaggio, Nag, and Blei, 2013; Mohr et al., 2013; Törnberg and Törnberg, 2016; Fligstein, Stuart Brundage, and Schultz, 2017; Nelson, 2020). This article addresses a central limitation of topic models: while they are suited to inductive research that identifies emergent themes from document collections, they fare poorly at identifying, in transparent and replicable ways, specific concepts predefined by the researcher. Topic models, and unsupervised methods more generally, rely on post hoc analysis to make sense of the output in light of sociological theory, opening up an old rift between inductive and deductive research within the discipline. As computational text analysis has matured as a methodology in the sociological toolkit, calls have been made for an important next step: to move beyond the implementation of standard models and to strive to apply specialized models that are more transparent, replicable, theory-driven, and interpretable, and thus more attuned to the central demands of social science research (DiMaggio, 2015; Nelson, 2019; Mohr et al., 2020; Pääkkönen and Ylikoski, 2021; Nelson, 2021b; Grimmer, Roberts, and Stewart, 2022; Bonikowski and Nelson, 2022).

We contribute further to this debate and argue for the use of semi-supervised text analysis. We focus on the seeded (or constrained) topic model (Arora, Ge, and Moitra, 2012; Jagarlamudi, Daumé, and Udupa, 2012; Watanabe, Xuan-Hieu, and Watanabe, 2022), which combines the original model’s unsupervised nature with sociological domain knowledge.¹ In contrast to other topic model extensions commonly used by social scientists, such as the structural topic model (Roberts et al., 2014) that utilizes document-level covariates to interpret model results in light of theory, the seeded topic model creates an informative dimension reduction of the corpus. In practice, scholars often want to take advantage of the exploratory capabilities of topic models, while also hoping that the models will capture themes that are presumed a priori to exist in a corpus. Our proposed approach makes it possible to achieve both objectives by seeding certain topics, while letting other topics emerge inductively, combining the inductive power of topic models with some degree of researcher supervision. Here lies an important advantage over the deterministic use of dictionary approaches to measure predefined concepts in that seeding helps crystallize topics of interest but it allows for imperfect knowledge of the topics before running the model.

The seeding crystallizes topics around predefined words that describe themes of interest. We use the term “topics” to refer to model output, and we use “themes,” “issues,” and “frames” when referring to theoretical concepts. Seed words require researchers to be explicit about how a concept is operationalized, and seeding is one way to constrain the model to search for specific themes of interest. Seeding can also increase the robustness of computational text analysis to language change, an endemic challenge when analyzing text archives of historical timescales (Bearman, 2015; Rule, Cointet, and Bearman, 2015; Voyer et al., 2022; Bonikowski, Luo, and Stuhler, 2022). By identifying associations between a focal topic and other topics with which it frequently co-occurs, the model can detect widely shared interpretations (or frames) associated with the theme in question. These model features provide an attractive complement to the mixed-methods approaches (e.g., DiMaggio, 2015; Karell and Freedman, 2019; Nelson, 2019, 2020) that are currently being discussed as a way of bringing computational text analysis into sociological research.

One strength of the topic model approach is to allow for words’ mixed memberships in topics. Our use of the seeded topic model, however, aims at measuring clearly defined and interpretable topics, which we will achieve by using seed words that we believe to have a single, very clear meaning. Seeding will work less well if one starts from polysemic words, i.e., words with multiple meanings, or if one tries to seed a polysemic topic altogether. While the words associated with the seeded words within a given topic are also allowed to emerge from the data, forced monosemy is a limitation of our approach that will hinder its applicability to certain use cases.

Seeded topic models have been around for a decade and have more recently become available in general-purpose programming languages such as R (Watanabe, Xuan-Hieu, and Watanabe, 2022) and Python (Anoop and Asharaf, 2017). However, strong computational requirements and limitations in the scalability of off-the-shelf implementations (Lu et al., 2011; Jagarlamudi, Daumé, and Udupa, 2012; Fan, Doshi-Velez, and Miratrix, 2019; Eshima, Imai, and Sasaki, 2024; Watanabe and Baturo, 2024) have hampered their application in sociology. We discuss a scalable implementation for big text data (Magnusson et al., 2018) that removes previous bottlenecks and that we hope will make the algorithm attractive to a broader sociological audience. We illustrate the method using an important case study that measures the ways the media have framed immigration in a Swedish newspaper corpus spanning 75 years. The corpus, one of the most extensive ever analyzed in the social sciences, contains 30 million text blocks from more than 100,000 editions of the country’s four national newspapers from the period 1945–2019.

Our study connects to a long tradition of sociological research studying newspaper discourses (e.g., Gamson and Modigliani, 1989; Marx Ferree, 2003; Koopmans and Olzak, 2004; Fiss and Hirsch, 2005; Janssen, Kuipers, and Verboord, 2008; Bail, 2012; Shor et al., 2015). Previous immigration-related research has relied on corpora comprising between a few thousand and 130,000 articles, which have typically been assembled using keyword searches, and which have spanned time frames of between 1 and 14 years (Helbling, 2014; Lawlor and Tolley, 2017; Greussing and Boomgaarden, 2017; Heidenreich et al., 2019; Czymara and van Klingeren, 2022). The largest studies to date have included 850,000 articles in six European languages (Eberl and Galyga, 2021) and 850,000 immigration-related headlines from UK newspapers (Bleich and van der Veen, 2021). Compared to past snap-shot corpora, our data are vast and—in combination with a scalable algorithm—permit a fine-grained mapping of the newspaper discourse on immigration over 75 years.

Using the corpus described above, we map how shared interpretations of immigration have evolved over time. We operationalize interpretative media frames as associations between a focal topic and other topics, estimating the co-occurrence patterns of predefined themes (combining “immigration” with, e.g., “the economy,” “culture,” or “security”). Issues that frequently co-occur with the focal topic represent prominent logics for the topic’s interpretation. Through the ways journalists curate and present the news flow, the media frames that we measure in this study establish a shared context of meaning-making (Scheufele, 1999; Fiss and Hirsch, 2005; Chong and Druckman, 2017; Lizardo, 2021), placing events, people, and ideas into a wider context of interpretability (Strauss and Quinn, 1997; DiMaggio, 1997; Cerulo, Leschziner, and Shepherd, 2021; Arseniev-Koehler and Foster, 2022).

Since we estimate changes in cultural associations and delineate periods during which associations measurably differed, our computational approach adds scale to the qualitative analysis of “turning points” in collective meaning-making (Sewell, 1996; Abbott, 1997, 2001; Wagner-Pacifici, 2017). It further lends a broader empirical foundation to the casing of timelines than the narrative accounts usually heralded in the historical social sciences (Ermakoff, 2019; Griffin, 1992; Bearman, Faris, and Moody, 1999).

In the following, we provide a brief primer on frames of interpretation and turning points in media discourse, and we introduce the Swedish case study in relation to earlier large-scale studies of newspaper content. We then turn to the method itself and describe its implementation as a means of estimating predefined topics and their relations to one another over time. We present results for the Swedish newspaper corpus that highlight the interpretability of model outputs. In the concluding section, we discuss our insights into the Swedish media coverage of immigration over the past 75 years, and we ponder the degree to which text measures, drawn for example from the mainstream media as in our case, provide social sensors that can help us learn about trends in contemporary societies.

Frames and Turning Points

Frames concern how information is conveyed in communication, and how specific interpretations are promoted by relating one concept to other concepts, thereby linking new information to existing ideas and previous experiences (Gamson and Modigliani, 1989; Entman, 1993; Scheufele, 2000; Rawlings and Childress, 2021). As such, frames are “interpretive packages” (Gamson and Modigliani, 1989) that evoke particular perspectives and problem definitions through which objects in the social world can be seen and understood (Weaver, 2007; Gamson, 1992; Benford and Snow, 2000). Immigration, for example, might be interpreted, among others, through a security frame or an economic frame. Individuals may have opposing opinions on immigration (e.g., “immigrants provide necessary labor” and “immigrants take our jobs”), but they can still agree to interpret immigration through a similar lens (e.g., the economy). Taken together, frames provide the cognitive contexts that speak to and activate the learned categories of individuals’ cognition (Lizardo, 2017; Wood et al., 2018; Hunzaker and Valentino, 2019; Cerulo, Leschziner, and Shepherd, 2021), and they organize cognition at a higher order of abstraction than do opinions, attitudes, or values (DiMaggio, 1997; Goldberg, 2011; Mohr et al., 2020).

In our application, we focus on how immigration has been framed in national news media, exploring the interpretations of immigration formulated by journalists and editors. In line with the idea that an interpretative frame can be viewed as an associative pattern, we operationalize media frames as associations between a focal theme and other topics. Media frames that frequently co-occur with the focal issue represent prominent logics for the issue’s interpretation. For example, one frame may connect immigration with issues of religion in order to highlight the cultural differences between natives and migrants, while another may connect immigration with party politics in order to promote a politicized perspective on immigration. The composition of salient frames at a certain time point aggregates into what we refer to as the shared interpretation of immigration communicated by the media.

In contests over sovereignty in interpretation (Swidler, 1986; Gamson, 1992; Benford and Snow, 2000), entrepreneurs of meaning—such as governments, political parties, advocacy groups, and media outlets themselves—are keen to obtain ownership of salient issues and to influence their shared interpretations (Andrews and Caren, 2010; Quinsaat, 2014; Tsur, Calacci, and Lazer, 2015; Farrell, 2016; Bail, Brown, and Mann, 2017). But how do publicly available interpretations change? Influential social science theorizing refers to “turning points” that constitute breaks with routine practices of meaning-making (Sewell, 1996; Abbott, 1997; Wagner-Pacifici, 2017). Turning points take shape in “unsettled times” (Swidler, 1986) or “periods of rupture” (Wagner-Pacifici, 2017) in which sequences of events occur that imply thresholds and shifts that are recognizable to contemporaries. In retrospect, we give names to these ruptures because they bring with them a series of occurrences that challenge established interpretations and “durably transforms previous structures and practices” (Sewell, 1996).

We use the concept of turning points that are grounded in, and operative on, publicly available interpretations to partition Sweden’s immigration discourse into recognizable eras. We estimate annual salience shifts in the composition of dominant frames over time to identify breakpoints in the media’s framing of immigration and to parse discursive periods during which meaning-making measurably differed.

The Swedish Newspaper Corpus in Context

The Swedish Newspaper Corpus 1945–2019, digitized by the National Library of Sweden (Börjeson et al., 2023), contains 75 years of journalistic content from the country’s four largest newspapers Aftonbladet, Dagens Nyheter, Expressen, and Svenska Dagbladet. The corpus allows for a macroscopic analysis of the Swedish migration discourse as reflected in the mainstream media, dating back to the time when mass immigration to Sweden started. Sweden entered Europe’s post-war reconstruction period as a neutral country without an influential colonial history and with an ethnically homogeneous population of 6.6 million. In the decades that followed, Sweden received labor migrants and, increasingly, refugees at an average annual rate of 0.6% of the population (Statistics Sweden, 2024). Figure 1A shows the number of immigrants arriving in Sweden during the observation period. Today, 20% of the 10.3 million Swedes are foreign born (Statistics Sweden, 2022).

The news articles we study represent a broad mixture of different formats and political orientations (see Table 1). Newspapers divide their content into multiple stand-alone sections, e.g., op-eds, domestic politics, world news, culture, sports, and TV listings. We restrict our analysis to the front sections of each newspaper. We believe these sections contribute most to meaning-making in newspapers. Using the front sections leaves us with 29.3 million documents and 1.6 billion words after removing rare words and documents shorter than 15 words. The corpus consists of text blocks, i.e., units of cohesive text identified in the segmentation procedure during digitization. The segmentation relies on a rule-based approach curated by the Swedish National Library (using the software Zissor with ABBYY as the optical character recognition engine); there are different segmentation rules for each newspaper that are updated when newspaper layouts change (Dannélls, Johansson, and Björk, 2019). We use each text block as a document. Previous research (Hurtado Bodell, Magnusson, and Mützel, 2022) has shown that an article is commonly captured by multiple text blocks and, importantly, that only 16% of text blocks contain content from more than one article. See Supplemental Material Section S1 in the Appendix for more details on corpus creation.

Table 1.

Corpus Description, 1945–2019.

	Aftonbladet	Dagens Nyheter	Expressen	Svenska Dagbladet
Newspaper type	Tabloid	Broadsheet	Tabloid	Broadsheet
Political leaning	Left	Moderate	Moderate	Right
Founding year	1830	1864	1944	1884
Avg. daily paid circulation	343,595	377,870	417,653	166,426
# documents (in millions)	7.20	6.86	7.89	7.36
Tokens (in millions)	338.5	427.9	338.8	455.7
Avg. # tokens per doc	47.3	44.0	61.1	44.1
# Immigration-rich docs	86,070	117,876	90,261	112,844

Note: Average daily paid circulation refers to 1945–2018, tokens refers to number of words, and we classified documents as immigration-rich if at least 2.5% of its tokens belong to the estimated immigration topic.

By comparison with earlier computational studies of archival text that have described national conversations based on sets of political speeches (Rule, Cointet, and Bearman, 2015; Barron et al., 2018; Fuhse et al., 2020; Card et al., 2022), the extreme breadth of the newspaper archive (106,000 daily issues in total) permits us to focus on the national conversation about one particular issue, immigration, with high granularity. In relation to the newspaper corpora studied in prior immigration-related research, our data set is much larger. As a comparison, Eberl and Galyga (2021) searched for manually selected keywords and found an increase in the attention focused on immigration during the period of the 2015 European “refugee crisis” in Germany, Hungary, Poland, Spain, Sweden, and the UK (102,000 articles, 2003–2017); with regard to Sweden, the study reported that most found keywords centered around security and welfare issues. Greussing and Boomgaarden (2017) used principal component analysis to analyze data based on 89 predefined immigration-related words in Austrian newspapers (10,000 articles, 2015) and found media frames focused on security and economic issues. A lexicon-based sentiment analysis of immigration-related headlines in 850,000 articles from UK newspapers (2001–2012) found negative connotations and problem frames, particularly in news reporting on Muslim immigrants (Bleich and van der Veen, 2021). Based on the original topic modeling framework, Heidenreich et al. (2019) explored framing during the “refugee crisis” in 24 newspapers from Germany, Hungary, Spain, Sweden, and the UK (130,000 articles, 2015–2016), and found a stronger humanitarian framing of immigration in Sweden than in the other European countries. Using a structural topic model, Czymara and van Klingeren (2022) showed that, in Germany, print reporting perpetuated a more diverse set of frames of the “refugee crisis” than online reporting (32,000 articles, 2015–2017).

While they have been innovative and carefully implemented, previous topic-model studies have relied exclusively on an inductive operationalization of meaningful frames that were detected as topics in articles identified as having a focus on immigration based on a keyword search. The inferred topics, and the sociological concepts they may represent, have been interpreted post hoc, after seeing the model outputs. In this article, we argue that this practice invites researchers to adapt the boundaries of theoretical constructs on the basis of model outputs rather than on what is suggested by theory. Because topics inferred by unsupervised topic models differ each time a model is estimated, this could create a situation in which the conceptualization of a theoretical construct changes with each model run. In our use case, a topic model may capture different aspects of the “immigration discourse” with each re-run. The use of seed words to anchor an immigration topic stabilizes inferences across model estimations. As we explain in the next section, the seeded topic model improves both replicability and interpretability and combines improvements in transparency with a more theoretically informed approach to detecting topics and topical associations.

Methods

For many in the social sciences, computational text analysis comes in two variants: supervised or unsupervised. Supervised methods rest on the researcher’s access to labels for meaning structures in text data, such as categories and a coding scheme, and then extrapolate these labels to unseen texts (Nelson et al., 2021; Chen et al., 2018; Lichtenstein and Rucks-Ahidiana, 2021; Do, Ollion, and Shen, 2022). By contrast, unsupervised methods infer information about language patterns, such as co-occurrences of words in documents, without drawing on predefined categories or coding schemes. A growing number of studies are using unsupervised methods to describe the cultural meanings of sociological concepts—such as class (Kozlowski, Taddy, and Evans, 2019), gender (Garg et al., 2018), race (Nelson, 2021b), stigma (Best and Arseniev-Koehler, 2023), and art (DiMaggio, Nag, and Blei, 2013). Unsupervised methods rely on algorithms that either trace the meaning of individual words—for word embedding models in recent sociological research see Kozlowski, Taddy, and Evans (2019); Nelson et al. (2021); Bonikowski, Luo, and Stuhler (2022); Voyer et al. (2022); Best and Arseniev-Koehler (2023)—or on algorithms that identify thematic structures in ensembles of text—for topic models see, e.g., DiMaggio, Nag, and Blei (2013); Karell and Freedman (2019); Bohr (2020); Greve et al. (2022).

Topic models or, more specifically, models based on Latent Dirichlet Allocation (LDA, Blei, Ng, and Jordan, 2003) represent an important class of unsupervised methods that inductively detect themes by learning the topics that are present in a document and the words that best describe them. LDA represents a generative probabilistic process that treats each document as a bag of words from which each word (token) is randomly drawn from a mixture of topics present in the document. The model then assigns each word in a document to a topic, allowing the same word to belong to various topics to a differing degree. Each topic, in turn, is a low entropy distribution over words that tend to co-occur. This graded membership property aligns closely with our analytical aim of determining which co-occurring topics are most relevant for describing the shared interpretation (or framing) of an issue.

As was mentioned above, unsupervised methods quantify what would otherwise be inaccessible, making the interpretive process that is always an important part of text analysis more transparent and systematic. However, unsupervised methods require post hoc operations to connect the model output to meaningful sociological concepts. Word embedding models, such as the one used by Kozlowski, Taddy, and Evans (2019), rely on vector algebra and focus on a set of manually selected keywords in order to identify interpretable dimensions of a concept. In applications that use LDA models, the standard practice employed to achieve interpretability involves qualitatively inspecting each inferred topic and making iterative decisions as to which topics are meaningful and relevant for inclusion in the final analysis (e.g., Törnberg and Törnberg, 2016; Karell and Freedman, 2019; Nelson, 2020; Czymara and van Klingeren, 2022). As a consequence, “sociologists using text as data must make a dizzying number of decisions about what information to extract and how to answer their research question” (Nelson, 2019: 139). While they are important as a result of their exploratory potential and for their links to existing qualitative methodologies, iterative mixed-method approaches such as “computational grounded theory” (Baumer et al., 2017; Nelson, 2020), or “computational hermeneutics” (Mohr, Wagner-Pacifici, and Breiger, 2015) remain reliant on making sense of the output after a model is learned (Goldenstein and Poschmann, 2019; Nelson, 2019; Pääkkönen and Ylikoski, 2021). Because the inductive finding of relevant sociological concepts places researchers at risk of also finding seemingly meaningful interpretations where none actually exist, calls have been made for the development and use of intrinsically interpretable models (Hurtado Bodell, Arvidsson, and Magnusson, 2019; Rudin, 2019; Madsen, Reddy, and Chandar, 2021).

Seeded Topic Model

We suggest an extension to the original topic model, the seeded topic model (Lu et al., 2011; Arora, Ge, and Moitra, 2012; Jagarlamudi, Daumé, and Udupa, 2012; Magnusson et al., 2018; Fan, Doshi-Velez, and Miratrix, 2019; Eshima, Imai, and Sasaki, 2024; Watanabe and Zhou, 2022; Watanabe and Baturo, 2024), as a middle ground between supervised and unsupervised approaches. The fully unsupervised nature of the original topic model does not guarantee that the topics identified will meaningfully reflect concepts of interest. By applying a simple extension of the original LDA framework, we aim to measure specific topics that we believe a priori to exist in a corpus. Seed words—a collection of words that the researcher believes represent topics of interest prior to seeing model outputs—guide the model toward the topics of interest. This extension makes the decisions that must be made during the topic definition procedure more transparent and reproducible.

Allowing researchers to seed topics on the basis of existing domain knowledge constitutes an important step toward a more deductive, insight-oriented approach to modeling that is both less reliant on post hoc interpretations of model outputs (as are required in the unsupervised approach) and not restricted to a priori manually annotated categories or manually selected keywords (as are required in the supervised approach). Instead, the seed words help form topics around predefined concepts, names, or ideas, while at the same time utilizing the functionality of LDA to find new associations in text data based on word co-occurrences.

It is important to note that there is a crucial difference between the seed word strategy used here and the use of keyword searches to identify meaningful topics and identify documents that “belong” to or are most salient in relation to specific topics. Keyword search involves a deterministic procedure that requires detailed knowledge of the configuration of topics before models are run. Previous research shows that even domain experts perform poorly in identifying the keywords that are most relevant for capturing specific concepts (King, Lam, and Roberts, 2017). This results in biased text measures and differences in substantive conclusions. In contrast, seed words are only the starting point from which a model proceeds to learn which words go together. The unsupervised part of the algorithm will expand upon the original list of seed words in crystallizing topics of interest. We discuss the model and its implementation in detail in Supplemental Material Sections S2 and S4.

Previous contributions that have introduced seeded topic models using informative priors on preselected seed words (Lu et al., 2011; Jagarlamudi, Daumé, and Udupa, 2012; Fan, Doshi-Velez, and Miratrix, 2019; Eshima, Imai, and Sasaki, 2024; Watanabe and Baturo, 2024) relied on the standard collapsed Gibbs sampler as described in Griffiths and Steyvers (2004), limiting their applicability to large-scale data. By increasing scalability, and by using the model as a method for measuring sociological concepts, our implementation extends in important ways to the existing methodological literature. Seeded topic models that are implemented via highly scalable parallelizable sampling (Magnusson et al., 2018) permit the extraction of predefined topics and their associations with other themes from massive text data. Even though we have used this highly specialized algorithm, the model estimation process based on our vast corpus took 4.5 days using a machine with 360 GB RAM and 32 cores.² Without the specialized algorithm, our analysis would not have been possible. See Authors’ Note for information about the code and data that reproduces our analysis.

Seeding the Immigration Topic

Seeded topic models rely on Bayesian informative priors to decide which topics the algorithm should identify. In practice, informative priors are placed on the topic-word distribution such that a word used to guide the model has a zero probability of belonging to any other topic than the one for which it is a seed word. The seed words one uses to guide the model should be highly unlikely to occur in contexts outside the topic of interest—in our case, immigration. We use five types of words that are highly unlikely to be used in texts that do not relate to immigration: (i) names of immigration laws, (ii) titles of ministers responsible for immigration, (iii) names of agencies responsible for immigration, (iv) terms referring to related policy areas (e.g., integration policy), and (v) terms referring to different types of immigration (e.g., labor migration). Moving beyond the predefined seed words, the model learns other meaningful words that define the topic of interest. Among these, we find words that relate, for example, to race and ethnicity, such as names and slurs associated with minorities in Sweden (see Supplemental Material Section S7 for details). Our choice of seed words allows us to capture different dimensions of the immigration issue including, for example, discourses on different types of migrants such as refugees, asylum seekers, and labor migrants.

Seeding also allows the model to be infused with a priori knowledge of language change. Conceptually, actors, meanings, and contexts change over time, which implies that no single measure of discourse may be appropriate over long timescales. Lexical shifts and the changing meanings of social categorizations are critical challenges to the computational analysis of historical text (Bail, 2014; Rule, Cointet, and Bearman, 2015; Bonikowski, Luo, and Stuhler, 2022; Voyer et al., 2022). The word “immigrant,” for example, had rarely been used prior to the 1970s (“foreigner”” was the term of the day), and concepts such as “family reunification” and “unaccompanied minor” first appeared in the 1970s and 1990s, respectively. We implement the semi-supervised seeded topic model using domain knowledge to guide the model estimation over language changes that introduce new words to discuss the same topic. Topic seeding is best equipped to handle this type of language change that, in a standard modeling approach, would lead to the splitting of a theme into various topics. A previous name of the current Migration Agency (Migrationsverket), for example, was Statens Invandrarverk, and—by placing a prior on multiple words to inform the model that they belong to the same topic—we allow the immigration topic to crystallize around both these names (see Supplemental Material Section S2 for details on the seeding procedure and S3 for a full list of the seed words employed).

We measure the salience of the immigration topic (Figure 1B) by calculating the proportion of words in all documents that are estimated to belong to the seeded immigration topic each week.

Co-occurring Topics as Interpretative Frames

The seeding strategy also permits us to define a set of additional topics that meaningfully co-occur with immigration and that we wish to flesh out from the media discourse as potential interpretations of immigration. We operationalize prominent media frames via the focal topic’s associations with other frequently co-occurring topics, and we interpret these relationships as culturally shared associations between concepts. This implies that we abstract away from word-level analyses, such as keyword in context, and instead, focus on how topics (rather than words) co-occur. In our analysis, it is not crucial whether the word “immigrant” is discussed alongside words such as “workplace” or “murder”; what matters instead is the association of the immigration topic with the economy topic and the crime topic, respectively.

We have predefined co-occurring topics on the basis of existing research on the common themes found in European news reporting on immigration (Korkut et al., 2013; Greussing and Boomgaarden, 2017; Eberl et al., 2018; Heidenreich et al., 2019) and research documenting Sweden’s immigration history (Geddes and Scholten, 2016; Byström and Frohnert, 2017; Krzyżanowski, 2018; Andersson et al., 2010). Based on this research, we expect five dominant frames—“culture,” “economy,” “human rights,” “politics,” and “security”—to co-occur with discussions of immigration. We capture each frame that represents a known interpretation of immigration by seeding several topics (Table 2). We seed multiple topics to capture each frame such that an interpretative frame can be viewed as a “supratopic” covering different dimensions of a related issue. For example, “crime,” which constitutes part of the security frame, is a highly diverse issue that includes a focus on offenses such as burglary, narcotics, murder, and sexual assault, to name only a few. To capture the many different crime-related aspects, we seed four different topics using the same set of seed words (see Supplemental Material Sections S2 and S3 for details). By seeding different topics with the same words we allow the model to crystallize around particular dimensions of a broader theme of interest in separate topics without explicitly having to choose these dimensions a priori. For example, while we know that “crime” is a multi-dimensional theme in our corpus (e.g., news covering different types of crimes at different phases in an investigation will be defined by different vocabularies), we let the model inductively find which type and aspect of crime should form a particular topic. One seeded topic then becomes a drug topic, for example, one becomes a homicide topic, and so on, and these are then combined into the larger topic of crime. This procedure allows the model to identify more specialized topics which, depending on the research question, can then be combined into a well-defined larger topic. We set the number of topics to 1,000, allowing for a combination of seeded and unseeded topics in the model.

Table 2.
Seeded Topics Reflecting Frames of Immigration.

Interpretative frame Seeded topics

Culture Diversity perspectives, language, national identity, religion

Economy Labor market, public finance, health care, housing, education

Human rights Discrimination, family, human rights, racism

Politics Political parties, European Union

Security Crime, terrorism

Note: We capture each interpretative frame as a supratopic composed of several specialized seeded topics that frequently co-occur with the immigration topic.

Unlike previous research, we quantify interpretative frames using co-occurrence frequencies for different topics that are inferred from the same topic model that simultaneously measures the focal topic of interest. We measure the importance of each frame (Figure 2) in terms of the proportion of words that belong to the respective seeded topics in immigration-rich documents printed in the newspapers (see Supplemental Material Section S5).

Document Inclusion, Sensitivity, and Validation

The analysis includes all documents that we classified as “immigration-rich” if at least 2.5% of its tokens were estimated to belong to the immigration topic (i.e., $\geq$ 25 times more than the a priori expected proportion, which is 1/1,000 or 0.1%, where 1,000 represents the number of topics used in the model). One could argue that if a news item contains only a single token related to immigration, it should belong to the immigration topic. However, often immigration or immigrants are mentioned only once in an article, for example, as one of many policy areas. To establish a useful threshold for document inclusion for the entire observation period, we have read samples of the material at different threshold values and evaluated when the topic of immigration indeed was central to the news items; we settled with what we considered a good trade-off between keeping a reasonable number of documents and keeping the analysis centered around the focal theme of interest. Our main results are robust to threshold choice (see Supplemental Material Section S6).

We report on model diagnostics and sensitivity analyses in Supplemental Material Section S6, including (i) a test for model convergence as well as model re-runs (ii) using alternative numbers of topics (950, 1500), (iii) using each newspaper corpus separately, (iv) using alternative thresholds for document inclusion (1%, 4%, and 5%), and (v) using random subsets of 90%, 80%, and 70% of the original set of seed words.

In Supplemental Material Section S7, we report on validation strategies for topic definition that evaluate the degree to which a seeded topic captures the concept of interest. Those strategies include (i) a comparison of documents classified as being about immigration with a manual annotation of a sample of documents, (ii) an inspection of the tokens that the algorithm learned to belong to the topic, and (iii) an analysis of influential immigration-related events based on high temporal resolution data. The latter analysis tests whether the model picks up on immediate changes in newspapers’ framing following such events. We focus on events for which clear theoretical expectations exist about their likely impact on the salience of a particular seeded frame. An Islamist terrorist attack, for example, may serve to re-frame Islam as a violent ideology, leading to revisions of the current security-related interpretations of immigration (Greenberg, Pyszczynski, and Solomon, 1986; Legewie, 2013; Schmidt-Catran and Czymara, 2020). In this case, we would expect the relative salience of the security-related frame to increase in the weeks following the attack—indicating valid topic seeding.

Parsing Discursive Eras

We use a Bayesian Gaussian change-point model (Barry and Hartigan, 1993; Erdman and Emerson, 2007) to detect shifts over time in the salience of single frames as well as in the relative composition of salient frames. We interpret salience shifts as breakpoints in the media’s framing of immigration. The model assumes that a time series of frame salience can be partitioned into an unknown number of periods, with each period having a constant mean reflecting a “new probability regime” (Abbott, 2001). We estimate two kinds of specifications of the change-point model: (i) A univariate specification that tests for breakpoints in the salience of each of the five seeded frames separately, and (ii) a combined multivariate specification that tests for breakpoints in the relative composition of all five seeded frames. We are particularly interested in the multivariate model results. The composition of salient frames at a certain point in time aggregates into what we refer to as the shared interpretation of immigration communicated by the media. A shared interpretation describes a set of frames that are available to the public at a given point in time to make sense of an issue. The estimates of the change-point model provide an empirical foundation for the parsing of discursive periods (Rule, Cointet, and Bearman, 2015) in which meaning-making measurably differed.

The model, regardless of its specification as univariate or multivariate, estimates the posterior probability that each year constitutes a change point, delimiting sharp differences in the means of the respective time series in adjacent periods. That is, the model estimates the likelihood of a significant shift has occurred in the way the newspapers frame immigration in each one of the 75 years included in the data. We use a standard implementation of the model (Erdman and Emerson, 2007), and we set the model’s hyperparameter $γ$ to its default value 0.3, which reflects the absence of a priori knowledge as to how many change points the model should identify. We interpret years that have a multivariate posterior change-point probability equal to or larger than 50% as consequential turning points that mark the beginning of a new era of discourse. For most years, the estimated change-point probabilities are close to 0 (see Figure 2B). Our choice of a $\geq$ 50%-threshold is non-exclusive and merely requires a change-point year to have a higher likelihood of representing a turning point than of not doing so.

Results

Figure 1B traces the relative salience of the seeded immigration topic in Sweden’s newspaper corpus from 1945 to 2019. The blue line represents the annual average salience of immigration and shows how important this issue was in the media. Prior to the first major peak in the number of immigrants in 1970, the level of media attention focused on immigration was low. On average, 0.05% of tokens in the newspapers referred to it. By contrast, from 2015 to 2019, the salience of immigration as a news issue reached 0.37%, a 7.4-fold increase vis-à-vis the first period.³ Both the actual number of immigrants arriving in Sweden (Figure 1A) and the importance of the immigration topic in newspaper coverage (Figure 1B) reached unprecedented heights in 2015. The year of the European “refugee crisis” represents a clear disruption in terms of the salience of immigration. Salience also spiked during 1969–1970, which were years of high labor migration, and during the armed conflicts in Iraq (1990–1991, 2003–2011) and Bosnia (1992–1995), which resulted in many refugees arriving in Sweden. The linear correlation between the annual number of newly arrived immigrants and the salience of the immigration topic is 0.82 for the entire period examined; this correlation increases to 0.93 from 2010 to 2019. These results show that the attention of the media shifts to immigration in periods of peak influx, particularly if immigrant numbers increase rapidly.

Figure 1.
(A) Annual number of immigrants (in thousands) arriving in Sweden. (B) Annual average salience of the immigration topic in Sweden’s four major newspapers (blue line). Data points represent the percentage of all words in a given week’s news articles that are estimated to belong to the immigration topic.

Figure 2.
(A) The evolution of media frames of immigration. The Y-axis represents the salience proportion of the five seeded topics that frequently and meaningfully co-occur with the “immigration” topic. The salience proportions of these five frames sum to 1 in each year, and trajectories represent 5-year moving averages. The dashed vertical lines indicate the beginning and ending of inferred eras. (B) The likely turning points in the framing of immigration. Colored trajectories represent the univariate posterior distribution of potential change points per media frame. The black trajectory represents the multivariate posterior distribution of potential change points in the composition of frames, which constitutes our measure of the shared interpretation of immigration. The background colors highlight the seven periods implied by the model.

Relative topic salience provides an important measure of what has been discussed at different times. It does not reveal, however, how issues have been covered and thought about. To answer the second question, we trace the salience of the different immigration frames shared by the media. Figure 2A maps the co-evolution of different interpretations of immigration, plotting the salience proportion of each of the five seeded frames over 75 years. Figure 2B provides estimates of likely change points for each frame (colored lines for univariate models) and in the composition of the different frames (black line for the multivariate model). When parsing discursive eras, our primary interest lies in the detection of measurable shifts in the composition of frames. From the multivariate change-point model, we infer seven recognizable eras (with an average length of 10.7 $\pm$ 2.6 years) between which the media’s interpretation of immigration measurably differed.
Period 1, 1945–1954.

Immediately following the war, the media discourse portrayed immigration mainly from a humanitarian perspective (Figure 2A). As this association became less prominent, we find likely univariate change points in the humanitarian interpretation and, to a lesser degree, in the cultural interpretation of immigration during the late 1940s and early 1950s (Figure 2B).

Period 2, 1955–1964.

We estimate the first turning point, with a 96% posterior probability in the multivariate model, as occurring in 1955. This year was characterized by a surge of labor migration to Sweden. At the end of the second period, in the mid-1960s, the association between immigration and the economy had caught up with the humanitarian perspective. Both inferred periods 1 and 2 of post-war immigration align with historical accounts that partition Sweden’s immigration history on the basis of immigration flows and policy changes (Geddes and Scholten, 2016; Byström and Frohnert, 2017; Krzyżanowski, 2018; Kupskỳ, 2017; Andersson et al., 2010; Svanberg and Tydén, 1998).

Period 3, 1965–1973.

Our model identifies a period of rupture in the mid-1960s—which coincides with the first discussions of multiculturalism (1964) and investigations into the costs of immigration for the expanding welfare state (1965). In the immediate aftermath of these discussions and investigations, the dominant interpretation of immigration became economic, and a cultural framing gained importance. These ruptures, with multivariate change-point probabilities of 95% in 1964 and 70% in 1966, mark the beginning of a long era of relative stability in the associative patterns. Rapid economic growth and the political hegemony of the Social Democratic party resulted in the roll-out of the welfare state, which was extended in 1968 to cover migrant workers, and a newly established migration board was tasked with overseeing their employability. Again, the inferred period is largely in alignment with the narrative presented by historical social science (Byström and Frohnert, 2017; Krzyżanowski, 2018).

Period 4, 1974–1985.

We infer turning points in 1974 (70%) and 1986 (77%). Labor migration declined during the economic crises of the 1970s and was increasingly replaced by immigration involving non-European refugees. The univariate breakpoint for culture in 1984 coincides with the arrival of increasing numbers of non-Western refugees, discussions of legislation against ethnic discrimination, and increased efforts focused on integration, including family reunification (Byström and Frohnert, 2017; Andersson et al., 2010).

Period 5, 1986–1999.

1986 marks the year in which the Swedish Prime Minister, Olof Palme, was murdered. Spearheaded by Palme’s governments (1969–1976, 1982–1986), immigration law had embraced multicultural ideals, affirming diversity and the protection of immigrants’ cultural identities. Despite the turning point identified in 1986, the media framing of immigration remained remarkably stable across periods 4 and 5, and we interpret the interval 1974–1999 as representing Sweden’s famed era of tolerance (Schierup and Ålund, 2011; Rydgren and van der Meiden, 2019), during which an inert mix of economic, humanitarian, and security-related frames shaped the interpretation of migration for almost a generation. This interpretation weathered economic downturns, peaks in immigration, and Sweden’s accession to the EU in 1995, and remained dominant until the end of the 1990s—which is much longer than the historical narrative suggests (Dahlström, 2004; Byström and Frohnert, 2017; Svanberg and Tydén, 1998). At the same time, the turning points we identify in this era are disproportionately driven by an increase in a new, politically polarized understanding of immigration. Notably, this upward trend in the politicization of immigration precedes the electoral success of populist far-right parties and the decline in the Social Democratic consensus that have characterized Swedish policy debates in recent decades (Dahlström, 2004; Byström and Frohnert, 2017).

Period 6, 2000–2012.

Our analysis identifies the year 2000 as a consequential turning point (84%) driven by politicization. This was a year of revisions to immigration law, when the EU started to harmonize its immigration policies in the lead-up to the Schengen agreement (2001), and led to an increase in the number of migrant workers arriving in Sweden from the eastern countries of the EU. We find that a further convergence of media frames and, ultimately, their gradual replacement by politics as the dominant lens through which immigration is viewed, coincided with the populist right Sweden Democrats’ entry into parliament in 2010. The Sweden Democrats have since become the country’s second-largest party in national elections. Several years are associated with non-zero change-point probabilities for specific frames, but none of these are particularly pronounced and we do not find them to be sufficiently consequential to register in the model as having altered the interpretation of immigration. Throughout this period, and despite the September 11 attacks and the subsequent US-led “war on terror,” the association between Swedish immigration and security issues remained flat.

Period 7, 2013–today.

The final turning point that we estimate to lie above the 50%-threshold (51%) occurred in 2013. This disruption, which is less clear than those described above, marks the beginning of the most recent discursive era. This period included generous revisions of asylum law. At the same time, the consensual migration politics of past decades, which some have argued cemented an “opinion corridor” of views perceived as socially acceptable (Ekengren Oscarsson, 2013), were increasingly being criticized in society at large. This period reflects a further politicization of the immigration discourse, a surge in a security-related interpretation, and probably also the end of Sweden’s “exceptionalism” (Schierup and Ålund, 2011; Rydgren and van der Meiden, 2019) as regards the country’s tolerant approach to immigration. Our results indicate that this reinterpretation of immigration started well before the 2014 general election (in which the Sweden Democrats doubled their number of seats in parliament) and, most importantly, before the 2015 “refugee crisis.” Neither of these years was sufficiently consequential to register in our change-point model. Strikingly, we instead see that the 2015 “refugee crisis,” which many observers have classified as a watershed in European immigration history, was of little consequence for the ways in which the Swedish media have portrayed immigration.
In Supplemental Material Section S6, we report these results separately per newspaper. We find that the framing of immigration over time varies little between newspapers of different political orientations or between highbrow broadsheets (Dagens Nyheter, Svenska Dagbladet) and lowbrow tabloids (Aftonbladet, Expressen). These separate analyses closely reproduce the findings of the main analysis presented here.

Discussion

We have argued that the seeded (or constrained) topic model constitutes a promising semi-supervised method—combining both inductive and deductive reasoning—that provides a more replicable and transparent means of measuring meaning in digital text. Semi-supervised methods can improve transparency and replicability by decreasing the number of idiosyncratic decisions made during model implementation. Importantly, the seeded topic model permits a theoretical grounding of the topic definition procedure, because seed words require researchers to be explicit about how concepts are operationalized, and these constraints ensure that the model will identify the same concepts in each model run. This approach represents an advance in relation to concerns about whether computationally identified patterns can provide replicable and interpretable empirical evidence that is relevant to social science research. The seeding procedure allows researchers to tame the unsupervised nature of the topic model by guiding the model in its detection of topics, but without predetermining the full vocabulary associated with the topics identified. We have demonstrated the applicability of one specific algorithm to the task of identifying predefined, sociologically relevant concepts in texts and inferring the associations that exist between these concepts.

Model performance should be validated to ensure that the seeded topics represent the concepts of interest, and model validation still requires subjective interpretations of topic quality. To be sure, choosing seed words may be an iterative process, based on interpretations of model outputs and allowing previously unknown patterns to arise from the data. Such iterative processes are essential in most research that employs computational text analysis (Grimmer, Roberts, and Stewart, 2022), and as Mohr and colleagues have noted, “there can be no measurement of culture without interpretation” (Mohr et al., 2020: 4). Against this backdrop, we have taken important steps toward a more principled interpretation of topic models. First, identifying both a focal concept and its neighboring topics in a single estimation—instead of first identifying the relevant documents that contain the focal concept and then searching for other concepts within these documents—ensures that the analysis is less reliant on early operationalization decisions. One-step procedures are particularly important for producing reliable measures of meaning-making over long timescales, where they may be affected by language change.

Second, seeding facilitates diagnostics of model performance, something that is typically difficult in purely unsupervised settings (Chang et al., 2009; Ying, Montgomery, and Stewart, 2022). The semi-supervised nature of the model allows us to restrict validation efforts to the seeded topics. This is particularly important because there are currently no standards regarding how topic models should best be evaluated when used in sociological research. In the Appendix (Supplemental Material Section S7), we suggest various measures that will assist in inspecting the quality of seeded topics, and we found a high level of correspondence when we compared a manually coded sample of documents with documents inferred by the model to belong to a seeded topic. Additionally, we have checked the sensitivity of our results regarding the number of topics, seed word selection, and different thresholds for document inclusion (Supplemental Material Section S6).

In a supplementary analysis also reported in Supplemental Material Section S7, we provide suggestive evidence that unforeseen and widely recognized events have the capacity to measurably shift the salience of certain media frames. These results illustrate another validation strategy that tests whether the model picks up on shifts in the salience of the frame most closely related to the event in question. The results lend support to the validity of our semi-supervised inference of interpretative frames, and they provide pointers to the immediate response of newspapers to disruptive events. The event-focused analysis of high temporal resolution data also illustrates how—under certain assumptions—latent features of text data can be used as the outcome variable when estimating causal effects (Egami et al., 2022; Gencoglu and Gruber, 2020).

Of course, seeded topic models also have their own limitations. Current applications of the original topic model focus on discovering previously unknown patterns in text data (Grimmer, Roberts, and Stewart, 2022). The seeding of topics places bounds on an open discovery process. One solution (which we followed in our case study) involves allowing for a combination of seeded and unseeded topics in the model such that unexpected signals in the data can still be detected and explored. The applicability of the seeded topic model depends on how well researchers can operationalize a theoretical concept via one or more topics. A seeded topic model can easily identify some concepts, depending on the availability of unique words associated with the theme of interest. Other concepts are nearly impossible to pin down, however. For example, the model will struggle to capture a topic that is mostly defined by polysemic words, i.e., words with different possible meanings. To tackle issues with polysemy, researchers can seed multiple topics with the same words—as we did, for example, for the multifaceted crime topic—and thereby rely on the model to inductively capture their different meanings. While this may solve issues related to polysemy, it also decreases the replicability of the model. Therefore, finding non-polysemic words to crystallize interpretable topics of interest poses an important scope condition and, in some potential use cases, a roadblock to making full use of the seeded topic model. At the same time, however, vague and multifaceted themes that are difficult to identify using a seeded topic model may also present challenges to supervised methods that require human annotation.

Large language models (LLMs), which increasingly find their way into social science publications, also blur the line between supervised and unsupervised learning. LLMs have shown great capacity in a vast array of classification tasks (Do, Ollion, and Shen, 2022; Widmann and Wich, 2023; Bonikowski, Luo, and Stuhler, 2022; Chae and Davidson, 2023; Gilardi, Alizadeh, and Kubli, 2023; Törnberg, 2023), although current models’ performance is still under debate (e.g., Ollion et al., 2024; Bail, 2024), especially in classification tasks that require cross-document reasoning as in topic modeling and when texts pertain to a particular place and time as in historical corpora (Ziems et al., 2024). The development of LLMs proceeds at an extremely fast pace. Decreasing costs will open them up for analyses of very large corpora, and ideas of identifying, in principled ways, concepts predefined by the researcher will hopefully guide some of the modeling advances. If researchers find ways to gain more control over labeling, replicability, and transparency (Grossmann et al., 2023), this transformative brand of text modeling will be in a good position to develop important alternatives to the seeded topic model.

We have applied the seeded topic model to a vast newspaper archive to learn how the issue of immigration has been framed in Swedish newspapers from 1945 to 2019. The storytelling of journalists—their use of interpretative frames to make news events understandable to their audiences—makes newspaper archives a treasure trove for the study of meaning-making over historical timescales. We have operationalized frames as themes that frequently co-occur with the issue of interest, and we have interpreted these relationships as culturally relevant associations between concepts. Hence, we have also studied newspaper coverage as a social sensor of discursive processes (Fiss and Hirsch, 2005; Gamson and Modigliani, 1989) in which broader interpretations of societal developments and events are generated, negotiated, and revised (Swidler, 1986; Bourdieu, 1991; Strauss and Quinn, 1997). Viewing text as a social sensor involves the use of large repositories of digital text to uncover latent observations about the social world and trends in contemporary societies in particular.

Some have argued that media content reflects elite discourses and that a media sensor can capture “common cultural patterns, but it cannot observe what is never articulated” (Bonikowski, 2016). We recognize that media-generated perceptions of current events do not equate to the perceptions of the whole population, especially not with regard to polarized “hot” topics and in the age of social media. We have not measured meaning at the individual level, and we have not delineated different “thought communities,” although they no doubt exist, particularly in a politicized domain such as immigration. One example would be that different segments of society may have different groups in mind when they think about immigrants (Blinder, 2015; Eberl et al., 2018). Still, our case study has demonstrated that vast corpora of the type and scale studied here are likely to contain important evidence of the dominant interpretative frames—in the sense of “common cultural patterns”—that have been used to make sense of societal issues at a certain point in time. We believe that using such sensors may have general implications for sociological research in light of the increasing availability of “found” online data (e.g., Keuschnigg, Lovsjö, and Hedström, 2018; Salganik, 2018; Jarvis, Keuschnigg, and Hedström, 2021).

We have highlighted the induction of different eras of meaning-making as a potential means of analyzing the output of seeded topic models, offering a refined empirical foundation for the parsing of “discursive periods” during which specific interpretations of an issue are widely shared. Historians often define “eras” of social change on the basis of policy shifts (Ermakoff, 2019), and—for immigration history—many have viewed key revisions of immigration law as turning points demarcating different eras (Andersson et al., 2010; Geddes and Scholten, 2016). However, historical narratives that partition the flow of events into coherent, meaningful sequences (Stone, 1979; Sewell, 1996) have been criticized for their lack of explanatory depth and, in particular, for involving a risk that spurious events will be identified as marking the beginning and end of posited periods (Popper, 1957; Griffin, 1992). Our study exemplifies that digital archives offer new opportunities for the identification of turning points and for delineating discursive periods on the basis of the ideas expressed by contemporaries (Bearman, 2015; Rule, Cointet, and Bearman, 2015; Garg et al., 2018).

Our measures of media framing are in close alignment with the type of immigration experienced in post-war Sweden until the mid-1970s. The inferred discursive periods match those implied by historical accounts that have partitioned Sweden’s immigration history on the basis of policy changes (Andersson et al., 2010; Geddes and Scholten, 2016; Kupskỳ, 2017). We found that the texts from the late 1970s and early 1980s best describe the country’s signature era of multiculturalism and tolerance toward immigration. Different frames achieved similar salience, indicating a new pluralism in how immigration has been discussed. Weathering economic downturns and peaks in immigration, this era lasted until the end of the 1990s—and thus much longer than historical accounts have suggested (Dahlström, 2004; Svanberg and Tydén, 1998). At the same time, we found that the media began framing immigration as a political issue as early as the mid-1970s—long before anti-immigration platforms started attracting larger audiences and the erosion of the parliamentary consensus on immigration in the mid to late 1980s (Byström and Frohnert, 2017). As the political framing of immigration gained momentum, we were once again able to see a more unidimensional discussion of migration—now as a strongly politicized issue.

We have also found that seemingly obvious turning points—such as the economic downturns of the 1970s and 1990s, and the “refugee crisis” of 2015—had few consequences for the frames used by the news media to portray immigration in Sweden. However, the public might frame things differently from the mainstream media, and future research is therefore needed to examine how broader segments of society, e.g., the online public, react to highly publicized events.

To conclude, seeded topic modeling provides a means whereby researchers can rely on sociological knowledge when implementing and validating replicable models that make inferences beyond the words on the page. Semi-supervised approaches of this kind could become an important next step toward further improving the work of social scientists in their computational analysis of social data.

Supplemental Material

sj-pdf-1-smr-10.1177_00491241241268453 - Supplemental material for Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers, 1945–2019

Supplemental material, sj-pdf-1-smr-10.1177_00491241241268453 for Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers, 1945–2019 by Miriam Hurtado Bodell, Måns Magnusson and Marc Keuschnigg in Sociological Methods & Research

Interpretative frame	Seeded topics
Culture	Diversity perspectives, language, national identity, religion
Economy	Labor market, public finance, health care, housing, education
Human rights	Discrimination, family, human rights, racism
Politics	Political parties, European Union
Security	Crime, terrorism

Footnotes

Acknowledgments

We thank Maria Brandén,Jacob Habinek,Peter Hedström,Leif Jonsson,Friedolin Merhout,Étienne Ollion,and Sarah Valdez for discussions,and our editor Justin Grimmer and three anonymous reviewers for their valuable comments. We are indebted to the National Library of Sweden for granting access to their digitized newspaper archive.

Author’s Note

To facilitate the running of seeded topic models on very large text data,we developed an R package,available on GitHub:

. The pre-processed data and code,which can be used to recreate our main analyses,can be found under the same link.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research has been funded by the Swedish Research Council (2018-05170,2018-06063). Resources provided by the Swedish National Infrastructure for Computing (2020/5-145,2021/5-161,2023-22-360) enabled computations.

ORCID iDs

Miriam Hurtado Bodell

Måns Magnusson

Marc Keuschnigg

Data Availability Statement

The pre-processed data are available on GitHub: https://github.com/mhbodell/seeded_topic_models_digital_archives.

The original data are available from the authors upon request but,for copyright reasons,can only be accessed on-site at the National Library of Sweden.

Supplemental Material

The supplemental material for this article is available online.

Author Biographies

Miriam Hurtado Bodell is a PhD candidate in analytical sociology at Linköping University,Sweden. Her research focuses on theoretically-driven computational text analysis and social inquiries about meaning-making processes.

Marc Keuschnigg is Professor of Sociology at Leipzig University,Germany,and Associate Professor at the Institute for Analytical Sociology at Linköping University,Sweden. His research interests include cultural dynamics,inequality,and normative change. Recent work has appeared in the European Sociological Review,Nature Human Behaviour,Science Advances,and PNAS .

Måns Magnusson is an assistant professor of Statistics at Uppsala University,Sweden and affiliated with the Institute for Analytical Sociology at Linköping University,Sweden. His primary research interests are Bayesian inference,probabilistic machine learning,and statistical inference from textual data.

References

Abbott

1997. “On the Concept of Turning Point.” Comparative Social Research 16: 85–106.

Abbott

. 2001. Time Matters: On Theory and Method. Chicago, IL, USA: University of Chicago Press.

Andersson

Dhalmann

Holmqvist

Kauppinen

T.M.

Magnusson Turner

Skifter Andersen

Søholt

Vaattovaara

Vilkama

Wessel

, et al. 2010. Immigration, Housing and Segregation in the Nordic Welfare States. Helsinki University, Finland: Department of Geosciences and Geography.

Andrews

K.T.

Caren

. 2010. “Making the News: Movement Organizations, Media Attention, and the Public Agenda.” American Sociological Review 75(6): 841–866.

Anoop

V.S.

Asharaf

. 2017. “A Topic Modeling Guided Approach for Semantic Knowledge Discovery in E-Commerce.” International Journal of Interactive Multimedia and Artificial Intelligence 4(6): 1–8.

Arora

Moitra

. 2012. “Learning Topic Models–Going Beyond SVD.” Pp. 1–10 in Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science. New Brunswick, New Jersey: IEEE.

Arseniev-Koehler

Cochran

S.D.

Mays

V.M.

Chang

K.-W.

Foster

J.G.

. 2022. “Integrating Topic Modeling and Word Embedding to Characterize Violent Deaths.” Proceedings of the National Academy of Sciences 119(10): 1–6.

Arseniev-Koehler

Foster

J.G.

. 2022. “Machine Learning as a Model for Cultural Learning: Teaching an Algorithm What It Means to Be Fat.” Sociological Methods & Research 51(4): 1484–1539.

Bail

C.A

. 2012. “The Fringe Effect: Civil Society Organizations and the Evolution of Media Discourse About Islam Since the September 11th Attacks.” American Sociological Review 77(6): 855–879.

10.

Bail

C.A

. 2014. “The Cultural Environment: Measuring Culture With Big Data.” Theory and Society 43(3-4): 465–482.

11.

Bail

C.A

. 2016. “Cultural Carrying Capacity: Organ Donation Advocacy, Discursive Framing, and Social Media Engagement.” Social Science & Medicine 165: 280–288.

12.

Bail

C.A

. 2024. “Can Generative AI Improve Social Science?” Proceedings of the National Academy of Sciences 121(21): e2314021121.

13.

Bail

C.A.

Brown

T.W.

Mann

. 2017. “Channeling Hearts and Minds: Advocacy Organizations, Cognitive-Emotional Currents, and Public Conversation.” American Sociological Review 82(6): 1188–1213.

14.

Barron

A.T.J.

Huang

Spang

R.L.

DeDeo

. 2018. “Individuals, Institutions, and Innovation in the Debates of the French Revolution.” Proceedings of the National Academy of Sciences 115(18): 4607–4612.

15.

Barry

Hartigan

J.A.

. 1993. “A Bayesian Analysis for Change Point Problems.” Journal of the American Statistical Association 88(421): 309–319.

16.

Baumer

Mimno

Guha

Quan

Gay

G.K.

. 2017. “Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?” Journal of the Association for Information Science and Technology 68(6): 1397–1410.

17.

Bearman

. 2015. “Big Data and Historical Social Science.” Big Data & Society 2(2): 1–5.

18.

Bearman

Faris

Moody

. 1999. “Blocking the Future: New Solutions for Old Problems in Historical Social Science.” Social Science History 23(4): 501–533.

19.

Benford

R.D.

Snow

D.A.

. 2000. “Framing Processes and Social Movements: An Overview and Assessment.” Annual Review of Sociology 26(1): 611–639.

20.

Best

R.K.

Arseniev-Koehler

. 2023. “The Stigma of Diseases: Unequal Burden, Uneven Decline.” American Sociological Review 88(5): 938–969.

21.

Blei

D.M

. 2012. “Probabilistic Topic Models.” Communications of the ACM 55(4): 77–84.

22.

Blei

D.M.

A.Y.

Jordan

M.I.

. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.

23.

Bleich

van der Veen

A.M.

. 2021. “Media Portrayals of Muslims: A Comparative Sentiment Analysis of American Newspapers, 1996–2015.” Politics, Groups, and Identities 9(1): 20–39.

24.

Blinder

. 2015. “Imagined Immigration: The Impact of Different Meanings of ’Immigrants’ in Public Opinion and Policy Debates in Britain.” Political Studies 63(1): 80–100.

25.

Bohr

. 2020. “Reporting on Climate Change: A Computational Analysis of U.S. Newspapers and Sources of Bias, 1997–2017.” Global Environmental Change 61: 1–12.

26.

Bonikowski

. 2016. “Nationalism in Settled Times.” Annual Review of Sociology 42: 427–449.

27.

Bonikowski

Luo

Stuhler

. 2022. “Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in US Presidential Campaigns (1952–2020) with Neural Language Models.” Sociological Methods & Research 51(4): 1721–1787.

28.

Bonikowski

Nelson

L.K.

. 2022. “From Ends to Means: The Promise of Computational Text Analysis for Theoretically Driven Sociological Research.” Sociological Methods & Research 51(4): 1469–1483.

29.

Börjeson

Haffenden

Malmsten

Klingwall

Rende

Kurtz

Rekathati

Hägglöf

Sikora

. 2023. “Transfiguring the Library as Digital Research Infrastructure: Making KBLab at the National Library of Sweden.” Retrieved March 3, 2024 (https://osf.io/preprints/socarxiv/w48rf).

30.

Bourdieu

. 1991. Language and Symbolic Power. Cambridge, MA: Harvard University Press.

31.

Boutyline

Arseniev-Koehler

Cornell

D.J.

. 2023. “School, Studying, and Smarts: Gender Stereotypes and Education Across 80 Years of American Print Media, 1930–2009.” Social Forces 102(1): 263–286.

32.

Byström

Frohnert

. 2017. Invandringens Historia: Från “Folkhemmet” til Dagens Sverige. Elanders Sverige AB, Stockholm, Sweden: Delegationen för migrationsstudier.

33.

Card

Chang

Becker

Mendelsohn

Voigt

Boustan

Abramitzky

Jurafsky

. 2022. “Computational Analysis of 140 Years of US Political Speeches Reveals More Positive but Increasingly Polarized Framing of Immigration.” Proceedings of the National Academy of Sciences 119(31): 1–9.

34.

Cerulo

K.A.

Leschziner

Shepherd

. 2021. “Rethinking Culture and Cognition.” Annual Review of Sociology 47: 63–85.

35.

Chae

Davidson

. 2023. “Large Language Models for Text Classification: From Zero-Shot Learning to Fine-Tuning.” Retrieved March 5, 2024 (https://osf.io/preprints/socarxiv/sthwk).

36.

Chang

Gerrish

Wang

Boyd-Graber

Blei

. 2009. “Reading Tea Leaves: How Humans Interpret Topic Models.” Advances in Neural Information Processing Systems 22: 288-296.

37.

Chen

N.-C.

Drouhard

Kocielnik

Suh

Aragon

C.R.

. 2018. “Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity.” ACM Transactions on Interactive Intelligent Systems 8(2): 1–20.

38.

Chong

Druckman

J.N.

. 2017. “Framing Theory.” Annual Review of Political Science 10: 103–126.

39.

Czymara

C.S.

van Klingeren

. 2022. “New Perspective? Comparing Frame Occurrence in Online and Traditional News Media Reporting on Europe’s “Migration Crisis”.” Communications 47(1): 136–162.

40.

Dahlström

. 2004. “Rhetoric, Practice and the Dynamics of Institutional Change: Immigrant Policy in Sweden, 1964–2000.” Scandinavian Political Studies 27(3): 287–310.

41.

Dannélls

Johansson

Björk

. 2019. “Evaluation and Refinement of an Enhanced OCR Process for Mass Digitisation.” Pp. 112–123 in Digital Humanities in the Nordic Countries, Vol. 2364, edited by: Costanza Navarretta, Manex Agirrezabal, Bente Maegaard. Copenhagen, Denmark: University of Copenhagen.

42.

DiMaggio

. 1997. “Culture and Cognition.” Annual Review of Sociology 23(1): 263–287.

43.

DiMaggio

. 2015. “Adapting Computational Text Analysis to Social Science (and Vice Versa).” Big Data & Society 2(2): 1–5.

44.

DiMaggio

Nag

Blei

. 2013. “Exploiting Affinities Between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of US Government Arts Funding.” Poetics 41(6): 570–606.

45.

Ollion

Shen

. 2022. “The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.” Sociological Methods & Research 53(3): 1167-1200.

46.

Eberl

J.-M.

Galyga

. 2021. “Mapping Media Coverage of Migration Within and Into Europe.” Pp. 105–122 in Media and Public Attitudes Toward Migration in Europe, edited by Strömbäck J, Meltzer CE, Eberl J-M. et al. Oxfordshire, England, UK: Routledge.

47.

Eberl

J.-M.

Meltzer

C.E.

Heidenreich

Herrero

Theorin

Lind

Berganza

Boomgaarden

H.G.

Schemer

Strömbäck

. 2018. “The European Media Discourse on Immigration and its Affects: A Literature Review.” Annals of the International Communication Association 42(3): 207–223.

48.

Egami

Fong

C.J.

Grimmer

Roberts

M.E.

Stewart

B.M.

. 2022. “How to Make Causal Inferences Using Texts.” Science Advances 8(42): 1–13.

49.

Ekengren Oscarsson

. 2013. Väljare är inga dumbommar. Retrieved May 21, 2024 (https://tinyurl.com/juyntbk2).

50.

Entman

R.M

. 1993. “Framing: Toward Clarification of a Fractured Paradigm.” Journal of Communication 43(4): 51–58.

51.

Erdman

Emerson

J.W.

. 2007. “bcp: An R Package for Performing a Bayesian Analysis of Change Point Problems.” Journal of Statistical Software 23(3): 1–13.

52.

Ermakoff

. 2019. “Causality and History: Modes of Causal Investigation in Historical Social Sciences.” Annual Review of Sociology 45: 581–606.

53.

Eshima

Imai

Sasaki

. 2024. “Keyword-Assisted Topic Models.” American Journal of Political Science 68(2): 730–750.

54.

Fan

Doshi-Velez

Miratrix

. 2019. “Assessing Topic Model Relevance: Evaluation and Informative Priors.” Statistical Analysis and Data Mining: The ASA Data Science Journal 12(3): 210–222.

55.

Farrell

. 2016. “Corporate Funding and Ideological Polarization About Climate Change.” Proceedings of the National Academy of Sciences 113(1): 92–97.

56.

Fiss

P.C.

Hirsch

P.M.

. 2005. “The Discourse of Globalization: Framing and Sensemaking of an Emerging Concept.” American Sociological Review 70(1): 29–52.

57.

Fligstein

Stuart Brundage

Schultz

. 2017. “Seeing Like the Fed: Culture, Cognition, and Framing in the Failure to Anticipate the Financial Crisis of 2008.” American Sociological Review 82(5): 879–909.

58.

Fuhse

Stuhler

Riebling

Martin

J.L.

. 2020. “Relating Social and Symbolic Relations in Quantitative Text Analysis: A Study of Parliamentary Discourse in the Weimar Republic.” Poetics 78: 1–17.

59.

Gamson

W.A.

1992. Talking Politics. Cambridge, UK: Cambridge University Press.

60.

Gamson

W.A.

Modigliani

. 1989. “Media Discourse and Public Opinion on Nuclear Power: A Constructionist Approach.” American Journal of Sociology 95(1): 1–37.

61.

Garg

Schiebinger

Jurafsky

Zou

. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115(16): E3635–E3644.

62.

Geddes

Scholten

. 2016. The Politics of Migration and Immigration in Europe. London, UK: Sage.

63.

Gencoglu

Gruber

. 2020. “Causal Modeling of Twitter Activity During Covid-19.” Computation 8(4): 1–14.

64.

Gilardi

Alizadeh

Kubli

. 2023. “ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120(30): 1–3.

65.

Goldberg

. 2011. “Mapping Shared Understandings Using Relational Class Analysis: The Case of the Cultural Omnivore Reexamined.” American Journal of Sociology 116(5): 1397–1436.

66.

Goldenstein

Poschmann

. 2019. “Analyzing Meaning in Big Data: Performing a Map Analysis Using Grammatical Parsing and Topic Modeling.” Sociological Methodology 49(1): 83–131.

67.

Greenberg

Pyszczynski

Solomon

. 1986. “The Causes and Consequences of a Need for Self-Esteem: A Terror Management Theory.” Pp. 189–212 in Public Self and Private Self, edited by Baumeister, R.F. New York: Springer.

68.

Greussing

Boomgaarden

H.G.

2017. “Shifting the Refugee Narrative? An Automated Frame Analysis of Europe’s 2015 Refugee Crisis.” Journal of Ethnic and Migration Studies 43(11): 1749–1774.

69.

Greve

H.R.

Rao

Vicinanza

Zhou

E.Y.

. 2022. “Online Conspiracy Groups: Micro-Bloggers, Bots, and Coronavirus Conspiracy Talk on Twitter.” American Sociological Review 87(6): 919–949.

70.

Griffin

L.J

. 1992. “Temporality, Events, and Explanation in Historical Sociology: An Introduction.” Sociological Methods & Research 20(4): 403–427.

71.

Griffiths

T.L.

Steyvers

. 2004. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101(1): 5228–5235.

72.

Grimmer

Roberts

M.E.

Stewart

B.M.

. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, NJ: Princeton University Press.

73.

Grossmann

Feinberg

Parker

Christakis

Tetlock

Cunningham

. 2023. “AI and the Transformation of Social Science Research.” Science (New York, NY) 380(6650): 1108–1109.

74.

Heidenreich

Lind

Eberl

J.-M.

Boomgaarden

H.G.

2019. “Media Framing Dynamics of the ’European Refugee Crisis’: A Comparative Topic Modelling Approach.” Journal of Refugee Studies 32(SI 1): 172–182.

75.

Helbling

. 2014. “Framing Immigration in Western Europe.” Journal of Ethnic and Migration Studies 40(1): 21–41.

76.

Hunzaker

M.B.F.

Valentino

. 2019. “Mapping Cultural Schemas: From Theory to Method.” American Sociological Review 84(5): 950–981.

77.

Hurtado Bodell

Arvidsson

Magnusson

. 2019. “Interpretable Word Embeddings via Informative Priors.” Pp. 6323–6329 in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), edited by Padó, S., Huang, R. Hong Kong, China: Association for Computational Linguistics.

78.

Hurtado Bodell

Magnusson

Mützel

. 2022. “From Documents to Data: A Framework for Total Corpus Quality.” Socius 8: 1–15.

79.

Jagarlamudi

Daumé III

Udupa

. 2012. “Incorporating Lexical Priors Into Topic Models.” Pp. 204–213 in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, edited by Daelemans,W. Avignon, France: Association for Computational Linguistics.

80.

Janssen

Kuipers

Verboord

2008. “Cultural Globalization and Arts Journalism: The International Orientation of Arts and Culture Coverage in Dutch, French, German, and US Newspapers, 1955 to 2005.” American Sociological Review 73(5): 719–740.

81.

Jarvis

B.F.

Keuschnigg

Hedström

. 2021. “Analytical Sociology Amidst a Computational Social Science Revolution.” Pp. 33–52 in Handbook of Computational Social Science, edited by U. Engel, A. Quan-Haase, S. X. Liu, and L. Lyberg. Oxfordshire, England, UK: Routledge.

82.

Karell

Freedman

2019. “Rhetorics of Radicalism.” American Sociological Review 84(4): 726–753.

83.

Keuschnigg

Lovsjö

Hedström

. 2018. “Analytical Sociology and Computational Social Science.” Journal of Computational Social Science 1(1): 3–14.

84.

King

Lam

Roberts

M.E.

. 2017. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science 61(4): 971–988.

85.

Koopmans

Olzak

. 2004. “Discursive Opportunities and the Evolution of Right-Wing Violence in Germany.” American Journal of Sociology 110(1): 198–230.

86.

Korkut

Bucken-Knapp

McGarry

Hinnfors

Drake

. 2013. The Discourses and Politics of Migration in Europe. New York: Springer.

87.

Kozlowski

A.C.

Taddy

Evans

J.A.

. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review 84(5): 905–949.

88.

Krzyżanowski

. 2018. “‘We Are a Small Country That Has Done Enormously Lot’: The? Refugee Crisis? and the Hybrid Discourse of Politicizing Immigration in Sweden.” Journal of Immigrant & Refugee Studies 16(1-2): 97–117.

89.

Kupskỳ

. 2017. “History and Changes of Swedish Migration Policy.” Journal of Geography, Politics and Society 7(3): 50–56.

90.

Lawlor

Tolley

2017. “Deciding Who’s Legitimate: News Media Framing of Immigrants and Refugees.” International Journal of Communication 11: 967–991.

91.

Legewie

. 2013. “Terrorist Events and Attitudes Toward Immigrants: A Natural Experiment.” American Journal of Sociology 118(5): 1199–1245.

92.

Lichtenstein

Rucks-Ahidiana

2023. “Contextual Text Coding: A Mixed-Methods Approach for Large-Scale Textual Data.” Sociological Methods & Research 52(2): 606-641.

93.

Lizardo

. 2017. “Improving Cultural Analysis: Considering Personal Culture in its Declarative and Nondeclarative Modes.” American Sociological Review 82(1): 88–115.

94.

Lizardo

. 2021. “Culture, Cognition, and Internalization.” Sociological Forum 36(S1): 1177–1206.

95.

Ott

Cardie

Tsou

. 2011. “Multi-Aspect Sentiment Analysis with Topic Models.” Pp. 81–88 in 2011 IEEE 11th International Conference on Data Mining Workshops, edited by Spiliopoulou, M., Wang, H.,Cook, D. et al. Vancover, Canada: IEEE.

96.

Madsen

Reddy

Chandar

. 2021. “Post-Hoc Interpretability for Neural NLP: A Survey.” ACM Computing Surveys 55(8): 1–41.

97.

Magnusson

Jonsson

Villani

Broman

2018. “Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models.” Journal of Computational and Graphical Statistics 27(2): 449–463.

98.

Magnusson

Öhrvall

Barrling

Mimno

. 2018. “Voices from the Far Right: A Text Analysis of Swedish Parliamentary Debates.” Retrieved May 21, 2024 (https://osf.io/preprints/socarxiv/jdsqc).

99.

Marx Ferree

. 2003. “Resonance and Radicalism: Feminist Framing in the Abortion Debates of the United States and Germany.” American Journal of Sociology 109(2): 304–344.

100.

Mohr

J.W.

Bail

C.A.

Frye

Lena

J.C.

Lizardo

McDonnell

T.E.

Mische

Tavory

Wherry

F.F.

. 2020. Measuring Culture. New York: Columbia University Press.

101.

Mohr

J.W.

Wagner-Pacifici

Breiger

R.L.

. 2015. “Toward a Computational Hermeneutics.” Big Data & Society 2(2): 1–8.

102.

Mohr

J.W.

Wagner-Pacifici

Breiger

R.L.

Bogdanov

. 2013. “Graphing the Grammar of Motives in National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41(6): 670–700.

103.

Nelson

L.K

. 2019. “To Measure Meaning in Big Data, Don’t Give Me a Map, Give Me Transparency and Reproducibility.” Sociological Methodology 49(1): 139–143.

104.

Nelson

L.K

. 2020. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research 49(1): 3–42.

105.

Nelson

L.K

. 2021a. “Cycles of Conflict, a Century of Continuity: The Impact of Persistent Place-Based Political Logics on Social Movement Strategy.” American Journal of Sociology 127(1): 1–59.

106.

Nelson

L.K

. 2021b. “Leveraging the Alignment Between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century US South.” Poetics 88: 1–14.

107.

Nelson

L.K.

Burk

Knudsen

McCall

. 2021. “The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods.” Sociological Methods & Research 50(1): 202–237.

108.

Ollion

Shen

Macanovic

Chatelain

. 2024. “The Dangers of Using Proprietary LLMs for Research.” Nature Machine Intelligence 6: 4–5.

109.

Pääkkönen

Ylikoski

. 2021. “Humanistic Interpretation and Machine Learning.” Synthese 199(1): 1461–1497.

110.

Popper

K.R.

1957. The Poverty of Historicism. Oxfordshire, UK: Routledge.

111.

Quinsaat

. 2014. “Competing News Frames and Hegemonic Discourses in the Construction of Contemporary Immigration and Immigrants in the United States.” Mass Communication and Society 17(4): 573–596.

112.

Rawlings

C.M.

Childress

. 2021. “Schemas, Interactions, and Objects in Meaning-Making.” Sociological Forum 36: 1446–1477.

113.

Roberts

M.E.

Stewart

B.M.

Tingley

Lucas

Leder-Luis

Gadarian

S.K.

Albertson

Rand

D.G.

. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58(4): 1064–82.

114.

Rudin

. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1(5): 206–215.

115.

Rule

Cointet

J.-P.

Bearman

P.S.

. 2015. “Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014.” Proceedings of the National Academy of Sciences 112(35): 10837–10844.

116.

Rydgren

van der Meiden

. 2019. “The Radical Right and the End of Swedish Exceptionalism.” European Political Science 18: 439–455.

117.

Salganik

M.J.

2018. Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press.

118.

Scheufele

D.A

. 1999. “Framing as a Theory of Media Effects.” Journal of Communication 49(1): 103–122.

119.

Scheufele

D.A

. 2000. “Agenda-Setting, Priming, and Framing Revisited: Another Look at Cognitive Effects of Political Communication.” Mass Communication and Society 3(2-3): 297–316.

120.

Schierup

C.-U.

Ålund

. 2011. “The End of Swedish Exceptionalism? Citizenship, Neoliberalism and the Politics of Exclusion.” Race & Class 53(1): 45–64.

121.

Schmidt-Catran

Czymara

C.S.

. 2020. “‘Did You Read About Berlin?’ Terrorist Attacks, Online Media Reporting and Support for Refugees in Germany.” Soziale Welt 71(2-3): 305–337.

122.

Sewell

W.H

. 1996. “Historical Events as Transformations of Structures: Inventing Revolution at the Bastille.” Theory and Society 25(6): 841–881.

123.

Shor

Van De Rijt

Miltsov

Kulkarni

Skiena

. 2015. “A Paper Ceiling: Explaining the Persistent Underrepresentation of Women in Printed News.” American Sociological Review 80(5): 960–984.

124.

Statistics Sweden. 2022. “Utrikes Födda i Sverige.” Retrieved May 21, 2024 (https://tinyurl.com/4hxukurf).

125.

Statistics Sweden. 2024. “Population and Population Changes 1749–2023.” Retrieved May 21, 2024 (https://tinyurl.com/ms9767nu).

126.

Stoltz

D.S.

Taylor

M.A.

. 2021. “Cultural Cartography with Word Embeddings.” Poetics 88: 1–14.

127.

Stone

. 1979. “The Revival of Narrative: Reflections On a New Old History.” Past & Present 85: 3–24.

128.

Strauss

Quinn

. 1997. A Cognitive Theory of Cultural Meaning. Cambridge, UK: Cambridge University Press.

129.

Svanberg

Tydén

. 1998. Tusen år av Invandring. En Svensk Kulturhistoria. Stockholm, Sweden: Arena.

130.

Swidler

. 1986. “Culture in Action: Symbols and Strategies.” American Sociological Review 51(2): 273–286.

131.

Taylor

M.A.

Stoltz

D.S.

. 2020. “Concept Class Analysis: A Method for Identifying Cultural Schemas in Texts.” Sociological Science 7: 544–569.

132.

Törnberg

. 2023. “Chatgpt-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning.” Retrieved March 5, 2024 (https://arxiv.org/abs/2304.06588).

133.

Törnberg

. 2016. “Muslims in Social Media Discourse: Combining Topic Modeling and Critical Discourse Analysis.” Discourse, Context & Media 13: 132–142.

134.

Tsur

Calacci

Lazer

. 2015. “A Frame of Mind: Using Statistical Models for Detection of Framing and Agenda Setting Campaigns.” Pp. 1629–1638 in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by Zong, C. & Strube, M. Beijing, China: Association for Computational Linguistics.

135.

Voyer

Kline

Z.D.

Danton

Volkova

. 2022. “From Strange to Normal: Computational Approaches to Examining Immigrant Incorporation Through Shifts in the Mainstream.” Sociological Methods & Research 51(4): 1540–1579.

136.

Wagner-Pacifici

. 2017. What Is an Event? Chicago, IL: University of Chicago Press.

137.

Watanabe

Baturo

. 2024. “Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences.” Social Science Computer Review 42(1): 224–248.

138.

Watanabe

Xuan-Hieu

Watanabe

M.K.

. 2022. “Package ‘seededlda’.”Retrieved February 2, 2023 (https://cran.irsn.fr/web/packages/seededlda/seededlda.pdf).

139.

Watanabe

Zhou

. 2022. “Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches.” Social Science Computer Review 40(2): 346–366.

140.

Weaver

D.H

. 2007. “Thoughts on Agenda Setting, Framing, and Priming.” Journal of Communication 57(1): 142–147.

141.

Widmann

Wich

. 2023. “Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text.” Political Analysis 31(4): 626–641.

142.

Wood

M.L.

Stoltz

D.S.

Van Ness

Taylor

M.A.

. 2018. “Schemas and Frames.” Sociological Theory 36(3): 244–261.

143.

Wang

Evans

J.A.

. 2019. “Large Teams Develop and Small Teams Disrupt Science and Technology.” Nature 566(7744): 378–382.

144.

Ying

Montgomery

J.M.

Stewart

B.M.

. 2022. “Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures.” Political Analysis 30(4): 570–589.

145.

Ziems

Held

Shaikh

Chen

Zhang

Yang

. 2024. “Can Large Language Models Transform Computational Social Science?” Computational Linguistics 50(1): 237–291.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.73 MB

Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers,1945–2019

Abstract

Keywords

Introduction

Frames and Turning Points

The Swedish Newspaper Corpus in Context

Methods

Seeded Topic Model

Seeding the Immigration Topic

Co-occurring Topics as Interpretative Frames

Document Inclusion, Sensitivity, and Validation

Parsing Discursive Eras

Results

Discussion

Supplemental Material

sj-pdf-1-smr-10.1177_00491241241268453 - Supplemental material for Seeded Topic Models in Digital Archives: Analyzing Interpretations of Immigration in Swedish Newspapers, 1945–2019

Footnotes

Acknowledgments

Author’s Note

Declaration of Conflicting Interests

Funding

ORCID iDs

Data Availability Statement

Supplemental Material

Author Biographies

References

Supplementary Material