Abstract
Introduction
Hofmann [1] introduced Probabilistic Latent Semantic Analysis (PLSA), which is also known as Probabilistic Latent Semantic Indexing (PLSI) when used in information retrieval and text mining [2]. The basic idea of PLSA is to treat the words in each document as observations from a mixture model where the components of the model are word distributions for latent topics. The selection of the latent topics is controlled by a set of mixing weights such that words in the same document share the same mixing weights. PLSA was initially proposed for text-based applications that do indexing, retrieval, mining, and clustering. Later, its use was expanded to other fields including collaborative filtering [3], computer vision [4], and audio processing [5].
PLSA can be viewed as a probabilistic version of the seminal work on latent semantic analysis [6], which revealed the utility of the singular value decomposition of the document-term matrix. PLSA is the precursor of probabilistic topic models which are widely used nowadays including Latent Dirichlet Allocation (LDA) [7]. The basic generative processes of PLSA and LDA are very similar. In PLSA, the topic mixture is conditioned on each document, while the topic mixture in LDA is drawn from a conjugate Dirichlet prior. Theoretically, PLSA is equivalent to MAP estimated LDA under a uniform prior [8]. The PLSA model does not make any assumptions about how the mixture weights are generated and thus its generative semantics are not well defined [7]. Consequently, there is no natural way to predict a previously unseen document. On the other hand, the LDA model is more complex and cannot be solved by exact inference. Gibbs sampling [9] and variational inference [7] are often used for inference in LDA type of topic models. However, these methods scale poorly to large datasets. Variational inference requires dozens of expensive passes over the entire dataset, and Gibbs sampling requires multiple Markov chains [10]. In contrast, the parameter estimation and inference of PLSA can be efficiently done by the EM algorithm.
PLSA and LDA are the two most representative topic models. Various empirical comparisons have been conducted between them. Blei et al. [7] shows that LDA outperforms PLSA in the perplexity of new documents. On the other hand, Lu et al. [11] conduct a systematic empirical study of PLSA and LDA on three representative IR tasks, including document clustering, text categorization, and ad-hoc retrieval. They found that LDA and PLSA tend to perform similarly on these tasks. Furthermore, the performance of LDA on all tasks is quite sensitive to the setting of its hyperparameters, and the optimal setting of hyperparameters varies according to how the model is used in a task.
The original PLSA and LDA models as well as most of their variants are unsupervised models. Many real-world text documents are associated with a response variable connected to each document such as the number of stars given to a movie, the number of times a news article was downloaded, or the category of a document. Incorporating such information into latent aspect modeling could guide a topic model towards discovering semantically more salient statistical patterns that may be more interesting or relevant to the user’s task. Thus, a very important extension of LDA is supervised LDA (sLDA) [12]. sLDA jointly models the content and responses of documents in order to find latent topics that best predict the responses of documents.
In this paper, we propose supervised Probabilistic Latent Semantic Analysis (sPLSA) by extending PLSA to learn from the responses of documents. For PLSA, our proposed model is the analog of what sLDA is to LDA. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. We introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient EM algorithm to estimate the parameters of our model. Compared to sLDA, sPLSA is much more efficient and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA the ideal choice for latent response analysis such as ranking latent topics by their factorized response values. We utilize the sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues. The contributions of the paper can be summarized as follows.
We propose a novel supervised PLSA model which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. We derive an efficient EM algorithm to estimate the parameters of the model. We utilize sPLSA and sLDA to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of sPLSA over sLDA as part of this analysis.
Probabilistic topic models
In 1999, three papers [1, 2, 13] introduced the model of Probabilistic Latent Semantic Analysis. One variant of the model appeared in 1998 [14] and all these models were originally discussed in an earlier technical report [15]. PLSA was a probabilistic implementation of latent semantic analysis (LSA) introduced by Deerwester et al. [6]. LSA was extended from the vector space model. It aimed to represent documents in a low dimensional vector space consisting of common semantic factors. Differing from LSA in projecting document or word vectors into the latent semantic space, PLSA extracted the aspects related to documents. This aspect model was interpreted as a mixture model containing latent semantic mixtures. Parameters of mixture probabilities were estimated by the maximum-likelihood (ML) principle. PLSA did not provide a straightforward way to make inferences about new documents not seen in the training data and the parameterization of the model was susceptible to overfitting. Latent Dirichlet Allocation (LDA) addressed these limitations by proposing a Bayesian probabilistic topic model.
PLSA and LDA established the field of probabilistic topic models. Many extensions of the two basic models have been proposed. In Zhai et al. [16], PLSA was extended to include a background component to explain the non-informative background words and a cross-collection mixture model was proposed to support comparative text mining. Mei and Zhai [17] propose a general contextual text mining model which is an extension of PLSA to incorporate context information. They further regularize PLSA with a harmonic regularizer based on a graph structure in the data [18]. One active area of topic modeling research is how to relax and extend the assumptions of PLSA and LDA to uncover more sophisticated structure in the texts. For example, the work by Rosen-Zvi et al. [19] extends LDA to include authorship information. Recently, probabilistic topic models are proposed for unsupervised many-to-many object matching [20] and cross-lingual tasks [21]. There are many other topic models proposed. Blei [22] gives an overview of the field of probabilistic topic models.
The original PLSA and LDA and most of their variants are unsupervised models. Blei and McAuliffe [12] proposed supervised LDA (sLDA) to capture real-valued document rating as a regression response. The generative process of sLDA is similar to LDA, but with an additional step: draw a reponse variable. The sLDA model is trained by maximizing the joint likelihood of the contents and the responses of documents. They tested sLDA on two real-world datasets: movie reviews with ratings and web pages with popularity, and the experimental results demonstrated the advantages of sLDA versus regularized regression, and versus an unsupervised LDA analysis followed by a separate regression. Other extensions include multi-class sLDA [23], which directly captures discrete labels of documents as a classification response; and discriminative LDA (DiscLDA) [24], which also performs classification, but with a mechanism different from that of sLDA; and MedLDA [25], which leverages the maximum margin principle for estimation of latent topical representations. Recently, Jameel et al. [26] integrate class label information and word order structure into a supervised topic model for document classification. More variants of supervised topic models can be found in a number of applied domains, such as Labeled LDA [27], automatic summarization of changes in dynamic text collections [28], modeling of numerical time series [29], inferring topic hierarchies [30], and query expansion [31]. In computer vision, several supervised topic models have been designed for understanding complex scene images [32, 33]. Mimno and McCallum [34] also proposed a topic model for considering document-level meta-data; for example, publication date and venue of a paper.
Most of the above supervised topic models are based on LDA. There exist very few work on extending PLSA to the supervised setting. One such work was to use the spoken content of a multimedia document as a query for retrieving similar or relevant documents [35]. The query was used to train the model in a supervised fashion with respect to a query-document similarity objective function. Fergus et al. [36] extend PLSA to include spatial information in a translation and scale invariant manner, and utilized this modified PLSA model to learn an object category. Another work added a category-topic distribution in PLSA for human action recognition [37]. However, these models do not associate the topic distribution of the document with the response variable. Consequently, the discovered topics may not be indicative of the response. Aliyanto et al. [38] proposed a version of supervised PLSA to estimating technology readiness level, but they assumed the topics of each word in a document are observed which are actually not available in many real-world applications. In this paper, we follow the way LDA was extended to sLDA by directly associating the documents’ topic distributions with the response. The response is at the document level instead of the word level and it is more readily accessible. The learned topics depend on both the document’s content and response. To the best of our knowledge, no prior work has extended PLSA in a similar manner.
Recently, with the rise of deep learning, novel topic models based on neural networks have been proposed. Salakhutdinov and Hinton [39] proposed a two layer restricted Boltzmann machine (RBM) called the replicated-softmax to extract low level latent topics from a large collection of unstructured documents. Larochelle and Lauly [40] proposed a neural auto-regressive topic model inspired from the replicated softmax model but replacing the RBM model with a neural auto-regressive distribution estimator (NADE). Kingma and Welling [41] proposed variational autoencoders by combining topic modeling and neural networks. Cao et al. [42] proposed neural topic model (NTM), and it is supervised extension (sNTM) where words and documents embedding are combined. Moody [43] proposed the lda2vec, a model combining LDA and word embeddings. Dieng et al. [44] integrated to a recurrent neural network based language model global word semantic information extracted using a probabilistic topic model. Gupta et al. [45] integrated to an LSTM recurrent neural network, a neural auto-regressive topic model. Murakami and Chakraborty [46] investigated the use of word embedding with NTM for interpretable topics from short texts. Grootendorst [47] proposed BERTopic to generate document embedding with pre-trained transformer-based language models and then produce topic representations with the class-based TF-IDF procedure. Two recent surveys [48, 49] provided comprehensive reviews of neural topic models, with nearly a hundred models developed and a wide range of applications in neural language understanding such as text generation, summarization and language models. Despite the popularity of deep learning, our work has focused on traditional probabilistic methods because they are often easier to implement and more efficient to train, which may be more suitable in resource constrained environments where only limited computation and storage are accessible. Nevertheless we will explore to combine the proposed model with neural networks in a future work.
Controversy analysis of legislative bills
Legislative voting is a major area of research. Most of the research is focused on the ideal point estimation of the ideological positions of legislators. This is primarily for the purpose of predicting their voting patterns. An early work in this area presented a spatial model of legislative voting [50]. Londregan [51] estimated the preferred positions of legislators by modeling the legislative agenda. Cox and Poole [52] used a spatial model to assess the role of partisanship in influencing the votes of legislators. Variational methods were applied to predict votes [53]. Thomas et al. [54] modeled voting behavior from congressional debate transcripts. Gerrish and Blei [55] demonstrated roll call predictive models which link legislative text with legislative sentiment. They [56] further derived approximate posterior inference algorithms based on variational methods to predict the positions of legislators. Fang et al. [57] analyzed public statements from legislators to build a contrastive opinion model of the legislators. Gu et al. [58] conducted ideal point estimations of legislators on the latent topics of voted documents.
Some of the work cited above utilized topic models. For example, Gerrish and Blei [55] extended LDA to build a generative model of votes and bills called the ideal point topic model. The model infers two bill related latent variables. One of the latent variables explains bills that all legislators will vote for or against while the other variable explains bills that do not have unanimous approval or disapproval. In addition, the model infers a latent variable for the legislators’ ideal points. Another example, Fang et al. [57] present the cross-perspective topic model which unifies two identically extended LDA models to contrast the opinion words of a bipolar legislative body. The opinion words reflect the subjective positions of the polar entities on various topics. The model discriminates between opinion words and topics words by treating them as two separate observed variables.
On the broader field of controversy analysis, much work has been done detecting contradictions in textual data. One of the early works studied the dynamics of conflicting opinions in texts by visually inspecting graphs [59]. Tsytsarau et al. [60] further investigated two types of contradictions, namely, “overlapping contradicting opinions” and “change of sentiment”. Many supervised learning approaches have been proposed for classifying texts into one of the two opposing opinions using annotated controversial corpora including sentences [61], documents [62] and document collections [61]. Some recent work addresses the task of identifying controversial contents on Wikipedia [63, 64, 65] and on social media [66, 67, 68].
Notations
Notations
Graphical model representation of (a) PLSA and (b) sPLSA.
Notations
Assume the corpus
Generative process
Similar to many other topic models, sPLSA assumes that a document consists of multiple topics. Therefore, there is a distribution
The essential difference between PLSA and sPLSA lies in the modeling of the response variable
For each word
Choose a topic Choose a word Draw a response
Here the response comes from a Gaussian linear model. The mean is the inner product of topic distribution
Figure 1 illustrates the graphical model representation of PLSA and sPLSA, respectively.
It is worth noting that our approach for modeling
The likelihood function in supervised PLSA consists of two parts. The first part is the likelihood for observing all the words in the corpus,
where
The second part of the likelihood function comes from the likelihood of the response variable. As shown in the generative process, we assume a linear model with Gaussian noise for modeling the response
where
where
We assume a Gaussian prior on the coefficients
Equations (2) and (5) share
where
Now that we have established the unified likelihood, we can use it to derive formulas for iteratively updating the parameters
The iterative updates of the parameter estimation process.
The values of
It can be seen that the above objective function is strictly convex in
This solution is equivalent to Ridge Regression or Tikhonov regularization [69].
The values of
In the M-step, we maximize the expected complete data log-likelihood as follows:
with the constraint of
The values of
Since the second and third terms in the above lower bound are constants with respect to
This means we use the following objective instead of the unified likelihood to update
The above objective is a concave function with respect to
The constraint must be met because each
where
Irrespective of the value of
Furthermore, we can reduce the number of
when
As a result, we can express
This results in
and to the following when
Finally, we can express
The above representation of
We use the gradient ascent algorithm to maximize the objective function
where:
and:
After we update each
After the parameter estimation is completed, we do the following to infer the latent topics and their factorized response values:
We infer the latent topics from the topic-word distribution We infer the factorized response for each latent topic
In this section, we discuss the dataset we used to test sPLSA, present experimental results, and compare our model to the baselines.1
The dataset and source code for our experiments can be found at
We tested sPLSA using bills which were placed for a vote in the United States Congress. The objective of our test is to generate the latent topics of the bills, and then rank them by controversy. We do this by first assigning a controversy score to each bill followed by inferring the factorized controversy score of each topic using sPLSA. We assign a controversy score to each bill by using the spread of the number of yes and no votes. The formula we use is as follows:
where
The reason why we selected congressional bills and their controversy scores as our dataset is to demonstrate applying sPLSA to a real world problem. Specifically, we want to identify contentious issues in the United States Congress by generating their latent topics. By inferring their relative controversy using sPLSA, we can rank the topics by controversy, and identify the contentious issues by selecting the most controversial topics.
We collected bills starting from the
We were able to collect the votes and content of 6,403 bills. 5,531 bills were from the House of Representatives and 872 bills were from the Senate. 6,160 bills had more yes votes than no votes, and 243 bills had more no votes than yes votes. Figure 3 shows the distribution of the bills’ controversy score.
Histogram of the distribution of the response variable calculated using Eq. (27).
We did the following preprocessing of the bills to create our dataset:
Removed words which have characters that are not in the English alphabet. Removed words less than 4 characters in length. Removed common English words using Mallet’s4
Removed domain specific words using a custom stop-word list. The stop-word list has 157 words, and we created it by analyzing the word frequency of the bills. It mostly consists of legal terms. Selected the 15,000 most frequent words as the vocabulary of our corpus.
We then created the dataset as a bag-of-words representation of each bill.
We randomly partition our dataset as follows: 80% for training, 10% for validation, and 10% for testing. We initialize
Evaluation metrics
We test a trained model by folding-in the test dataset similar to the way specified in [1]. This is essentially the same as training the model with the test dataset except
For For
The higher the correlation between
The reason why we can correlate each
The sparsity of 
Table 2 shows the top 5 words for the
The top 5 words for the
,
, median,
last, and last most controversial topics of selected
values as well as the
values of the topics
The top 5 words for the
Our baseline is an sLDA model. The response variable for the model is the controversy score. We used the ’slda.em’ function in the R “lda” package5
Comparison of the Pearson correlation between
sPLSA is designed for topic discovery and latent response inference. This comes at the expense of its prediction performance. Theoretically, we can use sPLSA in a semi-supervised setting where we mix both labeled and unlabeled data, and then try to predict the labels for the unlabeled data. In such a scenario, we update
We run sPLSA and sLDA on a MacPro laptop with a 2 GHz processor and 16 GB RAM on the training dataset for various values of
The reason why Gibbs sampling converges a lot slower than the EM algorithm is because the topics tend to depend on one another. This prolongs the burn-in period for the Gibbs sampling process where a stationary distribution has not been achieved. A stationary distribution needs to be achieved for the actual sampling to take place. During the burn-in period, the Gibbs sampling process can diverge at times. On the other hand, EM does not have the equivalent of a burn-in period and every iteration of the algorithm is guaranteed to monotonically improve the convergence of the likelihood.
Impact of
We trained the model with
The perplexity and Pearson correlation values on the validation dataset for different values of
The perplexity and Pearson correlation values on the validation dataset for different values of
The prediction RMSE values of sPLSA and sLDA at various values of 
Comparison of the training time of sPLSA and sLDA for various values of 
We trained the model for each combination of
The perplexity for various combinations of 
From the figure, we can generally see that as
Figure 8 shows the values of the Pearson correlation between
The Pearson correlation between 
From the figure, we can generally see that the correlation increases steeply from
Tables 5, 6, and 7 show the top words for the topics generated by PLSA, sLDA, and sPLSA. In general, we can see that very similar topics are generated by all three models. For example, topic 7 for PLSA, topic 2 for sLDA, and topic 6 for sPLSA are about education. This illustrates that the perplexity trade-off we did in selecting
Top 5 words for the topics of PLSA when
Top 5 words for the topics of PLSA when
Top 5 words for the topics of sLDA when
Top 5 words for the topics of sPLSA when
For each topic listed in Table 2 where
Sample bill for the most controversial topic
Sample bill for the most controversial topic
Sample bill for the second most controversial topic
Sample bill for the most moderately controversial topic
Sample bill for the second least controversial topic
Sample bill for the least controversial topic
In this paper, we introduce sPLSA. We describe sPLSA as an extension of PLSA that is an analog of what sLDA is to LDA. Similar to sLDA, sPLSA processes a response variable associated with the documents to factorize the responses on a per-topic basis. We discuss the advantage sPLSA has over sLDA for doing latent response analysis such as the ranking of the topics by their factorized responses and the execution efficiency of the model. In addition, we discuss the advantage sLDA has over sPLSA for predicting the responses of documents. We experimentally demonstrated sPLSA on a real world problem by doing a latent controversy analysis of topics inferred from the bills of the United States Congress.
This work is an initial step towards a promising research direction. The presented model assumes the response comes from a Gaussian linear model. This assumption can be relaxed by extending the distribution of the response to a generalized linear model (GLM) [70], which allows for response variables that have error distribution models other than a Gaussian distribution. In future work, we plan to extend sPLSA to other types of response variables including the multinomial, the Poisson, the gamma, Weibull, inverse Gaussian, and so on. This will allow us to apply sPLSA to do latent topic analysis on a more diverse set of problems. Last but not the least, we will explore to combine the proposed model with neural networks by leveraging their nonlinearity modeling capability and extend the work to the realm of neural topic models [71].
