Sage Journals: Discover world-class research

Abstract

Probabilistic topic models, which frequently represent topics as multinomial distributions over words, have been extensively used for discovering latent topics in text corpora. However, because topic models are entirely unsupervised, they may lead to topics that are not understandable in applications. Recently, several knowledge-based topic models have been proposed which primarily use word-level domain knowledge in the model to enhance the topic coherence and ignore the rich information carried by entities (e.g, persons, locations, organizations, etc.) associated with the documents. Additionally, there exists a vast amount of prior knowledge (background knowledge) represented as Linked Open Data (LOD) datasets and other ontologies, which can be incorporated into the topic models to produce coherent topics. In this paper, we introduce a novel regularization entity-based topic model (RETM), which integrates an ontology with an entity-based topic model (EntLDA) to increase the coherence of the identified topics through the topic modeling process. Our experimental results demonstrate the effectiveness of the proposed model in improving the coherence of topics.

Keywords

Statistical learning topic modeling topic coherence Semantic Web ontologies

Get full access to this article

View all access options for this article.

References

Allahyari and

Kochut, Automatic topic labeling using ontology-based topic models, in: 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, 2015.

Allahyari and

Kochut, Discovering coherent topics with entity topic models, in: 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), IEEE, 2016, pp. 26–33. doi:10.1109/WI.2016.0015.

Andrzejewski,

Zhu and

Craven, Incorporating domain knowledge into topic modeling via Dirichlet forest priors, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 25–32.

Andrzejewski,

Zhu,

Craven and

Recht, A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic, in: IJCAI Proceedings – International Joint Conference on Artificial Intelligence, Vol. 2, 2011, pp. 1171–1177.

Bizer,

Heath and

Berners-Lee, Linked data-the story so far, in: Semantic Services, Interoperability and Web Applications: Emerging Concepts, 2009, pp. 205–227. doi:10.4018/978-1-60960-593-3.ch008.

Bizer,

Lehmann,

Kobilarov,

Auer,

Becker,

Cyganiak and

Hellmann, DBpedia – A crystallization point for the Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web 7(3) (2009), 154–165. doi:10.1016/j.websem.2009.07.002.

D.M.

Blei,

A.Y.

Ng and

M.I.

Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

J.L.

Boyd-Graber,

D.M.

Blei and

Zhu, A topic model for word sense disambiguation, in: EMNLP-CoNLL, 2007, pp. 1024–1033.

Chang,

Gerrish,

Wang,

J.L.

Boyd-Graber and

D.M.

Blei, Reading tea leaves: How humans interpret topic models, in: Advances in Neural Information Processing Systems, 2009, pp. 288–296.

10.

Chemudugunta,

Holloway,

Smyth and

Steyvers, Modeling documents by combining semantic concepts with unsupervised statistical learning, in: The Semantic Web – ISWC 2008, Springer, 2008, pp. 229–244. doi:10.1007/978-3-540-88564-1_15.

11.

Chemudugunta,

Smyth and

Steyvers, Combining concept hierarchies and statistical topic models, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, ACM, 2008, pp. 1469–1470.

12.

Chen,

Mukherjee,

Liu,

Hsu,

Castellanos and

Ghosh, Discovering coherent topics using general knowledge, in: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, ACM, 2013, pp. 209–218.

13.

Chen,

Mukherjee,

Liu,

Hsu,

Castellanos and

Ghosh, Leveraging multi-domain prior knowledge in topic models, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, AAAI Press, 2013, pp. 2071–2077.

14.

Deng,

Han,

Zhao,

Yu and

C.X.

Lin, Probabilistic topic models with biased propagation on heterogeneous information networks, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, pp. 1271–1279. doi:10.1145/2020408.2020600.

15.

T.L.

Griffiths and

Steyvers, Finding scientific topics, Proceedings of the National academy of Sciences of the United States of America 101(Suppl. 1) (2004), 5228–5235. doi:10.1073/pnas.0307752101.

16.

Han and

Sun, An entity–topic model for entity linking, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 2012, pp. 105–115.

17.

Hingmire and

Chakraborti, Topic labeled text classification: A weakly supervised approach, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, 2014, pp. 385–394.

18.

Hofmann

et al., Probabilistic latent semantic analysis, in: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 289–296.

19.

Hu,

Boyd-Graber,

Satinoff and

Smith, Interactive topic modeling, Machine Learning 95(3) (2014), 423–469. doi:10.1007/s10994-013-5413-0.

20.

Jagarlamudi,

DauméIII and

Udupa, Incorporating lexical priors into topic models, in: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 204–213.

21.

Li,

Cardie and

Li, TopicSpam: A topic-model based approach for spam detection, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 217–221.

22.

Lu,

Ott,

Cardie and

B.K.

Tsou, Multi-aspect sentiment analysis with topic models, in: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), IEEE, 2011, pp. 81–88. doi:10.1109/ICDMW.2011.125.

23.

Ma,

Zhou,

Liu,

M.R.

Lyu and

King, Recommender systems with social regularization, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 287–296. doi:10.1145/1935826.1935877.

24.

Mei,

Cai,

Zhang and

Zhai, Topic modeling with network regularization, in: Proceedings of the 17th International Conference on World Wide Web, ACM, 2008, pp. 101–110. doi:10.1145/1367497.1367512.

25.

G.A.

Miller, WordNet: A lexical database for English, Communications of the ACM 38(11) (1995), 39–41. doi:10.1145/219717.219748.

26.

Mimno,

H.M.

Wallach,

Talley,

Leenders and

McCallum, Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 262–272.

27.

Nadeau and

Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30(1) (2007), 3–26. doi:10.1075/li.30.1.03nad.

28.

Newman,

Chemudugunta and

Smyth, Statistical entity–topic models, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 680–686. doi:10.1145/1150402.1150487.

29.

Newman,

Noh,

Talley,

Karimi and

Baldwin, Evaluating topic models for digital libraries, in: Proceedings of the 10th Annual Joint Conference on Digital Libraries, ACM, 2010, pp. 215–224. doi:10.1145/1816123.1816156.

30.

Petterson,

Buntine,

S.M.

Narayanamurthy,

T.S.

Caetano and

A.J.

Smola, Word features for latent Dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 1921–1929.

31.

Pouriyeh,

Allahyari,

Kochut,

Cheng and

H.R.

Arabnia, ES-LDA: Entity summarization using knowledge-based topic modeling, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 316–325.

32.

Pouriyeh,

Allahyari,

Kochut,

Cheng and

H.R.

Arabnia, Combining word embedding and knowledge-based topic modeling for entity summarization, in: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), IEEE, 2018, pp. 252–255. doi:10.1109/ICSC.2018.00044.

33.

Pouriyeh,

Allahyaril,

Cheng,

H.R.

Arabnia,

Kochut and

Atzori, R-LDA: Profiling RDF datasets using knowledge-based topic modeling, in: 2019 IEEE 13th International Conference on Semantic Computing (ICSC), IEEE, 2019, pp. 146–149. doi:10.1109/ICOSC.2019.8665510.

34.

C.P.

Robert and

Casella, Monte Carlo Statistical Methods, Vol. 319, Citeseer, 2004. doi:10.1007/978-1-4757-4145-2.

35.

Rosen-Zvi,

Griffiths,

Steyvers and

Smyth, The author–topic model for authors and documents, in: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2004, pp. 487–494.

36.

Tang,

Leung,

Luo,

Chen and

Gong, Towards ontology learning from folksonomies, in: IJCAI, Vol. 9, 2009, pp. 2089–2094.

37.

Wei and

W.B.

Croft, LDA-based document models for ad-hoc retrieval, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2006, pp. 178–185.

38.

Witten and

Milne, An effective, low-cost measure of semantic relatedness obtained from Wikipedia links, in: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, AAAI Press, 2008, pp. 25–30.

39.

Yao,

Haghighi,

Riedel and

McCallum, Structured relation discovery using generative models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 1456–1466.

40.

Zhu,

Ghahramani,

Lafferty et al., Semi-supervised learning using Gaussian fields and harmonic functions, in: ICML, Vol. 3, 2003, pp. 912–919.