Sage Journals: Discover world-class research

Abstract

Text classification is one of the most important sectors of machine learning theory. It enables a series of tasks among which are email spam filtering and context identification. Classification theory proposes a number of different techniques based on different technologies and tools. Classification systems are typically distinguished into single-label categorization and multi-label categorization systems, according to the number of categories they assign to each of the classified documents. In this paper, we present work undertaken in the area of single-label classification which resulted in a statistical classifier, based on the Naive Bayes assumption of statistical independence of word occurrence across a document. Our algorithm, takes into account cross-category word occurrence in deciding the class of a random document. Moreover, instead of estimating word co-occurrence in assigning a class, we estimate word contribution for a document to belong in a class. This approach outperforms other statistical classifiers as Naive Bayes Classifier and Language Models, as proven in our results.

Keywords

language models Naive Bayes Classifier single-label document classification/categorization statistics

Get full access to this article

View all access options for this article.

References

S.B. Kotsiantis and P.E. Pintelas , Increasing the classification accuracy of simple bayesian classifier, Lecture Notes in Artificial Intelligence, AIMSA 2004 3192 ( 2004) 198-207.

G. Tsoumakas , I. Katakis and I. Vlahavas , A review of multilabel classification methods, Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery (2006) 99-109.

J.D.M. Rennie , L. Shih , J. Teevan and D.R. Karger , Tackling the poor assumptions of Naive Bayes Text Classifiers, Proceedings of the 20th International Conference on Machine Learning (ICML-2003) (Washington, DC, USA, 2003 ).

I. Rish , An empirical study of the Naive Bayes Classifier, Proceedings of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence , 41-6 (Seattle, USA, 2001).

M. Srikanth and R. Srihari , Biterm language models for document retrieval, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 425-6 (Tampere, Finland, 2002).

W.B. Croft , Language models for information retrieval, Proceedings of the 19th International Conference on Data Engineering ( Bangalore, India, 2003).

J.M. Ponte and W.B. Croft , A language modelling approach to information retrieval, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 275-81 (Melbourne, Australia, 1998).

F. Peng , D. Schuurmans and S. Wang , Augmenting Naive Bayes Classifiers with Statistical Language Models, Information Retrieval 7 (2004) 317-45.

C.D. Manning , P. Raghavan and H. Schütze , Introduction to Information Retrieval , Cambridge: Cambridge University Press (2008).

10.

J. Rocchio , Relevance feedback in information retrieval, The SMART Retrieval System: Experiments in Automatic Document Processing (1971) 313-23.

11.

B.D. Ripley , Neural networks and related methods for classification , Journal of the Royal Statistical Society. Series B (Methodological) 56 ( 1994) 409-56.

12.

N. Karanikolas and C. Skourlas , A parametric methodology for text classification , Journal of Information Science 36(4) (2010) 421-42.

13.

TREC Corpus , http://trec.nist.gov (last accessed 30 July 2010).

14.

Reuters Corpus , http://trec.nist.gov/data/reuters/reuters.html (last accessed 30 July 2010).

15.

A.K. McCallum and K. Nigam , A comparison of event models for Naive Bayes Text Classification, Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 41-8 (Winsconsin, USA , 1998).

16.

K. Nigam , A.K. McCallum , S. Thrun and T. Mitchell , Text classification from labeled and unlabeled documents using EM, Machine Learning 23 (2000) 103-34.

17.

W. Dai , G.R. Xue , Q. Yang and Y. Yu , Transferring Naive Bayes Classifiers for text classification, Proceedings of the 22nd AAAI Conference on Artificial Intelligence , 540-5 (Vancouver, Canada, 2007).

18.

M. Galley and K. McKeown , Improving word sense: disambiguation in lexical chaining, Proceedings of the 18th International Joint Conference on Artificial Intelligence , 1486-8 (Acapulco, Mexico, 2003).

19.

C. Bouras and V. Tsogkas , Improving text summarization using noun retrieval techniques, Lecture Notes in Computer Science 5178 (2010) 593-600.

20.

M.F. Porter , An algorithm for suffix stripping, Program 14(3) (1980) 130-7.

21.

S. Scott and S. Matwin , Feature engineering for text classification, Proceedings of the 16th International Conference on Machine Learning (1999 ) 379-88.

22.

T.Z. Kalamboukis , Suffix stripping with modern Greek, Program: Electronic Library and Information Systems 29(3) (1995) 313-21.

23.

G. Mamakis , A.G. Malamos , Y. Kaliakatsos , A. Axaridou and J.A. Ware , An algorithm for automatic content summarization in modern Greek language, Proceedings of ICICT ‘05 (Cairo, Egypt, 2005).

24.

AUEB Greek POS Tagger, (2010), http://nlp.cs.aueb.gr/software_and_datasets/AUEB_Greek_POS_tagger.tar.gz (last accessed 20 July 2010).

25.

A.K. McCallum , MALLET: A Machine Learning for Language Toolkit , http://mallet.cs.umass.edu (last accessed 20 July 2010).

26.

Alias-I , LingPipe 4.0.0, http://alias-i.com/lingpipe (last accessed 20 July 2010).