Sage Journals: Discover world-class research

Abstract

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different than the target documents language. CLIR incorporates a machine translation technique, like, Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) which use either a dictionary or a parallel corpus for the training. A Hindi language word may have multiple variations due to the morphological richness of the language, these morphological variants may or may not be present in the dictionary or parallel corpus. The morphological variants which are not present in the dictionary or parallel corpus, are not translated by the state-of-art SMT or NMT translation techniques. Conventional Information Retrieval (IR) technique eliminates the stop-words to improve the IR effectiveness, but there are some significant stop-words whose presence may improve the IR effectiveness. In this paper, a translation induction algorithm, incorporates the refined stop-words list, morphological variants solutions, and translates the words based on the contextual words, is proposed. The proposed algorithm is compared to the manual dictionary, probabilistic dictionary, SMT and NMT based translation techniques for the experimental analysis of Hindi-English CLIR, where it outperforms the other CLIR approaches.

Keywords

Cross-lingual information retrieval refined stop-words morphological variants solutions statistical machine translation neural machine translation

Get full access to this article

View all access options for this article.

References

S.S.

Akhtar ,

Gupta ,

Vajpayee ,

Srivastava and

Shrivastava , Unsupervised morphological expansion of small datasets for improving word embeddings. arXiv preprint arXiv:1711.05678, 2017.

Bahdanau ,

Cho and

Bengio , Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473., 2014.

Bojar ,

Diatka ,

Rychly ,

Stranak ,

Suchomel ,

Tamchyna and

Zeman , HindEnCorp-Hindi-English and hindi-only corpus for machine translation, In LREC, 2014, pp. 3550–3555.

Dakwale and

Monz , Convolutional over recurrent encoder for neural machine translation, The Prague Bulletin of Mathematical Linguistics 108(1) (2017), 37–48.

Du ,

Hou ,

Wu ,

Shen ,

Li and

Wang , Key Research of Preprocessing on Mongolian-Chinese Neural Machine Translation, 2016.

I.A.

El-Khair , Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study. arXiv preprint arXiv:1702.01925., 2017.

Green ,

Cer and

Manning , Phrasal: A toolkit for new directions in statistical machine translation, In Proceedings of the NinthWorkshop on Statistical Machine Translation, 2014, pp. 114–121.

Gujral ,

Khayrallah and

Koehn , Translation of Unknown Words in Low Resource Languages, 2016.

Huck ,

Tamchyna ,

Bojar and

Fraser , Producing Unseen Morphological Variants in Statistical Machine Translation, In-Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2, 2017, pp. 369–375.

10.

Jagarlamudi and

Kumaran , Cross-Lingual Information Retrieval System for Indian Languages, In Advances in Multilingual and Multimodal Information Retrieval, Springer Berlin Heidelberg, 2007, pp. 80–87.

11.

S.C.

Janarthanam ,

Sethuramalingam and

Nallasamy , Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm, In Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, ACM 2008, pp. 33–38.

12.

Koehn ,

Hoang ,

Birch ,

Callison-Burch ,

Federico ,

Bertoldi ,

Cowan ,

Shen ,

Moran and

Zens , Moses: Open source toolkit for Statistical Machine Translation, In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007.

13.

Koehn , Statistical machine translation, Cambridge University Press, 2009.

14.

Kunchukuttan ,

Mehta and

Bhattacharyya , The IIT Bombay English-Hindi Parallel Corpus, arXiv preprint arXiv:1710.02855., 2017.

15.

Lamb and

Xie , Convolutional encoders for neural machine translation, WEB download, 2016.

16.

Makin ,

Pandey ,

Pingali and

Varma , Approximate String Matching Techniques for Effective CLIR, International Workshop on Fuzzy Logic and Applications Springer-Verlag, 2007, pp. 430–437.

17.

Mustafa ,

Tait and

Oakes , Literature review of crosslanguage information retrieval, In Transactions on Engineering, Computing and Technology, 2005.

18.

Nagarathinam and

Saraswathi , State of art: Cross lingual information retrieval system for indian languages, In International Journal of Computer Application 35(13) (2011), 15–21.

19.

N.A.

Nasharuddin and

M.T.

Abdullah , Cross-lingual information retrieval state-of-the-art, In electronic Journal of Computer Science and Information Technology (EJCSIT) 2(1) (2010).

20.

Papineni ,

Roukos ,

Ward and

W.J.

Zhu , BLEU: A method for automatic evaluation of machine translation, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 311–318.

21.

Saravanan ,

Udupa and

Kumaran , Crosslingual information retrieval system enhanced with transliteration generation and mining, In Forum for Information Retrieval Evaluation (FIRE- 2010) Workshop, 2010.

22.

Sennrich ,

Haddow and

Birch , Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909., 2015.

23.

V.K.

Sharma and

Mittal , Cross Lingual Information Retrieval (CLIR): Review of Tools, Challenges and Translation Approaches, In Information System Design and Intelligent Application, 2016, pp. 699–708.

24.

V.K.

Sharma and

Mittal , Cross Lingual Information Retrieval: A Dictionary Based Query Translation Approach, inAdvances in Intelligent Systems and Computing, 2016.

25.

V.K.

Sharma and

Mittal , Exploiting parallel sentences and cosine similarity for identifying target language translation, Journal of Procedia Computer Science 89 (2016), 428–433.

26.

Shishtla ,

Surya ,

Sethuramalingam and

Varma , A languageindependent translit-eration schema using character aligned models at NEWS 2009, In Proceedings of the 2009 Named EntitiesWorkshop: Shared Task on Transliteration, Association for Computational Linguistics, 2009, pp. 40–43.

27.

Surya ,

Harsha ,

Pingali and

Verma , Statistical transliteration for cross language information retrieval using HMM alignment model and CRF, In Proceedings of the 2nd Workshop on Cross Lingual Information Access, 2008.

28.

Vulic ,

W. De

Smet and

M.F.

Moens , Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora, Information Retrieval 16(3) (2013), 331–368.

29.

Wang ,

Li and

Wang , Cross language information retrieval based on lda, In International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009 IEEE, vol. 3, 2009, pp. 485–490.IEEE.

30.

Wu ,

Schuster ,

Chen ,

Q.V.

Le ,

Norouzi ,

Macherey ,

Krikun ,

Cao ,

Gao ,

Macherey and

Klingner , Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144., 2016.

31.

Zhou ,

Truran ,

Brailsford ,

Wade and

Ashman , Translation techniques in cross-language information retrieval, ACM Computing Surveys (CSUR) 45(1) (2012), 1.