Abstract
Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different than the target documents language. CLIR incorporates a machine translation technique, like, Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) which use either a dictionary or a parallel corpus for the training. A Hindi language word may have multiple variations due to the morphological richness of the language, these morphological variants may or may not be present in the dictionary or parallel corpus. The morphological variants which are not present in the dictionary or parallel corpus, are not translated by the state-of-art SMT or NMT translation techniques. Conventional Information Retrieval (IR) technique eliminates the stop-words to improve the IR effectiveness, but there are some significant stop-words whose presence may improve the IR effectiveness. In this paper, a translation induction algorithm, incorporates the refined stop-words list, morphological variants solutions, and translates the words based on the contextual words, is proposed. The proposed algorithm is compared to the manual dictionary, probabilistic dictionary, SMT and NMT based translation techniques for the experimental analysis of Hindi-English CLIR, where it outperforms the other CLIR approaches.
Keywords
Get full access to this article
View all access options for this article.
