Abstract
. Word Sense Disambiguation, which is one of the most challenging problems in the process of machine translation, can be considered as a classification problem. In this paper, we use K-Nearest-Neighbor, as one of the most popular classification methods, as well as some knowledge based resources in order to design a WSD scheme. The success of K-Nearest-Neighbor is tightly dependent on two factors; the features used to represent the context in which an ambiguous word occurs and the distance/similarity measure used for comparison of text vectors. Hence, in the present study, we focus on these two matters. For the first purpose, we extract three sets of features; syntactic features, lexical features and semantic features. In order to produce enriched and useful corpora, we apply preprocessed steps. In this work, we carry out a feature selection process as well as a feature weighting policy in order to fine-tune the classifier. For the second purpose, we try several distance/similarity metrics (rather than one metric) in order to find the most proper one. We also assign and use feature weights and propose a weighted formula for every metric. Moreover, to show that the proposed schemes are not language-dependent, we apply the suggested schemes to two sets of data; English and Persian corpora. The evaluation results, with regards to the feature selection and feature weighting strategies, show that the semantic and syntactic features have a significant effect on the classification ability of the system. The results are also encouraging compared to state of the art.
Keywords
Get full access to this article
View all access options for this article.
