Abstract
This paper conducts a study and analysis of the SEA dataset, grouping the data and digitizing features to obtain corresponding labels. Subsequently, the K-Nearest Neighbors (KNN) algorithm is applied to the dataset to investigate its performance, revealing that the Manhattan distance is the optimal distance formula for researching this problem using the KNN algorithm. The study explicitly selects frequent words chosen by individual users as features for calculation. During the computation process, grid search is employed to find the optimal parameters, and a model is created using these optimal parameters. The paper then applies the Naive Bayes algorithm to the dataset, comparing the strengths and weaknesses of different types of Naive Bayes methods. An analysis is conducted on the differences and advantages/disadvantages of various feature extraction methods, highlighting that the accuracy of Bernoulli Naive Bayes is higher than that of Multinomial Naive Bayes.
Keywords
Get full access to this article
View all access options for this article.
