Sage Journals: Discover world-class research

Abstract

The k-NN algorithm is an instance-based learning algorithm which is widely used in the data mining applications. The core engine of the k-NN algorithm is the distance/similarity function. The performance of the k-NN algorithm varies with the selection of distance function. The traditional distance/similarity functions in k-NN do not perfectly handle the mix-mode words such as when one string has multiple substrings/words. For example, a two-word string of “Employee Name”, a one-word string of “Name” or more than one word such as, “Name of Employee”. This ambiguity is faced by different distance/similarity functions causing difficulties in finding the perfect match of words. To improve the perfect-match calculation functionality in the traditional k-NN algorithm, a new similarity distance metric is developed and named as word-distance (w-distance). The perfect match will help us to identify the exact required value. The proposed w-distance is a hybrid of distance and similarity in nature because it is to handle dissimilarity and similarity features of strings at the same time. The simulation results showed that w-distance has a better impact on the performance of the k-NN algorithm as compared to the Euclidean distance and the cosine similarity.

Keywords

k-NN algorithm distance/similarity metric text match data mining cosine similarity

Get full access to this article

View all access options for this article.

References

Fix

and Hodges

J.L.

, Discriminatory analysis, nonparametric discrimination: Consistency properties, USAF School of Aviation Medicine Randolph Field Texas (1951), 1–24.

Kotsiantis

S.B.

, Supervised machine learning: A review of classification techniques, Informatica 31, 249–268, 2007.

, et al. Top 10 algorithms in data mining, Knowledge and Information Systems 14(1) (2008), 1–37.

Khan

, Ding

and Perrizo

, K-Nearest neighbor classification on spatial data streams using P-trees, Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science 23(36) (2002), 517–528n.

, Ianakiev

and Govindaraju

, Improved k -nearest neighbor classification, Pattern Recognition 35(10) (2002), 2311–2318.

Angiulli

, Fast condensed nearest neighbor rule, in technical report, Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany, (2005), pp. 25–32.

Zhan

, Chang

L.W.

and Matwin

, Privacy preserving K-nearest neighbor classification, International Journal of Network Security 1(1) (2005), pp. 46–51.

Zhu

, Xu

, Takagi

, Secure k-NN computation on encrypted cloud data without sharing key with query users, Proceedings of the 2013 international workshop on Security in cloud computing - Cloud Computing ’13, Hangzhou, China, (2013) pp. 55–60.

Zardari

M.A.

and Jung

L.T.

, Data security rules/regulations based classification of file data using TsF-kNN algorithm, Cluster Computing 19(1) (2016), 349–368.

10.

Zardari

M.A.

and Jung

L.T.

, Confidentiality based file attributes and data classification using TsF-KNN, 5th International Conference on IT Convergence and Security (ICITCS), Malaysia, (2015), 1–5.

11.

Zardari

M.A.

, Jung

L.T.

and Zakaria

, K-NN Classifier for data confidentiality in cloud computing, International Conference on Computer and Information Sciences, Malaysia, (2014), 1–6.

12.

Suguna

and Thanushkodi

, An improved k-Nearest neighbor classification using genetic algorithm, International Journal of Computer Science 7(4) (2010), 7–10.

13.

Kilian

Q.W.

and Lawrence

K.S.

, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (2009), 207–244.

14.

Yong

, Youwen

and Shixiong

, An improved KNN text classification algorithm based on custering, Journal of Computers 4(3) (2009), 230–237.

15.

, et al., Extraction of semantic relations between concepts with KNN algorithms on wikipedia, CDUD Workshop Proceedings (2012), pp. 78–86.

16.

Zardari

M. A.

and Jung

L. T.

, Data classification with kNN using novel character frequency-direct word frequency (CF-DWF) similarity formula, International Symposium on Mathematical Sciences and Computing Research (iSMSC), Ipoh, Malaysia, (2016), pp. 280–285.

17.

Bharath

, Samanthula

E.Y.

and Wei

, k-Nearest Neighbor classification over semantically secure encrypted relational data, IEEE Transactions on Knowledge and Data Engineering 27(5) (2015), 1261–1273.

18.

Mark

, Yongdae

and Vipin

, Privacy Preserving Nearest Neighbor Search, Book Chapter, Machine Learning in Cyber Trust (2009), pp. 247–276.

19.

Miyoung

, Young-Sung

and Jae-Woo

, A Grid-based k-Nearest Neighbor Join for Large Scale Datasets on MapReduce, IEEE International Conference on High Performance Computing and Communication (2015), pp. 888–891.

20.

Yousef

, Bharath

K.S.

and Wei

, Secure k-Nearest Neighbor Query over Encrypted Data in Outsourced Environment, Technical Report, Department of Computer Science Missouri (2013), pp. 1–23.

21.

Mei

, Gan

, Heng-ru

and Wen-bin

, Fast Recommendation with the M-Distance,pp, IEEE Access 4 (2016), 1464–1468.

22.

Shuchao

, Wenqian

and Yuqi

, A k-Highest Expert Text Classification Algorithm Based on Choquet Integral, International Conference on Applied Computing and Information Technolgy, (2015), pp. 499–503.

23.

Shanker

and Ilayaraja

, Secure Optimal k-NN on Encrypted Cloud Data using Homomorphic Encryption with Query Users, 2018 International Conference on Computer Communication and Informatics (ICCCI), 2018.

24.

Zhang

, Xin

, Xie

and Pan

, A k-NN Query Method Over Encrypted Data, 2018 IEEE 22nd International Conference on Computer Supported Cooperative Work in Design ((CSCWD)), 2018.

25.

Nagata

, Nakamura

and Farouk

, Quantum cryptography based on the Deutsch-Jozsa algorithm, International Journal of Theoretical Physics 56(9) (2017), 2887–2897.

26.

Farouk

, Rashad

, Omara

and Megahed

A.A.

, Architecture of multicast centralized key management scheme using quantum key distribution and classical symmetric encryption, The European Physical Journal Special Topics 223(8) (2014), 1711–1728.

27.

Farouk

, Zakaria

, Megahed

and Omara

F.A.

, A generalized architecture of quantum secure direct communication for N disjointed users with authentication, Scientific reports, 5, 16080, 2015.