Sage Journals: Discover world-class research

Abstract

The k-nearest neighbors (kNN) algorithm is one of the most popular and simplest lazy learners. However, as the training dataset becomes larger, the algorithm suffers from the following drawbacks: large storage requirements, slow classification speed, and high sensitivity to noise. To overcome these drawbacks, we reduce the size of the training data by only selecting the necessary prototypes before the classification. This study proposes an extended prototype selection technique based on the geometric median (GM). We compare the proposed method with seven state-of-the-art prototype selection methods and 1NN as the baseline model. We use 25 datasets from the KEEL and UCI dataset repository website. The proposed method runs at least 3.5 times faster than the baseline model at the cost of slightly reduced accuracy. In addition, the classification accuracy and kappa value of the proposed method are comparable to those of all the state-of-the-art prototype selection methods considered.

Keywords

Data pre-processing nearest-neighbors classification prototype selection instance selection

Get full access to this article

View all access options for this article.

References

Ben-David

, A lot of randomness is hiding in accuracy, Engineering Applications of Artificial Intelligence 20 (2007), 875–885.

Kasemtaweechok

and Suwannik

, Prototype selection for k-nearest neighbors classification using geometric median, in: the Fifth International Conference on Network, Communication and Computing, ACM, Kyoto, 2016, pp. 140–144.

Nettleton

Orriols-Puig

and Fornell

, A study of the effect of different types of noise on the precision of supervised learning techniques, Artificial Intelligence Review 33 (2010), 275–306.

Wilson

D.L.

, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics: Systems SMC-2 (1972), 408–421.

Wilson

D.R.

and Martinez

T.R.

, Reduction techniques for instance-based learning algorithms, Machine Learning Journal 38 (2000), 257–286.

Hochbaum

D.S.

, Greedy strikes back: Heuristics for the fixed cost median problem, Mathematical Programming 22 (1982), 148–162.

Aha

D.W.

Kibler

and Albert

M.K.

, Instance-based learning algorithms, Machine Learning Journal 6 (1991), 37–66.

Marchiori

, Hit miss networks with applications to instance selection, Journal of Machine Learning Research 9 (2008), 997–1017.

Marchiori

, Class conditional nearest neighbor for large margin instance selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010), 364–370.

10.

Angiulli

, Fast nearest neighbor condensation for large data sets classification, IEEE Transactions on Knowledge and Data Engineering 19 (2007), 1450–1464.

11.

Wilcoxon

, Individual comparisons by ranking methods, Biometrics Bulletin 1 (1945), 80–83.

12.

Gates

G.W.

, The reduced nearest neighbor rule, IEEE Transactions on Information Theory 18 (1972), 431–433.

13.

Brighton

and Mellish

, Advances in instance selection for instance-based, Data Mining and Knowledge Discovery Journal 6 (2002), 153–172.

14.

Tomek

, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics: Systems SMC-6 (1976), 448–452.

15.

Alcala-Fdez

Fernandez

Luengo

Derrac

Garcia

Sanchez

and Herrera

, KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17 (2011), 255–287.

16.

Olvera-Lopez

J.A.

Carrasco-Ochoa

J.A.

and Martinez Trinidad

J.F.

, A review of instance selection methods, Artificial Intelligence Review 34 (2010), 133–143.

17.

Sanchez

J.S.

Pla

and Ferri

, Prototype selection for the nearest neighbour rule through proximity graphs, Pattern Recognition Letter 18 (1997), 507–513.

18.

Sanchez

J.S.

, High training set size reduction by space partitioning and prototype abstraction, The Journal of the Pattern Recognition Society 37 (2004), 1561–1564.

19.

Ingber

, Simulated annealing: Practice versus theory, Mathematical and Computer Modelling 18 (1993), 29–57.

20.

Lichman

, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, University of California, 2013.

21.

Lozano

Sotoca

Sanchez

J.S.

Pla

Pekalska

and Duin

R.P.W.

, Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces, The Journal of the Pattern Recognition Society 39 (2006), 1827–1838.

22.

Bartlett

M.S.

, Properties of Sufficiency and Statistical Tests, in: The Royal Society of London Series A Mathematical, Physical and Engineering Sciences, The Royal Society Publishing, London, 1937, pp. 268–282.

23.

Megiddo

and Tamir

, On the complexity of locating linear facilities in the plane, Operations Research Letters 1 (1982), 194–197.

24.

Altman

N.S.

, An introduction to kernel and nearest neighbor nonparametric regression, The American Statistician 46 (1992), 175–185.

25.

Hart

P.E.

, Greedy Strikes Back: The Condensed Nearest Neighbor Rule, IEEE Transactions on Information Theory 14 (1968), 515–516.

26.

Bradley

P.S.

Mangasarian

O.L.

and Street

W.N.

, Clustering via concave minimization, in: Advances Neural Information Processing Systems, MIT PRESS, Denver, 1996, pp. 368–374.

27.

Silva

R.S.

Gomes

G.M.C.

Alvim

M.S.

and Goncalves

M.A.

, Compression-based selective sampling for learning to rank, in: 25th ACM International on Conference on Information and Knowledge Management, ACM, Indianapolis, 2016, pp. 247–256.

28.

Garcia

Derrac

Cano

J.R.

and Herrera

, Prototype selection for nearest neighbor classification: Taxonomy and empirical study, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012), 417–435.

29.

Guha

and Khuller

, Greedy strikes back: improved facility location algorithms, Journal of Algorithm 31 (1999), 228–248.

30.

Kirkpatrick

Gelatt

C.D.

, Jr. and Vecchi

M.P.

, Optimization by simulated annealing, SCIENCE 220 (1983), 671–680.

31.

, A 1.488 Approximation algorithm for the uncapacitated facility location problem, Information and Computation 222 (2013), 45–58.

32.

Cover

T.M.

and Hart

P.E.

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1967), 21–27.

33.

Yamane

, Statistics: An Introductory Analysis, Book, Harper and Row, New York, 1967.

34.

Zhang

Ramakrishnan

G.R.

and Livny

, BIRCH: An efficient data clustering method for very large databases, in: ACM SIGMOD international conference on Management of data, ACM, Montreal, 1996, pp. 103–114.

35.

Zhu

and Wu

, Class noise vs. attribute noise: a quantitative study, Artificial Intelligence Review 22 (2004), 177–210.

Adaptive geometric median prototype selection method for k-nearest neighbors classification

Abstract

Keywords

Get full access to this article

References