Sage Journals: Discover world-class research

Abstract

In supervised classification if one of the classes has fewer objects than the other, we have a class imbalance problem. One of the most common solutions to address class imbalance problems is oversampling, and SMOTE is the most referenced and well-known oversampling method. However, SMOTE creates synthetic objects in a random way, therefore it produces a different result each time it is applied, and in practice the user has to apply SMOTE several times for choosing the best of all the generated balanced datasets. For this reason, in this paper, we present SMOTE-D, a deterministic version of SMOTE, and propose new deterministic SMOTE-D-based versions of some of the most recent and successful SMOTE-based methods. In our experiments, we show that all proposed deterministic methods produce as good results as random methods but our proposals need to be applied just once. This is very important from a practical point of view since our proposals save time by avoiding multiple applications of them as SMOTE does and they provide one unique result.

Keywords

Imbalanced datasets oversampling supervised classification

Get full access to this article

View all access options for this article.

References

Alcalá

, Fernández

, Luengo

, Derrac

, García

, Sánchez

and Herrera

, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing17(2-3) (2010), 255–287.

Batista

G.E.A.P.A.

, Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter6(1) (2004), 20–29.

Bunkhumpornpat

, Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, In Advances in Knowledge Discovery and Data Mining, Springer, 2009, pp. 475–482.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

and Kegelmeyer

W.P.

, Smote: Synthetic minority oversampling technique, Journal of Artificial Intelligence Research (2002), 321–357.

Deepa

and Punithavalli

, An e-smote technique for feature selection in high-dimensional imbalanced dataset, In Electronics Computer Technology (ICECT), 2011 3rd International Conference on, volume 2, 2011, pp. 322–324. IEEE.

Dong

and Wang

, A new over-sampling approach: Random-smote for learning from imbalanced data sets, In Knowledge Science, Engineering and Management, Springer, 2011, pp. 343–352.

Draper

N.R.

, Applied regression analysis bibliography update 1994-97, Communications in Statistics-Theory and Methods27(10) (1998), 2581–2623.

Ducange

, Lazzerini

and Marcelloni

, Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets, Soft Computing14(7) (2010), 713–728.

Fernández

, del Río

, Chawla

N.V.

and Herrera

, An insight into imbalanced big data classification: Outcomes and challenges, Complex & Intelligent Systems (2017), 1–16.

10.

Fernández

, Garcia

, Herrera

and Chawla

N.V.

, Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research61 (2018), 863–905.

11.

Guerrero-Enamorado

and Ceballos-Gastell

, An experimental study of evolutionary product-unit neural network algorithm, Computación y Sistemas20(2) (2016).

12.

Han

, Wang

W.-Y.

and Mao

B.-H.

, Borderlinesmote: A new over-sampling method in imbalanced data sets learning, In Advances in Intelligent Computing, 2005, pp. 878–887. Springer.

13.

Hart

, The condensed nearest neighbor rule (corresp.), IEEE Transactions on Information Theory14(3) (1968), 515–516.

14.

, Liang

, Ma

and He

, Msmote: Improving classification performance when training data is imbalanced, In 2009 Second International Workshop on Computer Science and Engineering, 2009, pp. 13–17. IEEE.

15.

Koto

, Smote-out, smote-cosine, and selected-smote: An enhancement strategy to handle imbalance in data level, In Advanced Computer Science and Information Systems (ICACSIS), 2014 International Conference on, 2014, pp. 280–284. IEEE.

16.

Kubat

, Matwin

, et al., Addressing the curse of imbalanced training sets: One-sided selection, In ICML, volume 97, Nashville, USA, 1997, pp. 179–186.

17.

Laurikkala

, Improving identification of difficult small classes by balancing class distribution, Springer, 2001.

18.

Lin

W.-J.

and Chen

J.J.

, Class-imbalanced classifiers for high-dimensional data, Briefings in Bioinformatics14(1) (2013), 13–26.

19.

Liu

X.-Y.

and Zhou

Z.-H.

, The influence of class imbalance on cost-sensitive learning: An empirical study, In Sixth International Conference on Data Mining (ICDM’06), 2006, pp. 970–974. IEEE.

20.

López

, Fernández

, García

, Palade

and Herrera

, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences250 (2013), 113–141.

21.

Luengo

, Fernández

, García

and Herrera

, Addressing data complexity for imbalanced data sets: Analysis of smote-based oversampling and evolutionary undersampling, Soft Computing15(10) (2011), 1909–1936.

22.

Lunardon

, Menardi

and Torelli

, Rose: A package for binary imbalanced learning, A Peer-Reviewed, Open-Access Publication of the R Foundation for Statistical Computing, 2014, p. 79.

23.

Maciejewski

and Stefanowski

, Local neighbourhood extension of smote for mining imbalanced data, In Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on, 2011, pp. 104–111. IEEE.

24.

Patil

S.S.

and Sonavane

S.P.

, Enhanced over_sampling techniques for imbalanced big data set classification, In Data Science and Big Data: An Environment of Computational Intelligence, Springer, 2017, pp. 49–81.

25.

Ramentol

, Caballero

, Bello

and Herrera

, Smote-rsb*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory, Knowledge and Information Systems33(2) (2012), 245–265.

26.

Rastogi

A.K.

, Narang

and Siddiqui

Z.A.

, Imbalanced big data classification: A distributed implementation of smote, In Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking, ACM, 2018, p. 14.

27.

Sáez

J.A.

, Luengo

, Stefanowski

and Herrera

, Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences291 (2015), 184–203.

28.

Sharifirad

, Nazari

and Ghatee

, Modified smote using mutual information and different sorts of entropies, arXiv preprint arXiv:1803.11002 (2018).

29.

Stefanowski

and Wilk

, Selective pre-processing of imbalanced data for improving classification performance, In International Conference on Data Warehousing and Knowledge Discovery, Springer, 2008, pp. 283–292.

30.

Sun

, Kamel

M.S.

, Wong

A.K.C.

and Wang

, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition40(12) (2007), 3358–3378.

31.

Tomek

, Two modifications of cnn, IEEE Trans Syst Man Cybern6 (1976), 769–772.

32.

Torres

F.R.

, Carrasco-Ochoa

J.A.

and Martínez-Trinidad

J.F.

, Smote-d a deterministic version of smote, In Springer, pp, Mexican Conference on Pattern Recognition (2016), 177–188.

33.

Verbiest

, Ramentol

, Cornelis

and Herrera

F.C.

, Improving smote with fuzzy rough prototype selection to detect noise in imbalanced classification data, In Advances in Artificial Intelligence–IBERAMIA, Springer, 2012, pp. 169–178.

34.

Wang

, You

, Li

and Xu

, Extract minimum positive and maximum negative features for imbalanced binary classification, Pattern Recognition45(3) (2012), 1136–1145.

35.

Wilson

D.L.

, Asymptotic properties of nearest neighbor rules using edited data, Systems, Man and Cybernetics, IEEE Transactions on (3) (1972), 408–421.

36.

Zhang

, Li

, Kotagiri

, Wu

, Tari

and Cheriet

, Krnn: K rare-class nearest neighbour classification, Pattern Recognition62 (2017), 33–44.

37.

Zhou

Z.-H.

and Liu

X.-Y.

, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering18(1) (2006), 63–77.

38.

Zong

, Huang

G.-B.

and Chen

, Weighted extreme learning machine for imbalance learning, Neurocomputing101 (2013), 229–242.