Sage Journals: Discover world-class research

Abstract

Outlier detection is an interesting issue in data mining and machine learning. In this paper, to detect outliers, an information-entropy-based k-nearest neighborhood relevant outlier factor algorithm is proposed that is combined with Shannon information theory and the triangle pruning strategy. The algorithm accounts for the data points whose k-nearest neighbors are distributed on the edge of the range within the designated radius. In particular, the neighborhood influence on each point is considered to address the problem of information concealment and submergence. Information entropy is used to calculate the weights to distinguish the importance of each attribute. Then, based on the attribute weights, the improved pruning strategy reduces the computational complexity of the subsequent procedures by removing some inliers and obtaining the outlier candidate dataset. Finally, according to the weighted distance between the objects in the candidate dataset and those in the original dataset, the algorithm calculates the dissimilarity between each object and its k-nearest neighbors. The data points with the top $r$ dissimilarity are regarded as the outliers. Experimental results show that, compared to existing methods, the proposed approach improves pruning and detection rates while maintaining the coverage rate.

Keywords

Outlier detection information entropy attribute weights pruning

Get full access to this article

View all access options for this article.

References

W.F.

and Wang

, Research on credit card fraud detection model based on distance sum, in: Proceedings of IEEE International Joint Conference on Artificial Intelligence (JCAI'09) (2009), 353-356.

Ganji

V.R.

and Mannem

S.N.P.

, Credit card fraud detection using anti-k nearest neighbor algorithm, International Journal on Computer Science and Engineering (IJCSE) 4(6) (2012), 1035-1039.

Zhang

and Zulkernine

, Anomaly based network intrusion detection with unsupervised outlier detection, in: Proceedings of IEEE International Conference on Communications (ICC'06) (2006), 2388-2393.

Gogoi

, Bhattacharyya

D.K.

, Borah

and Kalita

J.K.

, A survey of outlier detection methods in network anomaly identification, The Computer Journal 54(4) (2011), 570-588.

Junejo

I.N.

and Foroosh

, Trajectory rectification and path modeling for video surveillance, in: Proceedings of the 11th International Conference on Computer Vision, ICCV 2007, IEEE, (Section 4), (2007), 1-7.

Lee

and Ho

Y.S.

, Temporally consistent depth video filter using temporal outlier reduction, Signal, Image and Video Processing (6) (2014), 1-8.

Martinez-Camara

, Vetterli

and Stohl

, Outlier removal for improved source estimation in atmospheric inverse problems, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (2014), 6820-6824.

Podgorelec

, Heričko

and Rozman

, Improving mining of medical data by outliers prediction, in: Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05) (2005), 91-96.

Huang

and Yang

, Finding key knowledge attribute subspace of outliers in high-dimensional dataset, Expert Systems with Applications 38(8) (2011), 10147-10152.

10.

Aggarwal

C.C.

, Applications of outlier analysis, Outlier Analysis, New York: Springer, (2013), 373-400.

11.

Barnett

and Lewis

, Outliers in statistical data, New York: John Wiley and Sons, 1994.

12.

Knorr

E.M.

and Ng

R.T.

, Algorithms for mining distance-based outliers in large datasets, in: Proceedings of the 24th International Conference on Very Large Data Bases (1998), 392-403.

13.

Knorr

E.M.

and Ng

R.T.

, Finding intensional knowledge of distance-based outliers, in: Proceedings of the 25th International Conference on Very Large Data Bases (1999), 211-222.

14.

Knorr

E.M.

, Ng

R.T.

and Tucakov

, Distance-based outliers: Algorithms and applications, The VLDB Journal-The International Journal on Very Large Data Bases 8(3-4) (2000), 237-253.

15.

Han

, Kamber

and Pei

, Data mining: Concepts and techniques, 3rd Edition, Morgan Kaufmann Publishers, Waltham, MA, USA, 2012.

16.

, Xu

and Deng

, Discovering cluster based local outliers, Pattern Recognition Letters 24(3) (2003), 1641-1650.

17.

Breunig

M.M.

, Kriegel

H.P.

, Ng

R.T.

and Sander

, LOF: Identifying density-based local outliers, in: Proceedings of the ACM SIGMOD International Conference on Management of Data (2000), 93-104.

18.

Kovacs

, Vass

and Vidacs

, Improving quality of service parameter prediction with preliminary outlier detection and elimination, in: Proceedings of the Second International Workshop on Inter-domain Performance and Simulation (IPS 2004), Budapest, (2004), 194-199.

19.

Jiang

and Chen

Y.M.

, Outlier detection based on granular computing and rough set theory, Applied Intelligence 42(2) (2015), 303-322.

20.

Ramaswamy

, Rastogi

and Shim

, Efficient algorithms for mining outliers from large data sets, in: Proceedings of the ACM SIGMOD International Conference on Management of Data (2000), 427-438.

21.

Angiulli

and Pizzuti

, Fast outlier detection in high dimensional spaces, in: Proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery (2002), 15-26.

22.

Angiulli

and Pizzuti

, Outlier mining in large high-dimensional data sets, IEEE Transactions on Knowledge and Data Engineering(TKDE) 17(2) (2005), 203-215.

23.

Angiulli

and Fassetti

, DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets, ACM Transactions on Knowledge Discovery From Data (TKDD) 3(1) (2009), 4: 1-57.

24.

Chen

, Miao

and Zhang

, Neighborhood outlier detection, Expert Systems with Applications 37(12) (2010), 8745-8749.

25.

Angiulli

, Basta

and Pizzuti

, Distance-based detection and prediction of outliers, IEEE Transactions on Knowledge and Data Engineering (TKDE) 18(2) (2006), 145-160.

26.

Yang

and Zhu

, Finding key attribute subset in dataset for outlier detection, Knowledge-Based Systems 24(2) (2011), 269-274.

27.

Shannon

C.E.

, A mathematical theory of communication, Bell System Technical Journal 27(3) (1948), 379-423.

28.

Bay

S.D.

and Schwabacher

, Mining distance-based outliers in near linear time with randomization and a simple pruning rule, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '03, New York, USA, ACM Press, (2003), 29-38.

29.

Hawkins

, He

, Williams

and Baxter

, Outlier detection using replicator neural networks, in: Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, Berlin Heidelberg, Springer, (2002), 170-180.

30.

Aggarwal

C.C.

and Yu

P.S.

, Outlier detection for high dimensional data, Acm Sigmod Record 30(5) (2002), 37-46.

31.

, Xu

and Deng

, An optimization model for outlier detection in categorical data, Advances in Intelligent Computing, Berlin Heidelberg: Springer, (2005), 400-409.

32.

Surya

and Azhagusundari

, Spatial outlier detection approaches and methods: A survey, International Journal of Innovative Research and Development 3(4) (2014), 24-29.

33.

Cai

, He

and Man

, Spatial outlier detection based on iterative self-organizing learning model, Neurocomputing 117(14) (2013), 161-172.

34.

Achtert

, Hettab

, Kriegel

H.P.

, Schubert

and Zimek

, Spatial outlier detection: Data, algorithms, visualizations, Advances in Spatial and Temporal Databases, Springer Berlin Heidelberg (2011), 512-516.