Sage Journals: Discover world-class research

Abstract

Data stream mining seeks to extract useful information from quickly-arriving, infinitely-sized and evolving data streams. Although these challenges have been addressed throughout the literature, none of them can be considered “solved.” We contribute to closing this gap for the task of data stream clustering by proposing two modifications to the well-known ClusTree data stream clustering algorithm: pruning unused branches and detecting concept drift. Our experimental results show the difficulty in tackling these aspects of data stream mining and the sensitivity of stream mining algorithms to parameter values. We conclude that further research is required to better equip stream learners for the data stream clustering task.

Keywords

Concept drift data streams on-line learning

Get full access to this article

View all access options for this article.

References

Gama

Knowledge Discovery from Data Streams, CRC Press, 2010.

Zhang

, Zhu

, Tan

and Guo

Classifier and cluster ensembles for mining concept drifting data streams, in: Proc of IEEE International Conference on Data Mining, pp. 1175–1180.

Krawczyk

, Minku

L.L.

, ao Gama

, Stefanowski

and Woźniak

, Ensemble learning for data stream analysis: A survey, Information Fusion 37 (2017), 132–156.

Everitt

B.S.

, Landau

and Leese

Cluster Analysis, Wiley Publishing, 4th edition, 2009.

Silva

J.A.

, Faria

E.R.

, Barros

R.C.

, Hruschka

E.R.

, d. Carvalho

A.C.P.L.F.

and Gama

J.a.

, Data stream clustering: A survey, ACM Comput Surv 46 (2013), 13:1–13:31.

Rodrigues

P.P.

, Gama

J.a.

and Pedroso

, Hierarchical clustering of time-series data streams, IEEE Trans on Knowl and Data Eng 20 (2008), 615–627.

Krempl

, Spiliopoulou

, Stefanowski

, Žliobaitė

, Brzeziński

, Hüllermeier

, Last

, Lemaire

, Noack

, Shaker

and Sievi

, Open challenges for data stream mining research, ACM SIGKDD Explorations Newsletter 16 (2014), 1–10.

Gama

, Žliobaitė

, Bifet

, Pechenizkiy

and Bouchachia

, A survey on concept drift adaptation, ACM Computing Surveys 46 (2014), 1–37.

Žliobaitė

Learning under Concept Drift: An Overview, Technical Report, Vilnius University, 2010.

10.

Webb

G.I.

, Hyde

, Cao

, Nguyen

H.L.

and Petitjean

, Characterizing concept drift, Data Mining and Knowledge Discovery 30 (2016), 964–994.

11.

Moulton

R.H.

, Viktor

H.L.

, Japkowicz

and Gama

Clustering in the presence of concept drift, in: Machine Learning and Knowledge Discovery in Databases – European Conference, ECML PKDD 2018, Dublin, Ireland, Setember 10–14, 2018, Proceedings, Part I, pp. 339–355.

12.

Kranen

, Assent

, Baldauf

and Seidl

, The ClusTree: Indexing micro-clusters for anytime stream mining, Knowledge and Information Systems 29 (2011), 249–272.

13.

Zgraja

and Woźniak

Drifted data stream clustering based on clustree algorithm, in: de Cos Juez

F.J.

, Villar

J.R.

, de la Cal, Á

E. A.

, Herrero

Á.

, Quintiań

, Sáez

J. A.

, Corchado

, (Eds.), Hybrid Artificial Intelligent Systems, Springer International Publishing, Cham 2018, pp. 338–349.

14.

Aggarwal

, Han

, Wang

and Yu

A framework for clustering evolving data streams, in: Proc of the 29th Int Conf. on Very Large Data Bases ȓ Volume 29, VLDB ’03, VLDB Endowment, 2003, pp. 81–92.

15.

Bifet

, Holmes

, Kirkby

and Pfahringer

, MOA: Massive online analysis, J Mach Learn Res 11 (2010), 1601–1604.

16.

Kremer

, Kranen

, Jansen

, Seidl

, Bifet

, Holmes

and Pfahringer

An effective evaluation measure for clustering on evolving data streams, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, ACM, New York, NY, USA, 2011, pp. 868–876.

17.

Ester

, Sander

and Moise

, P3C: A robust projected clustering algorithm, 2013 IEEE 13th International Conference on Data Mining 00 (2006), 414–425.

18.

Wilcoxon

, Individual comparisons by ranking methods, Biometrics bulletin 1 (1945), 80–83.

19.

Friedman

, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32 (1937), 675–701.

20.

Alcalá-Fdez

, Sánchez

, García

, del Jesus

M.J.

, Ventura

, Garrell

J.M.

, Otero

, Romero

, Bacardit

, Rivas

V.M.

, Fernández

J.C.

and Herrera

, KEEL: A software tool to assess evolutionary algorithms to data mining problems, Soft Computing 13 (2009), 307–318.

21.

Alcalá-Fdez

, Fernandez

, Luengo

, Derrac

, García

, Sánchez

and Herrera

, KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17 (2011), 255–287.

22.

Hulten

, Spencer

and Domingos

Mining time-changing data streams, in: Proceedings of the seventh ACM SIGKDD in ternational conference on Knowledge discovery and data mining, ACM, pp. 97–106.

23.

Street

W.N.

and Kim

A streaming ensemble algorithm (sea) for large-scale classification, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, ACM, New York, NY, USA, 2001, pp. 377–382.

24.

Domingos

and Hulten

, Mining high-speed data streams, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’00, ACM, New York, NY, USA, 2000, pp. 71–80.

25.

Lyon

R.J.

, Stappers

B.W.

, Cooper

, Brooke

J.D.

and Knowles

J.M.

, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, MNRAS 000 (2015), 000–000.

26.

Bifet

, Gavaldá

Adaptive learning from evolving data streams, in: N.M. Adams, C. Robardet, A. Siebes, J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII, Springer Berlin Heidelberg, Berlin, Heidelberg 2009, pp. 249–260.

27.

Zhu

X.H.

Stream data mining repository, http://www.cse.fau.edu/~xqzhu/stream.html, 2010. Accessed: 2019-02-21.

28.

Rösler

and Suendermann

A first step towards eye state prediction using eeg, Proc. of the AIHLS, (2013).

29.

Moulton

R.H.

and Zgraja

The Wilderness Area Data Set: Adapting the Covertype data set for unsupervised learning, arXiv e-prints (2019) arXiv:1901.11040.

30.

Dua

and Karra Taniskidou

UCI machine learning repository, 2017.

31.

Blackard

J.A.

and Dean

D.J.

, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture 24 (1999), 131–151.