Abstract
Introduction
Clustering is to partition the data into different clusters with respect to similarity measures and is one of the most important tasks in data analysis, such as pattern discovery, pattern recognition, data summary, and image processing.
1
These fields now include many branches, namely partition-based methods, model-based methods,
2
hierarchical methods,
3
density-based methods,4–6 graph-based methods, grid-base methods,
7
and clustering ensemble methods.
8
Partition-based method is the simplest and most important part of cluster analysis, and many algorithms are raised to facilitate its process such as
Due to it is simple yet effective,

Example of clusters begins with a bad initialization.
Many work attempts to resolve the sensitivity of initialization for
Moreover,
In this article, we propose
While this work is based on a conference article, 20 the scope of the proposed work has been significantly extended.
In this article, we now integrate these three optimized principles into a framework (section “Framework”). The framework helps us to better understand
We now add and elaborate the proof of the proposed principles and strategies. These materials include “top-
We give the pseudo-code of “top-
We extend the experimental evaluation focusing on the effectiveness (section “Effectiveness evaluation”) and efficiency (section “Efficiency evaluation”) of proposed
We now evaluate and discuss the contribution of three optimization principles on efficiency (section “Impact of optimization principles”). Experiments show that our proposed three optimizations reduce CPU time significantly.
We also discuss the value selection of the parameter
Related work
Initialization methods for k -means
Forgey
21
first partition the data into
As mentioned above,
Su and Dy
23
propose two methods, PCA-Part and Var-Part, to solve the initializing problem. PCA-Part first regards all the data as one cluster, and then cuts it into two partitions by the hyperplane that passes through the cluster centroid and is orthogonal to the principle eigenvector of cluster covariance matrix. After that the method repeatedly selects the cluster that has the greatest
A recent study named MinMax
Another category of methods try to eliminate the dependence on initialization, such as global
Optimized variants of k -means on efficiency
Kanungo et al.
28
implement an efficient
The algorithm
In this section, we present our proposed
Framework
Because
Supposing cluster
However, some additional steps need to be implemented to decrease
Top-n nearest clusters merging
As shown in algorithm 1, we first obtain
Definition 1 (edge)
Given two clusters
By Definition 1, we only need to find top-
However, top-
Proof
Suppose the top-
Consider the cases of reducing maximum number of clusters, the first case is that the
Although top-

An example of cluster merging: (a) before merging and (b) after merging.
Lemma 1
Given clusters
where
Proof
Without loss of generality, let
First, since
Then we get equations (5) and (6)
Note that
Second,
And then, we get equation (8) by equation (7)
By equations (5) and (6), we also get
Hence, we deduce equation (10) by equations (8) and (9)
Because all points in
By equations (5) and (6) we have
Substituting equation (12) into equation (12), we obtain equation (13)
Therefore, we have equation (4) by equations (10) and (13), and we are done.
Lemma 1 implies that the feature values of merged cluster can be computed directly according to those of previous sub-clusters. Therefore, our top-
Optimized update principle
From extending experiments on UCI data (see details in section “Experimental setup and methodologies”), we observe that the number of moved points declines sharply during the

The number of moved points in
To understand our optimized update principle, we first state Lemma 2 and Lemma 3.
Lemma 2
Given a cluster
Proof
Lemma 2 can be easily proved by Lemma 1. Consider
and
Lemma 3
Given a cluster
Proof
First, we show proof of equation (17).
Since
By equation (5), we can easily get
Next, we give proof of equation (18). Let
Because
Therefore, we have
According to equation (5), then we get
We further deduce that
by equation (17).
Because
Lemma 2 and Lemma 3 depict updating method of
Armed with Lemma 2 and Lemma 3, we propose the optimized update method to reduce computation costs, as shown in algorithm 3. In each iteration, we maintain the copies of feature values incrementally according to moved points. Optimized update method obviously saves much computation resource when the number of moved points is very low. However, at beginning iterations of
Cluster pruning strategy
To reduce the computational costs, we now introduce optimization principle, regarding to minimization of distance comparison, termed cluster pruning strategy.
In
Next, we introduce our pruning optimization method for each point at cluster level based on definition 2 and Lemma 4.
Definition 2 (radius)
Given a cluster
Lemma 4
Given two clusters
Proof
Suppose there are two clusters,

Pruning distance between two clusters.
Now, we consider the farthest point in
If
Since
Therefore,
Lemma 4 guides our pruning rule for
It is essential to show the usage of
k *-means algorithm
Next, we present our extending
In
Experiments
Experimental setup and methodologies
All algorithms are implemented by Java, and all experiments are performed on platform equipped with 2.4 GHz Intel Core i3 CPU, 4 GB memory, and Windows 7 operating system.
Datasets
We use 11 datasets including 4 synthetic datasets and 7 real-world datasets from the UCI Machine Learning Repository
31
to evaluate the effectiveness and efficiency of our
Descriptions of synthetic and real-world datasets.
As shown in Figure 5(a), D1 includes 30 clusters that composed of 3851 points. In this dataset, each cluster contains less than 200 points with Gauss distribution and number of clusters is balanced. We denote the small dataset as Smalldata.

Demonstrations of four synthetic datasets: (a) D1, (b) D2, (c) D3, and (d) D4.
D2 (Figure 5(b)) includes five clusters with clear gaps; number of points in the dataset reaches at 18,000 but is unbalanced in each cluster. We name the unbalanced dataset Unbaldata.
D3 (Figure 5(c)) contains 41,065 points, which consists of 12 clusters. Clusters of the dataset are also unbalanced but with clear shape. This dataset is denoted as LargeUnbal.
The forth dataset shown in Figure 5(d) comprises massive data points of 120,000, including 50 clusters with arbitrary shapes. Moreover, there are no obvious gaps for some clusters, and the number of points in clusters also is unbalance. We denote the big dataset with arbitrary shape clusters as Bigdata.
Ecoli 31 is calculated from the amino acid sequences, which includes 8 classes that comprised of 336 proteins. The data has eight attributes (includes one label), and the number of classes is different, thus it is an unbalanced dataset.
Pendigits 31 is collected from 44 writers, which contains 10,992 instances that represented as constant length feature vectors in 16-dimensional space.
Optdigits 31 is optical recognition of handwritten digits by extracting normalized bitmaps of handwritten digits from a preprinted form. Each preprinted form is represented as an 8 × 8 matrix, where every element is an integer. The dataset includes 5620 instances in 64-dimensional space.
Dermatology 31 contains 366 patients with 34 attributes, which is partitioned into 6 types of erythemato-squamous disease. The dataset is also unbalanced.
Folio1 & Folio2 31 are leaves images taken from different plants in the farm of the University of Mauritius and nearby locations. We first extract the SIFT descriptors from these images, and then use the bag of 500 visual words to represent the data. Two subsets of the data were generated, Folio1 (300 leaves from 15 different types of plants, balanced) and Folio2 (300 leaves but from 20 different types of plants and unbalanced).
CNAE-9 31 contains 1080 documents of free text business descriptions of Brazilian companies. These documents are categorized into a subset of nine categories. Each document is represented as a vector, where the weight of each word is its frequency in the document.
Experimental methodologies
For effectiveness evaluation of our algorithm, we use three metrics of average
Therefore, NMI score ranges from 0 to 1.0, more higher value indicates more better quality of clustering.
As shown in equation (20), we need additional labeled class information for computing NMI score. But it is difficult to obtain data with annotations in many real-world applications. So, we also adopt another widely used metric SC, which describes both similarity among points in intra-cluster and differences among clusters without cluster labels. Let
The average SC of all points
For efficiency evaluation, we measure the common metrics of CPU consumption. All experiments are conducted 10 times on each dataset, and we take the average CPU time. Moreover, we discuss the selection of
Alternative algorithm
We compare our algorithm against
Effectiveness evaluation
First, we evaluate the clustering quality of
Clustering quality on NMI.
NMI: normalized mutual information.
Clustering quality on SC.
SC: Silhouette coefficient.
Clustering quality on
NMI
The results of NMI on 11 datasets are shown in Table 2, our
For
SC
As shown in Table 3, again,
SSE
Table 4 shows the
In summary, the above comprehensive experiments confirm the effectiveness of our proposed
Efficiency evaluation
Comparison of efficiency with other algorithms
Next, we evaluate the efficiency of our

CPU time of algorithms on 10 datasets: (a) Group 1, (b) Group 2, (c) Group 3, and (d) Group 4.
Impact of optimization principles
We next analyze the contribution of three optimization principles (“top-

CPU costs in different part of
Determining of parameters
and
We conduct experiments to explore the selection of

Impact of
Figure 9 shows the impact of

Impact of
Finally, we evaluate the effect of parameter

Impact of

Impact of
Conclusion
In this work, we propose a novel optimized hierarchical clustering method incorporated with three optimization principles, namely “top-
An interesting direction for future work is to leverage modern distributed multi-core cluster of machines for further improving the scalability of our algorithm.
