Text clustering has been an overlooked field of text mining that requires more attention. Several applications require automatic text organisation which relies on an information retrieval system based on organised search results. Spherical k-means is a successful adaptation of the classic k-means algorithm for text clustering. However, conventional methods to accelerate k-means may not apply to spherical k-means due to the different nature of text document data. The proposed work introduces an iterative feature filtering technique that reduces the data size during the process of clustering which further produces more feature-relevant clusters in less time compared to classic spherical k-means. The novelty of the proposed method is that feature assessment is distinct from the objective function of clustering and derived from the cluster structure. Experimental results show that the proposed scheme achieves computation speed without sacrificing cluster quality over popular text corpora. The demonstrated results are satisfactory and outperform compared to recent works in this domain.
Electronic documents save time and facilitate collaborative work across the globe beyond geographical or departmental boundaries. This increase in the number of electronic documents has led to a growing challenge for information systems to effectively organise, manage and retrieve the information comprised in extensive collections of texts according to the users’ information needs. The repositories like digital libraries store documents so that they can be later searched and retrieved. Business enterprises view social media documents as a potential source to analyse trends and gain commercially helpful information. Document or text clustering algorithms provide an effective mechanism for such user applications by organising large collections into a small number of meaningful clusters that improve retrieval performance in speed and relevance. A few applications of document clustering and the intended purpose are briefed in Table 1. While document classification has received much attention, document clustering is becoming increasingly important as a precursor to classification, topic detection, and grouping of huge collections of documents. Document clustering is the application of natural language processing to process document collections in the same manner as any other data collection method, they are represented numerically. A bag-of-words [1] is a common approach where every document is viewed as a set of words independent of their point of occurrence in the document and independent of each other. Thus, a document clustering algorithm deals with a set of documents called a corpus, where each document is also a set of words or terms.
Types of text documents for clustering with its benefits and applications.
Sl. no.
Types of documents
Purpose
Advantage
Application
1
News articles and blogs
Detection of duplicate content, identification of plagiarism and detecting related news/articles
Better presentation of content
Media monitoring and analysis
2
Customer reviews
Recommendation of contents and products
Improved relevance of recommendation
Recommendation systems
3
Microblogs and tweets
Detection of not predefined and unforeseen subjects in surveys and claims
Proactive management and effective response to user
Feedback analysis and opinion mining
4
Search results
Grouping of results and the suggestion of related information
Fast and efficient search and improved user experience
Information retrieval systems
5
Research articles and digital books/records
Structuring according to implicit subjects and topic detection
Aids navigation and faster search
Digital libraries
A vector-space model (VSM) [2] represents documents through positive vectors in the term space formed by the union of words of all documents and numeric values that correspond to the frequency of a word generally. A toy example of bag-of-words will give the reader an idea about the size of such representation for real-life documents. Figure 1 shows a few short documents related to simple reading exercises for children, and Table 2 is a simple document-term matrix containing the frequency of terms occurring in the documents. The corpus is tokenized to obtain main terms and remove stop words and parts of speech like articles, pronouns, and prepositions; stemming is done to retain only the word’s root. The challenge of document clustering lies in the variety of documents in any corpus. The documents vary in length, the number of unique words, and the frequency of these words. Different approaches exist to weight the term frequency statistics contained in the document-by-term matrix. This weighting aims to consider the relative importance of different terms, thereby facilitating improved performance in common tasks such as classification, clustering, and ad hoc retrieval. Two popular approaches are term-frequency inverse document frequency (TF-IDF) [3] and BM25 [4]. Every representation gives a length-biased, sparse high-dimensional space. While normalisation to unit length can handle the biasing effect of length, the other aspects make this vector representation of a corpus high dimensional and very sparse. There is no general solution for the ‘curse of dimensionality’ and sparsity of any vector feature space of documents. These problems have to be handled within a clustering algorithm.
Document-term matrix containing the frequency of terms in the toy example of Figure 1.
black
cat
jump
rat
hat
sat
mat
ran
see
Doc 1
2
1
1
1
0
0
0
0
0
Doc 2
0
1
0
0
1
1
1
0
0
Doc 3
0
0
0
2
1
0
1
2
0
Doc 4
0
1
0
1
0
1
1
0
0
Doc 5
1
1
0
1
1
0
1
0
1
Toy example of the corpus with five small documents.
Hornik et al. [5] have categorised clustering approaches for text data into three categories: First, a suitable dissimilarity/similarity measure between the texts is chosen, and a partition style clustering is performed based on this. Cosine similarity is a popular similarity measure. String kernels [6] have become very popular in this context, as the corresponding dissimilarities can be computed relatively efficiently and deployed in modern kernel-based learning techniques such as kernel k-means clustering [7]. Second, a generative approach is adopted using probabilistic models for texts, such as topic models for uncovering the latent semantic structure based on a hierarchical Bayesian analysis of the texts [8,9]. Finally, one can use a suitable representation of the corpus (the collection of text documents) to convert it into a numeric form [10] and apply some traditional approaches for clustering multivariate numeric data. This article proposes how feature selection can reduce dimensionality and increase the performance of a clustering algorithm in a hybrid manner.
The article is organised into five sections; in the first section, the terminology and notations are discussed, brief literature relevant to our work and information related to initialization and feature selection of spherical k-means (SKM) are discussed in Section 2. Proposed work is presented in Section 3, and the implementation of work and results are discussed in Section 4. Conclusion of proposed work is presented in Section 5.
2. Related work
In this section, we will explore k-mean and its variants. k-means algorithm is a popular method for clustering a set of data vectors [11]. The classical version of k-means uses squared Euclidean distance to measure the dissimilarity of two objects. However, this distance measure is often inappropriate for its application to document clustering, as observed in Lodhi et al.’s study [6]. A variant of k-means used for document clustering is popular as SKM proposed by Dhillon et al. [12]. SKM retains the benefits of k-means: scalability, flexibility, and simplicity while making it fast for high-dimensional text data by using cosine similarity for comparison. However, fast clustering algorithms may be designed; the high dimensionality of text data remains a major hurdle in speeding up the process. To date, feature selection with unsupervised learning in literature has been performed using either filter approach [13] that removes features of lesser significance before clustering process and wrapper approach [14] that determines ranks of features through clustering process.
Throughout this article, we consider the term-frequency VSM of document representation. In this model, a document is a vector in the term space Document-term matrix. After the pre-processing, we obtain say unique words from the corpus of documents, and each document is represented as -dimensional vector as where . Thus, the corpus is a document-term matrix of dimensions . Each vector is normalised to unit length, that is and the values are computed using equation (1)
with values denoting the number of times word appears in document . This has an effect that documents are now points on the unit sphere in non-negative term space , and the inner product of any two vectors is a natural measure of similarity between them. Hence, cosine similarity is used here, computed as equation (2)
The simplification is due to the unit length of each document, and the Sparsity of vectors makes this inner product computation very fast. Clustering is viewed as partitioning of the set of documents into disjoint clusters as where and if . Each cluster can be represented through its concept vector obtained as a unit vector in the direction of the centroid vector of the cluster. The mean vector or centroid of documents in any cluster is computed as the mean of all the documents contained in the cluster using equation (3)
where is the number of documents in the cluster . A point to be noted carefully here is that the mean vector need not be of unit length. A unit vector in the centroid direction is called the concept vector and is the closest in cosine similarity to all document vectors in the cluster . Thus, the objective function of the optimization problem of clustering can be stated as maximise such that and if .
Besides these, we need two more metrics; the first is an intrinsic measure called density of a cluster, and the second is an extrinsic measure called Adherence among clusters. For a better understanding, we require the definitions from Baraldi et al. [15]. The principal dimension of any corpus corresponds to the term having the highest normalised frequency in the corpus, and denotes it. The density of a cluster is the ratio of its population to the length of the principal dimension spanned by the cluster, computed as equation (4)
Thus, density is also a measure of the compactness of a cluster. If a cluster contains documents belonging to a variety of subtopics, the range spanned by the cluster is large, and density will be low, indicating the diversity of the cluster. If a cluster contains documents related to a single or limited set of topics, the span is lesser and density higher.
2.1. Spherical k-means and variants
The k-means is a popular clustering algorithm that has been adapted to a variety of data and applications. The basic working of k-means is to collect objects that are more similar to each other in a cluster. It begins with k initial centroids called seeds and then iteratively proceeds through two major steps. The first step is to assign data objects to a cluster by measuring its distance from all centroids and assigning it a label of the nearest centroid. The second step is to update centroids of all clusters by taking the mean of all points assigned in that cluster. These two steps are repeated until a stopping criterion is met. Any variant of k-means has to define the following in order to have a working version of the algorithm:
Initialization method that defines how seeds will be chosen or computed. The seeds can either be from among the objects or a virtual object from the object space.
Dissimilarity/similarity measure computation to calculate the ‘nearness’ of objects from centroids.
A stopping criterion to decide when to stop the iterations. If centroids are not changing, it can be a threshold beyond which or a maximum number of iterations are allowed.
Besides this, there are other scopes of changes in the k-means algorithm, like updates to centroids can be done as soon as a point is assigned to that cluster, called online update, as opposed to the batch update in classic k-means where centroids are updated only when all points have been assigned to respective clusters.
Dhillon and Modha [12] proposed spherical k-means (SKM), and they used a VSM where numerical values were derived from frequencies of words and inverse document frequency. The normalised VSM of corpus makes the vectors fall on the surface of a unit hypersphere in the object space, hence the name Spherical k-means. Cosine similarity is used to compute similarities among documents. The algorithm is reproduced here from Dhillon and Modha [12] in Figure 2. Geometrically, the SKM algorithm partitions the high-dimensional space into Voronoi or Dirichlet regions separated by hyperplanes passing through the origin. Each cluster is represented by a centroid that provides a compact summary of the cluster. The similarity of the document is measured with the centroid (representative or concept vector) to decide whether to make the document a member of the cluster. The concept vectors are iteratively updated as the mean direction of all cluster members, and the decision of membership is also revised at every iteration until stability is attained. SKM suffers from an obvious drawback: in k-means of the tendency to get trapped in local optima, dependency on initialization of centroids, and empty clusters.
Listing of Spherical k-means algorithm reproduced from Hornik et al.’s study [5].
Dhillon et al. [16] suggested how to make SKM computationally efficient. The primary conclusion of their work is that even for very large documents, clustering can be achieved in time linear in the size of the collection. They have suggested a multi-threaded technique to accelerate the entire process of fetching documents from disc, creating the vector representation, and then clustering them. A different representation of the text documents is suggested. Every document is a -dimensional vector , where any jth entry for , where is a term weighting component that depends only on the frequency of the term in the document, is a global weighting component and depends only on the number of documents containing that term j, and is the normalisation component for the ith document. This representation tries to catch the importance of the term in the document and in the collection and thus enhances its discrimination effect.
SKM suffers from a major drawback of producing empty clusters or clusters with very few members when it gets stuck in a local optimum. Dhillon et al. [16] proposed a clustering refinement algorithm. They design a local search procedure called ‘first variation’ to refine clusters produced by the SKM. It runs as an added step after cluster assignment that perturbs the clusters that appear ‘too stable’ over a few iterations. It reshapes the clusters by moving one point from a cluster to another in a step, thereby increasing the value of the objective function. A sequence of first variation steps moves many points. However, this comes at increased computational complexity and cannot theoretically guarantee that all local optima can be escaped. Empty clusters and computational effort are jointly addressed in the scheme by Kim et al. [17]. They proposed an initialization process which gives dispersed initial points and 1000 times faster than previous works. Viewing the updating of concept vectors as a gradient ascent approach, they have proposed to update them according to gradient direction. The algorithm is named improved spherical k-means (ISKM). The update equation is , where is the learning rate. They have employed two different rate schedules for learning. The first is a static learning schedule, and the second is a gradually decreasing similar to the annealing schedule. To safeguard from producing empty clusters, a list of farthest points is maintained. They are assigned to the empty clusters. Another practical suggestion is to defer normalising the centroids after every update until the magnitude of the vector becomes too large. This avoids a major computation bottleneck of ISKM. Banerjee and Ghosh [18] proposed reducing the number of empty clusters in Spherical k-means. Specifically, they used the technique of frequency-sensitive competitive learning. The distances/similarity to the concept vectors is computed according to weights directly proportional to the number of vectors assigned to their cluster. Thus, a cluster with many members appears to be at a greater distance than it is. Alternatively, in other words, the clusters with many members appear to shrink when computing similarity from a vector in a query. However, this proposal aims to produce a balanced cluster structure that is helpful for organisational applications that require the documents to be arranged in almost equal-sized clusters.
Thus, it is observed that SKM has two significant improvement aspects: runtime and other is handling empty or very small clusters.
2.2. Initialization in Spherical k-means
Any k-means-based algorithm suffers from the effects of bad seeding. The initial concept vectors should ideally be well distributed among the object space to capture cluster structure without bias. Nevertheless, this contradicts that the clustering itself is meant to discover the distribution of objects in the space. Since the structure is an output, it cannot be given as an input. This has led to many types of research dedicated to improving the initialization of k-means. K-means++ [19] is considered to be a very effective initialization technique. It successively selects a centroid as a point farthest from already selected centroids. However, this has a very high computational complexity. Intuition from Kim et al. [17] for initialization method can be taken as:
Initalize the set randomly selected from R.
Select randomly from R.
Remove where and .
Repeat Steps 2 to 3 until k centroid points are selected or becomes empty.
If the number of selected centroid is less than k, choose remaining point from randomly.
Efforts to develop an initialization method dedicated to SKM instead of adopting from those for k-means have been a few. The reader is referred to the discussion in previous studies [5,12] that concludes that selecting a concept vector as a mean of the entire corpus and perturbing it in k different directions gives effective initialization for clustering. Duwairi and Abu-Rahmeh [20] present a deterministic technique for initialization that places the concept vectors uniformly in all directions within the unit hypersphere. Li et al. [21] observed that this would be effective only if documents themselves are uniformly distributed over the positive orthant of the unit hypersphere, hence proposing a median-based deterministic initialization. It simply computes -median values in each dimension to obtain vectors of dimensions used as seeds. In our previous work [22], two probabilistic initialization methods suggested named single-pass seed selector (SPSS) and memory-efficient seed selector (MESS). SPSS is a fast method that first assigns weights to all points based on their distance from one chosen pivot and then points are picked as a random sample with weights as probabilities, and two thresholds are used, one to decide whether the point is chosen and another to decide whether to update pivot or not. MESS performs exactly k iterations, picking one seed at a time. At every round, search space is reduced by removing the objects that are already very near to previously picked seeds. This ensures that every successive seed concept vector is not much closer to previously selected seeds in the space, thus maintaining the underlying idea of k-means++ and reducing space complexity. Recently Hassani et al. [23] proposed deterministic initialization for spherical K-means, additionally they came up with a new feature agglomeration method which uses the concept of matrix factorization, which further separates the terms into groups and finally creates new feature vector which is best suitable for clustering. A novel seeding algorithm with penalties was proposed for spherical k-mean clustering by Ji et al. [24].
2.3. Feature selection in text clustering
Li et al. [21] have used a statistic-based feature selection, named , that measures the relevance between feature and class through numeric value. Since it cannot be directly applied to unsupervised learning, it is embedded into an expectation–maximisation framework for text clustering. Instead of removing any feature, the weight of the feature is reduced if found irrelevant and can be later increased if relevance increases. Thus, feature ranking in this method improves cluster quality, and there will be no computation cost reduction. Bora et al. [25] proposed Heuristic Frequent Term–Based Clustering (HFTC) that saves computational effort through feature selection by reducing the terms and documents during iterations of clustering. A frequent itemset with some predefined minimum support is searched from all remaining documents to be clustered. It is tested for overlap with the remaining terms. The set having minimum overlap with remaining terms is picked as the best set, and the documents that support this best set are clustered together. This cluster is removed, and further clustering is performed only on the remaining documents. Thus, the features holding importance for a few documents are identified and then removed along with those documents. Under the changed situations now, a new set of terms may be the best set. It can be converted to hierarchical clustering by further clustering the primary cluster with reduced dimensionality and corresponding best set. This method saves effort and strengthens the idea that clusters in text data pertain to topics described with few keywords. Roffo et al. [26] have proposed a feature-ranking system that attempts to associate a measure of importance to a feature through the importance of neighbours in a feature affinity graph. The affinity of any two terms or features is measured using popular metrics like the Fisher criterion for relevance and mutual information for redundancy. When represented as an adjacency matrix with some edge weights, the graph has Eigenvectors that directly indicate more important nodes (terms). Thus, it is a method that combines the relevance, redundancy measures with graph theory. The computation is time-consuming, and the space complexity of the graph should also be taken into account.
Hu et al. [27] suggest user intervention during clustering to pick up features preferred most by the user. This can be very effective if the user has very good knowledge about the corpus at hand, which is impossible in many real-life situations. Zhao et al. [28] have proposed an entropy-based feature selection method to use data clustering, but it has not been put to text clustering. Entropy calculations can be costly for a high-dimensional space like text documents. Zhou et al. [29] have suggested selecting features such that a balanced clustering structure is obtained instead of focusing only on ranking-based feature selection. Balanced structures are required by text organising applications where besides similarity constraint, there is a constraint on the size of the cluster. Very recently, Chaudhuri et al. [30] have proposed a rank-based feature selection that can be used with heterogeneous attributes and high-dimensional data. It has not been applied to document clustering yet. Again, it is a filter-style approach. Liu et al. [31] have used a method to handle empty clusters by replacing the centroid of the empty cluster with the nearest document vector in k-means-based clustering of Chinese documents using Euclidean distance structure for feature selection.
3. Proposed framework
3.1. Feature filtering
Our proposed work is to remove those words that occur in almost all documents and thus will not have any discretionary information from a clustering point of view. Applying this to the entire corpus is computationally intensive, given the huge size of the document-term matrix. We utilise the observation that centroids are representatives of the members in their respective cluster, and if there is a term with a consistent presence in the majority of documents, then it is most probably to have such a presence among the centroids. Thus, analysing the feature distribution of the centroid matrix is equivalent to analysing the entire corpus. A term with approximately the same value in all centroids has occurrence in most of the documents and does not play a significant role in identifying documents of any one particular cluster. On the contrary, the terms that occur in only one centroid are more probably to be the key concept words of documents of that cluster. Standard deviation is the most straightforward tool to check the uniformity of occurrence of a feature in all vectors.
We proposed a novel approach that iteratively filters features within the clustering framework. It cannot be called a pure wrapper approach because the filter method does not consider the clustering labels for feature weighting or ranking. Instead, the feature selection is very similar to a simple statistic filter. However, a major effect is brought by performing filtering in every round of clustering. During each iteration of SKM clustering, a few features are removed. Hence, we call this proposed method spherical k-means with iterative feature filtering (SKIFF) (Figure 3).
Listing of proposed spherical k-means with iterative feature filtering.
The considerations behind our proposal are as follow:
Easy availability of functions for simple statistical computations.
The very low computational complexity of statistical computations as the sub-operations of such formulae can be parallelized over the data making them sub-linear in time.
Statistical properties apply to both cases: mutually independent features and dependent features.
Statistical properties like mean and sum do not vary much in interpretation whether applied to entire data or its summary (like centroids).
The centroid matrix is of size , having k vectors of d dimensions. Since k is much smaller than n, total document vectors, compute on centroid matrix is much less time-consuming. Any feature with low variance or standard deviation within the centroid matrix has very low discriminating power. Removing such features is helpful as document vectors will have similarities with the centroid vector that overlaps in more significant words. Nevertheless, the centroids change during the iterations of clustering. Hence, instead of doing feature filtering only once, it is performed every time the centroids change.
Let the document-term matrix be at th iteration, and the centroid matrix is . The standard deviation of all dimensions of the centroid matrix is computed, and portion of the features with the lowest standard deviation are discarded. Thus, the number of dimensions decreases geometrically, making the size of the current document-term matrix at any iteration much smaller than the original and that too decreases as the algorithm progresses to the next round. So the cost of clustering per iteration decreases significantly. Listing 1 shows the proposed SKIFF algorithm. It takes to input the corpus to be clustered, the number of clusters desired and value of rejection ratio . This value between 0 and 1 indicates the percentage of features to be removed from the current document-term matrix. The output labels each document and assigns it to a unique cluster; hence, the corpus gets partitioned into clusters. For post-analysis and topic identification application purposes, we also have taken centroids of the clusters as auxiliary output. The user is free to use any initialization method with this SKIFF clustering.
Besides the obvious advantage of reducing computational effort through dimension reduction, we also achieve one more benefit. Such feature filtering will avoid the wrong assignment of documents to any centroid based on more familiar terms, and clustering will be done for more significant words.
3.2. Handling empty clusters
Text documents are not uniformly distributed over the term space, and if any initial centroid is in the region where there are no actual documents in the vicinity, none of the documents get assigned to that cluster, and eventually, we obtain clusters lesser in number than desired. A cluster with no documents assigned to it is empty and has no use. This problem is often addressed by only having a good initialization or trying clustering through different initializations. Since textual data are more prone to empty clusters, we suggest a technique to handle the problem here. We suggest a simple strategy to handle empty clusters within the iterative structure: detect an empty cluster before updating the concept vectors and then select the document with the least similarity with its centroid and assign it to the empty cluster. This document will automatically be the centroid of that cluster for the next iteration. The advantage of such a process is that perturbing the edge points of any cluster is suitable for heuristics like SKM so that they do not get trapped in local optima. It is better than a random assignment to an empty cluster because randomness cannot guarantee the effect it has on heuristic search. The farthest point in the other cluster is a border point, and it might probably be part of another cluster. In case multiple empty clusters are present, more such points can be assigned. Since the similarity of such point with its centroid is already very low, its removal will not much affect the cluster it is leaving, and no update to all other clusters is required.
3.3. Computational complexity
The time complexity of SKM is linear in the size of data, that is , where is the total number of iterations. Since is of much higher order for text documents than or , the deciding factor of the runtime is the value of . In our proposed algorithm SKIFF, we have focused on reducing the value of at every round of clustering. Let be the number of dimensions at the th iteration then . This gives a geometrically decreasing series with ratio , making the time complexity of SKIFF as . Handling empty clusters does not incur extra cost because each point’s similarity with its centroid is already stored, and thus a point of minimum similarity with its concept vector can be picked in time. This can be further reduced by saving time over the iterations through a marker. Comparing the space complexity of SKM and SKIFF, at every iteration current document-term matrix is of reduced size than the original, consuming lesser memory. Thus, instead of trading off time with space cost, SKIFF saves the space requirement of clustering too.
4. Results and discussion
4.1. Experimental setup
For experiments, we have implemented the proposed method SKIFF and classic SKM with six different initialization methods: Medians of attributes (MoAs) [32], single-pass seed selector (SPSS) [22], Approximate K-mean++ (AK), and random picking vectors from the data (RS). Handling of empty clusters is done in all implementations through the suggested technique above. All implementations are done in MATLAB 2012b on Intel core i3 CPU with speed 3.4 GHz, 4 GB RAM, and 64-bit Windows 8.1. The text corpora used for the experiments are Classic3, Classic4, KOS, NIPS, and Yahoo series. Their characteristics can be seen in Table 3. The pre-processed bag-of-words of all these data sets are available at the UCI repository [33], except that Yahoo is available at [34]. Each programme is run over the same data 50 times, and the results are averaged for comparison.
Description of data sets used in experiments.
Corpus
Contents
Number of documents
Number of unique terms
Number of clusters
Remarks
Classic3
Abstracts from medical journals, information retrieval papers, and aeronautical papers
3893
3 (known)
Well-defined clusters
Classic4
Classic3 plus CACM collection
7095
5896
4 (known)
Well-defined clusters
KOS
Blog entries
3430
6906
5
Small documents with less vocabulary
NIPS
Full papers of NIPS 2003 conference
1500
12,419
6
Large vocabulary and overlapping clusters
Yahoo20
Yahoo news articles
2340
21,839
20
One large cluster with several small clusters
4.2. Performance metrics
Performance is measured majorly for runtime as the proposal is aimed at saving time of clustering. Moreover, cluster quality is measured to check that in order to save time, the proposal has not sacrificed the quality of output through objective function value and Adjusted Random Index (ARI). The objective function is the sum of similarities of documents with the centroids in the cluster to which they are assigned, that is , it should be high in value because SKM and SKIFF solve a maximisation problem. ARI can be computed where actual cluster labels are known for Classic 3 and Classic 4 data sets. For specific applications, a major cluster identification among a collection of documents is required for organisation purposes, and the documents in this cluster should essentially be very similar to each other. Hence, we also record the density of largest cluster output by the algorithms. Maximum adherence among all output clusters is measured to explain the extent of overlap among the topics. Minimum adherence among the output clusters depicts how well separated the two majorly different topics among all documents are.
4.3. Result analysis
The proposed algorithm SKIFF is aimed primarily at saving time. Table 4 shows the runtime of algorithms SKM and SKIFF against each other with different initializations. While for smaller corpora, the gain in speed is not much evidence, for large corpora like Classic4 and Yahoo, where the value of d is very high, the gain in runtime is apparent. Several features belong to very few documents; hence, their removal makes clustering stable faster. We measured ARI to observe the effect of gain in speed versus lack in quality, as given in Table 5.
Runtime (in seconds) of SKM and SKIFF with different initializations.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data sets
Classic 3
15.2
9.14
8.33
6.3
5.31
4.53
9.43
5.32
Classic 4
51.19
21.1
16.79
11.33
12.29
10.15
16.81
11.33
KOS
2.41
1.42
2.57
0.91
2.18
1.98
0.99
1.98
NIPS
6.54
4.78
6.29
4.49
4.33
3.66
4.42
3.53
Yahoo
53.28
3.89
26.47
3.89
30.54
3.83
34.1
3.95
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
Adjusted Random Index to observe the effect of gain in speed versus lack in quality by SKM and SKIFF for different initializations.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data set
Classic 3
0.9393
0.9379
0.9354
0.936
0.9369
0.9369
0.9354
0.8825
Classic 4
0.4233
0.3714
0.4049
0.3945
0.4014
0.4014
0.3107
0.3053
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
Here, SKIFF is at par with SKM, indicating that the quality of categorization or grouping is done even after removing a few dimensions is reliable. This supports our claim that filtering features based on properties of only concept vectors are an effective and time-saving method. Computing objective function as the sum of similarities of document vectors and their respective concept vectors, the values are given in Table 6. It can be noticed that the value of the objective function is much more robust in SKIFF than SKM, showing the reason for faster convergence. It strengthens against the doubt that fast convergence may be towards a poorer local optimum. Instead, SKIFF has been able to produce a better optimal solution than SKM. The working objective function of SKIFF is dynamic in contrast to the static objective function of almost all heuristics of clustering. However, the value of the objective function that is computed and produced here in Table 6 is the one that corresponds to the SKM objective function, computed using the cluster labels produced by SKIFF.
Objective function value produced by SKM and SKIFF.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data set
Classic 3
1080.609
1080.623
1080.617
1080.62
1080.522
1080.62
1080.565
1076.55
Classic 4
2049.109
2027.944
2058.952
2058.326
2054.137
2054.726
2034.872
2034.284
KOS
1443.181
1441.643
1444.633
1389.132
1478.003
1477.942
1473.435
1475.308
NIPS
733.629
734.871
738.352
738.734
733.381
733.564
735.989
736.298
Yahoo
1028.577
1064.123
1207.467
1132.951
1187.103
1109.162
1204.392
1139.172
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
Measuring the intrinsic quality through the density of the largest cluster, we can see in Table 7 that it is higher in SKIFF output than SKM demonstrating that SKIFF has increased speed and improved the quality of output. It indicates that the largest cluster output by SKIFF has only the most similar documents and that cluster does not allow less-relevant documents, which further means that clusters have more balanced structure. The proposed algorithm has a much higher density large cluster as compared to SKM for Classic4 and Yahoo corpora which are large, and many algorithms tend to produce a highly imbalanced clustering for these.
Density of largest cluster produced by SKM and SKIFF.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data set
Classic 3
0.248908
0.248908
0.248231
0.248231
0.248231
0.248231
0.248231
0.248739
Classic 4
0.086472
0.106665
0.073259
0.071329
0.058961
0.058961
0.059179
0.059817
KOS
0.85923
0.850388
0.692885
0.924786
0.90725
0.822683
0.328579
0.940088
NIPS
1.963997
1.963997
1.604517
1.604517
1.410701
1.413289
1.472381
1.472381
Yahoo
0.238628
1.123185
0.504927
0.578659
0.865754
0.936483
0.947097
0.580744
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
Adherence between the clusters is computed, and maximum and minimum of all are recorded in Tables 8 and 9, respectively. A lower value is preferred for this metric. Here no clear winner among SKIFF and SKM can be identified, and the methods have almost equal adherence values in many cases. For some data sets, specific initialization gives better results in SKIFF, and others give better in SKM. This indicates that while winning over runtime, SKIFF is not sacrificing the cluster quality and can give a cluster structure that is at par with the output obtained by SKM.
Maximum adherence among clusters output by SKM and SKIFF.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data set
Classic 3
0.2485
0.2485
0.2493
0.2497
0.2502
0.2502
0.2491
0.3242
Classic 4
0.3618
0.3683
0.3958
0.4001
0.4898
0.4898
0.4559
0.4443
KOS
0.6149
0.6065
0.6199
0.6221
0.6528
0.6433
0.6481
0.6294
NIPS
0.6986
0.6992
0.6051
0.6057
0.6323
0.6323
0.6358
0.6358
Yahoo
1.9027
1.9027
0.9239
0.8335
0.8032
1.9027
0.8448
0.8112
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
Minimum adherence among clusters output by SKM and SKIFF.
Initialization method
MoA
AK
SPSS
RS
Clustering method
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
SKM
SKIFF
Data set
Classic 3
0.1888
0.1888
0.1893
0.1893
0.1893
0.1893
0.1894
0.1904
Classic 4
0.0697
0.0697
0.2064
0.2069
0.2154
0.2154
0.2163
0.2163
KOS
0.2679
0.2682
0.2794
0.336
0.2794
0.2989
0.2851
0.2864
NIPS
0.0207
0.0207
0.2073
0.2078
0.1439
0.1439
0.4357
0.4357
Yahoo
0.1059
0.3034
0.2423
0.2988
0.155
0.2862
0.1904
0.2002
SKM: spherical k-means; SKIFF: spherical k-means with iterative feature filtering; MoA: medians of attributes; AK: approximate k-means.
5. Conclusion
The research on reducing computational efforts involved in clustering through SKM or any k-means-based clustering method has been only to reduce either the iterations or number of objects to be processed during any iteration. Such optimizations work well for data with low dimensions. However, an optimization technique to reduce the number of processed dimensions is better for text data where several dimensions are much larger than the number of instances. Instead of reducing dimensions through conventional filtering or wrapper self-learning filtering, we have proposed idea of filtering irrelevant features as successive reduction of dimensions within the iterative structure is better compared to previously proposed methods. The proposed clustering method SKIFF maintains the quality of clusters with less computational efforts. Runtime comparison of SKIFF with SKM shows effectiveness of filtering for clustering large corpora like Yahoo and Classic 4. SKIFF works well with initialization methods like SPSS when well-separated clusters are desired. In applications where a clustering near to a particular perception of truth is required, SKIFF works well with MoA and AK seeding methods. Although we have applied a simple technique to handle the empty cluster, the clustering algorithm suffers from the trapping in local optima problem several times which may impact the overall efficiency of the method. However, this article does not address the fundamental issues of k-mean clustering, and the results are not significant in SPSS and RS seeding methods. Furthermore, in future works, we may include nature-inspired algorithms for dealing with convergence issue and finding the global optima in k-means clustering. Thus, the major finding of this article is that clustering and feature selection can reinforce each other giving good clustering results at a faster speed, and one does not need to sacrifice the quality of clustering to reduce computational cost.
Footnotes
Author's note
Manoj Kumar is also affiliated to MEU Research Unit,Faculty of Information Technology,Middle East University,Amman,Jordan.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.
Funding
The author(s) received no financial support for the research,authorship,and/or publication of this article.
ORCID iD
Manoj Kumar
References
1.
ZhangYJinRZhouZH. Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybernet2010; 1(1–4): 43–52.
2.
SaltonGWongAYangCS. A vector space model for automatic indexing. Commun ACM1975; 18(11): 613–620.
3.
SaltonGBuckleyC. Term-weighting approaches in automatic text retrieval. Inform Process Manage1988; 24(5): 513–523.
SchützeHManningCDRaghavanP.Introduction to information retrieval, vol. 39. Cambridge: Cambridge University Press, 2008, pp. 234–265.
11.
ForgyE. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics1965; 21: 768–780.
12.
DhillonISModhaDS. Concept decompositions for large sparse text data using clustering. Mach Learn2001; 42(1): 143–175.
13.
GrahamMWMillerDJ. Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans Signal Process2006; 54(4): 1289–1303.
14.
SaeedMYAwaisMTalibR, et al. Unstructured text documents summarization with multi-stage clustering. IEEE Access2020; 8: 212838–212854.
15.
BaraldiABruzzoneLBlondaP. Quality assessment of classification and cluster maps without ground truth knowledge. IEEE Trans Geosci Remote Sens2005; 43(4): 857–873.
KimHKimHKChoS. Improving spherical k-means for document clustering: fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Syst Appl2020; 150: 113288.
18.
BanerjeeAGhoshJ. Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw2004; 15(3): 702–719.
19.
ArthurDVassilvitskiiS.k-means++: the advantages of careful seeding. Stanford, CA: Stanford University Press, 2006.
20.
DuwairiRAbu-RahmehM. A novel approach for initializing the spherical K-means clustering algorithm. Simula Model Practice Theory2015; 54: 49–63.
21.
LiYLuoCChungSM. Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng2008; 20(5): 641–652.
22.
SharmaISharmaH. Recognizing patterns in text data through effective initialization of spherical K-means. In: 2018 second international conference on electronics, communication and aerospace technology (ICECA), Coimbatore, India, 29–31 March 2018, pp. 327–331. New York: IEEE.
23.
HassaniAIranmaneshAMansouriN. Text mining using nonnegative matrix factorization and latent semantic analysis. Neur Comput Appl2021; 33(20): 13745–13766.
24.
JiSXuDGuoL, et al. The seeding algorithm for spherical k-means clustering with penalties. J Comb Optim2022; 44: 1977–1994.
ZhaoJLiangJMDongZN, et al. Accelerating information entropy-based feature selection using rough set theory with classified nested equivalence classes. Pattern Recogn2020; 107: 107517.
29.
ZhouPChenJFanM, et al. Unsupervised feature selection for balanced clustering. Knowl Based Syst2020; 193: 105417.
30.
ChaudhuriASamantaDSarmaM. Two-stage approach to feature set optimization for unsupervised dataset with heterogeneous attributes. Expert Syst Appl2021; 172: 114563.
31.
LiuWLiuMHuangM. Study on Chinese text clustering algorithm based on K-mean and evaluation method on effect of clustering for software-intensive system. In: 2020 international conference on computer engineering and application (ICCEA), Guangzhou, China, 18–20 March 2020, pp. 513–519. New York: IEEE.
32.
VourosALangdellSCroucherM, et al. An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations. Mach Learn2021; 110: 1975–2003.