Sage Journals: Discover world-class research

Abstract

Text clustering has been an overlooked field of text mining that requires more attention. Several applications require automatic text organisation which relies on an information retrieval system based on organised search results. Spherical k-means is a successful adaptation of the classic k-means algorithm for text clustering. However, conventional methods to accelerate k-means may not apply to spherical k-means due to the different nature of text document data. The proposed work introduces an iterative feature filtering technique that reduces the data size during the process of clustering which further produces more feature-relevant clusters in less time compared to classic spherical k-means. The novelty of the proposed method is that feature assessment is distinct from the objective function of clustering and derived from the cluster structure. Experimental results show that the proposed scheme achieves computation speed without sacrificing cluster quality over popular text corpora. The demonstrated results are satisfactory and outperform compared to recent works in this domain.

Keywords

Document clustering feature selection text clustering text mining

1. Introduction

Electronic documents save time and facilitate collaborative work across the globe beyond geographical or departmental boundaries. This increase in the number of electronic documents has led to a growing challenge for information systems to effectively organise, manage and retrieve the information comprised in extensive collections of texts according to the users’ information needs. The repositories like digital libraries store documents so that they can be later searched and retrieved. Business enterprises view social media documents as a potential source to analyse trends and gain commercially helpful information. Document or text clustering algorithms provide an effective mechanism for such user applications by organising large collections into a small number of meaningful clusters that improve retrieval performance in speed and relevance. A few applications of document clustering and the intended purpose are briefed in Table 1. While document classification has received much attention, document clustering is becoming increasingly important as a precursor to classification, topic detection, and grouping of huge collections of documents. Document clustering is the application of natural language processing to process document collections in the same manner as any other data collection method, they are represented numerically. A bag-of-words [1] is a common approach where every document is viewed as a set of words independent of their point of occurrence in the document and independent of each other. Thus, a document clustering algorithm deals with a set of documents called a corpus, where each document is also a set of words or terms.

Table 1.

Types of text documents for clustering with its benefits and applications.

Sl. no.	Types of documents	Purpose	Advantage	Application
1	News articles and blogs	Detection of duplicate content, identification of plagiarism and detecting related news/articles	Better presentation of content	Media monitoring and analysis
2	Customer reviews	Recommendation of contents and products	Improved relevance of recommendation	Recommendation systems
3	Microblogs and tweets	Detection of not predefined and unforeseen subjects in surveys and claims	Proactive management and effective response to user	Feedback analysis and opinion mining
4	Search results	Grouping of results and the suggestion of related information	Fast and efficient search and improved user experience	Information retrieval systems
5	Research articles and digital books/records	Structuring according to implicit subjects and topic detection	Aids navigation and faster search	Digital libraries

A vector-space model (VSM) [2] represents documents through positive vectors in the term space formed by the union of words of all documents and numeric values that correspond to the frequency of a word generally. A toy example of bag-of-words will give the reader an idea about the size of such representation for real-life documents. Figure 1 shows a few short documents related to simple reading exercises for children, and Table 2 is a simple document-term matrix containing the frequency of terms occurring in the documents. The corpus is tokenized to obtain main terms and remove stop words and parts of speech like articles, pronouns, and prepositions; stemming is done to retain only the word’s root. The challenge of document clustering lies in the variety of documents in any corpus. The documents vary in length, the number of unique words, and the frequency of these words. Different approaches exist to weight the term frequency statistics contained in the document-by-term matrix. This weighting aims to consider the relative importance of different terms, thereby facilitating improved performance in common tasks such as classification, clustering, and ad hoc retrieval. Two popular approaches are term-frequency inverse document frequency (TF-IDF) [3] and BM25 [4]. Every representation gives a length-biased, sparse high-dimensional space. While normalisation to unit length can handle the biasing effect of length, the other aspects make this vector representation of a corpus high dimensional and very sparse. There is no general solution for the ‘curse of dimensionality’ and sparsity of any vector feature space of documents. These problems have to be handled within a clustering algorithm.

Table 2.

Document-term matrix containing the frequency of terms in the toy example of Figure 1.

	black	cat	jump	rat	hat	sat	mat	ran	see
Doc 1	2	1	1	1	0	0	0	0	0
Doc 2	0	1	0	0	1	1	1	0	0
Doc 3	0	0	0	2	1	0	1	2	0
Doc 4	0	1	0	1	0	1	1	0	0
Doc 5	1	1	0	1	1	0	1	0	1

Figure 1.

Toy example of the corpus with five small documents.

Hornik et al. [5] have categorised clustering approaches for text data into three categories: First, a suitable dissimilarity/similarity measure between the texts is chosen, and a partition style clustering is performed based on this. Cosine similarity is a popular similarity measure. String kernels [6] have become very popular in this context, as the corresponding dissimilarities can be computed relatively efficiently and deployed in modern kernel-based learning techniques such as kernel k-means clustering [7]. Second, a generative approach is adopted using probabilistic models for texts, such as topic models for uncovering the latent semantic structure based on a hierarchical Bayesian analysis of the texts [8,9]. Finally, one can use a suitable representation of the corpus (the collection of text documents) to convert it into a numeric form [10] and apply some traditional approaches for clustering multivariate numeric data. This article proposes how feature selection can reduce dimensionality and increase the performance of a clustering algorithm in a hybrid manner.

The article is organised into five sections; in the first section, the terminology and notations are discussed, brief literature relevant to our work and information related to initialization and feature selection of spherical k-means (SKM) are discussed in Section 2. Proposed work is presented in Section 3, and the implementation of work and results are discussed in Section 4. Conclusion of proposed work is presented in Section 5.

2. Related work

In this section, we will explore k-mean and its variants. k-means algorithm is a popular method for clustering a set of data vectors [11]. The classical version of k-means uses squared Euclidean distance to measure the dissimilarity of two objects. However, this distance measure is often inappropriate for its application to document clustering, as observed in Lodhi et al.’s study [6]. A variant of k-means used for document clustering is popular as SKM proposed by Dhillon et al. [12]. SKM retains the benefits of k-means: scalability, flexibility, and simplicity while making it fast for high-dimensional text data by using cosine similarity for comparison. However, fast clustering algorithms may be designed; the high dimensionality of text data remains a major hurdle in speeding up the process. To date, feature selection with unsupervised learning in literature has been performed using either filter approach [13] that removes features of lesser significance before clustering process and wrapper approach [14] that determines ranks of features through clustering process.

Throughout this article, we consider the term-frequency VSM of document representation. In this model, a document is a vector in the term space Document-term matrix. After the pre-processing, we obtain say $d$ unique words from the corpus of $n$ documents, and each document is represented as $d$ -dimensional vector $x$ as $x_{i} = x_{i 1}, x_{i 2}, \dots, x_{id}$ where $1 \leq i \leq n$ . Thus, the corpus is a document-term matrix of dimensions $n \times d$ . Each vector is normalised to unit length, that is $x_{i} = 1$ and the values $x_{i j}$ are computed using equation (1)

$x_{i j} = \frac{t f_{i j}}{\sqrt{\sum_{j = 1}^{d} {t f}_{i j}^{2}}}$ (1)

with values $t f_{i j}$ denoting the number of times word $j$ appears in document $i$ . This has an effect that documents $x_{1}, x_{2}, \dots, x_{n}$ are now points on the unit sphere in non-negative term space $R_{\geq 0}^{d}$ , and the inner product of any two vectors is a natural measure of similarity between them. Hence, cosine similarity is used here, computed as equation (2)

$\cos (x_{p}, x_{q}) = \frac{x_{p}^{T} x_{q}}{x_{p} x_{q}} = x_{p}^{T} x_{q}$ (2)

The simplification is due to the unit length of each document, and the Sparsity of vectors makes this inner product computation very fast. Clustering is viewed as partitioning of the set $X$ of documents into $k$ disjoint clusters as $π_{1}, π_{2}, \dots, π_{k}$ where $\cup_{j = 1}^{k} π_{j} = X$ and $π_{j} \cap π_{l} = \emptyset$ if $j \neq l$ . Each cluster can be represented through its concept vector obtained as a unit vector in the direction of the centroid vector of the cluster. The mean vector or centroid of documents in any cluster $π_{j}$ is computed as the mean of all the documents contained in the cluster using equation (3)

$m_{j} = \frac{1}{n_{j}} \sum_{x \in π_{j}} x$ (3)

where $n_{j}$ is the number of documents in the cluster $π_{j}$ . A point to be noted carefully here is that the mean vector need not be of unit length. A unit vector in the centroid direction is called the concept vector $c_{j} = m_{j} / ‖ m_{j} ‖$ and is the closest in cosine similarity to all document vectors in the cluster $π_{j}$ . Thus, the objective function of the optimization problem of clustering can be stated as maximise $\sum_{j = 1}^{k} \sum_{x \in π_{j}} x^{T} c_{j}$ such that $\cup_{j = 1}^{k} π_{j} = X$ and $π_{j} \cap π_{l} = \emptyset$ if $j \neq l$ .

Besides these, we need two more metrics; the first is an intrinsic measure called density of a cluster, and the second is an extrinsic measure called Adherence among clusters. For a better understanding, we require the definitions from Baraldi et al. [15]. The principal dimension of any corpus corresponds to the term having the highest normalised frequency in the corpus, and $α = \underset{j}{a r g m a x} \sum_{i = 1}^{n} x_{i j}$ denotes it. The density of a cluster is the ratio of its population to the length of the principal dimension spanned by the cluster, computed as equation (4)

$ρ_{k} = \frac{| π_{k} |}{\underset{x_{i} \in π_{k}}{m a x} x_{i α} - \min_{x_{i} \in π_{k}} x_{i α} + 1}$ (4)

Thus, density is also a measure of the compactness of a cluster. If a cluster contains documents belonging to a variety of subtopics, the range spanned by the cluster is large, and density will be low, indicating the diversity of the cluster. If a cluster contains documents related to a single or limited set of topics, the span is lesser and density higher.

2.1. Spherical k-means and variants

The k-means is a popular clustering algorithm that has been adapted to a variety of data and applications. The basic working of k-means is to collect objects that are more similar to each other in a cluster. It begins with k initial centroids called seeds and then iteratively proceeds through two major steps. The first step is to assign data objects to a cluster by measuring its distance from all centroids and assigning it a label of the nearest centroid. The second step is to update centroids of all clusters by taking the mean of all points assigned in that cluster. These two steps are repeated until a stopping criterion is met. Any variant of k-means has to define the following in order to have a working version of the algorithm:

Initialization method that defines how seeds will be chosen or computed. The seeds can either be from among the objects or a virtual object from the object space.

Dissimilarity/similarity measure computation to calculate the ‘nearness’ of objects from centroids.

A stopping criterion to decide when to stop the iterations. If centroids are not changing, it can be a threshold beyond which or a maximum number of iterations are allowed.

Besides this, there are other scopes of changes in the k-means algorithm, like updates to centroids can be done as soon as a point is assigned to that cluster, called online update, as opposed to the batch update in classic k-means where centroids are updated only when all points have been assigned to respective clusters.

Dhillon and Modha [12] proposed spherical k-means (SKM), and they used a VSM where numerical values were derived from frequencies of words and inverse document frequency. The normalised VSM of corpus makes the vectors fall on the surface of a unit hypersphere in the object space, hence the name Spherical k-means. Cosine similarity is used to compute similarities among documents. The algorithm is reproduced here from Dhillon and Modha [12] in Figure 2. Geometrically, the SKM algorithm partitions the high-dimensional space into Voronoi or Dirichlet regions separated by hyperplanes passing through the origin. Each cluster is represented by a centroid that provides a compact summary of the cluster. The similarity of the document is measured with the centroid (representative or concept vector) to decide whether to make the document a member of the cluster. The concept vectors are iteratively updated as the mean direction of all cluster members, and the decision of membership is also revised at every iteration until stability is attained. SKM suffers from an obvious drawback: in k-means of the tendency to get trapped in local optima, dependency on initialization of centroids, and empty clusters.

Figure 2.

Listing of Spherical k-means algorithm reproduced from Hornik et al.’s study [5].

Dhillon et al. [16] suggested how to make SKM computationally efficient. The primary conclusion of their work is that even for very large documents, clustering can be achieved in time linear in the size of the collection. They have suggested a multi-threaded technique to accelerate the entire process of fetching documents from disc, creating the vector representation, and then clustering them. A different representation of the text documents is suggested. Every document is a $w$ -dimensional vector $x_{i}$ , where any jth entry for $1 \leq j \leq w$ , $x_{i j} = t_{i j} * g_{j} * s_{i}$ where $t_{i j}$ is a term weighting component that depends only on the frequency of the term in the document, $g_{j}$ is a global weighting component and depends only on the number of documents containing that term j, and $s_{i}$ is the normalisation component for the ith document. This representation tries to catch the importance of the term in the document and in the collection and thus enhances its discrimination effect.

SKM suffers from a major drawback of producing empty clusters or clusters with very few members when it gets stuck in a local optimum. Dhillon et al. [16] proposed a clustering refinement algorithm. They design a local search procedure called ‘first variation’ to refine clusters produced by the SKM. It runs as an added step after cluster assignment that perturbs the clusters that appear ‘too stable’ over a few iterations. It reshapes the clusters by moving one point from a cluster to another in a step, thereby increasing the value of the objective function. A sequence of first variation steps moves many points. However, this comes at increased computational complexity and cannot theoretically guarantee that all local optima can be escaped. Empty clusters and computational effort are jointly addressed in the scheme by Kim et al. [17]. They proposed an initialization process which gives dispersed initial points and 1000 times faster than previous works. Viewing the updating of concept vectors as a gradient ascent approach, they have proposed to update them according to gradient direction. The algorithm is named improved spherical k-means (ISKM). The update equation is $c_{j} = (c_{j} + η x) / (c_{j} + η x)$ , where $η$ is the learning rate. They have employed two different rate schedules for learning. The first is a static learning schedule, and the second is a gradually decreasing similar to the annealing schedule. To safeguard from producing empty clusters, a list of farthest points is maintained. They are assigned to the empty clusters. Another practical suggestion is to defer normalising the centroids after every update until the magnitude of the vector becomes too large. This avoids a major computation bottleneck of ISKM. Banerjee and Ghosh [18] proposed reducing the number of empty clusters in Spherical k-means. Specifically, they used the technique of frequency-sensitive competitive learning. The distances/similarity to the concept vectors is computed according to weights directly proportional to the number of vectors assigned to their cluster. Thus, a cluster with many members appears to be at a greater distance than it is. Alternatively, in other words, the clusters with many members appear to shrink when computing similarity from a vector in a query. However, this proposal aims to produce a balanced cluster structure that is helpful for organisational applications that require the documents to be arranged in almost equal-sized clusters.

Thus, it is observed that SKM has two significant improvement aspects: runtime and other is handling empty or very small clusters.

2.2. Initialization in Spherical k-means

Any k-means-based algorithm suffers from the effects of bad seeding. The initial concept vectors should ideally be well distributed among the object space to capture cluster structure without bias. Nevertheless, this contradicts that the clustering itself is meant to discover the distribution of objects in the space. Since the structure is an output, it cannot be given as an input. This has led to many types of research dedicated to improving the initialization of k-means. K-means++ [19] is considered to be a very effective initialization technique. It successively selects a centroid as a point farthest from already selected centroids. However, this has a very high computational complexity. Intuition from Kim et al. [17] for initialization method can be taken as:

Initalize the set $R_{init}, α \times k$ randomly selected from R.

Select $C_{i}$ randomly from R.

Remove $x_{j}$ where $x_{j} \in R_{init}$ and $\cos (C_{i}, x_{j}) > t_{init}$ .

Repeat Steps 2 to 3 until k centroid points are selected or $R_{init}$ becomes empty.

If the number of selected centroid is less than k, choose remaining point from $R - R_{init}$ randomly.

Efforts to develop an initialization method dedicated to SKM instead of adopting from those for k-means have been a few. The reader is referred to the discussion in previous studies [5,12] that concludes that selecting a concept vector as a mean of the entire corpus and perturbing it in k different directions gives effective initialization for clustering. Duwairi and Abu-Rahmeh [20] present a deterministic technique for initialization that places the concept vectors uniformly in all directions within the unit hypersphere. Li et al. [21] observed that this would be effective only if documents themselves are uniformly distributed over the positive orthant of the unit hypersphere, hence proposing a median-based deterministic initialization. It simply computes $k$ -median values in each dimension to obtain $k$ vectors of $d$ dimensions used as seeds. In our previous work [22], two probabilistic initialization methods suggested named single-pass seed selector (SPSS) and memory-efficient seed selector (MESS). SPSS is a fast method that first assigns weights to all points based on their distance from one chosen pivot and then points are picked as a random sample with weights as probabilities, and two thresholds are used, one to decide whether the point is chosen and another to decide whether to update pivot or not. MESS performs exactly k iterations, picking one seed at a time. At every round, search space is reduced by removing the objects that are already very near to previously picked seeds. This ensures that every successive seed concept vector is not much closer to previously selected seeds in the space, thus maintaining the underlying idea of k-means++ and reducing space complexity. Recently Hassani et al. [23] proposed deterministic initialization for spherical K-means, additionally they came up with a new feature agglomeration method which uses the concept of matrix factorization, which further separates the terms into groups and finally creates new feature vector which is best suitable for clustering. A novel seeding algorithm with penalties was proposed for spherical k-mean clustering by Ji et al. [24].

2.3. Feature selection in text clustering

Li et al. [21] have used a $χ^{2}$ statistic-based feature selection, named $r χ^{2}$ , that measures the relevance between feature and class through numeric value. Since it cannot be directly applied to unsupervised learning, it is embedded into an expectation–maximisation framework for text clustering. Instead of removing any feature, the weight of the feature is reduced if found irrelevant and can be later increased if relevance increases. Thus, feature ranking in this method improves cluster quality, and there will be no computation cost reduction. Bora et al. [25] proposed Heuristic Frequent Term–Based Clustering (HFTC) that saves computational effort through feature selection by reducing the terms and documents during iterations of clustering. A frequent itemset with some predefined minimum support is searched from all remaining documents to be clustered. It is tested for overlap with the remaining terms. The set having minimum overlap with remaining terms is picked as the best set, and the documents that support this best set are clustered together. This cluster is removed, and further clustering is performed only on the remaining documents. Thus, the features holding importance for a few documents are identified and then removed along with those documents. Under the changed situations now, a new set of terms may be the best set. It can be converted to hierarchical clustering by further clustering the primary cluster with reduced dimensionality and corresponding best set. This method saves effort and strengthens the idea that clusters in text data pertain to topics described with few keywords. Roffo et al. [26] have proposed a feature-ranking system that attempts to associate a measure of importance to a feature through the importance of neighbours in a feature affinity graph. The affinity of any two terms or features is measured using popular metrics like the Fisher criterion for relevance and mutual information for redundancy. When represented as an adjacency matrix with some edge weights, the graph has Eigenvectors that directly indicate more important nodes (terms). Thus, it is a method that combines the relevance, redundancy measures with graph theory. The computation is time-consuming, and the space complexity of the graph should also be taken into account.

Hu et al. [27] suggest user intervention during clustering to pick up features preferred most by the user. This can be very effective if the user has very good knowledge about the corpus at hand, which is impossible in many real-life situations. Zhao et al. [28] have proposed an entropy-based feature selection method to use data clustering, but it has not been put to text clustering. Entropy calculations can be costly for a high-dimensional space like text documents. Zhou et al. [29] have suggested selecting features such that a balanced clustering structure is obtained instead of focusing only on ranking-based feature selection. Balanced structures are required by text organising applications where besides similarity constraint, there is a constraint on the size of the cluster. Very recently, Chaudhuri et al. [30] have proposed a rank-based feature selection that can be used with heterogeneous attributes and high-dimensional data. It has not been applied to document clustering yet. Again, it is a filter-style approach. Liu et al. [31] have used a method to handle empty clusters by replacing the centroid of the empty cluster with the nearest document vector in k-means-based clustering of Chinese documents using Euclidean distance structure for feature selection.

3. Proposed framework

3.1. Feature filtering

Our proposed work is to remove those words that occur in almost all documents and thus will not have any discretionary information from a clustering point of view. Applying this to the entire corpus is computationally intensive, given the huge size of the document-term matrix. We utilise the observation that centroids are representatives of the members in their respective cluster, and if there is a term with a consistent presence in the majority of documents, then it is most probably to have such a presence among the centroids. Thus, analysing the feature distribution of the centroid matrix is equivalent to analysing the entire corpus. A term with approximately the same value in all centroids has occurrence in most of the documents and does not play a significant role in identifying documents of any one particular cluster. On the contrary, the terms that occur in only one centroid are more probably to be the key concept words of documents of that cluster. Standard deviation is the most straightforward tool to check the uniformity of occurrence of a feature in all vectors.

We proposed a novel approach that iteratively filters features within the clustering framework. It cannot be called a pure wrapper approach because the filter method does not consider the clustering labels for feature weighting or ranking. Instead, the feature selection is very similar to a simple statistic filter. However, a major effect is brought by performing filtering in every round of clustering. During each iteration of SKM clustering, a few features are removed. Hence, we call this proposed method spherical k-means with iterative feature filtering (SKIFF) (Figure 3).

Figure 3.

Listing of proposed spherical k-means with iterative feature filtering.

The considerations behind our proposal are as follow:

Easy availability of functions for simple statistical computations.

The very low computational complexity of statistical computations as the sub-operations of such formulae can be parallelized over the data making them sub-linear in time.

Statistical properties apply to both cases: mutually independent features and dependent features.

Statistical properties like mean and sum do not vary much in interpretation whether applied to entire data or its summary (like centroids).

The centroid matrix is of size $k \times d$ , having k vectors of d dimensions. Since k is much smaller than n, total document vectors, compute on centroid matrix is much less time-consuming. Any feature with low variance or standard deviation within the centroid matrix has very low discriminating power. Removing such features is helpful as document vectors will have similarities with the centroid vector that overlaps in more significant words. Nevertheless, the centroids change during the iterations of clustering. Hence, instead of doing feature filtering only once, it is performed every time the centroids change.

Let the document-term matrix be $X^{(t)}$ at $t$ th iteration, and the centroid matrix is $C^{(t)}$ . The standard deviation of all dimensions of the centroid matrix is computed, and $ϑ$ portion of the features with the lowest standard deviation are discarded. Thus, the number of dimensions decreases geometrically, making the size of the current document-term matrix $X^{(t)}$ at any iteration much smaller than the original and that too decreases as the algorithm progresses to the next round. So the cost of clustering per iteration decreases significantly. Listing 1 shows the proposed SKIFF algorithm. It takes to input the corpus to be clustered, the number of clusters desired and value of rejection ratio $ϑ$ . This value between 0 and 1 indicates the percentage of features to be removed from the current document-term matrix. The output labels each document and assigns it to a unique cluster; hence, the corpus gets partitioned into $k$ clusters. For post-analysis and topic identification application purposes, we also have taken centroids of the clusters as auxiliary output. The user is free to use any initialization method with this SKIFF clustering.

Besides the obvious advantage of reducing computational effort through dimension reduction, we also achieve one more benefit. Such feature filtering will avoid the wrong assignment of documents to any centroid based on more familiar terms, and clustering will be done for more significant words.

3.2. Handling empty clusters

Text documents are not uniformly distributed over the term space, and if any initial centroid is in the region where there are no actual documents in the vicinity, none of the documents get assigned to that cluster, and eventually, we obtain clusters lesser in number than desired. A cluster with no documents assigned to it is empty and has no use. This problem is often addressed by only having a good initialization or trying clustering through different initializations. Since textual data are more prone to empty clusters, we suggest a technique to handle the problem here. We suggest a simple strategy to handle empty clusters within the iterative structure: detect an empty cluster before updating the concept vectors and then select the document with the least similarity with its centroid and assign it to the empty cluster. This document will automatically be the centroid of that cluster for the next iteration. The advantage of such a process is that perturbing the edge points of any cluster is suitable for heuristics like SKM so that they do not get trapped in local optima. It is better than a random assignment to an empty cluster because randomness cannot guarantee the effect it has on heuristic search. The farthest point in the other cluster is a border point, and it might probably be part of another cluster. In case multiple empty clusters are present, more such points can be assigned. Since the similarity of such point with its centroid is already very low, its removal will not much affect the cluster it is leaving, and no update to all other clusters is required.

3.3. Computational complexity

The time complexity of SKM is linear in the size of data, that is $O (NDT)$ , where $T$ is the total number of iterations. Since $d$ is of much higher order for text documents than $n$ or $T$ , the deciding factor of the runtime is the value of $d$ . In our proposed algorithm SKIFF, we have focused on reducing the value of $d$ at every round of clustering. Let $d^{(t)}$ be the number of dimensions at the $t$ th iteration then $d^{(t + 1)} = (1 - ϑ) * d^{(t)}$ . This gives a geometrically decreasing series with ratio $ϑ < 1$ , making the time complexity of SKIFF as $O (nd)$ . Handling empty clusters does not incur extra cost because each point’s similarity with its centroid is already stored, and thus a point of minimum similarity with its concept vector can be picked in $O (n)$ time. This can be further reduced by saving time over the iterations through a marker. Comparing the space complexity of SKM and SKIFF, at every iteration current document-term matrix is of reduced size than the original, consuming lesser memory. Thus, instead of trading off time with space cost, SKIFF saves the space requirement of clustering too.

4. Results and discussion

4.1. Experimental setup

For experiments, we have implemented the proposed method SKIFF and classic SKM with six different initialization methods: Medians of attributes (MoAs) [32], single-pass seed selector (SPSS) [22], Approximate K-mean++ (AK), and random picking vectors from the data (RS). Handling of empty clusters is done in all implementations through the suggested technique above. All implementations are done in MATLAB 2012b on Intel core i3 CPU with speed 3.4 GHz, 4 GB RAM, and 64-bit Windows 8.1. The text corpora used for the experiments are Classic3, Classic4, KOS, NIPS, and Yahoo series. Their characteristics can be seen in Table 3. The pre-processed bag-of-words of all these data sets are available at the UCI repository [33], except that Yahoo is available at [34]. Each programme is run over the same data 50 times, and the results are averaged for comparison.

Table 3.

Description of data sets used in experiments.

Corpus	Contents	Number of documents	Number of unique terms	Number of clusters	Remarks
Classic3	Abstracts from medical journals, information retrieval papers, and aeronautical papers	3893		3 (known)	Well-defined clusters
Classic4	Classic3 plus CACM collection	7095	5896	4 (known)	Well-defined clusters
KOS	Blog entries	3430	6906	5	Small documents with less vocabulary
NIPS	Full papers of NIPS 2003 conference	1500	12,419	6	Large vocabulary and overlapping clusters
Yahoo20	Yahoo news articles	2340	21,839	20	One large cluster with several small clusters

4.2. Performance metrics

Performance is measured majorly for runtime as the proposal is aimed at saving time of clustering. Moreover, cluster quality is measured to check that in order to save time, the proposal has not sacrificed the quality of output through objective function value and Adjusted Random Index (ARI). The objective function is the sum of similarities of documents with the centroids in the cluster to which they are assigned, that is $\sum_{j = 1}^{k} \sum_{x \in π_{j}} x^{T} c_{j}$ , it should be high in value because SKM and SKIFF solve a maximisation problem. ARI can be computed where actual cluster labels are known for Classic 3 and Classic 4 data sets. For specific applications, a major cluster identification among a collection of documents is required for organisation purposes, and the documents in this cluster should essentially be very similar to each other. Hence, we also record the density of largest cluster output by the algorithms. Maximum adherence among all output clusters is measured to explain the extent of overlap among the topics. Minimum adherence among the output clusters depicts how well separated the two majorly different topics among all documents are.

4.3. Result analysis

The proposed algorithm SKIFF is aimed primarily at saving time. Table 4 shows the runtime of algorithms SKM and SKIFF against each other with different initializations. While for smaller corpora, the gain in speed is not much evidence, for large corpora like Classic4 and Yahoo, where the value of d is very high, the gain in runtime is apparent. Several features belong to very few documents; hence, their removal makes clustering stable faster. We measured ARI to observe the effect of gain in speed versus lack in quality, as given in Table 5.

Table 4.

Runtime (in seconds) of SKM and SKIFF with different initializations.