Abstract
Introduction
Clustering is an unsupervised learning method that usually refers to dividing existing unlabeled instances into several clusters according to the similarity between objects without any prior information, making the instances in the same cluster have a higher similarity and in different clusters have a more substantial discrepancy [9, 43]. Ensemble clustering utilises a consensus function to unify multiple types of partitions of the same dataset into one clustering result. It usually constructs a base clustering pool by repeatedly running a single clustering approach or executing multiple clustering algorithms. Then, the consensus function is built through voting methods, hypergraph partitioning, or evidence accumulation to obtain more optimal clustering results [18, 25]. Many existing ensemble clustering studies have confirmed that ensemble clustering can usually improve the clustering result compared to a single clustering algorithm [1, 24].
Background
Existing established clustering algorithms are mainly based on the theories of model, grid, density, partition, and hierarchy [2, 8]. Different types of clustering algorithms are good at solving diverse types, distributions, and scales of data. In particular, with the development of deep learning [27], the performance of various clustering methods has been further improved. For example, in [28], a deep-learning feature extractor for time-series data is designed for relation extraction, and the clustering effect achieved significant improvement. Nonetheless, in view of the unknown data distribution in actual problems, it is difficult to determine which clustering algorithm can get better clustering results. Conventional solutions often try different methods and choose the algorithm that performs best. Ensemble clustering is expected to establish a general scheme to combine the advantages of multiple clustering algorithms and form the optimal clustering result. It is especially feasible under the conditions of mature distributed computing technology, so as to adapt to unknown and complex data.
The related studies to ensemble clustering are mainly divided into three categories: pair-wise co-occurrence based, graph partitioning based, and median partition based algorithms [18]. The first type refers to constructing a co-occurrence matrix by finding the times of all instances that occur in pairs (assigned as a cluster) in base clusterings. The two instances should be classified into the same cluster in the final clustering based on co-occurrence [19]. The similarity function constructed by the co-occurrence matrix can be used in any similarity matrix based clustering algorithm to acquire the final optimal clustering result, such as hierarchical clustering and spectral clustering [7, 30]. The idea of co-occurrence matrix was first proposed in [5]. Correspondingly, a method of evidence accumulation clustering (EAC) based on this theory was proposed for the ensemble clustering problem. Subsequent researches have made various improvements, such as using the technique of normalised edges and matrix completion [29, 45]. In graph partitioning, the graph model and consensus function are usually constructed to partition the graph into multiple parts representing the final cluster. The primary purpose of graph partitioning is to achieve
Motivations
Ensemble clustering is mainly divided into two steps. One is to generate a base clustering pool, for example, running the same clustering algorithm multiple times with different parameters, running various clustering algorithms multiple times, and performing clustering in subspaces. The other step is to select a consensus function, mainly based on the theories such as co-occurrence matrix, graph segmentation, and information entropy. An overview of ensemble clustering algorithms is depicted in Fig. 1.

The outline of ensemble clustering.
In the numerous types of ensemble clustering solutions, the pair-wise co-occurrence based algorithms are pretty naive, easy to implement and have played a massive role in ensemble clustering fields. Nevertheless, these algorithms always treat all clusters in the base clustering equally, ignoring the difference of the clusters [35]. Some attempts have been used in cluster weighting to distinguish the effect of different clusters, such as weighting schemes of information entropy [14] and random walk [16]. The authors used related theories to distinguish different clusters and mine implicit relationships between instances. Corresponding experiments proved that it is effective to distinguish different clusters. However, these approaches always try to complete the ensemble clustering without the joining of features, but only the labels of base clusterings, which may lose some vital information implied in data features.

The motivation of proposed method.
Compared with the algorithms considering base clustering results only, effectively combining base clustering and original features helps further improve the performance of ensemble clustering. Fuzzy-rough sets offer a high degree of flexibility in enabling the vagueness and imprecision present in real-valued data to be simultaneously modelled effectively [12, 38]. The idea of upper and lower approximation can well depict the membership of instances to each category, which is helpful for measuring the cluster reliability in ensemble clustering. The motivation of the proposed method is described in Fig. 2, and the solid black arrow and blue box are used to illustrate the difference with similar works, which means the joining of original features while distinguishing different cluster reliability.
To distinguish the validity of different clusters and combine the role of features, in this paper, the fuzzy-rough lower approximation is used to induce cluster reliability in all base clusterings. A novel fuzzy-rough induced spectral ensemble clustering (FREC) algorithm is proposed to enhance the performance of pair-wise co-occurrence based ensemble clustering. The contribution of the paper is threefold: Proposing the novel idea of cluster reliability through the fuzzy-rough lower approximation of each instance to enable the distinction of diverse cluster significance during clustering; Developing a new adjacency matrix based on cluster reliability to effectively enhance the effect of diverse base clusterings and improve the clustering performance; Establishing a consensus function and spectral ensemble clustering algorithm with its superiority confirmed through a comparative study and analysis on various benchmark datasets.
The experiment compares eleven state-of-the-art clustering algorithms on ten benchmark datasets, as well as the parallel algorithm that ignores the difference of clusters in base clustering. The result shows that FREC achieves a significant clustering performance. As the ensemble size increases, FREC achieves a superior effect.
The remainder of the paper is structured as follows. The preliminaries of the rough set and fuzzy-rough set are introduced in Section 2. The FREC algorithm is introduced in detail in Section 3. In Section 4, the experimental results are given and analysed. Finally, a summary is presented in Section 4.
Preliminaries
This section reviews the mathematical concepts concerning rough set and fuzzy-rough set, which are relevant to the reliability of the cluster developed in this paper.
Rough set
The study on rough sets theory [13, 49] provides a methodology that can be employed to extract knowledge from a domain in a concise way: it is able to minimise information loss whilst reducing the amount of information involved. Central to rough set theory is the concept of indiscernibility. Let
Obviously,
where ⊗ is defined as follows for sets
For any object
Fuzzy-rough sets [6, 38] encapsulate the related but distinct concepts of vagueness (for fuzzy sets) and indiscernibility (for rough sets), both of which occur as a result of uncertainty in knowledge. Compared to rough sets, fuzzy-rough sets offer a high degree of flexibility in enabling the vagueness and imprecision present in real-valued data to be simultaneously modelled effectively. In fuzzy-rough sets, the fuzzy lower and upper approximations to approximate a fuzzy concept
Here,
The unacceptable degree of clusters
The validity of a cluster can be well judged by considering the unacceptable degree (UD) of clusters in a base clustering. In multiple base clusterings, if the assignment of a cluster in one base clustering is consistently agreed by other base clusterings, this cluster should play a more critical role in the final consensus clustering. At the same time, if the assignment of a cluster is constantly negated by other base clusterings, the cluster should play a minor role.
For illustration purposes, some formalised descriptions are first introduced below. Let
Since for every fuzzy implicator
As proved in [31], if
Equation (12) implies that the lower approximation of
For different base clusterings, the assignment of clusters is distinct, but the data location is fixed, that is, multiple base clusterings are acting on the same dataset. For a specific cluster in one base clustering, the resulting assignment has two cases: Another base clustering approves this assignment; Another base clustering denies this assignment.
In this paper, a novel concept of UD is proposed to metric the cluster reliability. Here, two exemplar artificial datasets

Two exemplar base clusterings β1 and β2 in dataset
The first case is relatively simple, as shown in Fig. 3, including two exemplar base clusterings β1 and β2 in
For the three clusters (
Another situation is shown in Fig. 4. For a particular cluster in β3, if the cluster objects are split into a plurality of clusters in another base clustering β4, it can be considered that the specific cluster is not admitted or accepted by β4. Here,

Two exemplar base clusterings β3 and β4 in dataset
More specifically, for a specific cluster in a base clustering, considering its position distribution in another base clustering, if the objects of this particular cluster have a more significant lower approximation to the cluster in which the objects relocate in another base clustering, it indicates that the given cluster prefers the allocation of another base clustering. At the same time, it also means the extent of another base clustering does not accept the assignment of the given cluster.
Objects from
For the example cluster
To illustrate the concepts involved, the objects of the exemplar clusters
Two exemplar clusters in Fig. 4(a)
Relocated cluster of the objects in
Take
Through further calculations, the lower approximation of all objects in β3 to the base clustering β4 are shown in Table 3.
Lower approximation of
Then, the UD of
Considering the ground distribution of
Further, the global UD
The unacceptable degree computing (UDC) algorithm is outlined in Algorithm 1. Given the inputs
Unacceptable Degree Computing (UDC)
Γ, set of unacceptable degree of all clusters,
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12: Γ =
13:
14:
15:

Matrices

Redefined Matrices
The co-association matrix is obtained by summing and averaging a series of co-occurrence matrices, and it represents the frequency with which two objects co-occur in multiple base clusterings. Each base clustering β
Given the exemplar clusters in Fig. 4 and corresponding coordinates in Table 1, the matrices
Then, the elements
A higher UD for a cluster indicates that other base clusterings are more likely to disapprove of the cluster’s allocation scheme. At this point, the role of the cluster should be weakened. Otherwise, the function of the cluster should be reinforced. Therefore, the reliability of a cluster can be described as a decreasing function of the UD. In this paper, the reliability of the
Similar to Equations (16), (18) and (19), the redefined co-occurrence matrix
Again, for the example of
Then, the elements
The matrices
Co-Association Matrix Construction (CMC)
Γ, set of unacceptable degree of all clusters,
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
The co-association matrix construction (CMC) algorithm is detailed in Algorithm 2. Firstly, the initialised step is performed. Lines 2 to 17 represent the overall process, including the main loop to identify the co-occurrence and co-association matrices. Note that all matrices are calculated only for upper triangular due to the symmetry. In Line 15, the lower triangular matrix of
A mapping from a set of clusterings to a single final clustering is called a consensus function. Considering the superior performance of spectral clustering in complex shapes and cross data, in this paper, the optimised co-association matrix is used in the spectral method to acquire the consensus result.
Given a graph model
The Laplacian matrix
Normalisation makes the diagonal entries of the Laplacian matrix to be all units and scales off-diagonal entries correspondingly. In this case, the normalised Laplacian matrix
The components of the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian can be used for meaningful clustering [42]. In Equation (28), the eigenvectors corresponding to the first
As summarised in Algorithm 3,
Consensus Spectral Clustering (CSC)
1:
2:
3:
4:
5:
6:
7:
8:
9:
According to the description of the above three subsections, the overall fuzzy-rough induced spectral ensemble clustering is depicted in Algorithm 4. Given a dataset
Fuzzy-Rough Induced Spectral Ensemble Clustering (FREC)
Fuzzy-Rough Induced Spectral Ensemble Clustering (FREC)
1: Γ,
2:
3:
4:
This section presents the experimental evaluation of FREC and other algorithms on ten popular datasets contained in UCI
1
repository. For convenience, datasets
Benchmark datasets used for evaluation
Benchmark datasets used for evaluation
In the experimental investigation, all datasets are normalised first. Homogeneity score (HS) and normalised mutual info (NMI) are used to evaluate the performance of the separate clustering method [10, 15]. The base clustering pool

HS results with the ensemble size.
Ten state-of-the-art ensemble clustering algorithms, namely, locally weighted evidence accumulation (LWEA) [14], locally weighted graph partitioning (LWGP) [14], probability trajectory accumulation (PTA) [16], probability trajectory based graph partitioning (PTGP) [16], ensemble clustering by propagating cluster-wise similarities with hierarchical consensus function (ECPCS-HC) [18], ensemble clustering by propagating cluster-wise similarities with meta-cluster-based consensus function (ECPCS-MC) [18], evidence accumulation clustering (EAC) [5], weighted evidence accumulation clustering (WEAC) [15], graph partitioning with multi-granularity link analysis (GPMGLA) [15], and spectral ensemble clustering (SEC) [26] are selected to compare the ensemble performance of FREC. Moreover, two other non-ensemble state-of-the-art clustering methods, deep temporal clustering representation (DTCR) [37] and robust temporal feature network (RTFN) [50] are also used to compare the performance of the newly proposed method. For FREC, Łukasiewicz LWEA, LWGP: θ = 0.4; PTA, PTGP: ECPCSHC, ECPCSMC: WEAC, GPMGLA: α = 0.5, β = 2; SEC: μ = 1; DTCR: RTFN: CNNchannel = 128, kernelsize = 11,

NMI results with the increased ensemble size.

NMI results of FREC versus the parallel EAC and base clustering.
Figures 7 8 show the experimental results for different ensemble sizes, including two indexes of HS and NMI on ten benchmark datasets. The various contrastive algorithms use different colours, lines and markers, as the legend details. Apparently, for the proposed FREC, as the ensemble size increases, the results on nearly all datasets deliver an upward trend regardless of the evaluation criteria, which is in line with the objective of ensemble methods. More specifically, considering HS, FREC shows significantly superior performance on datasets
While using the evaluation index NMI, the results are resemblance. For
Comparison to the parallel EAC and base clustering
The algorithm EAC, which means the original co-association based ensemble scheme that does not consider cluster reliability, always conducts poorly in both HS and NMI. As exampled in Fig. 7(b), 7(d), 7(h) and 7(iFor the three clusters), EAC conveys the worst effect regardless of the ensemble size by comparing the other ten ensemble algorithms. To more comprehensively recognise the effect of using the cluster reliability induced by fuzzy-rough set, this part compares FREC and the parallel EAC in detail in the form of a histogram.
Since all ensemble strategies primarily achieve more satisfactory results at larger ensemble sizes, FREC and EAC use the pool containing 100 base clusterings. Moreover, both algorithms are run 100 times to acquire the average results. In addition, the algorithms in the base clustering pool (
HS results of FREC versus other ensemble algorithms
HS results of FREC versus other ensemble algorithms
NMI results of FREC versus other ensemble algorithms
Without overloading similar results, the NMI is used to report the experiment evaluation. As shown in Fig. 9, FREC consistently achieves a better clustering effect relative to the parallel EAC and base clustering on all datasets. Especially for
In order to comprehensively evaluate the performance of the proposed algorithm, experimental comparisons are carried out against the other eleven state-of-the-art methods. The results are summarised in Tables 5 6. Note that the results of EAC have been analysed in Section 4.3 and will not be repeated in this part.
Recall reported results, ensemble algorithms always work best using more ensemble size. Thus, all comparison ensemble methods employ the results with an ensemble size of 100, and the average and best results for each dataset are shown in columns Ave and Best, respectively. To describe the experimental results more obviously, the best results are highlighted in bold. Moreover, the second best results are highlighted with an underline to make the information in the table easier to follow.
Considering the metric HS, FREC achieves optimal average performance relative to the other eleven algorithms in most datasets, including
Now, take an observation of the evaluation NMI, the clustering result is highly analogous to HS. For datasets
In general, the average and best results of FREC are equal in most cases, which means that the FREC is relatively stable and the results are less serendipitous. At the same time, regardless of the average or best results, FREC always achieves the most significant or second best effect, which illustrates the rationality of the research in this paper.
Statistical analysis
Paired
Time complexity analysis
As shown in Algorithm 4, the computing cost of the proposed FREC mainly includes three parts: (1) For UDC, this function mainly consists of three loops with a time complexity of
Time complexity comparison of different algorithms (seconds)
Time complexity comparison of different algorithms (seconds)
To compare the running time gap with other methods, the running time (seconds) of each algorithm is reported in Table 7. The experimental CPU used is i7-12700, and the memory is 24G. It can be seen that after considering the data features, the running time of FREC is significantly higher than that of other comparison methods. Especially as the number of instances continues to increase, the time of FREC increases significantly. In comparison, methods such as LWEA, SEC, and RTFN have achieved better running time. The above implementation shows that the time efficiency of FREC is relatively poor, which requires further optimisation in subsequent work.
This paper explores the role of considering cluster reliability using fuzzy-rough set in co-occurrence based ensemble clustering, and guides a fuzzy-rough ensemble approach. The experimental results indicate that the reliability induced by fuzzy-rough lower approximation is effective and can be reasonably employed in the task of ensemble clustering. Compared with other ensemble algorithms that ignore attributes and only employ base clustering results, FREC demonstrates the advantage of viewing feature information. Meanwhile, compared with the parallel version and base clustering, FREC shows its superiority again.
Nonetheless, from the time experiment, the efficiency of FREC is relatively slow. This is mainly due to the high time complexity of the algorithm. For large sample datasets, it will take a lot of time to calculate the lower approximation for each object. In future work, the idea of KD-tree [44] can be introduced to improve the running speed of the algorithm further. In addition, in Equation (23), if the two instances do not belong to the same cluster, it may not be a better choice to assign the adjacency matrix to 0 directly. Further mining the implicit connection between instances of different clusters helps improve the clustering performance.
Whilst promising, further work remains. The performance of the ensemble strategy and multi-density cluster designs is worth further exploration. In addition, the implied relationship of the objects in the same base clustering but the different clusters is a valuable route of investigation. Moreover, a more comprehensive comparison of ensemble methods over diverse datasets from the real-world domains, such as mammographic risk assessment [46, 47] would construct the foundation for a broader series of issues for future research.
