Abstract
Keywords
Introduction
Cluster analysis, also known as clustering, is a type of unsupervised learning technique used to classify a set of individuals into groups such that individuals in the same group (called a cluster) are more similar to each other than to those in other groups. Such a technique has been widely used in biomedical studies for disaggregating heterogeneous diseases and identifying disease subtypes that may respond to different treatments and inform clinical decisions. Examples include using cluster analysis to identify gastric cancer subtypes to provide novel insights into tumor biology and inform clinical management,
1
to identify subtypes of obstructive sleep apnea in children with obesity to distinguish high-risk children for targeted interventions and personalized treatment plans,
2
to identify cardiogenic shock survivor subtypes at intensive care unit discharge with distinct late host-response patterns and are associated with poor long-term health outcomes.
3
Popular clustering methods include
With recent advancements in data science, cluster analysis faces new challenges such as high dimensionality, multi-modality and computational complexity. A typical example is clustering multi-view data. The analytical process of integrating information from different datasets (also known as data views) describing the same set of individuals is known as multi-view learning. 12 This process is also known as data integration, which refers to the use of multiple sources of data to provide a better understanding of a phenomenon, such as a disease. Such datasets may be of different types, from different sources, with different data structures and following different distributions. These datasets are often large in size, high-dimensional, sparse, incomplete, heterogeneous and noisy. A significant amount of methodological work has been carried out in the field of multi-view learning and data integration,12,13 and new integrative clustering algorithms dealing with multiple views have been developed. 14 These methods include joint latent variable models,15,16 similarity network fusion, 17 joint and individual variation explained approach 18 and graphical models. 19 In particular, several Bayesian approaches for integrative clustering have been developed, including Bayesian joint latent variable models, 20 Bayesian correlated clustering 21 and Bayesian consensus clustering. 22 Notably, model-based clustering approaches with feature selection via fast Bayesian variational inference for continuous data have emerged.23–25
Despite recent advancements, many integrative clustering methods remain limited to a single data type (e.g. continuous), do not allow feature selection or are computationally demanding, while computationally efficient approaches that simultaneously perform clustering and feature selection on mixed-type data (e.g. continuous, categorical, and count) are still underdeveloped. To this end, the main contribution of the present study is three-fold: (a) we developed a fast model-based variational Bayesian clustering approach, iClusterVB, for integrative cluster analysis and feature selection in high-dimensional settings for mixed-type multi-view data, (b) we evaluated the performance of iClusterVB compared to existing methods, and demonstrated its utility using simulated data and real data examples under various scenarios, and (c) we developed a user-friendly
The rest of the article is organized as follows: in Section 2, we provide a methodological background of the proposed iClusterVB approach for clustering mixed-type multi-view data, permitting feature selection. In Section 3, we describe a variational Bayesian inference to approximate the posterior distribution of the proposed model. In Section 4, we describe simulation studies to evaluate the performance of iClusterVB and compare its performance to several competing integrative clustering methods. In Section 5, we apply the iClusterVB to three real data examples to demonstrate its utility. Finally, in Section 6, we discuss our findings and conclude our study.
Model-based clustering with feature selection
In this section, we first provide an overview of the finite mixture model as a model-based approach for cluster analysis. We then introduce the iClusterVB to perform clustering and feature selection within the framework of the finite mixture model (Figure 1).

Schematic diagram for integrative analysis of multi-view data via variational Bayesian clustering.
Let
Furthermore, we assume that conditional on the cluster membership
In this subsection, we extended the finite mixture model defined previously to allow feature selection. In cluster analysis with a large number of features, it is often assumed that only a small portion of features are informative, while the other features are random noise and do not contribute to the clustering process. Appropriately removing these noise features reduces the complexity of the final model, enhances its stability and robustness to identify the true cluster structure, and eases interpretation. To this end, we developed a model-based clustering approach with feature selection for mixed data type building on a previous study, 24 which identifies features that are relevant to the overall mixture distribution. The methodological background for this is briefly described in the subsection below.
Let
Distributions for different types of features.
Given the model defined in (5), consider the assignment of cluster membership for an individual
Variational Bayesian inference for posterior approximation
We developed a variational Bayesian inference algorithm to estimate the proposed model. In this section, we first describe the prior specifications for the model parameters, followed by the variational inference algorithm used in the proposed iClusterVB approach.
Specification of prior distributions
We used the following prior distributions for the model parameters. To reduce the model complexity and to ease computation, we considered imposing prior distributions only on the mixing weights
In the present study, we used a weakly informative prior for all parameters of interest and therefore, the influence of prior distribution on the posterior distribution of the model parameters was minimized. For prior distribution of
Mean field variational Bayesian inference for iClusterVB
Under the Bayesian framework, estimation of the posterior distributions is commonly performed via Markov Chain Monte Carlo (MCMC). However, MCMC can be infeasible with a large number of observations, requiring massive computing resources, converging too slowly to be practically useful, and might approximate the entirely wrong posterior. Variational Bayesian inference is a promising alternative to MCMC for fast Bayesian inference. 27
The key idea of variational inference is to cast the problem of finding the true posterior distribution into an optimization problem. Specifically, for data
Based on the principles of mean-field variational inference, we derived the updates of the variational parameters for the proposed iClusterVB. We first initialized the variational distributions, then iterated each factor in turn and updated its current estimate with its optimal solution given the current estimates for the other factors. Algorithm 1 provides a pseudo-code sketching the computing process of the iClusterVB based on the CAVI algorithm. Notably, the algorithm involves calculating the terms of
The convergence of the algorithm is monitored through the inspection of the change of the evidence lower bound. To avoid local maxima and the potential influence of the initial values, we ran the model with 10 different sets of initial values and chose the one with the largest value of ELBO obtained as the final result.
A notable advantage of the variational Bayes approach is the determination of the optimal number of clusters. One can deliberately over-fit the model by setting a large number of clusters, and the algorithm will converge to yield a model that is composed of dominant clusters and the redundant clusters are removed. This automates the process of determining the number of clusters and avoids refitting the model multiple times with different values of
Under a Dirichlet prior with hyperparameter
Simulation studies
In this section, we performed two simulation studies to evaluate the performance of the iClusterVB. In Simulation I, we compared the clustering and feature selection performance of iClusterVB with several other clustering approaches that are applicable to mixed data types. In Simulation II, we evaluated the performance of iClusterVB under violations of model assumptions, specifically when the independence between features did not hold or when the normality assumption was violated.
Simulation setup
In Simulation I, two different scenarios were considered. In Scenario 1, clusters were well-separated and, therefore, were not difficult to identify. In Scenario 2, clusters were poorly separated, which may pose challenges in identifying the optimal number of clusters and the cluster membership. For each scenario, we generated
For Scenario 1 (well-separated clusters), data views were generated as follows. For data view 1 (continuous features), the relevant features were generated from a normal distribution, with
For Scenario 2 (poorly separated clusters), data views were generated as follows. For data view 1 (continuous features), the relevant features were generated from a normal distribution, with
In Simulation II, two continuous data views were generated to evaluate the robustness of iClusterVB under deviations from normality and independence. Data were simulated from a skew-normal (SN) distribution, with 500 features per view generated using the same location and scale parameters as in data views 1 and 2 of Scenarios 1 and 2 in Simulation I. The shape parameter
Implementation and performance metrics
The proposed iClusterVB was implemented via our newly developed
To measure the model performance, we computed the following performance measures: (a) accuracy of the cluster number (ACN), calculated as the percentage of times the model correctly identified the true number of clusters
Simulation results
The results of Simulation I comparing between methods are presented in Table 2. Across both scenarios, iClusterVB demonstrated superior clustering and feature selection performance as well as computational efficiency compared with existing methods. For the clustering performance, iClusterVB achieved near-perfect recovery of the true number of clusters (ACN
Results
from Simulation I comparing the performance of iClusterVB with existing approaches for equal cluster size (
) under Scenarios 1 and 2.
Results are presented as mean (SD) over 50 replications.
Abbreviation for methods: iClusterVB : integrative clustering based on variational Bayes using the iClusterVB package; iCluster+ : Bayesian joint latent variable model for integrative clustering using the iClusterPlus package; iClusterBayes : Bayesian integrative clustering based on the MCMC algorithm using the iClusterPlus package; VarSelLCM : model-based clustering with variable selection based on integrated complete-data likelihood using the VarSelLCM package. MOFA : multi-omics factor analysis using the MOFA2 package. A model-based approach, via the mclust package, was used for clustering. CIMLR : cancer integration via multikernel learning via the CIMLR package. SNF : similarity network fusion, via the SNFtool package.
Abbreviation for performance indices: ACN : accuracy of cluster number; ARI : adjusted rand index; TNR : true negative rate; TPR : true positive rate; FSA : overall feature selection accuracy. ARI, FSA, TPR, and TNR were calculated under the true number of clusters (
).
The number of clusters for iClusterVB was determined by using a cut-off
.
FSA, TPR, and TNR were not reported for MOFA , CIMLR , and SNF , respectively.
Results
In addition to model performance, iClusterVB was faster than other model-based methods. Its runtime was 92 s in the well-separated scenario and 75 s in the poorly separated scenario, compared to substantially longer run times for iClusterBayes (around 531 s), and iCluster+ (483–615 s). Although MOFA, CIMLR, and SNF were computationally faster, their ACNs were markedly lower and do not simultaneously perform feature selection. Importantly, when the number of clusters is treated as unknown and multiple candidate models must be fitted to determine the optimal solution, the computational burden of alternative methods would further increase. In contrast, iClusterVB automatically determines the number of clusters, therefore, avoiding this issue and further enhancing its efficiency. Overall, these findings suggest that iClusterVB yields an excellent performance in clustering and feature selection as well as computational efficiency.
In addition, we evaluated whether the number of clusters determined by iClusterVB is sensitive to the choice of a cut-off
The results of Simulation II for Scenarios 1 and 2 are presented in Tables 3 and 4, respectively. In Scenario I (Table 3), iClusterVB consistently achieved near-perfect clustering and feature selection performance across skew-normal distributions with varying levels of skewness (
Results
Results
In this section, we demonstrated the utility of the proposed iClusterVB approach by applying it to three real-life data examples. These three real data examples were data on glioblastoma multiforme, acute myeloid leukemia and lung cancer, respectively.
Glioblastoma multiforme data
The first example was from a glioblastoma multiforme (GBM) dataset. GBM is the most common and aggressive type of primary brain tumor in adults. It belongs to a group of tumors called gliomas, which originate from glial cells in the brain. GBM is characterized by its rapid growth, infiltrative nature, and high likelihood of recurrence. 34 The Cancer Genome Atlas (TCGA) Research Network 35 was established to generate a comprehensive catalog of genomic abnormalities driving tumorigenesis, and it provides a detailed view of the genomic changes in a large GBM cohort. Earlier studies have shown that GBM is heterogeneous and tumor subtypes have differential responses to therapies. 36 Therefore, appropriately identifying subtypes of GBM patients is crucial for understanding tumor heterogeneity, tailoring personalized treatment strategies, and accurately predicting prognosis. The current analysis included GBM patients with measurements on three views, namely, DNA copy number (continuous), messenger RNA (mRNA) expression (continuous) and somatic mutation (binary). Briefly, a somatic mutation is a genetic alteration that occurs in a cell of the body that is not passed down to offspring. Somatic mutations can disrupt the normal control mechanisms that regulate cell growth, division, and death, leading to uncontrolled proliferation and the formation of tumors. DNA copy number refers to the number of copies of a particular DNA sequence within the genome of an organism or a cell. Tumor cells often exhibit abnormal DNA copy number profiles, including amplifications (increased copies) and deletions (decreased copies) of specific genomic regions. mRNA expression refers to the process by which genes in a cell are transcribed into messenger RNA (mRNA) molecules, which then serve as templates for protein synthesis. Alterations in mRNA expression patterns are commonly observed in cancer cells compared to normal cells and can contribute to the dysregulation of cellular processes.
A total of 84 GBM patients were included in the current analysis. This dataset is available in the
Given a small sample size in this study, the maximum number of clusters was set to be

Analysis results for glioblastoma multiforme (GBM) dataset using the iClusterVB. (a) Heatmap for DNA copy number data based on all 5512 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for mRNA expression data by clusters derived from iClusterVB based on all 1740 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for DNA copy number based on 1296 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for mRNA expression data by clusters derived from iClusterVB based on 41 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Probability of feature inclusion of DNA copy number. The dashed reference line in red indicates a probability of 0.5. (f) Probability of feature inclusion of mRNA expression. The dashed reference line in red indicates a probability of 0.5. (g) K-M curves for survival probabilities by clusters derived from iClusterVB.
Furthermore, Cluster 1 had significantly better survival than Cluster 2 (
The second example was from an AML dataset. AML is a malignant disease caused by the abnormal proliferation and differentiation of myeloid stem cells in the bone marrow. This process results in the arrest of cell development within the myeloid lineage, leading to an abnormal accumulation of blasts, or immature cells. Notably, the heterogeneity of AML at the genetic and molecular levels poses significant challenges for diagnosis, prognosis, and treatment.
The AML datasets were downloaded using the cBioPortal for Cancer Genomics tool.
42
Two data views were included in the present study, namely the gene expression data (continuous) and the mutation data (presence vs. absence of mutation). We preprocessed the data following the procedures of a previous study.
43
Specifically, for gene expression data, the 500 genes chosen based on having the highest ranked-based coefficients of variation and standard deviation across the samples were used (Figure 3(a)). For mutation data, mutations that appeared in at least two different individuals were chosen, resulting in 156 genes (Figure 3(b)). Finally, only samples that had gene expression data and mutation data were included, resulting in 170 samples. This dataset is included in the

Analysis results for acute myeloid leukemia (AML) data using the iClusterVB. (a) Heatmap for gene expression based on all 500 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for gene mutation data by clusters derived from iClusterVB based on all 156 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for gene expression based on 326 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for gene mutation data by clusters derived from iClusterVB based on one feature selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Probability of feature inclusion of gene expression. The dashed reference line in red indicates a probability of 0.5. (f) Probability of feature inclusion of gene mutation. The dashed reference line in red indicates a probability of 0.5. (g) Kaplan–Meier (K-M) curves for survival probabilities by clusters derived from iClusterVB.
We applied the iClusterVB to identify AML subtypes with the maximum number of clusters
The third example was from the lung cancer data.
45
This dataset includes three data views in continuous scale, namely the gene expression data for 12,042 genes (Figure 4(a)), DNA methylation information for 23,074 loci (Figure 4(b)), and miRNA expression profiles for 352 miRNAs (Figure 4(c)), collected from 106 patients with lung cancer. No further preprocessing was performed. The original datasets were sourced from The Cancer Genome Atlas repository, and can be downloaded using the link: https://github.com/evelinag/clusternomics. The goal of this analysis was to identify lung cancer subtypes based on these three data views. Therefore, we ran iClusterVB with a maximum number of clusters

Analysis results for the lung cancer dataset using the iClusterVB. (a) Heatmap for gene expression data based on all 12,042 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for methylation data by clusters derived from iClusterVB based on all 23,074 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for miRNA expression data by clusters derived from iClusterVB based on all 352 features. The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for gene expression data by clusters derived from iClusterVB based on 438 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Heatmap for methylation data by clusters derived from iClusterVB based on 14,038 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (f) Heatmap for miRNA data by clusters derived from iClusterVB based on 20 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (g) Probability of feature inclusion of gene expression. The dashed reference line in red indicates a probability of 0.5. (h) Probability of feature inclusion of methylation feature. The dashed reference line in red indicates a probability of 0.5. (i) Probability of feature inclusion of miRNA expression feature. The dashed reference line in red indicates a probability of 0.5. (j) Kaplan–Meier (K-M) curves for survival probabilities by clusters derived from iClusterVB.
Among 106 patients included in the current analysis, the iClusterVB identified three non-empty clusters, with a size of 12 (11%), 33 (31%), and 61 (58%), respectively, suggesting heterogeneous subgroups within the patients with lung cancer. Additionally, the model identified 438 informative features out of 12,042 in gene expression data (Figure 4(d)), 14,038 out of 23,074 in DNA methylation data (Figure 4(e)), and 20 out of 352 in miRNA expression data (Figure 4(f)), each surpassing a posterior inclusion probability threshold of 0.5, which clearly showed distinct gene expression, methylation and mutation profiles among the patients. In particular, among the 20 identified miRNA expression features, the five with the highest posterior feature inclusion probability were
The survival probabilities differed significantly across the three clusters (
Recent advances in engineering have enabled researchers to collect a large number of features from different data views on the same set of samples. Integrative analysis offers the opportunity to uncover biological mechanisms across different data views. The iClusterVB accommodates mixed-type (e.g. continuous, categorical, and count) features, allows for feature selection, quantifies feature importance through posterior feature inclusion probabilities and does not require refitting the model to determine the optimal number of clusters. In particular, the posterior feature inclusion probabilities can be used as a measure for feature ranking. While we use omics data as an illustrative example, the iClusterVB may also be applied to clustering other types of data, such as electronic health records. 48
Although existing integrative clustering methods are powerful, they are not designed to simultaneously and fully address the major practical challenges in real data applications, including selecting the appropriate number of clusters, identifying the features that drive clustering, quantifying uncertainty in feature selection, and ensuring computational efficiency. For example, MOFA, CIMLR, and SNF are computationally efficient methods that allow integrative clustering of data views of different types, however, they do not offer built-in feature selection to identify features nor quantify the feature selection uncertainty. Of note, while MOFA produces interpretable feature loadings, it does not perform explicit feature selection or provide uncertainty quantification of selected features in the context of integrative clustering. While iClusterBayes and iCluster+ are effective methods for integrative clustering and feature selection, they are computationally demanding, particularly when applied to data of a large sample size. VarSelLCM is a computationally efficient method for integrative clustering and feature selection. However, its feature selection performance was sub-optimal based on our simulation studies. The proposed iClusterVB offers an efficient and effective alternative method for multi-view data of different types (e.g. continuous, categorical, and count). It quantifies the uncertainty of feature selection through posterior inclusion probabilities and provides a faster alternative to other model-based methods such as iClusterBayes and iCluster+. Furthermore, iClusterVB allows automatic selection of the number of clusters, which does not require re-fitting the model to identify the optimal number of clusters. In our simulation study, we demonstrated that the iClusterVB yielded strong performance in determining the true number of clusters, identifying the cluster membership, selecting the important features and efficient computation.
To the best of our knowledge, existing R packages that allow implementing clustering via the variational Bayes approach are sparse. Notably,
While iClusterVB offers significant advantages for clustering and feature selection in high-dimensional datasets, it has notable limitations. One key limitation is its reliance on the assumption of normality for continuous features, which may not hold in certain applications. Our simulation studies suggest that when the clusters are well-separated, iClusterVB is robust and able to recover the true clusters and relevant features even when the features are correlated or the distribution assumption is violated. However, if the cluster is poorly separated, iClusterVB may fail to identify the correct number of clusters under a strong correlation between features, or when the distribution assumption is violated. Additionally, compared to MCMC and Gibbs sampling techniques, VB algorithms are prone to underestimating uncertainty. 27 However, if the primary interest is clustering rather than parameter estimation, this limitation may be less relevant. Finally, it should be noted that the algorithm could be sensitive to the initial cluster allocation. Thus, running the model with multiple sets of initial values is recommended.
Missing data and zero inflation are common challenges in multi-omics data integration. Missing values can arise from poor tissue quality, technical limitations, or subject dropout. Many existing integrative clustering methods require complete data, leading researchers to discard omics features or samples with missing values. This approach may introduce bias and reduce statistical power. To address this, several methods have been developed that explicitly model missing data, such as joint-imputation strategies and optimization-masking techniques. 50 Another frequent issue is zero inflation, which refers to an excess of zeros in the data. This type of data often results from biological entities being present at levels below detection thresholds. Statistical models to account for zero inflation in omics data are an active area of development, and methods such as Bayesian model selection have been proposed to model this type of data. 51 Future versions of iClusterVB aim to incorporate efficient strategies for handling both missing data and zero inflation to improve its robustness and flexibility.
Additionally, future studies comparing iClusterVB to other integrative clustering approaches would be of great interest. 52 Also, the iClusterVB could be further improved using sparse Bayesian variational inference via coreset53,54 and stochastic variational inference 55 for analyzing data views with massive sample size and ultra-high dimensional features. Moreover, as an alternative to the feature selection property in iClusterVB, a prior distribution such as spike-and-slab and shrinkage priors 56 may be considered to induce sparsity. Finally, the variational Bayes method for analyzing longitudinal data 57 is emerging, and extending the iClusterVB to clustering longitudinal data would be of interest, for example, in the context of multivariate,58–60 multi-source, 61 or high-dimensional longitudinal data modeling. 62
Conclusion
The iClusterVB is a flexible approach for fast integrative clustering and feature selection in high-dimensional multi-view data. Its key strengths include clustering of mixed-type features, feature selection and automatic determination of the optimal number of clusters, making it a useful tool for addressing practical challenges in integrative clustering.
Software
The newly developed
Supplemental Material
sj-pdf-1-smm-10.1177_09622802251406584 - Supplemental material for A fast integrative clustering and feature selection approach for high-dimensional multiview data
Supplemental material, sj-pdf-1-smm-10.1177_09622802251406584 for A fast integrative clustering and feature selection approach for high-dimensional multiview data by Abdalkarim Alnajjar, Helen Bian and Zihang Lu in Statistical Methods in Medical Research
Footnotes
Funding
Declaration of conflicting interests
Data availability statement
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
