Sage Journals: Discover world-class research

Abstract

Cluster analysis has been widely used in biomedical studies for disaggregating heterogeneous diseases and identifying disease subtypes that may inform clinical decisions. In the era of advanced data science and engineering, cluster analysis faces new challenges due to high dimensionality, multimodality and computational complexity. In the present study, we propose a fast integrative clustering approach based on variational Bayesian inference, called iClusterVB. The iClusterVB enables the integration of multiple datasets into the clustering process while performing feature selection in high-dimensional settings for mixed data types, including continuous, categorical, and count data. Simulation studies are performed to compare the performance of iClusterVB with six competing methods and highlight its advantages. Additionally, iClusterVB is applied to three real-life studies to demonstrate its utility in identifying important features and cancer subtypes that are associated with distinct survival probabilities. A user-friendly R package iClusterVB and a tutorial are developed to implement the proposed approach.

Keywords

Feature selection finite mixture model high-dimensional data multi-view data integrative clustering variational inference

1. Introduction

Cluster analysis, also known as clustering, is a type of unsupervised learning technique used to classify a set of individuals into groups such that individuals in the same group (called a cluster) are more similar to each other than to those in other groups. Such a technique has been widely used in biomedical studies for disaggregating heterogeneous diseases and identifying disease subtypes that may respond to different treatments and inform clinical decisions. Examples include using cluster analysis to identify gastric cancer subtypes to provide novel insights into tumor biology and inform clinical management,¹ to identify subtypes of obstructive sleep apnea in children with obesity to distinguish high-risk children for targeted interventions and personalized treatment plans,² to identify cardiogenic shock survivor subtypes at intensive care unit discharge with distinct late host-response patterns and are associated with poor long-term health outcomes.³ Popular clustering methods include $k$ -means clustering algorithm⁴ and hierarchical clustering algorithm,⁵ which are based on a distance metric to measure the dissimilarity or similarity between individual data points. Alternatively, model-based clustering via finite mixture model^6,7 is a powerful tool as it allows the full probabilistic formulation of the data-generating process, explicit parametrizations and uncertainty quantification which are essential in many medical research studies. Reviews and comparisons of commonly used clustering methods have been published previously.^8–11

With recent advancements in data science, cluster analysis faces new challenges such as high dimensionality, multi-modality and computational complexity. A typical example is clustering multi-view data. The analytical process of integrating information from different datasets (also known as data views) describing the same set of individuals is known as multi-view learning.¹² This process is also known as data integration, which refers to the use of multiple sources of data to provide a better understanding of a phenomenon, such as a disease. Such datasets may be of different types, from different sources, with different data structures and following different distributions. These datasets are often large in size, high-dimensional, sparse, incomplete, heterogeneous and noisy. A significant amount of methodological work has been carried out in the field of multi-view learning and data integration,^12,13 and new integrative clustering algorithms dealing with multiple views have been developed.¹⁴ These methods include joint latent variable models,^15,16 similarity network fusion,¹⁷ joint and individual variation explained approach¹⁸ and graphical models.¹⁹ In particular, several Bayesian approaches for integrative clustering have been developed, including Bayesian joint latent variable models,²⁰ Bayesian correlated clustering²¹ and Bayesian consensus clustering.²² Notably, model-based clustering approaches with feature selection via fast Bayesian variational inference for continuous data have emerged.^23–25

Despite recent advancements, many integrative clustering methods remain limited to a single data type (e.g. continuous), do not allow feature selection or are computationally demanding, while computationally efficient approaches that simultaneously perform clustering and feature selection on mixed-type data (e.g. continuous, categorical, and count) are still underdeveloped. To this end, the main contribution of the present study is three-fold: (a) we developed a fast model-based variational Bayesian clustering approach, iClusterVB, for integrative cluster analysis and feature selection in high-dimensional settings for mixed-type multi-view data, (b) we evaluated the performance of iClusterVB compared to existing methods, and demonstrated its utility using simulated data and real data examples under various scenarios, and (c) we developed a user-friendly R package, called iClusterVB, to facilitate the implementation of this newly developed approach.

The rest of the article is organized as follows: in Section 2, we provide a methodological background of the proposed iClusterVB approach for clustering mixed-type multi-view data, permitting feature selection. In Section 3, we describe a variational Bayesian inference to approximate the posterior distribution of the proposed model. In Section 4, we describe simulation studies to evaluate the performance of iClusterVB and compare its performance to several competing integrative clustering methods. In Section 5, we apply the iClusterVB to three real data examples to demonstrate its utility. Finally, in Section 6, we discuss our findings and conclude our study.

2. Model-based clustering with feature selection

In this section, we first provide an overview of the finite mixture model as a model-based approach for cluster analysis. We then introduce the iClusterVB to perform clustering and feature selection within the framework of the finite mixture model (Figure 1).

Figure 1.

Schematic diagram for integrative analysis of multi-view data via variational Bayesian clustering.

2.1. Finite mixture model

Let $X = (x_{1}, \dots, x_{N})^{⊤}$ denote a matrix including all observed data points, where $x_{i} = (x_{i 1}, \dots, x_{i p})^{⊤}$ denotes a $p$ -dimensional vector of features (variables) of interest, for individual $i$ , where $i = 1, \dots, N$ . Note that $x_{i j}$ can be data of different types, such as continuous, categorical or count. In particular, if $x_{i j}$ is a categorical variable, we assume $x_{i j}$ have $M_{j}$ categories, where $M_{j} > 1$ . Assuming the population of interest is heterogeneous and consists of several subpopulations, a finite mixture model²⁶ with $K$ clusters (i.e. subpopulations) can be written as follows: $P (x_{i} | π, Θ) = \sum_{k = 1}^{K} π_{k} P (x_{i} | ϕ_{k})$ (1)where $π = (π_{1}, \dots, π_{K})$ are the cluster weights ( $π_{k} \geq 0$ , $\sum_{k = 1}^{K} π_{k} = 1$ ), and $Θ = (ϕ_{1}, \dots, ϕ_{K})$ , where $ϕ_{k}$ are cluster-specific parameters characterizing the distribution of cluster $k$ , for $k = 1, \dots, K$ . Notably, the cluster weights can also be written as $π_{k} = P (c_{i} = k)$ , where $k = 1, \dots, K$ , and $c_{i} = 1, \dots, K$ is the cluster membership of individual $i$ .

Furthermore, we assume that conditional on the cluster membership $k$ , all features involved in the cluster analysis are independent, that is, $P (x_{i} | ϕ_{k}) = \prod_{j = 1}^{p} P (x_{i j} | ϕ_{k})$ . This assumption has been used in other Bayesian clustering approaches to reduce model complexity.²⁰ For computational convenience, the model in (1) can be re-parametrized by introducing a vector of latent variables $z_{i} = (z_{i 1}, \dots, z_{i K})^{⊤}$ denoting the cluster membership, where $z_{i k} = 1$ if individual $i$ belongs to cluster $k$ , 0 otherwise. Then the model can be further written as follows: $P (x_{i} | z, Θ) = \prod_{k = 1}^{K} \prod_{j = 1}^{p} P (x_{i j} | ϕ_{k})^{z_{i k}}$ (2)where the conditional distribution of $z_{i}$ is a multinomial distribution with parameters $π$ , that is, $z_{i} | π \sim multinomial (1; π)$ . We assume that $P (x_{i j} | ϕ_{k})$ is a distribution from the exponential family. Specifically, we focus on the following three types of data that are commonly seen in biomedical studies, namely continuous, categorical features (with two or more levels), and discrete (count) features.

2.2. Feature selection for finite mixture model

In this subsection, we extended the finite mixture model defined previously to allow feature selection. In cluster analysis with a large number of features, it is often assumed that only a small portion of features are informative, while the other features are random noise and do not contribute to the clustering process. Appropriately removing these noise features reduces the complexity of the final model, enhances its stability and robustness to identify the true cluster structure, and eases interpretation. To this end, we developed a model-based clustering approach with feature selection for mixed data type building on a previous study,²⁴ which identifies features that are relevant to the overall mixture distribution. The methodological background for this is briefly described in the subsection below.

Let $P (x_{i j} | ϑ_{k j})$ represent the distribution of a relevant feature $j$ for cluster $k$ , where $ϑ_{k j}$ are the parameters characterizing the distribution and $P (x_{i j} | η_{j})$ be the distribution of an irrelevant feature $j$ , where $η_{j}$ are the parameters characterizing the distribution, which does not depend on the cluster membership $k$ . Then the finite mixture model defined in (2) can be written as follows: $P (x_{i} | z, Θ) = \prod_{k = 1}^{K} \prod_{j = 1}^{p} [ω_{j} P (x_{i j} | ϑ_{k j}) + (1 - ω_{j}) P (x_{i j} | η_{j})]^{z_{i k}}$ (3)where $Θ = (ω, ϑ, η)$ , where $ω = (ω_{1}, \dots, ω_{p})^{⊤}$ and $ω_{j}$ denotes the probability of feature $j$ contributing to defining the clusters (i.e. relevant feature), for $j = 1, \dots, p$ . For computational convenience, we introduce a latent variable vector $γ_{i} = (γ_{i 1}, \dots, γ_{i p})^{⊤}$ , where $γ_{i j} \in {0, 1}$ . If $γ_{i j} = 1$ , it suggests that feature $j$ is a relevant feature and $0$ otherwise. Then the model (3) can be further written as follows: $P (x_{i} | z, γ, Θ) = \prod_{k = 1}^{K} \prod_{j = 1}^{p} [P (x_{i j} | ϑ_{k j})^{γ_{i j}} P (x_{i j} | η_{j})^{(1 - γ_{i j})}]^{z_{i k}}$ (4)The conditional distribution of latent variable $γ_{i} = (γ_{i 1}, \dots, γ_{i p})^{⊤}$ is an independent Bernoulli distribution, that is, $γ_{i} | ω = \prod_{j = 1}^{p} ω_{j}^{γ_{i j}} (1 - ω_{j})^{(1 - γ_{i j})}$ . The marginal distribution of model (4) (i.e. integrating out the latent variables $z$ and $γ$ ) becomes $P (x_{i} | π, Θ) = \sum_{k = 1}^{K} π_{k} \prod_{j = 1}^{p} [ω_{j} P (x_{i j} | ϑ_{k j}) + (1 - ω_{j}) P (x_{i j} | η_{j})]$ (5)where $Θ = (ω, ϑ, η)$ . The specification of $P (x_{i j} | ϑ_{k j})$ and $P (x_{i j} | η_{j})$ for different types of data are summarized in Table 1. Specifically, to model continuous features, $P (x_{i j} | ϑ_{k j})$ and $P (x_{i j} | η_{j})$ are assumed to be Gaussian distributions with $ϑ_{k j} = (μ_{k j}, σ_{k j}^{2})$ denoting the mean and variance of the Gaussian distribution for features relevant to the clustering and $η_{j} = (ι_{j}, ϵ_{j}^{2})$ denoting the mean and variance of the Gaussian distribution for features irrelevant to the clustering. Similarly, for categorical features $P (x_{i j} | ϑ_{k j})$ and $P (x_{i j} | η_{j})$ are assumed to be multinomial distributions with $ϑ_{k j} = (θ_{k j 1}, \dots, θ_{k j M_{j}})$ and $η_{j} = (ζ_{j 1}, \dots, ζ_{j M_{j}})$ , respectively. For count features, $ϑ_{k j} = (λ_{k j})$ and $η_{j} = (ν_{j})$ , respectively.

Table 1.

Distributions for different types of features.

	$P (x_{i j} \| ϑ_{k j})$	$P (x_{i j} \| η_{j})$
Continuous features	$N (μ_{k j}, σ_{k j}^{2})$	$N (ι_{j}, ϵ_{j}^{2})$
Categorical features	$Multinomial (1; θ_{k j 1}, \dots, θ_{k j M_{j}})$	$Multinomial (1; ζ_{j 1}, \dots, ζ_{j M_{j}})$
Count features	$Poisson (λ_{k j})$	$Poisson (ν_{j})$

2.3. Cluster assignment

Given the model defined in (5), consider the assignment of cluster membership for an individual $i$ , with data $x_{i}$ . The probability that this individual belongs to cluster $k$ can be estimated by $P (z_{i k} = 1 | x_{i}, \hat{Θ}, \hat{π}) = \frac{{\hat{π}}_{k} P (x_{i} | {\hat{Θ}}_{k})}{\sum_{l = 1}^{K} {\hat{π}}_{l} P (x_{i} | {\hat{Θ}}_{l})}$ , where $\hat{Θ} = ({\hat{Θ}}_{1}, \dots, {\hat{Θ}}_{K})$ and $\hat{π} = ({\hat{π}}_{1}, \dots, {\hat{π}}_{K})$ are estimates of model parameters. Then the individual is assigned to ${\hat{z}}_{i k}$ , where ${\hat{z}}_{i k} = \arg max_{1 \leq k \leq K} P (z_{i k} = 1 | x_{i}, \hat{Θ}, \hat{π})$ . In the next section, we describe the parameter estimation process.

3. Variational Bayesian inference for posterior approximation

We developed a variational Bayesian inference algorithm to estimate the proposed model. In this section, we first describe the prior specifications for the model parameters, followed by the variational inference algorithm used in the proposed iClusterVB approach.

3.1. Specification of prior distributions

We used the following prior distributions for the model parameters. To reduce the model complexity and to ease computation, we considered imposing prior distributions only on the mixing weights $π$ and the ‘‘relevant” distribution $P (x_{i j} | ϑ_{k j})$ , for $k = 1, \dots, K$ and $j = 1, \dots, p$ . Also, to reduce computational complexity, the updates for parameters associated with the ‘‘irrelevant” distribution $P (x_{i j} | η_{j})$ were carried out based on maximum likelihood estimates.²⁴ Specifically, for the cluster weight $π$ , we used prior distribution $(π_{1}, \dots, π_{K}) | α_{0} \sim Dirichlet (α_{01}, \dots, α_{0 K})$ , where $α_{0} = (α_{01}, \dots, α_{0 K})$ is the parameter for Dirichlet distribution. For continuous features that follow the Gaussian distribution, we used prior distributions $μ_{k j} | μ_{0}, s_{0}^{2} \sim N (μ_{0}, s_{0}^{2}) and σ_{k j}^{2} | a_{0}, b_{0} \sim IG (a_{0}, b_{0})$ , where $μ_{0}$ and $s_{0}^{2}$ are hyper-parameters of mean and variance for Gaussian distribution, and $a_{0}$ and $b_{0}$ are the hyper-parameters of shape and scale parameters of an inverse gamma distribution, for $k = 1, \dots, K$ and $j = 1, \dots, p$ . For categorical features that follow multinomial distribution, we used prior distributions $θ_{k j} | κ_{k j 1}, \dots, κ_{k j M_{j}} \sim Dirichlet (κ_{k j 1}, \dots, κ_{k j M_{j}})$ , where we set $κ_{k j 1} = \dots = κ_{k j M_{j}} = κ_{0}$ , which are hyper-parameters of the Dirichlet distribution, where $M_{j}$ denotes the number of categories for the $j$ th feature. For count features that follow a Poisson distribution, we used prior distributions $λ_{k j} | c_{0}, d_{0} \sim Gamma (c_{0}, d_{0})$ , where $Gamma (c_{0}, d_{0})$ denotes a gamma distribution, with $c_{0}$ and $d_{0}$ being the shape and scale hyperparameters, respectively.

In the present study, we used a weakly informative prior for all parameters of interest and therefore, the influence of prior distribution on the posterior distribution of the model parameters was minimized. For prior distribution of $π$ , we set $α_{01} = \dots = α_{0 K} = α_{0} = 0.001$ . For $μ_{k j}$ , we set $μ_{0} = 0$ and $s_{0}^{2} = 100$ . For $σ_{k j}^{2}$ , we set $a_{0} = 1$ and $b_{0} = 1$ . For $θ_{k j}$ , we set $κ_{k j 1} = \dots = κ_{k j M_{j}} = 1$ . For $λ_{k j}$ , we set $c_{0} = 1$ and $d_{0} = 1$ .

3.2. Mean field variational Bayesian inference for iClusterVB

Under the Bayesian framework, estimation of the posterior distributions is commonly performed via Markov Chain Monte Carlo (MCMC). However, MCMC can be infeasible with a large number of observations, requiring massive computing resources, converging too slowly to be practically useful, and might approximate the entirely wrong posterior. Variational Bayesian inference is a promising alternative to MCMC for fast Bayesian inference.²⁷

The key idea of variational inference is to cast the problem of finding the true posterior distribution into an optimization problem. Specifically, for data $X$ and model parameters $Θ$ , suppose the interest is to compute the posterior distribution $P (Θ | X)$ . In variational inference, we specify a family $G$ of densities over the model parameters (also called latent variables). The variational inference seeks the approximation to the full posterior by $P (Θ | X) \approx Q (Θ)$ , where $Q (Θ) \in G$ is a candidate approximation to the exact conditional. The goal is to find the closest candidate distribution to the exact conditional, as measured by the Kullback-Leibler (KL) divergence (also known as relative entropy): $Q^{*} (Θ) = \underset{Q (Θ) \in G}{arg min KL} (Q (Θ) | | P (Θ | X))$ . The KL divergence, denoted $KL (Q | | P)$ , is a type of statistical distance, measuring how one probability distribution $Q$ is different from another probability distribution $P$ (i.e. reference distribution). For variational inference, the KL divergence is $\begin{aligned} KL (Q (Θ) | | P (Θ | X)) & = E (\log (Q (Θ))) - E (\log (P (Θ | X))) \\ = - [E (\log (P (Θ, X))) - E (\log (Q (Θ)))] + \log P (X) \end{aligned}$ (6)The $E (\log (P (Θ, X))) - E (\log (Q (Θ)))$ is called evidence lower bound (ELBO). Thus, equation (6) can be further written as $KL (Q (Θ) | | P (Θ | X)) = - ELBO + \log P (X)$ . Therefore, maximizing the ELBO is equivalent to minimizing the KL divergence. The mean field variational inference²⁸ assumes that the variational distribution $Q (Θ)$ can be factorized into: $Q (Θ) = \prod_{t = 1}^{T} Q_{t} (Θ_{t}) for some partition {Θ_{1}, \dots, Θ_{T}} of Θ$ . Coordinate ascent mean-field variational inference (CAVI) is an algorithm that iteratively optimizes each factor of the mean-field variational density, while holding others fixed. The algorithm can be declared converged once the change in ELBO falls under some small threshold.

Based on the principles of mean-field variational inference, we derived the updates of the variational parameters for the proposed iClusterVB. We first initialized the variational distributions, then iterated each factor in turn and updated its current estimate with its optimal solution given the current estimates for the other factors. Algorithm 1 provides a pseudo-code sketching the computing process of the iClusterVB based on the CAVI algorithm. Notably, the algorithm involves calculating the terms of $E [\log P (x_{i j} | ϑ_{k j})]$ and $E [\log P (x_{i j} | η_{j})]$ , which is provided in Section A of the Supplemental Material. A sketch of the derivation of the ELBO and a detailed updating process for the variational parameters are also provided in the Supplemental Material.

The convergence of the algorithm is monitored through the inspection of the change of the evidence lower bound. To avoid local maxima and the potential influence of the initial values, we ran the model with 10 different sets of initial values and chose the one with the largest value of ELBO obtained as the final result.

3.3. Determining the number of clusters

A notable advantage of the variational Bayes approach is the determination of the optimal number of clusters. One can deliberately over-fit the model by setting a large number of clusters, and the algorithm will converge to yield a model that is composed of dominant clusters and the redundant clusters are removed. This automates the process of determining the number of clusters and avoids refitting the model multiple times with different values of $K$ . A previous study²³ showed that compared to the commonly-used EM algorithm with the Bayesian information criterion (BIC) method, the variational Bayes algorithm is more accurate in estimating the number of clusters.

Under a Dirichlet prior with hyperparameter $α_{0} < 1$ , if the model is over-fitted (i.e. setting the number of clusters to $K_{max}$ , which is larger than the true number of clusters), redundant clusters are removed during iterations with an estimate of $π_{k}$ nearly close to zero.²⁹ We employed a pre-defined cut-off $χ$ to determine whether the cluster $k$ is ‘‘empty” or not. If the estimated cluster proportion ${\hat{π}}_{k} < χ$ , then this cluster is considered a redundant cluster and therefore is removed. We considered $χ$ = 1%, 5%, and 10% in the simulation study to evaluate whether the number of clusters is sensitive to the choice of $χ$ under different scenarios.

4. Simulation studies

In this section, we performed two simulation studies to evaluate the performance of the iClusterVB. In Simulation I, we compared the clustering and feature selection performance of iClusterVB with several other clustering approaches that are applicable to mixed data types. In Simulation II, we evaluated the performance of iClusterVB under violations of model assumptions, specifically when the independence between features did not hold or when the normality assumption was violated.

4.1. Simulation setup

In Simulation I, two different scenarios were considered. In Scenario 1, clusters were well-separated and, therefore, were not difficult to identify. In Scenario 2, clusters were poorly separated, which may pose challenges in identifying the optimal number of clusters and the cluster membership. For each scenario, we generated $R = 4$ data views with mixed data types for a total of $N = 240$ individuals. The first two data views were continuous features and the rest of two data views were count and binary features. The number of clusters was set at $K = 4$ and the cluster proportions were set at $π_{1} = 0.25, π_{2} = 0.25, π_{3} = 0.25, and π_{4} = 0.25$ . The total number of features was set to be $p = 2000$ , with each of the four data views having 500 features. For each data view, 10% (i.e. 50) were relevant features that contributed to the clustering, and the rest were noise features that did not contribute to the clustering.

For Scenario 1 (well-separated clusters), data views were generated as follows. For data view 1 (continuous features), the relevant features were generated from a normal distribution, with $N (16, 1)$ for Cluster 1, $N (12, 1)$ for Cluster 2, $N (8, 1)$ for Cluster 3, and $N (4, 1)$ for Cluster 4, respectively. The noise features were generated from $N (0, 1)$ . For data view 2 (continuous features), the relevant features were also generated from a normal distribution, with $N (- 4, 1)$ for Cluster 1, $N (- 8, 1)$ for Cluster 2, $N (- 12, 1)$ for Cluster 3, and $N (- 16, 1)$ for Cluster 4, respectively. The noise features were generated from $N (0, 1)$ . For data view 3 (count features), the relevant features were generated from a Poisson distribution, with $Poisson (12)$ for Cluster 1, $Poisson (8)$ for Cluster 2, $Poisson (4)$ for Cluster 3, and $Poisson (1)$ for Cluster 4, respectively. The noise features were generated from $Poisson (2)$ . For data view 4 (binary features), the relevant features were generated from a Bernoulli distribution, with $Bernoulli (0.2)$ for Cluster 1, $Bernoulli (0.3)$ for Cluster 2, $Bernoulli (0.2)$ for Cluster 3, and $Bernoulli (0.3)$ for Cluster 4, respectively. The noise features were generated from $Bernoulli (0.1)$ . A heatmap of the 4 data views generated from Scenario 1 is shown in Supplemental Figure S1A to D.

For Scenario 2 (poorly separated clusters), data views were generated as follows. For data view 1 (continuous features), the relevant features were generated from a normal distribution, with $N (3, 1)$ for Cluster 1, $N (2, 1)$ for Cluster 2, $N (1, 1)$ for Cluster 3, and $N (- 1, 1)$ for Cluster 4, respectively. The noise features were generated from $N (0, 1)$ . For data view 2 (continuous features), the relevant features were also generated from a normal distribution, with $N (1, 1)$ for Cluster 1, $N (- 1, 1)$ for Cluster 2, $N (- 2, 1)$ for Cluster 3, and $N (- 3, 1)$ for Cluster 4, respectively. The noise features were generated from $N (0, 1)$ . For data view 3 (count features), the relevant features were generated from a Poisson distribution, with $Poisson (4)$ for Cluster 1, $Poisson (3)$ for Cluster 2, $Poisson (1)$ for Cluster 3, and $Poisson (1)$ for Cluster 4, respectively. The noise features were generated from $Poisson (2)$ . For data view 4 (binary features), the relevant features were generated from a Bernoulli distribution, with $Bernoulli (0.2)$ for Cluster 1, $Bernoulli (0.3)$ for Cluster 2, $Bernoulli (0.2)$ for Cluster 3, and $Bernoulli (0.3)$ for Cluster 4, respectively. The noise features were generated from $Bernoulli (0.1)$ . A heatmap of the four data views generated from Scenario 2 is shown in Supplemental Figure S1E to H.

In Simulation II, two continuous data views were generated to evaluate the robustness of iClusterVB under deviations from normality and independence. Data were simulated from a skew-normal (SN) distribution, with 500 features per view generated using the same location and scale parameters as in data views 1 and 2 of Scenarios 1 and 2 in Simulation I. The shape parameter $ζ$ , which controls skewness, was set to 5, 10, and 20, corresponding to mild, moderate, and strong skewness (Supplemental Figure S2 A to C). To further assess robustness, correlation $r$ was introduced among the 50 relevant features, with values of 0, 0.3, 0.5, 0.7, and 0.9 representing a gradient from no correlation to strong correlation. An additional setting considered data generated from a $t$ -distribution with 5 degrees of freedom (Supplemental Figure S2D) in place of the skew-normal distribution.

4.2. Implementation and performance metrics

The proposed iClusterVB was implemented via our newly developed iClusterVB package in R. The performance of the iClusterVB approach was compared with the following six commonly used and state-of-the-art integrative clustering approaches. These approaches are (a) iClusterBayes: Bayesian integrative clustering with variable selection based on the MCMC algorithm,²⁰ which was implemented via iClusterPlus package in R. Each model was run for 2000 iterations. The default setting was used for all other specifications. (b) iCluster+: Bayesian joint latent variable model for integrative clustering, which was also implemented via iClusterPlus package. Given that a large computation time is required for this approach, 200 iterations (default setting) were run for each model. The default setting was used for all other specifications. (c) VarSelLCM: model-based clustering with variable selection based on integrated complete-data likelihood,³⁰ which was implemented via the VarSelLCM package. (d) MOFA: multi-omics factor analysis,³¹ which was implemented via the MOFA2 package. Of note, since MOFA does not inherently perform clustering, we first estimated the latent factor scores and then applied model-based clustering⁷ via the mclust package to identify clusters. (e) CIMLR: cancer integration via multikernel learning,³² which was implemented via the CIMLR package, (f) SNF: similarity network fusion¹⁷ implemented via the SNFtool package.

To measure the model performance, we computed the following performance measures: (a) accuracy of the cluster number (ACN), calculated as the percentage of times the model correctly identified the true number of clusters $K$ over all simulated datasets. The number of clusters for iClusterBayes, iCluster+, VarSelLCM, and MOFA was estimated using Bayesian information criteria (BIC). The number of clusters for CIMLR was estimated using the gap statistic, while the number of clusters for SNF was estimated using the silhouette score. (b) adjusted rand index (ARI): agreement between the true clustering (partition) and estimated clustering (partition). The ARI³³ ranges from $- 1$ to 1, and the higher the ARI the better the model performance in recovering the true cluster membership, (c) overall feature selection accuracy (FSA): calculated as the sum of the number of relevant features and the number of noise features correctly identified by the model, divided by the total number of features, (d) true positive rate (TPR): calculated as the number of relevant features correctly identified by the model divided by the total number of relevant features, (e) true negative rate (TNR): calculated as the number of noise features correctly identified by the model divided by the total number of noise features. To facilitate a fair comparison between methods, the ARI, FSA, TPR, and TNR were calculated under the true number of clusters $K$ . Since MOFA, CIMLR, and SNF do not incorporate built-in feature selection, we only reported clustering performance metrics and did not assess feature selection performance for these methods. The computational time (in seconds) under the true $K$ is also recorded. A total of 50 datasets were generated. The mean and standard deviation (SD) of these performance metrics over all datasets were recorded.

4.3. Simulation results

The results of Simulation I comparing between methods are presented in Table 2. Across both scenarios, iClusterVB demonstrated superior clustering and feature selection performance as well as computational efficiency compared with existing methods. For the clustering performance, iClusterVB achieved near-perfect recovery of the true number of clusters (ACN $=$ 0.98) and the highest ARI (mean (SD) $=$ 1(0)) in the well-separated clusters scenario. Even under the poorly separated scenario, iClusterVB maintained strong clustering performance (ACN $=$ 0.92; ARI $=$ 1 (0)), while alternative methods showed poor ACN (except that VarSelLCM achieved a perfect ACN) and slightly lower ARI values (except that MOFA achieved a perfect ARI). In terms of feature selection, iClusterVB also demonstrated strong performance across the well-separated clusters scenario (FSA $=$ 0.962 (0.004), TPR $=$ 0.808 (0.016), TNR $=$ 0.979 (0.004)) and the poorly separated clusters scenario (FSA $=$ 0.962 (0.004), TPR $=$ 0.809 (0.018), TNR $=$ 0.979 (0.004)). In comparison, iClusterBayes maintained reasonably high FSA but with reduced TPRs, while iCluster+ yielded substantially lower FSA driven by poor TNR. Conversely, VarSelLCM consistently achieved high TNRs but with low TPRs across both scenarios. As expected, MOFA, CIMLR, and SNF do not perform feature selection and were therefore not evaluated for these metrics.

Table 2.
Results $^{a}$ from Simulation I comparing the performance of iClusterVB with existing approaches for equal cluster size ( $π_{1} = π_{2} = π_{3} = π_{4} = 0.25$ ) under Scenarios 1 and 2.

Methods $^{b}$ ACN $^{c}$ ARI FSA TPR TNR Run time (seconds)

Scenario 1: Well-separated clusters

iClusterVB $^{d}$ 0.98 1 (0) 0.962 (0.004) 0.808 (0.016) 0.979 (0.004) 91.671 (43.579)

iClusterBayes 0 0.574 (0.084) 0.972 (0.002) 0.759 (0.008) 0.996 (0.002) 531.214 (14.698)

iCluster+ 0 1 (0.002) 0.271 (0.007) 0.925 (0.017) 0.198 (0.009) 482.756 (161.123)

VarSelLCM 1 0.863 (0.175) 0.944 (0.037) 0.437 (0.375) 1 (0) 25.064 (8.594)

MOFA 0.02 0.845 (0.108) – $^{e}$ – $^{e}$ – $^{e}$ 45.372 (6.957)

CIMLR 0 0.999 (0.003) – $^{e}$ – $^{e}$ – $^{e}$ 1.598 (0.373)

SNF 0 1(0) – $^{e}$ – $^{e}$ – $^{e}$ 65.507 (22.135)

Scenario 2: Poorly separated clusters

iClusterVB $^{d}$ 0.92 1 (0) 0.962 (0.004) 0.809 (0.018) 0.979 (0.004) 74.829 (22.971)

iClusterBayes 0 0.998 (0.005) 0.974 (0.002) 0.771 (0.01) 0.996 (0.002) 530.231 (37.538)

iCluster+ 0 0.872 (0.111) 0.269 (0.008) 0.945 (0.017) 0.194 (0.008) 614.856 (97.371)

VarSelLCM 1 0.943 (0.142) 0.974 (0.011) 0.742 (0.107) 1 (0) 38.987 (6.559)

MOFA 0.02 1 (0) – $^{e}$ – $^{e}$ – $^{e}$ 9.158 (6.881)

CIMLR 0 0.862 (0.069) – $^{e}$ – $^{e}$ – $^{e}$ 1.751 (0.32)

SNF 0 0.838 (0.099) – $^{e}$ – $^{e}$ – $^{e}$ 49.723 (18.235)

$^{a}$ Results are presented as mean (SD) over 50 replications.

$^{b}$ Abbreviation for methods: iClusterVB: integrative clustering based on variational Bayes using the iClusterVB package; iCluster+: Bayesian joint latent variable model for integrative clustering using the iClusterPlus package; iClusterBayes: Bayesian integrative clustering based on the MCMC algorithm using the iClusterPlus package; VarSelLCM: model-based clustering with variable selection based on integrated complete-data likelihood using the VarSelLCM package. MOFA: multi-omics factor analysis using the MOFA2 package. A model-based approach, via the mclust package, was used for clustering. CIMLR: cancer integration via multikernel learning via the CIMLR package. SNF: similarity network fusion, via the SNFtool package.

$^{c}$ Abbreviation for performance indices: ACN: accuracy of cluster number; ARI: adjusted rand index; TNR: true negative rate; TPR: true positive rate; FSA: overall feature selection accuracy. ARI, FSA, TPR, and TNR were calculated under the true number of clusters ( $K = 4$ ).

$^{d}$ The number of clusters for iClusterVB was determined by using a cut-off $χ = 0.01$ .

$^{e}$ FSA, TPR, and TNR were not reported for MOFA, CIMLR, and SNF, respectively.

Methods $^{b}$	ACN $^{c}$	ARI	FSA	TPR	TNR	Run time (seconds)
Scenario 1: Well-separated clusters
iClusterVB $^{d}$	0.98	1 (0)	0.962 (0.004)	0.808 (0.016)	0.979 (0.004)	91.671 (43.579)
iClusterBayes	0	0.574 (0.084)	0.972 (0.002)	0.759 (0.008)	0.996 (0.002)	531.214 (14.698)
iCluster+	0	1 (0.002)	0.271 (0.007)	0.925 (0.017)	0.198 (0.009)	482.756 (161.123)
VarSelLCM	1	0.863 (0.175)	0.944 (0.037)	0.437 (0.375)	1 (0)	25.064 (8.594)
MOFA	0.02	0.845 (0.108)	– $^{e}$	– $^{e}$	– $^{e}$	45.372 (6.957)
CIMLR	0	0.999 (0.003)	– $^{e}$	– $^{e}$	– $^{e}$	1.598 (0.373)
SNF	0	1(0)	– $^{e}$	– $^{e}$	– $^{e}$	65.507 (22.135)
Scenario 2: Poorly separated clusters
iClusterVB $^{d}$	0.92	1 (0)	0.962 (0.004)	0.809 (0.018)	0.979 (0.004)	74.829 (22.971)
iClusterBayes	0	0.998 (0.005)	0.974 (0.002)	0.771 (0.01)	0.996 (0.002)	530.231 (37.538)
iCluster+	0	0.872 (0.111)	0.269 (0.008)	0.945 (0.017)	0.194 (0.008)	614.856 (97.371)
VarSelLCM	1	0.943 (0.142)	0.974 (0.011)	0.742 (0.107)	1 (0)	38.987 (6.559)
MOFA	0.02	1 (0)	– $^{e}$	– $^{e}$	– $^{e}$	9.158 (6.881)
CIMLR	0	0.862 (0.069)	– $^{e}$	– $^{e}$	– $^{e}$	1.751 (0.32)
SNF	0	0.838 (0.099)	– $^{e}$	– $^{e}$	– $^{e}$	49.723 (18.235)

In addition to model performance, iClusterVB was faster than other model-based methods. Its runtime was 92 s in the well-separated scenario and 75 s in the poorly separated scenario, compared to substantially longer run times for iClusterBayes (around 531 s), and iCluster+ (483–615 s). Although MOFA, CIMLR, and SNF were computationally faster, their ACNs were markedly lower and do not simultaneously perform feature selection. Importantly, when the number of clusters is treated as unknown and multiple candidate models must be fitted to determine the optimal solution, the computational burden of alternative methods would further increase. In contrast, iClusterVB automatically determines the number of clusters, therefore, avoiding this issue and further enhancing its efficiency. Overall, these findings suggest that iClusterVB yields an excellent performance in clustering and feature selection as well as computational efficiency.

In addition, we evaluated whether the number of clusters determined by iClusterVB is sensitive to the choice of a cut-off $χ$ , when the cluster proportions are balanced ( $π_{1} = π_{2} = π_{3} = π_{4} = 0.25$ ) and unbalanced ( $π_{1} = 0.1, π_{2} = 0.2, π_{3} = 0.3, π_{4} = 0.4$ ), in Scenarios 1 and 2. The results showed that iClusterVB achieved high accuracy in correctly identifying the true number of clusters across different settings (Supplemental Table S1). For example, when the cluster proportions were balanced, the ACN remained consistently high (0.98 in Scenario 1 and 0.92 in Scenario 2) across all cut-offs. In contrast, when the cluster proportions were unbalanced, performance was slightly reduced, particularly in Scenario 2, where ACN dropped to around 0.72–0.74. Overall, iClusterVB performed reliably in balanced settings and still achieved reasonable accuracy under unbalanced cluster proportion settings.

The results of Simulation II for Scenarios 1 and 2 are presented in Tables 3 and 4, respectively. In Scenario I (Table 3), iClusterVB consistently achieved near-perfect clustering and feature selection performance across skew-normal distributions with varying levels of skewness ( $ζ$ = 5, 10, 20) and correlations ( $r$ = 0–0.9) among features, with ARI, FSA, TPR, and TNR all equal or close to 1. Runtime increased moderately with stronger skewness and higher correlations. Under the t-distribution (df = 5), performance remained high but slightly reduced, with ACN around 0.82–0.92 and ARI equal 1 (0), while runtime was noticeably longer at higher correlations. In Scenario 2 of Simulation II (Table 4), iClusterVB still maintained high clustering and feature selection performance, with ACN, ARI, FSA, TPR, and TNR close to 1 across different skewness levels ( $ζ$ = 5, 10, 20) and t-distribution settings, though runtimes were generally higher than in Scenario 1. Overall, iClusterVB demonstrated strong robustness across both well- and poorly separated cluster scenarios.

Table 3.

Results $^{a}$ from Simulation II, Scenario 1 (well-separated clusters), evaluating the performance of iClusterVB across different feature correlations and underlying distributional settings.

Correlation ( $r$ )	ACN01 $^{b}$	ACN05	ACN10	ARI	FSA	TPR	TNR	Run time (seconds)
Skew normal distribution ( $ζ$ = 5)
0	1	1	1	1 (0)	0.9921 (0.0026)	1 (0)	0.9912 (0.0029)	25.07 (2.28)
0.3	0.96	0.96	0.96	1 (0)	0.9922 (0.0029)	1 (0)	0.9914 (0.0032)	26.33 (1.33)
0.5	1	1	1	1 (0)	0.9921 (0.0031)	1 (0)	0.9912 (0.0035)	28.6 (0.94)
0.7	0.96	0.96	0.96	1 (0)	0.9918 (0.0033)	1 (0)	0.9909 (0.0037)	32.14 (1.95)
0.9	1	1	1	1 (0)	0.9929 (0.0024)	1 (0)	0.9921 (0.0026)	36.36 (1.8)
Skew normal distribution ( $ζ$ = 10)
0	0.98	0.98	0.98	1 (0)	0.9924 (0.0024)	1 (0)	0.9915 (0.0027)	42.18 (2.28)
0.3	0.98	0.98	0.98	1 (0)	0.9926 (0.0031)	1 (0)	0.9918 (0.0035)	46.43 (1.99)
0.5	0.98	0.98	0.98	1 (0)	0.9924 (0.0023)	1 (0)	0.9915 (0.0026)	51.35 (2.49)
0.7	1	1	1	1 (0)	0.9922 (0.0032)	1 (0)	0.9913 (0.0035)	54.4 (2.33)
0.9	1	1	1	1 (0)	0.9925 (0.0029)	1 (0)	0.9917 (0.0033)	57.95 (2.11)
Skew normal distribution ( $ζ$ = 20)
0	0.94	0.94	0.94	1 (0)	0.9929 (0.0023)	1 (0)	0.9921 (0.0025)	63.29 (2.58)
0.3	0.98	0.98	0.98	1 (0)	0.9927 (0.0035)	1 (0)	0.9918 (0.0039)	67.08 (2.87)
0.5	1	1	1	1 (0)	0.9923 (0.0024)	1 (0)	0.9914 (0.0027)	70.7 (2.79)
0.7	0.98	0.98	0.98	1 (0)	0.9919 (0.0026)	1 (0)	0.991 (0.0029)	72.45 (2.37)
0.9	0.98	0.98	0.98	1 (0)	0.9927 (0.0023)	1 (0)	0.9919 (0.0026)	77.05 (4.35)
t-distribution (df $=$ 5)
0	0.88	0.88	0.9	1 (0)	0.9742 (0.0043)	1 (0)	0.9714 (0.0048)	25.54 (2.44)
0.3	0.9	0.9	0.92	1 (0)	0.9738 (0.0046)	1 (0)	0.9709 (0.0051)	33.63 (14.92)
0.5	0.82	0.82	0.84	1 (0)	0.9741 (0.0051)	1 (0)	0.9712 (0.0057)	62.03 (12.36)
0.7	0.84	0.84	0.84	1 (0)	0.9736 (0.0042)	1 (0)	0.9706 (0.0047)	70.72 (2.52)
0.9	0.86	0.86	0.86	1 (0)	0.9741 (0.0043)	1 (0)	0.9713 (0.0048)	77.27 (4.27)

$^{a}$ Results are presented as mean (SD) over 50 replications.

$^{b}$ Abbreviation for performance indices: ACN01 denotes the accuracy of the cluster numbers using the cutoff $χ = 1 %$ , ACN05 denotes the accuracy of the cluster numbers using the cutoff $χ = 5 %$ , ACN10 denotes the accuracy of the cluster numbers using the cutoff $χ = 10 %$ . ARI: adjusted rand index; TNR: true negative rate; TPR: true positive rate; FSA: overall featureselection accuracy. ARI, FSA, TPR, and TNR were calculated under the true number of clusters ( $K = 4$ ).

Table 4.

Results $^{a}$ from Simulation II, Scenario 2 (poorly separated clusters), evaluating the performance of iClusterVB across different feature correlations and underlying distributional settings.

Correlation ( $r$ )	${ACN01}^{b}$	ACN05	ACN10	ARI	FSA	TPR	TNR	Run time (seconds)
Skew normal distribution ( $ζ$ = 5)
0	0.76	0.76	0.78	0.9885 (0.0571)	0.9924 (0.0028)	1 (0)	0.9916 (0.0032)	46.44 (17.87)
0.3	0.9	0.9	0.92	1 (0)	0.9918 (0.0027)	1 (0)	0.9909 (0.0031)	32.13 (4.35)
0.5	0.82	0.82	0.82	1 (0)	0.992 (0.0027)	1 (0)	0.9912 (0.003)	33.99 (3.54)
0.7	1	1	1	1 (0)	0.9918 (0.003)	1 (0)	0.9909 (0.0033)	37.85 (1.68)
0.9	0.98	0.98	0.98	1 (0)	0.9924 (0.0024)	1 (0)	0.9916 (0.0027)	40.04 (1.39)
Skew normal distribution ( $ζ$ = 10)
0	0.72	0.72	0.76	0.9827 (0.0692)	0.992 (0.0027)	1 (0)	0.9911 (0.003)	47.16 (2.64)
0.3	0.94	0.94	0.94	1 (0)	0.9925 (0.0032)	1 (0)	0.9916 (0.0035)	51.27 (2.25)
0.5	0.86	0.86	0.88	1 (0)	0.9923 (0.0024)	1 (0)	0.9914 (0.0026)	55.32 (2.88)
0.7	0.88	0.9	0.9	1 (0)	0.9923 (0.0028)	1 (0)	0.9915 (0.0031)	60.26 (2.2)
0.9	0.84	0.86	0.88	1 (0)	0.9929 (0.0031)	1 (0)	0.9921 (0.0034)	61.42 (2.38)
Skew normal distribution ( $ζ$ = 20)
0	0.72	0.72	0.72	0.9942 (0.0408)	0.993 (0.0027)	1 (0)	0.9922 (0.003)	67.33 (2.34)
0.3	0.76	0.76	0.8	1 (0)	0.9927 (0.0032)	1 (0)	0.9919 (0.0036)	71.86 (2.84)
0.5	0.86	0.86	0.86	1 (0)	0.9922 (0.0026)	1 (0)	0.9914 (0.0028)	74.64 (2.05)
0.7	0.94	0.94	0.94	1 (0)	0.9916 (0.0026)	1 (0)	0.9907 (0.0029)	80 (2.51)
0.9	0.92	0.92	0.92	1 (0)	0.9928 (0.0022)	1 (0)	0.992 (0.0024)	82.58 (4.13)
t-distribution (df $=$ 5)
0	0.9	0.9	0.98	0.9927 (0.0408)	0.9733 (0.0058)	1 (0)	0.9703 (0.0064)	62.58 (15.83)
0.3	0.98	0.98	0.98	0.9929 (0.0407)	0.9734 (0.0051)	1 (0)	0.9704 (0.0057)	71.51 (3.11)
0.5	0.88	0.88	0.96	0.9993 (0.0027)	0.9742 (0.0048)	1 (0)	0.9714 (0.0053)	76.26 (3.18)
0.7	0.82	0.82	0.9	0.9984 (0.0045)	0.9746 (0.0046)	1 (0)	0.9718 (0.0051)	81.37 (3.3)
0.9	0.88	0.88	0.9	0.9987 (0.0037)	0.9738 (0.0045)	1 (0)	0.9709(0.005)	68.97 (21.62)

$^{a}$ Results are presented as mean (SD) over 50 replications.

5. Application to real data

In this section, we demonstrated the utility of the proposed iClusterVB approach by applying it to three real-life data examples. These three real data examples were data on glioblastoma multiforme, acute myeloid leukemia and lung cancer, respectively.

5.1. Glioblastoma multiforme data

The first example was from a glioblastoma multiforme (GBM) dataset. GBM is the most common and aggressive type of primary brain tumor in adults. It belongs to a group of tumors called gliomas, which originate from glial cells in the brain. GBM is characterized by its rapid growth, infiltrative nature, and high likelihood of recurrence.³⁴ The Cancer Genome Atlas (TCGA) Research Network³⁵ was established to generate a comprehensive catalog of genomic abnormalities driving tumorigenesis, and it provides a detailed view of the genomic changes in a large GBM cohort. Earlier studies have shown that GBM is heterogeneous and tumor subtypes have differential responses to therapies.³⁶ Therefore, appropriately identifying subtypes of GBM patients is crucial for understanding tumor heterogeneity, tailoring personalized treatment strategies, and accurately predicting prognosis. The current analysis included GBM patients with measurements on three views, namely, DNA copy number (continuous), messenger RNA (mRNA) expression (continuous) and somatic mutation (binary). Briefly, a somatic mutation is a genetic alteration that occurs in a cell of the body that is not passed down to offspring. Somatic mutations can disrupt the normal control mechanisms that regulate cell growth, division, and death, leading to uncontrolled proliferation and the formation of tumors. DNA copy number refers to the number of copies of a particular DNA sequence within the genome of an organism or a cell. Tumor cells often exhibit abnormal DNA copy number profiles, including amplifications (increased copies) and deletions (decreased copies) of specific genomic regions. mRNA expression refers to the process by which genes in a cell are transcribed into messenger RNA (mRNA) molecules, which then serve as templates for protein synthesis. Alterations in mRNA expression patterns are commonly observed in cancer cells compared to normal cells and can contribute to the dysregulation of cellular processes.

A total of 84 GBM patients were included in the current analysis. This dataset is available in the iClusterPlus package in R. DNA copy number alterations were measured on 5512 genes (continuous), while the mRNA expression was measured on 1740 genes (continuous). The somatic mutation data includes the mutation status (mutation vs. no mutation) of 120 genes. We preprocessed the data following a previous analysis,²⁰ by filtering out genes with mutation rates <2% before the analysis.

Given a small sample size in this study, the maximum number of clusters was set to be $K_{max} = 6$ . The iClusterVB identified two clusters, with a size of 19 (23%) and 65 (77%) patients, respectively. The difference between clusters was evident in the heatmaps of the data views (Figure 2). Notably, the heatmaps for all features (both relevant and noise features) for each data view suggested different DNA copy number (Figure 2(a)) and mRNA expression (Figure 2(b)) patterns between clusters. These distinct patterns between clusters become more evident after filtering out noise features using a feature inclusion probability of 0.5 as a threshold (Figure 2(c) and (d)). No genetic feature was selected from the somatic mutation data view and therefore, the heatmap was not shown. Of note, iClusterVB feature selection resulted in 41 mRNA expression features meeting the posterior feature inclusion probability of 0.5. To narrow this down for interpretation, we looked at the four features that met the posterior feature inclusion probability of 0.9: BLVRB, LOH11CR2A, DHRS3, and MT1E. All four features are known to play a role in cancer development. BLVRB has been identified as a driver of hepatic cancer cell proliferation³⁷ and as a leading prognostic marker for lethal prostate cancer.³⁸ LOH11CR2A,³⁹ DHRS3⁴⁰, and MT1E⁴¹ have been proposed as candidate tumor suppressor genes, with evidence suggesting that their dysregulated expression contributes to cancer progression. The posterior feature inclusion probabilities for each data view are shown in Figure 2(e) and (f).

Figure 2.

Analysis results for glioblastoma multiforme (GBM) dataset using the iClusterVB. (a) Heatmap for DNA copy number data based on all 5512 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for mRNA expression data by clusters derived from iClusterVB based on all 1740 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for DNA copy number based on 1296 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for mRNA expression data by clusters derived from iClusterVB based on 41 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Probability of feature inclusion of DNA copy number. The dashed reference line in red indicates a probability of 0.5. (f) Probability of feature inclusion of mRNA expression. The dashed reference line in red indicates a probability of 0.5. (g) K-M curves for survival probabilities by clusters derived from iClusterVB. P value was derived from log-rank tests. Somatic mutation data were not shown as no features were selected by the model.

Furthermore, Cluster 1 had significantly better survival than Cluster 2 (p-value $=$ 0.0074 based on log-rank test; Figure 2(g)), suggesting that these clusters represented subtypes that were biologically distinct. Compared to a four-cluster solution in a previous study using iClusterBayes²⁰ with iSubtypes 2, 3, and 4 sharing similar survival probabilities, the clusters identified by iClusterVB suggested a more statistically significant difference between the two subtypes. This suggests that iClusterVB is capable not only of identifying biologically relevant genes that drive the separation of these clusters, but also of clustering patients into groups with meaningful survival differences. Additionally, the iClusterVB took 117 s and represented a faster alternative to iClusterBayes and iCluster+.²⁰

5.2. Acute myeloid leukemia (AML) data

The second example was from an AML dataset. AML is a malignant disease caused by the abnormal proliferation and differentiation of myeloid stem cells in the bone marrow. This process results in the arrest of cell development within the myeloid lineage, leading to an abnormal accumulation of blasts, or immature cells. Notably, the heterogeneity of AML at the genetic and molecular levels poses significant challenges for diagnosis, prognosis, and treatment.

The AML datasets were downloaded using the cBioPortal for Cancer Genomics tool.⁴² Two data views were included in the present study, namely the gene expression data (continuous) and the mutation data (presence vs. absence of mutation). We preprocessed the data following the procedures of a previous study.⁴³ Specifically, for gene expression data, the 500 genes chosen based on having the highest ranked-based coefficients of variation and standard deviation across the samples were used (Figure 3(a)). For mutation data, mutations that appeared in at least two different individuals were chosen, resulting in 156 genes (Figure 3(b)). Finally, only samples that had gene expression data and mutation data were included, resulting in 170 samples. This dataset is included in the iClusterVB package.

Figure 3.

Analysis results for acute myeloid leukemia (AML) data using the iClusterVB. (a) Heatmap for gene expression based on all 500 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for gene mutation data by clusters derived from iClusterVB based on all 156 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for gene expression based on 326 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for gene mutation data by clusters derived from iClusterVB based on one feature selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Probability of feature inclusion of gene expression. The dashed reference line in red indicates a probability of 0.5. (f) Probability of feature inclusion of gene mutation. The dashed reference line in red indicates a probability of 0.5. (g) Kaplan–Meier (K-M) curves for survival probabilities by clusters derived from iClusterVB. P value was derived from log-rank tests.

We applied the iClusterVB to identify AML subtypes with the maximum number of clusters $K_{max} = 10$ . The iClusterVB identified five non-empty clusters, with a size of 16 (9%), 66 (39%), 24 (14%), 39 (23%), and 25 (15%), respectively. The clustering revealed varying degrees of feature relevance across the data views: 326 out of 500 gene expression features (Figure 3(c)) and one out of 156 mutation features exceeded a posterior inclusion probability threshold of 0.5 (Figure 3(d)), indicating their significance in contributing to the clustering. Indeed, the one identified mutation feature, NPM1, is one of the most frequent driver mutations for AML.⁴⁴ Based on the selected features from each data view, there was a clear separation of gene expression between clusters and a noticeable difference in mutation profiles. The posterior feature inclusion probabilities for each data view are shown in Figure 3(e) and (f). The differences between clusters were also apparent in the survival probability (Figure 3(g), $p = 0.0027$ ; log-rank test). Cluster 1 had the highest survival probability among the five clusters, whereas Cluster 5 had the lowest survival probability.

5.3. Lung cancer data

The third example was from the lung cancer data.⁴⁵ This dataset includes three data views in continuous scale, namely the gene expression data for 12,042 genes (Figure 4(a)), DNA methylation information for 23,074 loci (Figure 4(b)), and miRNA expression profiles for 352 miRNAs (Figure 4(c)), collected from 106 patients with lung cancer. No further preprocessing was performed. The original datasets were sourced from The Cancer Genome Atlas repository, and can be downloaded using the link: https://github.com/evelinag/clusternomics. The goal of this analysis was to identify lung cancer subtypes based on these three data views. Therefore, we ran iClusterVB with a maximum number of clusters $K_{max} = 10$ .

Figure 4.

Analysis results for the lung cancer dataset using the iClusterVB. (a) Heatmap for gene expression data based on all 12,042 features. The row represents all genetic features and the column represents patients ordered by clusters. (b) Heatmap for methylation data by clusters derived from iClusterVB based on all 23,074 features. The row represents all genetic features and the column represents patients ordered by clusters. (c) Heatmap for miRNA expression data by clusters derived from iClusterVB based on all 352 features. The row represents all genetic features and the column represents patients ordered by clusters. (d) Heatmap for gene expression data by clusters derived from iClusterVB based on 438 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (e) Heatmap for methylation data by clusters derived from iClusterVB based on 14,038 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (f) Heatmap for miRNA data by clusters derived from iClusterVB based on 20 features selected by iClusterVB (posterior feature inclusion probability larger than 0.5). The row represents all genetic features and the column represents patients ordered by clusters. (g) Probability of feature inclusion of gene expression. The dashed reference line in red indicates a probability of 0.5. (h) Probability of feature inclusion of methylation feature. The dashed reference line in red indicates a probability of 0.5. (i) Probability of feature inclusion of miRNA expression feature. The dashed reference line in red indicates a probability of 0.5. (j) Kaplan–Meier (K-M) curves for survival probabilities by clusters derived from iClusterVB. P value was derived from log-rank tests.

Among 106 patients included in the current analysis, the iClusterVB identified three non-empty clusters, with a size of 12 (11%), 33 (31%), and 61 (58%), respectively, suggesting heterogeneous subgroups within the patients with lung cancer. Additionally, the model identified 438 informative features out of 12,042 in gene expression data (Figure 4(d)), 14,038 out of 23,074 in DNA methylation data (Figure 4(e)), and 20 out of 352 in miRNA expression data (Figure 4(f)), each surpassing a posterior inclusion probability threshold of 0.5, which clearly showed distinct gene expression, methylation and mutation profiles among the patients. In particular, among the 20 identified miRNA expression features, the five with the highest posterior feature inclusion probability were hsa-mir-154, hsa-mir-3613, hsa-mir-412, hsa-mir-487a, and hsa-mir-495. Indeed, all five are known to be involved in either lung cancer development and progression specifically,⁴⁶ or cancer development and progression generally.⁴⁷ Genes in the first $\sim$ 60% of Cluster 1 (rows) mostly displayed lower expression levels compared to Clusters 2 and 3, while the remaining genes mostly displayed higher expression levels compared to Clusters 2 and 3 (Figure 4(d)). Genes in Clusters 2 and 3 did not show discernible patterns. The DNA methylation heatmap also demonstrated clear clustering patterns (Figure 4(e)). Cluster 1 showed a mixture of high and low methylation levels, whereas Cluster 2 exhibited mostly high methylation levels and Cluster 3 exhibited mostly low methylation levels. For miRNA expression (Figure 4(f)), the clustering suggests predominantly low expression levels in Clusters 1 and 3, and occasionally high miRNA expression levels in Cluster 2. The posterior feature inclusion probabilities for each data view are shown in Figure 4(g) to (i), respectively.

The survival probabilities differed significantly across the three clusters ( $p = 0.0092$ ; logrank test), suggesting that the subtypes identified by the iClusterVB model have distinct survival outcomes (Figure 4(j)). Patients in Cluster 1 had the best survival probability. This cluster, which exhibited distinct gene expression and DNA methylation profiles, appeared to have a more favorable prognosis. In contrast, patients in Clusters 2 and 3 showed poorer survival outcomes, with the majority not surviving past 2000 days. These survival patterns illustrated the potential of molecular profiling in understanding lung cancer heterogeneity and its impact on patient prognosis.

6. Discussion

Recent advances in engineering have enabled researchers to collect a large number of features from different data views on the same set of samples. Integrative analysis offers the opportunity to uncover biological mechanisms across different data views. The iClusterVB accommodates mixed-type (e.g. continuous, categorical, and count) features, allows for feature selection, quantifies feature importance through posterior feature inclusion probabilities and does not require refitting the model to determine the optimal number of clusters. In particular, the posterior feature inclusion probabilities can be used as a measure for feature ranking. While we use omics data as an illustrative example, the iClusterVB may also be applied to clustering other types of data, such as electronic health records.⁴⁸

Although existing integrative clustering methods are powerful, they are not designed to simultaneously and fully address the major practical challenges in real data applications, including selecting the appropriate number of clusters, identifying the features that drive clustering, quantifying uncertainty in feature selection, and ensuring computational efficiency. For example, MOFA, CIMLR, and SNF are computationally efficient methods that allow integrative clustering of data views of different types, however, they do not offer built-in feature selection to identify features nor quantify the feature selection uncertainty. Of note, while MOFA produces interpretable feature loadings, it does not perform explicit feature selection or provide uncertainty quantification of selected features in the context of integrative clustering. While iClusterBayes and iCluster+ are effective methods for integrative clustering and feature selection, they are computationally demanding, particularly when applied to data of a large sample size. VarSelLCM is a computationally efficient method for integrative clustering and feature selection. However, its feature selection performance was sub-optimal based on our simulation studies. The proposed iClusterVB offers an efficient and effective alternative method for multi-view data of different types (e.g. continuous, categorical, and count). It quantifies the uncertainty of feature selection through posterior inclusion probabilities and provides a faster alternative to other model-based methods such as iClusterBayes and iCluster+. Furthermore, iClusterVB allows automatic selection of the number of clusters, which does not require re-fitting the model to identify the optimal number of clusters. In our simulation study, we demonstrated that the iClusterVB yielded strong performance in determining the true number of clusters, identifying the cluster membership, selecting the important features and efficient computation.

To the best of our knowledge, existing R packages that allow implementing clustering via the variational Bayes approach are sparse. Notably, BayesLCA⁴⁹ can be used for clustering binary variables but cannot apply to other types of data (e.g. continuous) nor perform feature selection. To achieve computational efficiency, the iClusterVB package is written in C++ and seamlessly integrated into R by using the Rcpp and RcppArmadillo packages. The newly developed R package can be used for continuous, categorical and count data, under Gaussian, multinomial and Poisson distributions, respectively. A tutorial on implementing iClusterVB using the iClusterVB package is provided in Section C of the Supplemental Material.

While iClusterVB offers significant advantages for clustering and feature selection in high-dimensional datasets, it has notable limitations. One key limitation is its reliance on the assumption of normality for continuous features, which may not hold in certain applications. Our simulation studies suggest that when the clusters are well-separated, iClusterVB is robust and able to recover the true clusters and relevant features even when the features are correlated or the distribution assumption is violated. However, if the cluster is poorly separated, iClusterVB may fail to identify the correct number of clusters under a strong correlation between features, or when the distribution assumption is violated. Additionally, compared to MCMC and Gibbs sampling techniques, VB algorithms are prone to underestimating uncertainty.²⁷ However, if the primary interest is clustering rather than parameter estimation, this limitation may be less relevant. Finally, it should be noted that the algorithm could be sensitive to the initial cluster allocation. Thus, running the model with multiple sets of initial values is recommended.

Missing data and zero inflation are common challenges in multi-omics data integration. Missing values can arise from poor tissue quality, technical limitations, or subject dropout. Many existing integrative clustering methods require complete data, leading researchers to discard omics features or samples with missing values. This approach may introduce bias and reduce statistical power. To address this, several methods have been developed that explicitly model missing data, such as joint-imputation strategies and optimization-masking techniques.⁵⁰ Another frequent issue is zero inflation, which refers to an excess of zeros in the data. This type of data often results from biological entities being present at levels below detection thresholds. Statistical models to account for zero inflation in omics data are an active area of development, and methods such as Bayesian model selection have been proposed to model this type of data.⁵¹ Future versions of iClusterVB aim to incorporate efficient strategies for handling both missing data and zero inflation to improve its robustness and flexibility.

Additionally, future studies comparing iClusterVB to other integrative clustering approaches would be of great interest.⁵² Also, the iClusterVB could be further improved using sparse Bayesian variational inference via coreset^53,54 and stochastic variational inference⁵⁵ for analyzing data views with massive sample size and ultra-high dimensional features. Moreover, as an alternative to the feature selection property in iClusterVB, a prior distribution such as spike-and-slab and shrinkage priors⁵⁶ may be considered to induce sparsity. Finally, the variational Bayes method for analyzing longitudinal data⁵⁷ is emerging, and extending the iClusterVB to clustering longitudinal data would be of interest, for example, in the context of multivariate,^58–60 multi-source,⁶¹ or high-dimensional longitudinal data modeling.⁶²

7. Conclusion

The iClusterVB is a flexible approach for fast integrative clustering and feature selection in high-dimensional multi-view data. Its key strengths include clustering of mixed-type features, feature selection and automatic determination of the optimal number of clusters, making it a useful tool for addressing practical challenges in integrative clustering.

8. Software

The newly developed iClusterVB package is available from The Comprehensive R Archive Network and the user may install the package using install.packages(“iClusterVB”). A tutorial on how to use the iClusterVB package is available in Section C of the Supplemental Material.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251406584 - Supplemental material for A fast integrative clustering and feature selection approach for high-dimensional multiview data

Supplemental material, sj-pdf-1-smm-10.1177_09622802251406584 for A fast integrative clustering and feature selection approach for high-dimensional multiview data by Abdalkarim Alnajjar, Helen Bian and Zihang Lu in Statistical Methods in Medical Research

Footnotes

ORCID iDs

Abdalkarim Alnajjar

Zihang Lu

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: Abdalkarim Alnajjar was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC): Canada Graduate Scholarships — Master's program award. Zihang Lu is supported by a Discovery Grant funded by NSERC. The funders had no role in study design,data collection and analysis,decision to publish,or preparation of the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data availability statement

Glioblastoma multiforme dataset: The Glioblastoma multiforme data used in the current study are available through the iClusterPlus package in R . Acute myeloid leukemia dataset: The acute myeloid leukemia dataset used in the current study is available through the iClusterVB package and can be loaded using data(‘‘laml") . Lung cancer dataset: The lung cancer dataset used in the illustration of the iClusterVB algorithm was sourced from a previous study 45 and can be accessed through the following link:

Supplemental material

Supplemental material for this article is available online.

References

Wang

. Identification of gastric cancer subtypes based on pathway clustering. npj Precision Oncol 2021; 5: 46.

Gatt

Ahmadiankalati

Voutsas

, et al. Identification of obstructive sleep apnea in children with obesity: a cluster analysis approach. Pediatr Pulmonol 2023; 59: 81–88.

Soussi

Ahmadiankalati

Jentzer

, et al. Clinical phenotypes of cardiogenic shock survivors: insights into late host responses and long-term outcomes. ESC Heart Fail 2023; 11: 1242–1248.

MacQueen

. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, 1967. Oakland, CA, USA, pp.281–297. https://api.semanticscholar.org/CorpusID:6278891.

Ward Jr

. Hierarchical grouping to optimize an objective function. J Am Stat Assoc 1963; 58: 236–244.

Banfield

Raftery

. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993; 49: 803–821.DOI: 10.2307/2532201

Fraley

Raftery

. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002; 97: 611–631.

Wunsch

. Survey of clustering algorithms. IEEE Trans Neural Netw 2005; 16: 645–678.

Rodriguez

Comin

Casanova

, et al. Clustering algorithms: a comparative approach. PLoS ONE 2019; 14: e0210236.

10.

Gao

Dwyer

Zhu

, et al. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Res 2023; 237: 115265.

11.

Jaeger

Banks

. Cluster analysis: a modern statistical review. Wiley Interdiscipl Rev: Comput Stat 2023; 15: e1597.

12.

Zitnik

Nguyen

Wang

, et al. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Informat Fusion 2019; 50: 71–91.

13.

Richardson

Tseng

Sun

. Statistical methods in integrative genomics. Annu Rev Stat Appl 2016; 3: 181–209.

14.

Zhang

Zhou

, et al. Integrative clustering methods for multi-omics data. Wiley Interdiscipl Rev: Comput Stat 2022; 14: e1553.

15.

Shen

Olshen

Ladanyi

. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009; 25: 2906–2912.

16.

Shen

Wang

. Sparse integrative clustering of multiple omics data sets. Ann Appl Stat 2013; 7: 269–294.

17.

Wang

Mezlini

Demir

, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 2014; 11: 333.

18.

Lock

Hoadley

Marron

, et al. Joint and individual variation explained for integrated analysis of multiple data types. Ann Appl Stat 2013; 7: 523.

19.

Zhang

Ouyang

Zhao

. A statistical framework for data integration through graphical models with application to cancer genomics. Ann Appl Stat 2017; 11: 161.

20.

Shen

Guo

, et al. A fully bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 2018; 19: 71–86.

21.

Kirk

Griffin

Savage

, et al. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 2012; 28: 3290–3297.

22.

Lock

Dunson

. Bayesian consensus clustering. Bioinformatics 2013; 29: 2610–2616.

23.

Teschendorff

Wang

Barbosa-Morais

, et al. A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 2005; 21: 3025–3033.

24.

Constantinopoulos

Titsias

Likas

. Bayesian feature and model selection for Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 2006; 28: 1013–1018.

25.

Dong

Hua

. Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans Pattern Anal Mach Intell 2008; 31: 953–960.

26.

McLachlan

Lee

Rathnayake

. Finite mixture models. Ann Rev Stat Appl 2019; 6(1): 355–378.

27.

Blei

Kucukelbir

McAuliffe

. Variational inference: a review for statisticians. J Am Stat Assoc 2017; 112: 859–877.

28.

Bishop

Nasrabadi

. Pattern recognition and machine learning. Vol. 4. Springer, 2006.

29.

Corduneanu

Bishop

. Variational Bayesian model selection for mixture distributions. In: Artificial intelligence and statistics, Vol. 2001. Morgan Kaufmann Waltham, MA, pp.27–34.

30.

Marbac

Sedki

. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput 2017; 27: 1049–1063.

31.

Argelaguet

Velten

Arnol

, et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol 2018; 14: e8124.

32.

Ramazzotti

Lal

Wang

, et al. Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun 2018; 9: 1–14.

33.

Hubert

Arabie

. Comparing partitions. J Classif 1985; 2: 193–218.

34.

Klockow

Zhang

, et al. Glioblastoma multiforme (GBM): an overview of current therapies and mechanisms of resistance. Pharmacol Res 2021; 171: 105780.

35.

TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008; 455: 1061–1068.

36.

Verhaak

Hoadley

Purdom

, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell 2010; 17: 98–110.

37.

Huan

Bao

Chen

, et al. MicroRNA-127-5p targets the biliverdin reductase b/nuclear factor-

κ

B pathway to suppress cell growth in hepatocellular carcinoma cells. Cancer Sci 2016; 107: 258–266.

38.

Ramberg

Richardsen

de Souza

, et al. Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer. Carcinogenesis 2021; 42: 685–693.

39.

Anghel

Correa-Rocha

Budinska

, et al. Breast cancer suppressor candidate-1 (BCSC-1) is a melanoma tumor suppressor that down regulates MITF. Pigment Cell Melanoma Res 2012; 25: 482–487.

40.

Sumei

Xiangyun

Fenrong

, et al. Hypermethylation of DHRS3 as a novel tumor suppressor involved in tumor growth and prognosis in gastric cancer. Front Cell Dev Biol 2021; 9: 624871.

41.

Faller

Rafferty

Hegarty

, et al. Metallothionein 1E is methylated in malignant melanoma and increases sensitivity to cisplatin-induced apoptosis. Melanoma Res 2010; 20: 392–400.

42.

Cerami

Gao

Dogrusoz

, et al. The CBIO cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2012; 2: 401–404.

43.

Zainul Abidin

Westhead

. Flexible model-based clustering of mixed binary and continuous data: application to genetic regulation and cancer. Nucleic Acids Res 2017; 45: e53–e53.

44.

DiNardo

Cortes

. Mutations in AML: prognostic and therapeutic implications. Hematology 2016; 2016: 348–355.

45.

Gabasova

Reid

Wernisch

. Clusternomics: integrative context-dependent clustering for heterogeneous datasets. PLoS Comput Biol 2017; 13: e1005781.

46.

Nazarizadeh

Mohammadi

Alian

, et al. MicroRNA-154: a novel candidate for diagnosis and therapy of human cancers. Onco Targets Ther 2020; 13: 6603–6615.

47.

Ferreira Martins

da Silva Oliveira

Braga Bona

, et al. The emerging role of miRNAs and their clinical implication in biliary tract cancer. Gastroenterol Res Pract 2016; 2016: 9797410.

48.

Hubbard

Huang

Harton

, et al. A bayesian latent class approach for EHR-based phenotyping. Stat Med 2019; 38: 74–87.

49.

White

Murphy

. Bayeslca: an R package for Bayesian latent class analysis. J Stat Softw 2014; 61: 1–28.

50.

Flores

Claborne

Weller

, et al. Missing data in multi-omics integration: recent advances through artificial intelligence. Front Artif Intell 2023; 6: 1098308.

51.

Choi

Chen

Skelly

, et al. Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics. Genome Biol 2020; 21: 183.

52.

Pierre-Jean

Deleuze

Le Floch

, et al. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinformat 2020; 21: 2011–2030.

53.

Feldman

Faulkner

Krause

. Scalable training of mixture models via coresets. In: Proceedings of the 25th International conference on neural information processing systems, 2011. NIPS’11, Red Hook, NY, USA: Curran Associates Inc, pp.2142–2150. ISBN 9781618395993.

54.

Campbell

Beronov

. Sparse variational inference: Bayesian coresets from scratch. Red Hook, NY, USA: Curran Associates Inc., 2019.

55.

Hoffman

Blei

Wang

, et al. Stochastic variational inference. J Mach Learn Res 2013; 14: 1303–1347.

56.

Lou

. Bayesian approaches to variable selection: a comparative study from practical perspectives. Int J Biostat 2021; 18: 83–108.

57.

Hughes

García-Fiñana

Wand

. Fast approximate inference for multivariate longitudinal data. Biostatistics 2021; 24: 117–192.

58.

Lou

. Bayesian approaches to variable selection in mixture models with application to disease clustering. J Appl Stat 2021; 50: 1–21.

59.

Lou

. Bayesian consensus clustering for multivariate longitudinal data. Stat Med 2022; 41: 108–127.

60.

Ahmadiankalati

Tan

. Joint clustering multiple longitudinal features: a comparison of methods and software packages with practical guidance. Stat Med 2023; 49: 1–28.

61.

Subbarao

Lou

. A bayesian latent class model for integrating multi-source longitudinal data: application to the child cohort study. J R Stat Soc Ser C: Appl Stat 2023; 73: 398–419.

62.

Chandra

. A sparse factor model for clustering high-dimensional longitudinal data. Stat Med 2024; 43: 3633–3648.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.42 MB