Sage Journals: Discover world-class research

Abstract

Background

Multiple myeloma (MM) is a malignancy characterized by abnormal plasma cell proliferation. While bortezomib has improved outcomes, significant individual variability persists. Accurate early prediction of patient progression is crucial for optimizing therapeutic intensity and improving long-term survival. Developing an automated, multimodal prediction model can provide clinicians with a robust tool for personalized prognosis, thereby reducing the burden of ineffective treatments on patients.

Methods

We enrolled 207 newly diagnosed MM (NDMM) patients treated with bortezomib. Based on 2-year outcomes, patients were categorized into progression and non-progression groups. Bone marrow smear images, electrophoresis images, and baseline clinical data were used to train a multimodal ensemble learning model. Neural networks were employed for image feature extraction—ResNet and MobileNet for bone marrow smears; VGG16 and DenseNet for electrophoresis images. Clinical features were selected using LASSO and modeled with Random Forest and Logistic Regression. The best-performing models from each modality were integrated using a soft voting ensemble strategy.

Results

The ensemble model outperformed all single-modality models (area under the curve (AUC): 0.8180, Accuracy: 0.7000). Among single modalities, electrophoresis image-based models performed best—VGG16 achieved the highest accuracy (AUC: 0.8082, Accuracy: 0.7000), and DenseNet showed the highest AUC (0.8088, Accuracy: 0.6200). ResNet was optimal for bone marrow smears (AUC: 0.7295, Accuracy: 0.5800), while Logistic Regression led clinical data performance (AUC: 0.6779, Accuracy: 0.6800).

Conclusion

This multimodal ensemble model effectively predicts MM progression by integrating diverse diagnostic data. By enabling earlier identification of high-risk patients, this model serves as a practical decision-support tool for clinicians to tailor personalized treatment strategies.

Keywords

Multiple myeloma bortezomib ensemble learning multi-modal

Introduction

Multiple myeloma (MM) is characterized by the uncontrolled proliferation of plasma cells in the bone marrow, which secrete large quantities of non-functional monoclonal antibodies (M protein), thereby damaging related organs or tissues.¹ MM is the second most common hematologic malignancy, and over the past three decades, its incidence has doubled while mortality has increased by 1.5 times, posing a considerable societal burden in China.² Historically, treatment options for MM were limited, with a median overall survival (OS) of only 2–3 years.³ In recent years, the introduction of novel agents such as bortezomib, carfilzomib, and lenalidomide has extended the OS of MM patients to over 7–10 years,⁴ though most cases of MM remain incurable. As a potent proteasome inhibitor, bortezomib disrupts protein homeostasis by blocking the 26S proteasome.⁵ Since MM cells are highly sensitive to the accumulation of misfolded proteins due to their intensive M-protein synthesis, this inhibition triggers cellular stress and apoptosis, effectively delaying disease progression.¹ However, the biological heterogeneity of MM leads to varied treatment responses, and most cases remain eventually incurable.

Despite the availability of a wide range of treatment options, new challenges have emerged. Studies have shown that in standard-risk newly diagnosed multiple myeloma (NDMM) patients, the triplet regimen based on carfilzomib (carfilzomib + lenalidomide + dexamethasone, KRD) offers no progression-free survival benefit compared to VRD (bortezomib + lenalidomide + dexamethasone).⁶ Traditional staging systems are unable to guide first-line risk-adapted treatment decisions for individual patients, and there is a growing consensus that incorporating response prediction into frontline treatment planning is essential.^7,8

In recent years, the integration of artificial intelligence (AI) with medical imaging, laboratory testing, and pathology has significantly improved the prediction of clinical disease progression, offering great potential for precision medicine.^9,10 In addition to oncology, diverse predictive modeling techniques have shown robust performance in other medical domains, such as the use of Vision Transformers for pneumonia detection and secure pattern recognition in multimodal cardiac monitoring systems.^11,12 However, most studies have been based on gene expression data,^13,14 which require substantial cost and time, and are currently limited to clinical trials conducted in research centers, making widespread application in routine clinical practice challenging. Model-based real-world evidence is gaining increasing attention as an alternative to traditional randomized clinical trials.^4,15,16

Park et al.⁴ developed a machine learning model that utilized baseline data obtained during the diagnostic process to predict OS or treatment response in transplant-ineligible NDMM patients receiving either VMP (bortezomib + melphalan + prednisone) or RD (lenalidomide + dexamethasone), enabling treatment-specific risk stratification. Therefore, constructing machine learning models based on clinical data at diagnosis can assist NDMM patients in predicting disease progression under different treatment regimens; however, existing models still lack comprehensive evaluation.

The heterogeneity of MM depends on numerous factors, and baseline data, serum protein electrophoresis, and bone marrow smears from newly diagnosed patients can reflect different characteristics of MM to varying degrees. Deep learning, which simulates the human brain by constructing deep neural networks, has been widely applied in the field of image recognition. However, its application in MM progression prediction is still in its early stages.

In this study, we propose a predictive model based on multimodal ensemble learning, aiming to improve the accuracy and robustness of MM progression prediction and to better guide clinical decision-making for bortezomib-based treatment. This model integrates bone marrow smears, electrophoresis images, and baseline clinical data collected prior to treatment, leveraging the strengths of each modality through multimodal data fusion (Figure 1). For image feature extraction, we employed several advanced neural network models, such as ResNet and MobileNet for bone marrow smears, and VGG16 and DenseNet for electrophoresis images. These deep neural networks can automatically extract rich and discriminative features from raw images, capture complex structures and subtle variations, and ensure the quality and diversity of feature representations. For ensemble learning, we adopted the Soft Voting technique to integrate the best-performing models from different modalities, fully leveraging the strengths of each model while minimizing the risks of bias and overfitting associated with single models, thereby enhancing overall predictive performance. Through multimodal data fusion and an optimized ensemble learning strategy, our model demonstrates excellent performance in handling complex data and provides an efficient and reliable tool for predicting MM progression.

Our study makes the following primary contributions:

First, we developed a multimodal framework that integrates bone marrow smears, electrophoresis images, and clinical data. This approach captures a more comprehensive disease profile than traditional single-modality methods.

Second, we implemented an optimized ensemble learning strategy using multiple deep neural networks. This architecture significantly improves the accuracy and robustness of progression prediction, achieving a superior area under the curve (AUC) of 0.8180.

Finally, by utilizing routine diagnostic data, our model offers a practical, cost-effective decision-support tool for clinicians to personalize treatment strategies for NDMM patients in real-world settings.

The remainder of this paper is organized as follows. Section 2 describes the patient enrollment, variables, and multimodal data preprocessing, followed by a detailed explanation of the progression prediction models based on deep learning, machine learning, and ensemble strategies. Section 3 presents the experimental setup, demographic statistics, and a comprehensive validation of the models across different modalities. Section 4 provides a discussion on the clinical implications and findings of the study. Finally, Section 5 concludes the paper.

Methods

Patient enrollment

This was a retrospective, non-interventional, single-center observational study that enrolling 247 cases with relatively complete information from the database of our research group. These patients were treated at the Department of Hematology, Beijing Chaoyang Hospital, Capital Medical University, between January 2017 and December 2023. The study was approved by the Ethics Committee of Beijing Chaoyang Hospital, Capital Medical University (Ethics Approval Number: 2024-ke-850). Our study was conducted in accordance with the Declaration of Helsinki.

Inclusion Criteria: (1) Confirmed diagnosis of MM; (2) Received at least two cycles of bortezomib-based induction therapy; (3) Availability of complete baseline clinical data, immunofixation electrophoresis images, and bone marrow smear images; (4) Documented 2-year follow-up data. Exclusion Criteria: (1) Presence of other concurrent primary malignancies; (2) Smoldering MM; (3) Inadequate image quality that precluded feature extraction. After applying these criteria, 40 patients were excluded (mostly due to lost follow-up), resulting in a final cohort of 207 patients.

Variables

The clinical characteristics of patients at the time of diagnosis included age, sex, WBC, HGB, PLT, ALB, GLB, LDH, Cr, calcium calibration, β2-microglobulin (β2-MG), M protein type, M protein levels, 24-h urinary light chain, and bone marrow plasma cell percentage. MM staging systems included the Durie-Salmon (DS) staging, International Staging System (ISS), and the Revised International Staging System (R-ISS). Cytogenetic risk factors (1q21, 17p, t(14;16), t(4;14), t(11;14)) were determined using FISH on bone marrow cells.

Two senior morphological experts annotated the myeloma cells in bone marrow images of each NDMM patient. The “Segment Anything” model was then used to segment the myeloma cells and construct the dataset.

According to the 2016 IMWG response criteria, the time to first disease progression during a 5-year follow-up was recorded. Patients were categorized into a Progression group (PD) if disease progression occurred, and a Non-progression group (Non-PD) if it did not.

Multimodal data preprocessing

Image Preprocessing: To ensure consistent feature extraction across different modalities, we implemented a standardized preprocessing pipeline for both bone marrow smears and electrophoresis images. First, all raw images were resized to a uniform resolution of 224 × 224 pixels to meet the input requirements of the deep learning backbones (ResNet, MobileNet, VGG16, and DenseNet). We then applied median filtering (kernel size = 3) and Gaussian filtering (sigma = 1.0) to remove impulse noise and artifacts while preserving the morphological details of plasma cells and protein bands. To accelerate convergence during training, pixel intensity normalization was performed, scaling values to a range of [0, 1] (or standardized using the mean and standard deviation of the ImageNet dataset).

Data Augmentation: To address the potential imbalance between progression and non-progression groups and to prevent overfitting, offline and online data augmentation techniques were employed. This included random horizontal and vertical flipping, rotations within a range of 30, and random scaling (0.8–1.2). Furthermore, brightness and contrast adjustments were applied to simulate variations in staining and lighting conditions across different batches. These techniques effectively expanded the diversity of our training set, enhancing the model's generalization ability across heterogeneous clinical images.

MM disease progression prediction model construction

Deep learning-based progression prediction model

In this study, we introduced various deep learning models to extract high-dimensional features from bone marrow and electrophoresis images for predicting the progression of MM. These models include VGG16, AlexNet, DenseNet, MobileNet, and ResNet. VGG16 is known for its 16-layer deep convolutional structure, which effectively extracts multi-level features from images¹⁷; AlexNet, as a pioneer in deep learning networks, uses a large number of convolutional neural networks for computation¹⁸; DenseNet promotes feature transmission and reuse through densely connected layers, improving the model's learning efficiency¹⁹; MobileNet uses depthwise separable convolutions, significantly reducing computational load²⁰; ResNet introduces residual connections to effectively alleviate the vanishing gradient problem in deep networks.²¹

We conducted experiments on these models and found that ResNet and MobileNet achieved AUC values of 0.7295 and 0.6989, respectively, on the bone marrow smear dataset, outperforming the other three models. On the electrophoresis image dataset, VGG16 and DenseNet achieved AUC values of 0.8082 and 0.8088, respectively, outperforming the other three models. Therefore, for extracting features from bone marrow smear images, we selected ResNet and MobileNet, while for electrophoresis images, we selected VGG16 and DenseNet for feature extraction. The selected models from each modality serve as sub-models for further ensemble learning.

Machine learning-based progression prediction model

In constructing the MM progression prediction model, we selected various algorithms to compare their performance on the specific task, including Random Forest, K-Nearest Neighbors (KNN), Gradient Boosting, Adaptive Boosting (AdaBoost), SVM, XGBoost, Logistic Regression, and Decision Trees. Random Forest improves model accuracy and robustness by building multiple decision trees and aggregating their results. KNN is simple and intuitive, suitable for small datasets or as a baseline model. Gradient Boosting provides a powerful approach for handling regression and classification problems by incrementally adding weak models to minimize the loss function. AdaBoost enhances model performance by adjusting sample weights to prioritize learning from previously misclassified samples. Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. Its main idea is to find an optimal hyperplane to separate data points from different classes, maximizing the margin between them. XGBoost is widely popular for its efficiency and accuracy, improving prediction performance by combining multiple decision trees. These models have demonstrated their powerful capabilities and flexibility in handling various data tasks. We trained and predicted with multiple machine learning models, among which Logistic Regression and Random Forest achieved AUC values of 0.6800 and 0.6506, respectively, outperforming the other models. Therefore, we selected these two models as sub-models for further ensemble learning.

Ensemble learning-based progression prediction model using bone marrow, electrophoresis, and clinical data

In the ensemble learning disease progression prediction task based on bone marrow smears, electrophoresis images, and clinical baseline data, we selected the two best-performing models from each modality for ensemble learning. For bone marrow image processing, we chose ResNet and MobileNet; for electrophoresis images, we selected VGG16 and DenseNet; and for clinical baseline data, we used Random Forest and Logistic Regression. We adopted a Soft Voting strategy to integrate these sub-models from different modalities, which takes into account the prediction confidence of each model. Specifically, each classifier assigns a probability to each class, and the final combined prediction is the class with the highest total probability. The Soft Voting strategy effectively combines the results from multiple classifiers, is not limited by data modality, and often outperforms the prediction performance of individual models during testing.

Statistical analysis

Evaluation metrics play a crucial role in assessing the performance of machine learning models. These measures are essential for objectively evaluating model performance and guiding their development and improvement. In this experiment, the following metrics are used to measure the effectiveness of each model. The relationship between True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP) is used to obtain metrics such as ACC, Precision, Recall, F1, and AUC.

Accuracy: The percentage of correctly predicted results out of the total samples. $A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)}$

Precision: The probability that a sample predicted as positive is actually positive. $P r e c i s i o n = \frac{T P}{T P + F P}$

Recall: The probability that a sample, which is actually positive, is predicted as positive. $R e c a l l = \frac{T P}{T P + F P}$

F1: The harmonic mean of Precision and Recall, balancing the performance of both. $F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

The Receiver Operating Characteristic (ROC) curve is derived from the confusion matrix and is used to evaluate the predictive ability of a model. AUC represents the area under the ROC curve and is also an important criterion used in this paper to assess model performance.

Results

Experimental setup

In the experiment, we conducted a detailed setup to evaluate the performance of our proposed model (SOTA) in the MM progression prediction task. First, we divided the dataset into a training set and a testing set, with 80% used for training and 20% for testing, ensuring that the model is adequately trained while also being accurately evaluated. The experiment was conducted on a Windows 10 operating system, with hardware configuration including a computer equipped with an AMD Ryzen 9 5900X processor (32 GB RAM) and an NVIDIA GeForce RTX™ 3080 GPU (10 GB RAM). The experiment used Python 3.9 programming language and PyTorch deep learning framework version 1.8.1 for model construction and training.

Patient demographic data statistics

Table 1 shows the differences between the MM progression group and the non-progression group in terms of demographic and clinical baseline characteristics, including age, gender, staging, biochemical indicators, and immunological markers. The results indicate that positive expression of 1q21, ISS stage III, high levels of HGB, LDH, Cr, Calibration of Ca, and age (>65) are associated with MM progression. However, patient gender, bone marrow plasma cell ratio and other indicators are not significantly related to MM progression.

Table 1.

Demographic and clinical characteristics of all patients.

Variables	Multiple myeloma		p-value*
Variables	PD (n = 85)	Non-PD(n = 122)	p-value*
Age (years)			.022*
≤65	49（57.65%）	89（72.95%）
>65	36（42.35%）	33（27.05%）
Sex			.088
Male	52（61.18%）	60（49.18%）
Female	33（38.82%）	62（50.82%）
DS staging			.583
I/II	9（10.59%）	16（13.11%）
III	76（89.41%）	106（86.89%）
ISS staging			.014*
I/II	34（40.00%）	70（57.38%）
III	51（60.00%）	52（42.62%）
R-ISS staging			.215
I/II	66（77.65%）	103（84.43%）
III	19（22.35%）	19（15.57%）
WBC (*10^9/L)	5.13	5.08	.342
HGB (g/L)	83	95	.004*
PLT (*10^9/L)	173	187	.050
ALB (g/L)	35.4	36.4	.185
GLB (g/L)	34.5	46.0	.290
LDH (U/L)	181	163	.018*
Cr (umol/L)	88.3	73.8	.024*
Calibration of Ca (mmol/L)	2.336	2.327	.032*
β₂-MG (mg/L)	6.02	4.57	.087
M protein type			.511
IgG	34（40.00%）	60（49.18%）
IgA	18（21.18%）	24（19.67%）
IgD	7（8.23%）	6（4.92%）
κ light chain	12（14.12%）	19（15.57%）
λ light chain	11（12.94%）	12（9.84%）
Non-secretory	3（3.53%）	1（0.82%）
M protein (g/dL)	1.4	2.4	.274
24 h urine light chain (g)	2.05	1.13	.817
Plasma cells of bone marrow (%)	44	41	.311
1q21			.012*
Negative	35（41.18%）	72（59.02%）
Positive	50（58.82%）	50（40.98%）
17p			.205
Negative	77（90.59%）	116（95.08%）
Positive	8（9.41%）	6（4.92%）
t(14;16) IGH/MAF			.374
Negative	80（94.12%）	119（97.54%）
Positive	5（5.88%）	3（2.46%）
t(4;14) IGH/FGFR3			.505
Negative	68（80.00%）	102（83.61%）
Positive	17（20.00%）	20（16.39%）
t(11;14) IGH/CCND1			.614
Negative	60（70.59%）	90（73.77%）
Positive	25（29.41%）	32（26.23%）

Note: Categorical data are represented as counts and percentages. Numerical data were reported as medians. p-value* for categorical data was prepared by chi-squared tests, values for numerical data were prepared by Student's t test.

*p < .05 indicates a significant difference between the datasets.

Abbreviations: PD: progressive disease; WBC: white blood cell count; HGB: hemoglobin; PLT: platelet; ALB: albumin; GLB: globulin; LDH: lactic dehydrogenase; Cr: creatinine; Calibration of Ca: calibration of calcium; β2-MG: β2-microglobulin.

Progression prediction model validation

Bone marrow smear-based deep learning progression prediction model

To achieve optimal performance, we optimized the parameters of all models and used Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) optimization algorithms. The learning rate was set to 0.00002, and slightly adjusted momentum parameters (beta1 = 0.85, beta2 = 0.995) were used, with 60 training epochs. To accelerate model convergence, we adopted a learning rate decay strategy, reducing the learning rate to 0.1 times its current value at the end of each epoch. The batch size was set to 64, meaning that 64 random samples were selected from the training set for forward and backward propagation in each iteration. Additionally, to further enhance the model's generalization ability and training stability, we introduced L2 regularization and Dropout techniques. This combination of optimization parameters and techniques allowed us to effectively improve the performance of the network model during training.

In this study, we compared the performance of various deep learning models in predicting the progression of MM using bone marrow smears. ResNet performed the best across all metrics, with an AUC of 0.7295, accuracy of 0.5800, F1 score of 0.6316, Precision of 0.5294, and Recall of 0.7826, demonstrating excellent predictive accuracy and consistency. DenseNet had an AUC of 0.6924, and although its Precision was slightly lower (0.6140), its Recall was as high as 0.8434, indicating its outstanding performance in identifying positive samples. MobileNet showed balanced performance, with an AUC of 0.6989, accuracy of 0.6000, and F1 score and Precision being relatively balanced, although its Recall was slightly lower than that of DenseNet. In contrast, VGG16 and AlexNet performed weaker, especially AlexNet, which nearly failed to effectively classify categories. Overall, ResNet had the best overall performance, DenseNet excelled in Recall, MobileNet showed more balanced classification ability, while VGG16 and AlexNet performed poorly. Specific results can be found in Table 2 and Figure 2(a).

Figure 1.

The workflow of this study.

Figure 2.

ROC curves of various test models: (a) ROC for the bone marrow smear group, (b) ROC for the electrophoresis image group, and (c) ROC for the clinical baseline indicator group.

Table 2.

Test results of deep learning models based on bone marrow smears.

Model	AUC	Accuracy	F1	Precision	Recall
VGG16	0.6893	0.5811	0.7207	0.5755	0.9639
AlexNet	0.6354	0.5541	0.7130	0.5578	0.9880
DenseNet	0.6942	0.6149	0.7107	0.6140	0.8434
MobileNet	0.6989	0.6000	0.3333	0.7143	0.2174
ResNet	0.7295	0.5800	0.6316	0.5294	0.7826

Progression prediction model based on electrophoresis images and deep learning

In the experiment using electrophoresis images for progression prediction, we also employed the same training configuration as for bone marrow smears. The performance of each deep learning model exhibited unique characteristics: DenseNet performed the best, with an AUC of 0.8088 and accuracy of 0.6200, indicating its advantage in overall classification accuracy and consistency. VGG16 showed a good balance, with an AUC of 0.8082, an F1 score of 0.6341, and relatively balanced Precision and Recall. MobileNet performed slightly worse, with MobileNet having an AUC of 0.7633, showing stable performance but not surpassing other models. Overall, DenseNet demonstrated the best overall performance, and VGG16 showed a more balanced performance. Specific results can be found in Table 3 and Figure 2(b).

Table 3.

Test results of deep learning models based on electrophoresis images.

Model	AUC	Accuracy	F1	Precision	Recall
VGG16	0.8082	0.7000	0.6341	0.7222	0.5652
AlexNet	0.7971	0.6600	0.4848	0.8000	0.3478
DenseNet	0.8088	0.6200	0.3448	0.8333	0.2174
MobileNet	0.7633	0.6400	0.6667	0.5806	0.7826
ResNet	0.7987	0.6800	0.6190	0.6842	0.5652

Progression prediction model based on clinical baseline indicators and machine learning

In the experiment using clinical indicators for progression prediction, the performance of traditional machine learning models varied. Logistic regression performed the best across all metrics, with an AUC of 0.6779 and an accuracy of 0.6800, indicating a certain advantage in overall classification performance. Random forest achieved an AUC of 0.6506. KNN, AdaBoost, and gradient boosting had similar performance, with AUCs ranging from 0.55 to 0.65. Their F1 scores and precision were relatively consistent, but they were slightly lacking in recall. SVM had the lowest AUC of just 0.4010, failing to effectively identify positive samples. Overall, logistic regression performed best in classifying clinical data, with random forest also demonstrating decent results. Detailed results are shown in Table 4 and Figure 2(c).

Table 4.

Test results of machine learning models based on clinical indicators.

Model	AUC	Accuracy	F1	Precision	Recall
Random Forest	0.6506	0.5600	0.4211	0.5333	0.3478
KNN	0.6127	0.6000	0.5000	0.5882	0.4348
SVM	0.4010	0.5400	0.0000	0.0000	0.0000
AdaBoost	0.5572	0.5200	0.4000	0.4706	0.3478
Logistic Regression	0.6779	0.6800	0.5294	0.8182	0.3913
Gradient Boosting	0.6039	0.6000	0.5238	0.5789	0.4783
Decision Tree	0.5507	0.5600	0.4762	0.5263	0.4348

Multimodal ensemble learning model based on bone marrow smear, electrophoresis images, and clinical baseline indicators

In this study, we applied a Soft Voting strategy to integrate the top-performing models from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—resulting in a significant overall performance improvement. The ensemble model achieved an AUC of 0.8180, an accuracy of 0.7000, and an F1 score of 0.5455. Notably, it demonstrated excellent prediction accuracy and consistency, with a particularly high Precision of 0.9000, and it represents a clear improvement over other individual models.

Overall, by leveraging the complementary strengths of different modalities, the ensemble model demonstrated superior comprehensive performance in predicting the progression of MM. Detailed results are shown in Table 5 and Figure 3.

Figure 3.

Test results of the ensemble learning progression prediction model: (a) DCA curve, (b) calibration curve, (c) ROC curve.

Table 5.

Test results of the ensemble learning progression prediction model.

Model	Accuracy	Recall	Precision	F1	AUC
VGG16 (electrophoresis images)	0.7000	0.5652	0.7222	0.6341	0.8082
DenseNet (electrophoresis images)	0.6200	0.2174	0.8333	0.3448	0.8088
Random Forest (clinical indicators)	0.5600	0.3478	0.5333	0.4211	0.6506
Logistic Regression (clinical indicators)	0.6800	0.3913	0.8182	0.5294	0.6779
MobileNet (bone marrow smears)	0.6000	0.2174	0.7143	0.3333	0.6989
ResNet (bone marrow smears)	0.5800	0.7826	0.5294	0.6316	0.7295
Ensemble Model (multimodal data)	0.7000	0.3913	0.9000	0.5455	0.8180

Discussion

A real-world retrospective study conducted in Spain²² revealed that age, ECOG performance status, ISS stage, serum LDH, GFR, cytogenetics, and treatment regimen significantly affect OS. First-line treatment exhibited high heterogeneity among elderly and high-risk patients (e.g. ECOG PS ≥ 2, ISS stage III, severely impaired GFR, elevated LDH levels, and high-risk cytogenetics). In our study, positive 1q21 expression, ISS stage III, elevated LDH, Cr, calibration of Ca, HGB, and age (>65) were significantly associated with MM progression, which is largely consistent with the above-mentioned study.

The progression of NDMM is influenced by various factors, such as age, performance status, ASCT, and comorbidities.^23–25 Integrating clinical baseline characteristics, biochemical markers, and gene expression profiles is currently the main strategy for applying machine learning to personalize treatment for MM patients and prolong survival. Orgueira et al.¹³ utilized machine learning to analyze multiple variables, including 46 genes, to predict OS in MM patients receiving six first-line treatment regimens. Patients treated with the optimal drug combination identified by the model had longer OS than those receiving other regimens. Ubels et al.²⁶ designed Simulated Treatment Learning Signatures (STLsig) to predict treatment benefits in NDMM. STLsig identified two gene complexes that could jointly predict favorable responses to bortezomib.

Kubasch et al.²⁷ demonstrated that a machine learning model trained with gradient boosting classification could predict early relapse in NDMM with 73% accuracy using four features, such as the best response in the first year after first-line treatment. Most machine learning-based studies on MM disease progression rely on gene expression data, which requires substantial costs and time, making it difficult to apply widely in clinical practice.²⁸ Furthermore, an excessive number of input variables used to train models can lead to the risk of data overfitting.²⁹ This indicates that the technical demand for machine learning-based research on MM disease progression continues to grow, but it has not yet been fully addressed.

MM exhibits high heterogeneity, and deep learning algorithms can automatically uncover complex nonlinear relationships from heterogeneous data sources, performing excellently in predictive tasks. Deep learning simulates the human brain through deep neural networks for analysis and is widely used in image recognition; however, its application in MM progression prediction is still in its early stages.^30,31 Significant progress has also been made in the application of ensemble learning techniques in the medical AI field.^32,33 Therefore, this study combines images reflecting MM heterogeneity (bone marrow smear, electrophoresis), clinical baseline features, and AI to provide an opportunity to address the aforementioned issues in clinical decision-making.

This study utilizes data from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—and constructs an ensemble learning framework to predict the progression of MM based on bortezomib treatment, using various machine learning and deep learning models. In the single-modality experiments, the VGG16 and DenseNet models for electrophoresis images exhibited outstanding classification performance. The ResNet model for bone marrow smears achieved an AUC of 0.7295, demonstrating good classification ability. In the clinical indicator model, logistic regression achieved an AUC of 0.6779, outperforming random forest and other traditional machine learning methods. In the multi-modal ensemble model, the final AUC was improved to 0.8180, with accuracy reaching 0.7000 and Precision as high as 0.9000 through the Soft Voting strategy that combines the optimal models from different modalities. This indicates that the ensemble model significantly outperforms the single-modality models, particularly in terms of accuracy and precision.

Our model performs best in the prediction of MM progression, primarily due to the dual advantages of the model architecture and the task characteristics. From the perspective of model architecture, we adopt a multi-modal ensemble learning framework that combines the best models from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—by using the Soft Voting strategy to merge the predictions from each modality. Bone marrow smears reflect morphological changes of cells, electrophoresis images provide protein expression information, and clinical indicators show the patient's basic characteristics. Our method fully utilizes the unique features of each modality, enhancing the model's comprehensive understanding of complex data and improving prediction accuracy. Additionally, the Soft Voting strategy effectively mitigates the bias of individual models and boosts overall performance. This architecture is consistent with the methodological frameworks championed by scholars such as S. Jafarzadeh et al.³⁴ and S. Anari et al., who have demonstrated that ensemble systems significantly enhance model robustness and generalization in complex oncological tasks. The reason our multimodal ensemble approach outperforms competing methods lies in its ability to resolve the “information gap” inherent in single-modality assessments. While previous studies often relied solely on clinical markers (which may lack predictive granularity) or gene expression data (which is static and costly), our method captures the dynamic morphological features of plasma cells and the quantitative protein variations in electrophoresis. By integrating these complementary information streams, the ensemble model can identify subtle progression signals that are otherwise filtered out as noise in a single-modality network. Specifically, the high precision (0.9000) of our model compared to baseline classifiers suggests that the fusion of image-based deep features and clinical categorical variables creates a synergistic effect, allowing for a more robust characterization of MM heterogeneity. Therefore, our ensemble model demonstrates excellent performance in predicting MM progression.

In clinical applications, the interpretability of the model is crucial. Settouti et al.³⁵ applied explainable AI to identify the optimal chemotherapy regimen for MM, where SHAP provided a global perspective on feature contributions. To enhance the clinical interpretability of our model, we employed Gradient-weighted Class Activation Mapping (Grad-CAM) for visual analysis. Grad-CAM generates heatmaps that show the image areas the model focuses on when making predictions (Figure 4), providing an intuitive explanation of the model's decision-making process. In our study, Grad-CAM effectively revealed how the model focused on key regions in the bone marrow smears and electrophoresis images, which are associated with specific disease features. For example, Grad-CAM could display which parts of the bone marrow smear images contributed most to the prediction results, helping clinicians understand why the model labeled certain samples as high-risk or low-risk. Through Grad-CAM visualization, we can transform the complex decision-making process of the model into more understandable clinical information, increasing clinicians’ trust in and acceptance of the model's predictions, making the model more practical and reliable in real-world medical scenarios.

Figure 4.

GradCAM visualization results: (a) activation heatmap area of the electrophoresis image, (b) activation heatmap area of the bone marrow smear.

Failure Case Analysis via Grad-CAM: To provide a transparent and balanced evaluation of our model, we analyzed cases where the ensemble prediction did not align with the ground truth. As illustrated in Figure 5, in some misclassified electrophoresis images, the Grad-CAM heatmaps revealed that the model occasionally focused on the peripheral regions of the gel rather than the specific M-protein bands, likely due to baseline noise or suboptimal image contrast. Similarly, in failed bone marrow smear classifications, the attention was sometimes localized on cell debris rather than the diagnostic plasma cells. These findings suggest that while our ensemble approach mitigates the weakness of single models, the performance is still bounded by the quality of raw visual features. Future improvements will focus on implementing more robust attention mechanisms to suppress background noise and integrating molecular-level features to handle highly heterogeneous cases.

Figure 5.

Grad-CAM visualization of representative failure cases. (a) Misclassified electrophoresis images showing attention focused on gel edges or background noise rather than M-protein bands. (b) Failed bone marrow smear cases where the model localized on cellular debris or staining artifacts instead of diagnostic plasma cells.

The optimal selection of the initial treatment regimen is crucial for the prognosis of NDMM.²⁴ To date, the choice of MM treatment regimens largely depends on the judgment of clinicians. Due to the high heterogeneity of MM and the increasing complexity of treatment regimens, this process becomes cumbersome and inefficient. AI can simplify this task, making clinical work more efficient. Therefore, we developed an integrated algorithm based on deep learning and machine learning, utilizing multimodal data (patient bone marrow smear images, immunofixation electrophoresis, and baseline characteristics). This model observes the disease progression of NDMM patients undergoing bortezomib-based first-line treatment over 2 years and makes preliminary judgments based on PD and Non-PD. From a clinical perspective, this tool provides an objective, standardized risk-stratification framework that can assist clinicians in identifying high-risk patients earlier. For those predicted to progress, clinicians might consider increasing the frequency of follow-up or adjusting therapeutic strategies prematurely. Moreover, because our model relies on routine diagnostic images rather than expensive molecular testing, it offers a cost-effective solution for precision medicine, especially in hospitals where advanced genetic resources are limited. This reduces the overall economic burden and helps prevent potential over-treatment, laying a solid foundation for more accessible, individualized chemotherapy for MM.

This study has some limitations. A potential limitation is the small validation dataset, so larger-scale datasets are needed to confirm the results. The model does not provide more specific treatment options for NDMM patients, and clinicians are unable to develop precise treatment plans for each patient. In the future, treatment options could be further categorized, with efficacy assessments conducted after first-line induction therapy or ASCT, which would better guide personalized MM therapy. External validation will also be needed in later stages. The real clinical data we used contain noise and missing features, which are unavoidable characteristics of real-world clinical data. The machine learning model developed in this study achieved acceptable classification performance on noisy data.

Conclusion

This study predicts the progression of MM based on bortezomib treatment using a multimodal data integration learning model. Clinicians can combine the model's prediction results to adopt appropriate strategies, reducing patient suffering and unnecessary economic burdens, and improving quality of life. This model holds great potential in predicting MM progression and lays the foundation for future research on personalized treatment for MM.

Footnotes

Author note

AI Disclosure: The authors used ChatGPT in order to check grammar mistakes. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the final version of the manuscript.

ORCID iDs

Sha Li

Boyang Zang

Yantian Zhao

Hong Zong

Hong Huo

Chuanying Geng

Ethics approval and informed consent

All procedures involving human participants were approved by the Beijing Chao-Yang Hospital Ethics Committee, which waived the requirement of informed consent from subjects enrolled in this study (Ethics Approval Number: 2024-ke-850).

Consent for publication

All authors have approved the manuscript for publication.

Authors’ contributions

Sha Li designed the research, collected the data and images, analyzed the results, and wrote and edited the manuscript. Boyang Zang constructed model and analyzed the results. Jing Jia designed the research. Yantian Zhao, Hong Zong, and Hong Huo collected images. Chuanying Geng designed the research, collected the data, and reviewed the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The data analyzed in the current study are available from the corresponding author upon reasonable request.

References

Malard

Neri

Bahlis

, et al. Multiple myeloma. Nat Rev Dis Primers 2024; 10: 45.

Liu

, et al. Burden of multiple myeloma in China: an analysis of the global burden of disease, injuries, and risk factors study 2019. Chin Med J (Engl) 2023; 136: 2834–2838.

Cowan

Green

Kwok

, et al. Diagnosis and management of multiple myeloma: a review. JAMA 2022; 327: 464–477.

Park

Lee

Byun

, et al. ML-based sequential analysis to assist selection between VMP and RD for newly diagnosed multiple myeloma. NPJ Precis Oncol 2023; 7: 46.

Sogbein

Paul

Umar

, et al. Bortezomib in cancer therapy: mechanisms, side effects, and future proteasome inhibitors. Life Sci 2024; 358: 123125.

Kumar

Jacobus

Cohen

, et al. Carfilzomib or bortezomib in combination with lenalidomide and dexamethasone for patients with newly diagnosed multiple myeloma without intention for immediate autologous stem-cell transplantation (ENDURANCE): a multicentre, open-label, phase 3, randomised, controlled trial. Lancet Oncol 2020; 21: 1317–1330.

Corre

Montes

Martin

, et al. Early relapse after autologous transplant for myeloma is associated with poor survival regardless of cytogenetic risk. Haematologica 2020; 105: e480–e483.

Liang

Zhou

, et al. Dissecting the high-risk property of 1q gain/amplification in patients with newly diagnosed multiple myeloma. Am J Cancer Res 2025; 15: 501–516.

Gonzalez

Nejat

Saha

, et al. Performance of externally validated machine learning models based on histopathology images for the diagnosis, classification, prognosis, or treatment outcome prediction in female breast cancer: a systematic review. J Pathol Inform 2023; 15: 100348.

10.

Dong

Chen

Zhu

, et al. Artificial intelligence in skeletal metastasis imaging. Comput Struct Biotechnol J 2023; 23: 157–164.

11.

Khaniki

MAL

Mirzaeibonehkhater

Manthouri

. Enhancing pneumonia detection using vision transformer with dynamic mapping re-attention mechanism. In: 2023 13th international conference on computer and knowledge engineering (ICCKE), 2023, pp.144–149. Piscataway, NJ: IEEE.

12.

Ahmadi

Zhang

Tran

. Multiheart: secure and robust heartbeat pattern recognition in multimodal cardiac monitoring system. Electronics (Basel) 2025; 14: 3149.

13.

Mosquera Orgueira

González Pérez

Díaz Arias

JÁ

, et al. Survival prediction and treatment optimization of multiple myeloma patients using machine-learning models based on clinical and gene expression data. Leukemia 2021; 35: 2924–2935.

14.

Maura

Rajanna

Ziccheddu

, et al. Genomic classification and individualized prognosis in multiple myeloma. J Clin Oncol 2024; 42: 1229–1240.

15.

Grieb

Schmierer

Kim

, et al. A digital twin model for evidence-based clinical decision support in multiple myeloma treatment. Front Digit Health 2023; 5: 1324453.

16.

Squara

Luu

Pérol

, et al. Personalized reimbursement model (PRM) program: a real-world data platform of cancer drugs use to improve and personalize drug pricing and reimbursement in France. PLoS One 2022; 17: e0267242.

17.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR), Vol. 3, 2015, pp. 1–14.

18.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25: 1106–1114.

19.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.4700–4708. Piscataway, NJ: IEEE.

20.

Howard

. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 2017; 1: 1–9.

21.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. Piscataway, NJ: IEEE.

22.

Cejalvo

Bustamante

González

, et al. Treatment patterns and outcomes in real-world transplant-ineligible patients newly diagnosed with multiple myeloma. Ann Hematol 2021; 100: 1769–1778.

23.

Goel

Usmani

Kumar

. Current approaches to management of newly diagnosed multiple myeloma. Am J Hematol 2022; 97: S3–S25.

24.

Mosquera Orgueira

González Pérez

Diaz Arias

, et al. Unsupervised machine learning improves risk stratification in newly diagnosed multiple myeloma: an analysis of the spanish myeloma group. Blood Cancer J 2022; 12: 76.

25.

Thurlapati

Wesson

Davis

, et al. Impact of cytogenetic abnormalities, induction and maintenance regimens on outcomes after high-dose chemotherapy and autologous stem cell transplantation in patients with newly diagnosed multiple myeloma: a decade-long real-world experience. J Hematol 2023; 12: 243–254.

26.

Ubels

Sonneveld

van Vliet

, et al. Gene networks constructed through simulated treatment learning can predict proteasome inhibitor benefit in multiple myeloma. Clin Cancer Res 2020; 26: 5952–5961.

27.

Kubasch

Grieb

Oeser

, et al. Predicting early relapse for patients with multiple myeloma through machine learning. Blood 2021; 138: 2953–2953.

28.

Ren

, et al. A machine learning model to predict survival and therapeutic responses in multiple myeloma. Int J Mol Sci 2023; 24: 6683.

29.

de Reus

Kuijten

Saha

, et al. External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients. Spine J 2025; 25: 1386–1399.

30.

Morita

Karashima

Terao

, et al. 3D CNN-based deep learning model-based explanatory prognostication in patients with multiple myeloma using whole-body MRI. J Med Syst 2024; 48: 30.

31.

Chen

Zhang

Cao

, et al. Detection of circulating plasma cells in peripheral blood using deep learning-based morphological analysis. Cancer 2024; 130: 1884–1893.

32.

Wang

Dai

Gong

, et al. Development of a novel combined nomogram model integrating deep learning-pathomics, radiomics and immunoscore to predict postoperative outcome of colorectal cancer lung metastasis patients. J Hematol Oncol 2022; 15: 11.

33.

Boehm

Aherne

Ellenson

, et al. Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat Cancer 2022; 3: 723–733.

34.

Ranjbarzadeh

Bagherian Kasgari

Jafarzadeh Ghoushchi

, et al. Brain tumor segmentation based on deep learning and an attention mechanism using MRI multi-modalities brain images. Sci Rep 2021; 11: 10930.

35.

Settouti

Saidi

. Preliminary analysis of explainable machine learning methods for multiple myeloma chemotherapy treatment recognition. Evol Intell 2024; 17: 513–533.

Research on predicting the progression of multiple myeloma treated with bortezomib based on multimodal ensemble learning

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Methods

Patient enrollment

Variables

Multimodal data preprocessing

MM disease progression prediction model construction

Deep learning-based progression prediction model

Machine learning-based progression prediction model

Ensemble learning-based progression prediction model using bone marrow, electrophoresis, and clinical data

Statistical analysis

Results

Experimental setup

Patient demographic data statistics

Progression prediction model validation

Bone marrow smear-based deep learning progression prediction model

Progression prediction model based on electrophoresis images and deep learning

Progression prediction model based on clinical baseline indicators and machine learning

Multimodal ensemble learning model based on bone marrow smear, electrophoresis images, and clinical baseline indicators

Discussion

Conclusion

Footnotes

Author note

ORCID iDs

Ethics approval and informed consent

Consent for publication

Authors’ contributions

Funding

Declaration of conflicting interests

Data availability

References