Abstract
Introduction
Multiple myeloma (MM) is characterized by the uncontrolled proliferation of plasma cells in the bone marrow, which secrete large quantities of non-functional monoclonal antibodies (M protein), thereby damaging related organs or tissues. 1 MM is the second most common hematologic malignancy, and over the past three decades, its incidence has doubled while mortality has increased by 1.5 times, posing a considerable societal burden in China. 2 Historically, treatment options for MM were limited, with a median overall survival (OS) of only 2–3 years. 3 In recent years, the introduction of novel agents such as bortezomib, carfilzomib, and lenalidomide has extended the OS of MM patients to over 7–10 years, 4 though most cases of MM remain incurable. As a potent proteasome inhibitor, bortezomib disrupts protein homeostasis by blocking the 26S proteasome. 5 Since MM cells are highly sensitive to the accumulation of misfolded proteins due to their intensive M-protein synthesis, this inhibition triggers cellular stress and apoptosis, effectively delaying disease progression. 1 However, the biological heterogeneity of MM leads to varied treatment responses, and most cases remain eventually incurable.
Despite the availability of a wide range of treatment options, new challenges have emerged. Studies have shown that in standard-risk newly diagnosed multiple myeloma (NDMM) patients, the triplet regimen based on carfilzomib (carfilzomib + lenalidomide + dexamethasone, KRD) offers no progression-free survival benefit compared to VRD (bortezomib + lenalidomide + dexamethasone). 6 Traditional staging systems are unable to guide first-line risk-adapted treatment decisions for individual patients, and there is a growing consensus that incorporating response prediction into frontline treatment planning is essential.7,8
In recent years, the integration of artificial intelligence (AI) with medical imaging, laboratory testing, and pathology has significantly improved the prediction of clinical disease progression, offering great potential for precision medicine.9,10 In addition to oncology, diverse predictive modeling techniques have shown robust performance in other medical domains, such as the use of Vision Transformers for pneumonia detection and secure pattern recognition in multimodal cardiac monitoring systems.11,12 However, most studies have been based on gene expression data,13,14 which require substantial cost and time, and are currently limited to clinical trials conducted in research centers, making widespread application in routine clinical practice challenging. Model-based real-world evidence is gaining increasing attention as an alternative to traditional randomized clinical trials.4,15,16
Park et al. 4 developed a machine learning model that utilized baseline data obtained during the diagnostic process to predict OS or treatment response in transplant-ineligible NDMM patients receiving either VMP (bortezomib + melphalan + prednisone) or RD (lenalidomide + dexamethasone), enabling treatment-specific risk stratification. Therefore, constructing machine learning models based on clinical data at diagnosis can assist NDMM patients in predicting disease progression under different treatment regimens; however, existing models still lack comprehensive evaluation.
The heterogeneity of MM depends on numerous factors, and baseline data, serum protein electrophoresis, and bone marrow smears from newly diagnosed patients can reflect different characteristics of MM to varying degrees. Deep learning, which simulates the human brain by constructing deep neural networks, has been widely applied in the field of image recognition. However, its application in MM progression prediction is still in its early stages.
In this study, we propose a predictive model based on multimodal ensemble learning, aiming to improve the accuracy and robustness of MM progression prediction and to better guide clinical decision-making for bortezomib-based treatment. This model integrates bone marrow smears, electrophoresis images, and baseline clinical data collected prior to treatment, leveraging the strengths of each modality through multimodal data fusion (Figure 1). For image feature extraction, we employed several advanced neural network models, such as ResNet and MobileNet for bone marrow smears, and VGG16 and DenseNet for electrophoresis images. These deep neural networks can automatically extract rich and discriminative features from raw images, capture complex structures and subtle variations, and ensure the quality and diversity of feature representations. For ensemble learning, we adopted the Soft Voting technique to integrate the best-performing models from different modalities, fully leveraging the strengths of each model while minimizing the risks of bias and overfitting associated with single models, thereby enhancing overall predictive performance. Through multimodal data fusion and an optimized ensemble learning strategy, our model demonstrates excellent performance in handling complex data and provides an efficient and reliable tool for predicting MM progression.
Our study makes the following primary contributions:
First, we developed a multimodal framework that integrates bone marrow smears, electrophoresis images, and clinical data. This approach captures a more comprehensive disease profile than traditional single-modality methods.
Second, we implemented an optimized ensemble learning strategy using multiple deep neural networks. This architecture significantly improves the accuracy and robustness of progression prediction, achieving a superior area under the curve (AUC) of 0.8180.
Finally, by utilizing routine diagnostic data, our model offers a practical, cost-effective decision-support tool for clinicians to personalize treatment strategies for NDMM patients in real-world settings.
The remainder of this paper is organized as follows. Section 2 describes the patient enrollment, variables, and multimodal data preprocessing, followed by a detailed explanation of the progression prediction models based on deep learning, machine learning, and ensemble strategies. Section 3 presents the experimental setup, demographic statistics, and a comprehensive validation of the models across different modalities. Section 4 provides a discussion on the clinical implications and findings of the study. Finally, Section 5 concludes the paper.
Methods
Patient enrollment
This was a retrospective, non-interventional, single-center observational study that enrolling 247 cases with relatively complete information from the database of our research group. These patients were treated at the Department of Hematology, Beijing Chaoyang Hospital, Capital Medical University, between January 2017 and December 2023. The study was approved by the Ethics Committee of Beijing Chaoyang Hospital, Capital Medical University (Ethics Approval Number: 2024-ke-850). Our study was conducted in accordance with the Declaration of Helsinki.
Inclusion Criteria: (1) Confirmed diagnosis of MM; (2) Received at least two cycles of bortezomib-based induction therapy; (3) Availability of complete baseline clinical data, immunofixation electrophoresis images, and bone marrow smear images; (4) Documented 2-year follow-up data. Exclusion Criteria: (1) Presence of other concurrent primary malignancies; (2) Smoldering MM; (3) Inadequate image quality that precluded feature extraction. After applying these criteria, 40 patients were excluded (mostly due to lost follow-up), resulting in a final cohort of 207 patients.
Variables
The clinical characteristics of patients at the time of diagnosis included age, sex, WBC, HGB, PLT, ALB, GLB, LDH, Cr, calcium calibration, β2-microglobulin (β2-MG), M protein type, M protein levels, 24-h urinary light chain, and bone marrow plasma cell percentage. MM staging systems included the Durie-Salmon (DS) staging, International Staging System (ISS), and the Revised International Staging System (R-ISS). Cytogenetic risk factors (1q21, 17p, t(14;16), t(4;14), t(11;14)) were determined using FISH on bone marrow cells.
Two senior morphological experts annotated the myeloma cells in bone marrow images of each NDMM patient. The “Segment Anything” model was then used to segment the myeloma cells and construct the dataset.
According to the 2016 IMWG response criteria, the time to first disease progression during a 5-year follow-up was recorded. Patients were categorized into a Progression group (PD) if disease progression occurred, and a Non-progression group (Non-PD) if it did not.
Multimodal data preprocessing
Image Preprocessing: To ensure consistent feature extraction across different modalities, we implemented a standardized preprocessing pipeline for both bone marrow smears and electrophoresis images. First, all raw images were resized to a uniform resolution of 224 × 224 pixels to meet the input requirements of the deep learning backbones (ResNet, MobileNet, VGG16, and DenseNet). We then applied median filtering (kernel size = 3) and Gaussian filtering (sigma = 1.0) to remove impulse noise and artifacts while preserving the morphological details of plasma cells and protein bands. To accelerate convergence during training, pixel intensity normalization was performed, scaling values to a range of [0, 1] (or standardized using the mean and standard deviation of the ImageNet dataset).
Data Augmentation: To address the potential imbalance between progression and non-progression groups and to prevent overfitting, offline and online data augmentation techniques were employed. This included random horizontal and vertical flipping, rotations within a range of 30, and random scaling (0.8–1.2). Furthermore, brightness and contrast adjustments were applied to simulate variations in staining and lighting conditions across different batches. These techniques effectively expanded the diversity of our training set, enhancing the model's generalization ability across heterogeneous clinical images.
MM disease progression prediction model construction
Deep learning-based progression prediction model
In this study, we introduced various deep learning models to extract high-dimensional features from bone marrow and electrophoresis images for predicting the progression of MM. These models include VGG16, AlexNet, DenseNet, MobileNet, and ResNet. VGG16 is known for its 16-layer deep convolutional structure, which effectively extracts multi-level features from images 17 ; AlexNet, as a pioneer in deep learning networks, uses a large number of convolutional neural networks for computation 18 ; DenseNet promotes feature transmission and reuse through densely connected layers, improving the model's learning efficiency 19 ; MobileNet uses depthwise separable convolutions, significantly reducing computational load 20 ; ResNet introduces residual connections to effectively alleviate the vanishing gradient problem in deep networks. 21
We conducted experiments on these models and found that ResNet and MobileNet achieved AUC values of 0.7295 and 0.6989, respectively, on the bone marrow smear dataset, outperforming the other three models. On the electrophoresis image dataset, VGG16 and DenseNet achieved AUC values of 0.8082 and 0.8088, respectively, outperforming the other three models. Therefore, for extracting features from bone marrow smear images, we selected ResNet and MobileNet, while for electrophoresis images, we selected VGG16 and DenseNet for feature extraction. The selected models from each modality serve as sub-models for further ensemble learning.
Machine learning-based progression prediction model
In constructing the MM progression prediction model, we selected various algorithms to compare their performance on the specific task, including Random Forest, K-Nearest Neighbors (KNN), Gradient Boosting, Adaptive Boosting (AdaBoost), SVM, XGBoost, Logistic Regression, and Decision Trees. Random Forest improves model accuracy and robustness by building multiple decision trees and aggregating their results. KNN is simple and intuitive, suitable for small datasets or as a baseline model. Gradient Boosting provides a powerful approach for handling regression and classification problems by incrementally adding weak models to minimize the loss function. AdaBoost enhances model performance by adjusting sample weights to prioritize learning from previously misclassified samples. Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. Its main idea is to find an optimal hyperplane to separate data points from different classes, maximizing the margin between them. XGBoost is widely popular for its efficiency and accuracy, improving prediction performance by combining multiple decision trees. These models have demonstrated their powerful capabilities and flexibility in handling various data tasks. We trained and predicted with multiple machine learning models, among which Logistic Regression and Random Forest achieved AUC values of 0.6800 and 0.6506, respectively, outperforming the other models. Therefore, we selected these two models as sub-models for further ensemble learning.
Ensemble learning-based progression prediction model using bone marrow, electrophoresis, and clinical data
In the ensemble learning disease progression prediction task based on bone marrow smears, electrophoresis images, and clinical baseline data, we selected the two best-performing models from each modality for ensemble learning. For bone marrow image processing, we chose ResNet and MobileNet; for electrophoresis images, we selected VGG16 and DenseNet; and for clinical baseline data, we used Random Forest and Logistic Regression. We adopted a Soft Voting strategy to integrate these sub-models from different modalities, which takes into account the prediction confidence of each model. Specifically, each classifier assigns a probability to each class, and the final combined prediction is the class with the highest total probability. The Soft Voting strategy effectively combines the results from multiple classifiers, is not limited by data modality, and often outperforms the prediction performance of individual models during testing.
Statistical analysis
Evaluation metrics play a crucial role in assessing the performance of machine learning models. These measures are essential for objectively evaluating model performance and guiding their development and improvement. In this experiment, the following metrics are used to measure the effectiveness of each model. The relationship between True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP) is used to obtain metrics such as ACC, Precision, Recall, F1, and AUC.
Accuracy: The percentage of correctly predicted results out of the total samples.
Precision: The probability that a sample predicted as positive is actually positive.
Recall: The probability that a sample, which is actually positive, is predicted as positive.
F1: The harmonic mean of Precision and Recall, balancing the performance of both.
The Receiver Operating Characteristic (ROC) curve is derived from the confusion matrix and is used to evaluate the predictive ability of a model. AUC represents the area under the ROC curve and is also an important criterion used in this paper to assess model performance.
Results
Experimental setup
In the experiment, we conducted a detailed setup to evaluate the performance of our proposed model (SOTA) in the MM progression prediction task. First, we divided the dataset into a training set and a testing set, with 80% used for training and 20% for testing, ensuring that the model is adequately trained while also being accurately evaluated. The experiment was conducted on a Windows 10 operating system, with hardware configuration including a computer equipped with an AMD Ryzen 9 5900X processor (32 GB RAM) and an NVIDIA GeForce RTX™ 3080 GPU (10 GB RAM). The experiment used Python 3.9 programming language and PyTorch deep learning framework version 1.8.1 for model construction and training.
Patient demographic data statistics
Table 1 shows the differences between the MM progression group and the non-progression group in terms of demographic and clinical baseline characteristics, including age, gender, staging, biochemical indicators, and immunological markers. The results indicate that positive expression of 1q21, ISS stage III, high levels of HGB, LDH, Cr, Calibration of Ca, and age (>65) are associated with MM progression. However, patient gender, bone marrow plasma cell ratio and other indicators are not significantly related to MM progression.
Demographic and clinical characteristics of all patients.
*
Abbreviations: PD: progressive disease; WBC: white blood cell count; HGB: hemoglobin; PLT: platelet; ALB: albumin; GLB: globulin; LDH: lactic dehydrogenase; Cr: creatinine; Calibration of Ca: calibration of calcium; β2-MG: β2-microglobulin.
Progression prediction model validation
Bone marrow smear-based deep learning progression prediction model
To achieve optimal performance, we optimized the parameters of all models and used Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) optimization algorithms. The learning rate was set to 0.00002, and slightly adjusted momentum parameters (beta1 = 0.85, beta2 = 0.995) were used, with 60 training epochs. To accelerate model convergence, we adopted a learning rate decay strategy, reducing the learning rate to 0.1 times its current value at the end of each epoch. The batch size was set to 64, meaning that 64 random samples were selected from the training set for forward and backward propagation in each iteration. Additionally, to further enhance the model's generalization ability and training stability, we introduced L2 regularization and Dropout techniques. This combination of optimization parameters and techniques allowed us to effectively improve the performance of the network model during training.
In this study, we compared the performance of various deep learning models in predicting the progression of MM using bone marrow smears. ResNet performed the best across all metrics, with an AUC of 0.7295, accuracy of 0.5800, F1 score of 0.6316, Precision of 0.5294, and Recall of 0.7826, demonstrating excellent predictive accuracy and consistency. DenseNet had an AUC of 0.6924, and although its Precision was slightly lower (0.6140), its Recall was as high as 0.8434, indicating its outstanding performance in identifying positive samples. MobileNet showed balanced performance, with an AUC of 0.6989, accuracy of 0.6000, and F1 score and Precision being relatively balanced, although its Recall was slightly lower than that of DenseNet. In contrast, VGG16 and AlexNet performed weaker, especially AlexNet, which nearly failed to effectively classify categories. Overall, ResNet had the best overall performance, DenseNet excelled in Recall, MobileNet showed more balanced classification ability, while VGG16 and AlexNet performed poorly. Specific results can be found in Table 2 and Figure 2(a).

The workflow of this study.

ROC curves of various test models: (a) ROC for the bone marrow smear group, (b) ROC for the electrophoresis image group, and (c) ROC for the clinical baseline indicator group.
Test results of deep learning models based on bone marrow smears.
Progression prediction model based on electrophoresis images and deep learning
In the experiment using electrophoresis images for progression prediction, we also employed the same training configuration as for bone marrow smears. The performance of each deep learning model exhibited unique characteristics: DenseNet performed the best, with an AUC of 0.8088 and accuracy of 0.6200, indicating its advantage in overall classification accuracy and consistency. VGG16 showed a good balance, with an AUC of 0.8082, an F1 score of 0.6341, and relatively balanced Precision and Recall. MobileNet performed slightly worse, with MobileNet having an AUC of 0.7633, showing stable performance but not surpassing other models. Overall, DenseNet demonstrated the best overall performance, and VGG16 showed a more balanced performance. Specific results can be found in Table 3 and Figure 2(b).
Test results of deep learning models based on electrophoresis images.
Progression prediction model based on clinical baseline indicators and machine learning
In the experiment using clinical indicators for progression prediction, the performance of traditional machine learning models varied. Logistic regression performed the best across all metrics, with an AUC of 0.6779 and an accuracy of 0.6800, indicating a certain advantage in overall classification performance. Random forest achieved an AUC of 0.6506. KNN, AdaBoost, and gradient boosting had similar performance, with AUCs ranging from 0.55 to 0.65. Their F1 scores and precision were relatively consistent, but they were slightly lacking in recall. SVM had the lowest AUC of just 0.4010, failing to effectively identify positive samples. Overall, logistic regression performed best in classifying clinical data, with random forest also demonstrating decent results. Detailed results are shown in Table 4 and Figure 2(c).
Test results of machine learning models based on clinical indicators.
Multimodal ensemble learning model based on bone marrow smear, electrophoresis images, and clinical baseline indicators
In this study, we applied a Soft Voting strategy to integrate the top-performing models from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—resulting in a significant overall performance improvement. The ensemble model achieved an AUC of 0.8180, an accuracy of 0.7000, and an F1 score of 0.5455. Notably, it demonstrated excellent prediction accuracy and consistency, with a particularly high Precision of 0.9000, and it represents a clear improvement over other individual models.
Overall, by leveraging the complementary strengths of different modalities, the ensemble model demonstrated superior comprehensive performance in predicting the progression of MM. Detailed results are shown in Table 5 and Figure 3.

Test results of the ensemble learning progression prediction model: (a) DCA curve, (b) calibration curve, (c) ROC curve.
Test results of the ensemble learning progression prediction model.
Discussion
A real-world retrospective study conducted in Spain 22 revealed that age, ECOG performance status, ISS stage, serum LDH, GFR, cytogenetics, and treatment regimen significantly affect OS. First-line treatment exhibited high heterogeneity among elderly and high-risk patients (e.g. ECOG PS ≥ 2, ISS stage III, severely impaired GFR, elevated LDH levels, and high-risk cytogenetics). In our study, positive 1q21 expression, ISS stage III, elevated LDH, Cr, calibration of Ca, HGB, and age (>65) were significantly associated with MM progression, which is largely consistent with the above-mentioned study.
The progression of NDMM is influenced by various factors, such as age, performance status, ASCT, and comorbidities.23–25 Integrating clinical baseline characteristics, biochemical markers, and gene expression profiles is currently the main strategy for applying machine learning to personalize treatment for MM patients and prolong survival. Orgueira et al. 13 utilized machine learning to analyze multiple variables, including 46 genes, to predict OS in MM patients receiving six first-line treatment regimens. Patients treated with the optimal drug combination identified by the model had longer OS than those receiving other regimens. Ubels et al. 26 designed Simulated Treatment Learning Signatures (STLsig) to predict treatment benefits in NDMM. STLsig identified two gene complexes that could jointly predict favorable responses to bortezomib.
Kubasch et al. 27 demonstrated that a machine learning model trained with gradient boosting classification could predict early relapse in NDMM with 73% accuracy using four features, such as the best response in the first year after first-line treatment. Most machine learning-based studies on MM disease progression rely on gene expression data, which requires substantial costs and time, making it difficult to apply widely in clinical practice. 28 Furthermore, an excessive number of input variables used to train models can lead to the risk of data overfitting. 29 This indicates that the technical demand for machine learning-based research on MM disease progression continues to grow, but it has not yet been fully addressed.
MM exhibits high heterogeneity, and deep learning algorithms can automatically uncover complex nonlinear relationships from heterogeneous data sources, performing excellently in predictive tasks. Deep learning simulates the human brain through deep neural networks for analysis and is widely used in image recognition; however, its application in MM progression prediction is still in its early stages.30,31 Significant progress has also been made in the application of ensemble learning techniques in the medical AI field.32,33 Therefore, this study combines images reflecting MM heterogeneity (bone marrow smear, electrophoresis), clinical baseline features, and AI to provide an opportunity to address the aforementioned issues in clinical decision-making.
This study utilizes data from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—and constructs an ensemble learning framework to predict the progression of MM based on bortezomib treatment, using various machine learning and deep learning models. In the single-modality experiments, the VGG16 and DenseNet models for electrophoresis images exhibited outstanding classification performance. The ResNet model for bone marrow smears achieved an AUC of 0.7295, demonstrating good classification ability. In the clinical indicator model, logistic regression achieved an AUC of 0.6779, outperforming random forest and other traditional machine learning methods. In the multi-modal ensemble model, the final AUC was improved to 0.8180, with accuracy reaching 0.7000 and Precision as high as 0.9000 through the Soft Voting strategy that combines the optimal models from different modalities. This indicates that the ensemble model significantly outperforms the single-modality models, particularly in terms of accuracy and precision.
Our model performs best in the prediction of MM progression, primarily due to the dual advantages of the model architecture and the task characteristics. From the perspective of model architecture, we adopt a multi-modal ensemble learning framework that combines the best models from three modalities—bone marrow smears, electrophoresis images, and clinical baseline indicators—by using the Soft Voting strategy to merge the predictions from each modality. Bone marrow smears reflect morphological changes of cells, electrophoresis images provide protein expression information, and clinical indicators show the patient's basic characteristics. Our method fully utilizes the unique features of each modality, enhancing the model's comprehensive understanding of complex data and improving prediction accuracy. Additionally, the Soft Voting strategy effectively mitigates the bias of individual models and boosts overall performance. This architecture is consistent with the methodological frameworks championed by scholars such as S. Jafarzadeh et al. 34 and S. Anari et al., who have demonstrated that ensemble systems significantly enhance model robustness and generalization in complex oncological tasks. The reason our multimodal ensemble approach outperforms competing methods lies in its ability to resolve the “information gap” inherent in single-modality assessments. While previous studies often relied solely on clinical markers (which may lack predictive granularity) or gene expression data (which is static and costly), our method captures the dynamic morphological features of plasma cells and the quantitative protein variations in electrophoresis. By integrating these complementary information streams, the ensemble model can identify subtle progression signals that are otherwise filtered out as noise in a single-modality network. Specifically, the high precision (0.9000) of our model compared to baseline classifiers suggests that the fusion of image-based deep features and clinical categorical variables creates a synergistic effect, allowing for a more robust characterization of MM heterogeneity. Therefore, our ensemble model demonstrates excellent performance in predicting MM progression.
In clinical applications, the interpretability of the model is crucial. Settouti et al. 35 applied explainable AI to identify the optimal chemotherapy regimen for MM, where SHAP provided a global perspective on feature contributions. To enhance the clinical interpretability of our model, we employed Gradient-weighted Class Activation Mapping (Grad-CAM) for visual analysis. Grad-CAM generates heatmaps that show the image areas the model focuses on when making predictions (Figure 4), providing an intuitive explanation of the model's decision-making process. In our study, Grad-CAM effectively revealed how the model focused on key regions in the bone marrow smears and electrophoresis images, which are associated with specific disease features. For example, Grad-CAM could display which parts of the bone marrow smear images contributed most to the prediction results, helping clinicians understand why the model labeled certain samples as high-risk or low-risk. Through Grad-CAM visualization, we can transform the complex decision-making process of the model into more understandable clinical information, increasing clinicians’ trust in and acceptance of the model's predictions, making the model more practical and reliable in real-world medical scenarios.

GradCAM visualization results: (a) activation heatmap area of the electrophoresis image, (b) activation heatmap area of the bone marrow smear.
Failure Case Analysis via Grad-CAM: To provide a transparent and balanced evaluation of our model, we analyzed cases where the ensemble prediction did not align with the ground truth. As illustrated in Figure 5, in some misclassified electrophoresis images, the Grad-CAM heatmaps revealed that the model occasionally focused on the peripheral regions of the gel rather than the specific M-protein bands, likely due to baseline noise or suboptimal image contrast. Similarly, in failed bone marrow smear classifications, the attention was sometimes localized on cell debris rather than the diagnostic plasma cells. These findings suggest that while our ensemble approach mitigates the weakness of single models, the performance is still bounded by the quality of raw visual features. Future improvements will focus on implementing more robust attention mechanisms to suppress background noise and integrating molecular-level features to handle highly heterogeneous cases.

Grad-CAM visualization of representative failure cases. (a) Misclassified electrophoresis images showing attention focused on gel edges or background noise rather than M-protein bands. (b) Failed bone marrow smear cases where the model localized on cellular debris or staining artifacts instead of diagnostic plasma cells.
The optimal selection of the initial treatment regimen is crucial for the prognosis of NDMM. 24 To date, the choice of MM treatment regimens largely depends on the judgment of clinicians. Due to the high heterogeneity of MM and the increasing complexity of treatment regimens, this process becomes cumbersome and inefficient. AI can simplify this task, making clinical work more efficient. Therefore, we developed an integrated algorithm based on deep learning and machine learning, utilizing multimodal data (patient bone marrow smear images, immunofixation electrophoresis, and baseline characteristics). This model observes the disease progression of NDMM patients undergoing bortezomib-based first-line treatment over 2 years and makes preliminary judgments based on PD and Non-PD. From a clinical perspective, this tool provides an objective, standardized risk-stratification framework that can assist clinicians in identifying high-risk patients earlier. For those predicted to progress, clinicians might consider increasing the frequency of follow-up or adjusting therapeutic strategies prematurely. Moreover, because our model relies on routine diagnostic images rather than expensive molecular testing, it offers a cost-effective solution for precision medicine, especially in hospitals where advanced genetic resources are limited. This reduces the overall economic burden and helps prevent potential over-treatment, laying a solid foundation for more accessible, individualized chemotherapy for MM.
This study has some limitations. A potential limitation is the small validation dataset, so larger-scale datasets are needed to confirm the results. The model does not provide more specific treatment options for NDMM patients, and clinicians are unable to develop precise treatment plans for each patient. In the future, treatment options could be further categorized, with efficacy assessments conducted after first-line induction therapy or ASCT, which would better guide personalized MM therapy. External validation will also be needed in later stages. The real clinical data we used contain noise and missing features, which are unavoidable characteristics of real-world clinical data. The machine learning model developed in this study achieved acceptable classification performance on noisy data.
Conclusion
This study predicts the progression of MM based on bortezomib treatment using a multimodal data integration learning model. Clinicians can combine the model's prediction results to adopt appropriate strategies, reducing patient suffering and unnecessary economic burdens, and improving quality of life. This model holds great potential in predicting MM progression and lays the foundation for future research on personalized treatment for MM.
Footnotes
Author note
AI Disclosure: The authors used ChatGPT in order to check grammar mistakes. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the final version of the manuscript.
Ethics approval and informed consent
All procedures involving human participants were approved by the Beijing Chao-Yang Hospital Ethics Committee, which waived the requirement of informed consent from subjects enrolled in this study (Ethics Approval Number: 2024-ke-850).
Consent for publication
All authors have approved the manuscript for publication.
Authors’ contributions
Sha Li designed the research, collected the data and images, analyzed the results, and wrote and edited the manuscript. Boyang Zang constructed model and analyzed the results. Jing Jia designed the research. Yantian Zhao, Hong Zong, and Hong Huo collected images. Chuanying Geng designed the research, collected the data, and reviewed the manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The data analyzed in the current study are available from the corresponding author upon reasonable request.
