Abstract
Introduction
Breast cancer is the most commonly diagnosed malignancy and the leading cause of cancer-related death among women worldwide. 1 In 2022 alone, approximately 2.3 million new cases were reported, with over 670,000 deaths, according to the World Health Organization. 2 Early stage breast cancer typically lacks obvious symptoms, making timely detection extremely challenging. As the disease progresses, symptoms such as palpable lumps, breast pain, or nipple discharge may appear, 3 often indicating advanced stages that are more difficult to manage. These facts underscore the urgent need for reliable and efficient methods for early diagnosis and treatment planning.
Imaging techniques such as mammography, ultrasound, and magnetic resonance imaging (MRI) have become indispensable in the early detection of breast tumors. 4 However, their effectiveness often hinges on the experience and subjective judgment of radiologists, leading to variability and inconsistency in diagnoses. As a result, there has been a growing interest in leveraging artificial intelligence (AI) to assist with breast cancer diagnosis, particularly in the automation of lesion detection and segmentation.
Computer-aided detection systems have emerged to reduce diagnostic variability and enhance efficiency by automatically analyzing breast imaging data, especially mammograms. 5 Convolutional neural networks (CNNs) such as U-Net, 6 U-Net++, 7 and U-Net 3+ 8 have significantly advanced segmentation accuracy in this context, enabling precise delineation of lesion boundaries—a critical step in breast cancer diagnosis. While CNNs have shown versatility in medical image analysis, 9 their application to breast lesion segmentation faces unique challenges.
However, despite their effectiveness, CNN-based segmentation methods primarily rely on stacking convolutional layers to expand the receptive field, which can increase parameter complexity and reduce computational efficiency. Furthermore, their local operation nature often fails to capture global anatomical context, leading to over- or undersegmentation—particularly in dense breast tissues with indistinct lesion boundaries, where preserving subtle structural details is crucial. To address these issues, transformer-based segmentation models have gained traction due to their ability to capture long-range dependencies. Hybrid models such as HCT-Net 10 and ScribFormer 11 combine CNNs and transformers to improve global context modeling, although simple feature fusion across scales often fails to deliver consistent representations.
Another major limitation in current research is the disconnection between lesion detection and segmentation. Most studies address these tasks separately, focusing either on identifying lesion locations or on delineating lesion boundaries. However, in clinical practice, both tasks are crucial: detection locates the region of interest, while segmentation provides morphological details such as size and shape. Bridging this gap is essential for developing comprehensive and clinically applicable AI systems.
In this study, we propose DVF-YOLO-Seg, a two-stage detection–segmentation framework, designed to enhance the precision of breast lesion detection and segmentation. While similar works, such as YOLOv8+SAM 12 and YOLOv9+SAM, 13 also employ a detection–segmentation pipeline, our approach introduces several unique innovations. Specifically, DVF-YOLO-Seg integrates multiple modules in a synergistic manner, rather than a simple stacking of independent components.
First, we incorporate the DualConv 14 module into the YOLOv10 15 detection model to improve the detection stage. This module combines 3 × 3 and 1 × 1 grouped convolutions, enabling better extraction of multi-scale features, which significantly enhances the detection of small and irregular lesions. Varifocal Loss (VFL) 16 is then applied to mitigate class imbalance issues, refining bounding box precision for small lesions.
The bounding boxes produced by the detection stage are then passed to the segmentation stage as spatial prompts, where a Visual Reference Prompt Segment Anything Model (VRP-SAM) 17 further refines the segmentation output. Unlike the original segment anything model (SAM), 18 which uses standard bounding boxes, VRP-SAM leverages visual reference prompts to guide the segmentation process, ensuring higher robustness to low-quality bounding box predictions. This significantly enhances segmentation accuracy, especially in cases where the bounding boxes may be imprecise or inaccurate.
Additionally, we introduce a perturbation-voting strategy that mitigates the impact of detection errors and improves segmentation stability. By incorporating multiple variations of the bounding box predictions and performing majority voting, we further reduce the risk of errors due to inaccurate detections, making the model more reliable in clinical scenarios.
The main contributions of this work are as follows:
A two-stage detection–segmentation framework is developed, where lesion bounding boxes generated by an optimized YOLOv10 are refined through VRP-SAM for high-precision segmentation. YOLOv10 is enhanced with DualConv and VFL to strengthen feature extraction and weight distribution, increasing robustness in dense tissue and complex backgrounds. Clinical applicability and accuracy are validated through subjective evaluations by physicians from two hospitals, demonstrating the model’s real-world effectiveness.
Related work
Detection of medical images
Breast cancer detection has undergone a paradigm shift, moving from traditional geometric and texture-based feature engineering to advanced deep learning models that integrate multimodal data and multitask learning. Among these, CNNs have gained widespread adoption due to their ability to automatically learn hierarchical features from medical images. However, their heavy reliance on large-scale annotated datasets remains a critical bottleneck, limiting their generalizability in real-world clinical settings where labeled data are often scarce and costly to obtain.
To address this, Zhang et al. 19 incorporated a Bayesian framework into YOLOv4 to quantify prediction uncertainty and enhance model robustness in data-scarce environments. Despite these improvements, Das et al. 20 pointed out that traditional CNNs still face a high risk of overfitting when labeled data are limited, highlighting the need for more adaptive architectures. In response to small dataset challenges, Prinzi et al. 21 proposed a YOLOv5-based model for mammographic detection, leveraging transfer learning to boost performance, even for difficult lesions such as asymmetries.
As a representative of single-stage detectors, the YOLO family has become a dominant solution for real-time object detection in clinical workflows, owing to its unified design for feature extraction and lesion localization. For example, Aly et al. 22 applied YOLO to early breast lesion detection, significantly improving diagnostic efficiency. Building on this, Su et al. 23 enhanced YOLOv5 with transformer modules to capture long-range contextual dependencies in mammograms. While these advancements have improved performance, challenges still exist in detecting small lesions with high precision and handling complex tissues.
To meet the demands of real-time deployment in low-resource settings, current research has shifted toward lightweight architectures and multi-input fusion. For instance, combining weighted least squares regression with multi-view fusion enabled efficient thermal imaging-based detection on mobile devices. 24 Meanwhile, three-dimensional mammography has emerged to overcome the limitations of two-dimensional (2D) images. Umamaheswari and Mohanbabu 25 introduced a hybrid segmentation model for volumetric images, which combines adaptive thresholding, region-growing, and cat swarm optimization techniques.
Beyond imaging, novel multimodal approaches have been explored to improve breast cancer detection. BiaCanDet 26 integrates bioimpedance signals with deep learning using spatial–temporal attention to distinguish malignant tissues. Ensemble learning has also been employed, combining CNN architectures such as EfficientNet, AlexNet, ResNet, and DenseNet with customized scaling and feature fusion to further improve performance. 27 Recent advancements have also explored the potential of large language models (LLMs) in breast cancer diagnosis and treatment. Ghorbian et al. 28 demonstrated that LLMs, by analyzing vast amounts of medical data, can significantly enhance diagnostic accuracy, treatment decisions, and clinical workflows.
Segmentation of medical images
Early approaches to breast lesion segmentation primarily relied on handcrafted features, focusing on shape, boundary texture, and intensity. While intuitive, these methods suffered from poor generalizability, as manually designed features often failed to adapt to variations in lesion morphology, such as irregular boundaries, and differences in imaging conditions, including noise and contrast. The emergence of CNN-based models addressed this gap by enabling automatic hierarchical feature learning, significantly advancing the field. For example, Wang et al. 29 proposed the MLNnet to segment clustered microcalcifications in mammograms, with a particular focus on mitigating domain shifts caused by variations in patient pose, breast density, and imaging acquisition protocols. Following this, AAPFC-BUSnet 30 further enhanced segmentation accuracy by integrating multi-scale features through deformable convolutions to adapt to irregular lesion shapes and adaptive self-attention mechanisms to prioritize clinically relevant regions.
Despite these advances, CNN-based models are constrained by their local receptive fields, which limit their ability to capture fine morphological details—especially for lesions with ambiguous boundaries or subtle structural distortions. To overcome this limitation, Li et al. 31 developed a dual-stream framework: a locality-preserving learner focuses on fine-grained boundary details, while a conditional map learner enhances global context, jointly improving segmentation precision and diagnostic prediction. Similarly, MF-Net 32 addressed multi-scale challenges by combining transformer modules for global feature modeling with multi-path extraction for local detail preservation. For low-boundary-clarity scenarios, Jiang et al. 33 introduced a semi-supervised approach that leverages adaptive patch enhancement to highlight indistinct boundaries and contrastive learning to learn discriminative features from unlabeled data, effectively boosting performance in data-scarce settings.
Recently, the SAM revolutionized generic segmentation with a prompt-driven vision transformer framework, supporting flexible inputs such as points, boxes, text, and masks. Its successor, SAM2, 34 further improved usability by reducing interaction latency. However, SAM’s pretraining on natural images leaves it devoid of domain-specific knowledge, limiting its effectiveness in breast lesion segmentation.
To address this domain mismatch, several medical-oriented variants have been proposed. PATH-SAM2 35 integrates a Kolmogorov-Arnold classification module with a UNI encoder pretrained on histopathological data, enhancing adaptability to tissue-level features. MedSAM 36 and SAMMI 37 achieve better medical alignment via large-scale fine-tuning on medical images, though their heavy computational overhead hinders deployment in resource-constrained clinical settings. To reduce latency, SAMed 38 introduced low-rank adaptation (LoRA)-based adapters in self-attention layers, while Zhong et al. 39 combined parallel CNN branches with LoRA for efficient multi-scale feature sampling. For ultrasound-specific tasks, SAMUS 40 optimizes SAM with lightweight CNNs for low-cost feature extraction and location adapters to handle ultrasound speckle noise, boosting efficiency and accuracy. Though tailored for ultrasound, its core idea—using lightweight adaptations to enhance modality-specific robustness—offers insights for adapting to low-contrast or dense tissues in X-ray images.
Despite these innovations, a critical bottleneck remains in real-world clinical applications: prompt quality. Ambiguous or low-quality prompts often degrade segmentation performance, particularly for subtle lesions. VRP-SAM 17 directly addresses this limitation by introducing a visual reference prompt encoder, which extracts semantic guidance from high-quality reference images, such as well-annotated lesion examples. This strategy enhances segmentation robustness under noisy prompting conditions and consistently outperforms geometric prompt baselines, marking a key advance in bridging the gap between generic and clinical segmentation.
Multi-stage learning
Multi-stage learning strategies have emerged as a critical advancement to overcome the limitations of single-stage models in capturing complex breast lesion structures, particularly in cases with irregular shapes, indistinct boundaries, or overlapping regions. While single-stage approaches remain effective for initial analysis, they often struggle to maintain a balance between localization accuracy and segmentation precision, highlighting the need for sequential refinement of detection and segmentation outputs.
Yaqub et al. 41 introduced a multi-stage framework that integrates atrous spatial pyramid attention with a cross-scale U-Net, enabling adaptive feature extraction and improving segmentation performance across varying lesion morphologies. In a detection-guided design, Yan et al. 42 combined YOLOv3 for lesion candidate identification with U-Net++ for boundary refinement, demonstrating the benefit of modular specialization through stage-wise optimization. Extending this paradigm, Khatua et al. 43 incorporated YOLOv8 with the SAM for instance segmentation, showing that integrating foundation models into detection–segmentation pipelines can leverage their generalization strengths in medical contexts. These works collectively illustrate that coupling robust detection backbones such as YOLO with advanced segmentation networks such as U-Net variants or SAM can effectively decouple localization and boundary modeling, thereby addressing limitations of single-stage designs.
In clinical practice, mammography and ultrasound often involve the presence of multiple coexisting lesions with diverse characteristics, including varying sizes and subtle or overlapping contours. Accurate segmentation under such complex conditions remains a persistent challenge. Although multi-stage frameworks have shown promise, many existing methods lack robustness when the initial detection results are imprecise and rarely incorporate mechanisms tailored to the nuances of breast imaging, such as detecting early stage subtle lesions. Enhancements, including adaptive prompting strategies, cross-stage feature refinement, and domain-specific representation learning, are essential to improving both accuracy and reliability in clinical scenarios.
Motivated by these limitations, we propose a dual-stage detection–segmentation framework that combines an enhanced YOLOv10 detector with a visual reference prompt-guided SAM architecture. This approach improves robustness to low-quality prompts, enhances domain adaptability, and advances segmentation precision in challenging breast lesion cases, thereby bridging the gap between technical accuracy and clinical utility.
Methods
Data source and preprocessing
This study employed the curated breast imaging subset of the digital database for screening mammography (CBIS-DDSM), 44 a widely used public dataset for breast cancer research. It contains 753 cases with calcifications and 891 cases with masses. In this study, we focused specifically on mass lesions with corresponding bounding box annotations and segmentation masks.
To enhance lesion visibility and improve contrast in dense breast tissue, a standardized preprocessing pipeline was applied. First, contrast-limited adaptive histogram equalization (CLAHE) was used to improve local contrast. The CLAHE algorithm was configured with a clip limit of 2.0 and a tile grid size of 8 × 8, settings commonly adopted in mammogram enhancement tasks to strike a balance between local contrast and noise suppression.
Next, Gaussian filtering was employed to suppress background noise while preserving important lesion boundaries. A 5 × 5 kernel was used, which provides effective smoothing without excessively blurring fine structures.
All mammograms were resized to 640 × 640 pixels to match the input requirements of the YOLOv10-based detection model. Bounding box annotations were converted to the YOLO format, with normalized center coordinates, width, and height. The refined bounding boxes were subsequently used as visual prompts for the VRP-SAM segmentation module, enabling accurate lesion mask generation.
Figure 1 illustrates the full preprocessing pipeline, including CLAHE enhancement, Gaussian smoothing, bounding box generation, and prompt-based segmentation. To ensure consistency with model input requirements, all visualizations in Figure 1 are based on resized 640 × 640 images, which correspond to the standard input size for YOLOv10.

Preprocessing and prompt generation pipeline for breast lesion segmentation. (a) Original CBIS-DDSM mammogram image. (b) CLAHE-enhanced image to improve local contrast, particularly in dense breast tissue. (c) Gaussian-filtered image to suppress background noise while preserving lesion boundaries. (d) Lesion area detection using the YOLOv10-based model with refined bounding boxes. (e) High-precision lesion segmentation performed by the VRP-SAM model, guided by visual reference prompts.
Overall architecture of DVF-YOLO-seg
Detection module
In the detection stage, we adopted an enhanced YOLOv10 architecture tailored for breast imaging. As shown in Figure 2, the overall framework integrates two key components. These are the DualConv module and VFL, both of which are critical for improving lesion detection in challenging scenarios. To better extract lesion features under dense and noisy backgrounds, we incorporated the DualConv module into both the backbone and neck of YOLOv10. Its integration position is detailed in Figure 3. Figure 4 further illustrates the module’s internal structure. It employs a grouped convolution strategy, combining 3 × 3 kernels for spatial context modeling and 1 × 1 kernels for reducing channel redundancy. This dual-path design significantly improves the model’s capacity to detect small and irregular lesions without increasing computational burden. It also reduces misidentification of normal tissues, which is a common issue in low-contrast or cluttered surroundings.

DVF-YOLO-Seg framework diagram. The upper part shows the detection stage, using an enhanced version of YOLOv10 (including DualConv and Varifocal Loss) to identify breast lesions and generate bounding box hints. The lower part is the segmentation stage, using the bounding box hints from the detection stage, the Visual Reference Prompt Segment Anything Model (VRP-SAM) module performs accurate segmentation of the lesion area.

The architecture of the improved YOLOv10 detection module. The model incorporates DualConv modules in the backbone and neck to enhance feature extraction. Varifocal Loss is used in the classification head to address class imbalance. The red bounding box represents the initial lesion detection. A perturbation mechanism (shown as

Structure diagram of DualConv group convolution technology. M represents the input channels (input feature map depth), N is the number of filters (output channels), and G denotes the number of groups in the group and dual convolutions. The convolution kernel sizes are 3 × 3 and 1 × 1.
To address the severe class imbalance commonly found in mammographic datasets—where negative background samples dominate and tumor areas are rare—we integrated the VFL into the detection head. Unlike conventional binary losses, VFL dynamically adjusts the learning weight of each sample based on its classification confidence and bounding box quality. The formulation of the loss is defined as:
In this context,
To facilitate this joint prediction, VFL adopts a star-shaped bounding box feature representation, in which each bounding box is associated with nine fixed sampling points (center, midpoints of edges, and corners). These points are used to extract geometric and contextual cues via deformable convolutions, enabling more accurate learning of IACS values. As illustrated in Figure 5, these sampling points are visualized as yellow circles, surrounding the initial red bounding box. This representation strengthens the correlation between classification and localization, improving ranking and final detection accuracy, especially for small or ambiguous lesions.

Visualization of the Varifocal Loss mechanism and IoU-aware classification. The yellow circles represent nine fixed sampling points arranged in a star-shaped pattern, including the box center, edge midpoints, and corners. These points are used to extract geometric and contextual features through deformable convolution. The red bounding box shows the initial prediction, while the refined blue box represents the final accurate detection of the breast mass region.
The VFL function introduces two key hyperparameters:
Segmentation module
Building on the detection-stage outputs, in the second stage, we adopted the VRP-SAM. This model introduces a VRP encoder to extend the original SAM framework. The VRP encoder projects various annotation types, such as points, boxes, and masks, into interpretable embeddings for the SAM decoder. In our study, only box prompts are used, aligning with clinical practice where bounding boxes are readily available from radiologist annotations, ensuring practical applicability.
As shown in Figure 6, the VRP encoder comprises two main modules: the feature enhancer and the prompt generator. The feature enhancer first extracts annotation-based prototype features from a reference image

The Visual Reference Prompt Segment Anything Model (VRP-SAM) framework is based on box prompts from YOLOv10. Only box prompts are used in this study; other prompt types supported by VRP-SAM are excluded.
These prototype features
The prompt generator initializes a set of learnable queries
These reference-aware queries
The resulting prompt embeddings
In our framework, we use the bounding box prompts generated by the improved YOLOv10 as input to the prompt generator, guiding the segmentation process. Since breast lesions are typically small and may be obscured by surrounding tissue, providing accurate spatial priors is crucial. The bounding boxes help localize the lesion and direct attention to its boundary, allowing the model to generate more precise and context-aware visual prompts.
To enhance segmentation robustness under varying prompt conditions, we introduce a perturbation-based prompt refinement strategy. For each lesion, the original bounding box predicted by YOLOv10 is randomly expanded or contracted by 1–4 pixels along each axis. This process is repeated
Each perturbed box is individually fed into VRP-SAM to produce a segmentation mask, yielding a set of candidate masks:
The final segmentation result
This aggregation strategy mitigates the influence of inaccurate prompts and promotes consistent predictions. The integration of YOLOv10-generated box prompts, reference-aware attention, and perturbation-based refinement enables the VRP-SAM to segment lesions with improved accuracy and robustness, particularly for small or ambiguous targets.
Implementation details
To ensure optimal training and evaluation performance, specific hyperparameter settings were applied to each component of the proposed DVF-YOLO-Seg framework. These hyperparameters were carefully selected to improve convergence stability, reduce overfitting, and maintain computational efficiency throughout the training process. All experiments were conducted on an NVIDIA RTX 4090D GPU, using CUDA 11.3 on Ubuntu 20.04. The model was implemented in PyTorch 1.11.0 with Python 3.8. Table 1 summarizes the key hyperparameters used for the detection and segmentation modules, along with input preprocessing and learning configurations.
Summary of hyperparameter settings used for training DVF-YOLO-Seg.
GPU: graphical processing unit; OS: operating system; CLAHE: contrast limited adaptive histogram equalization.
Performance metrics
To comprehensively evaluate the detection and segmentation performance of the DVF-YOLO-Seg framework, we employed multiple evaluation metrics, including precision (P), recall (R), F1-score, mean average precision (mAP), and dice similarity coefficient (DSC). These metrics are derived from the confusion matrix, which contains true positives (TP), false positives (FP), and false negatives (FN). Below are the definitions and formulas for each metric:
Precision: Measures the proportion of true positive samples among all predicted positive samples:
Recall: Measures the proportion of actual positive samples that were correctly identified:
F1-score: The harmonic mean of precision and recall, balancing both metrics, especially useful in class-imbalanced tasks:
mAP: Used to assess detection performance, particularly for object localization:
DSC: Measures the overlap between the predicted segmentation mask and the ground truth:
In addition to these accuracy-related metrics, we also measured floating-point operations per second (FLOPs) and latency for the detection model in Table 2. FLOPs measures the number of floating-point operations needed for inference, while latency quantifies the inference time for a single image. These efficiency metrics provide insight into the computational cost and real-time performance of the models.
Detection performance and efficiency comparison of YOLO models.
mAP: mean average precision; FLOPs: floating-point operations.
Results
Performance comparison with mainstream segmentation models
To benchmark the overall segmentation effectiveness of DVF-YOLO-Seg, we compared it against a variety of state-of-the-art models across three prompt strategies: prompt-free, point-prompt, and box-prompt. This comparison aims to highlight how our two-stage framework, combined with optimized prompt design, outperforms existing approaches in clinical lesion segmentation.
As shown in Table 3 and Figure 7, DVF-YOLO-Seg achieved the best performance across all metrics, with a precision of 0.797, a recall of 0.815, a dice coefficient of 0.802, and an F1-score of 0.806, which surpasses both prompt-free and point-prompt models. This improvement highlights the benefit of the two-stage architecture and the integration of optimized box prompts for guiding segmentation.

Line chart comparing the performance of the compared models on the evaluation metrics.
Performance comparison of different methods.
Prompt-free models such as Swin UNETR, CoTr, and nnUNet show consistent but moderate performance, with dice scores between 0.725 and 0.755 and F1-scores from 0.744 to 0.757. While these methods benefit from end-to-end simplicity, their lack of prompt guidance limits sensitivity to small or low-contrast lesions, which are common in dense breast tissue. Point-prompt methods such as MedSAM and MobileSAM achieve higher recall values but at the cost of lower precision (0.732–0.766), indicating a tendency to over-segment, especially in ambiguous regions.
Box-prompt models, including YOLOv8+SAM and YOLOv9+SAM, show improved balance in
Although the precision of DVF-YOLO-Seg is 79.7%, this reflects a deliberate tradeoff prioritizing coverage over boundary tightness. In breast lesion segmentation, especially in screening or early diagnosis scenarios, the primary clinical objective is to avoid missing any suspicious lesion—including primary tumors, micro-metastases, and multifocal lesions. Missing such regions, even if they are <5 mm, can lead to delayed diagnosis, increased recurrence risk, and reduced survival.
In our model, slightly lower precision is a result of favoring broader inclusion in uncertain or low-contrast regions, thereby improving recall (81.5%) and dice coefficient (80.2%), which are crucial for ensuring lesion completeness. The model achieves a significantly higher F1-score (80.6%), indicating a strong balance between identifying relevant lesions and minimizing false positives. This strategy is particularly vital for identifying subtle lesions in dense or heterogeneous tissue, where under-segmentation may cause clinically significant misses.
Moreover, subjective evaluation results (see section “Subjective and clinical evaluation”) support the clinical acceptability of this approach. Radiologists found that the segmentation results provide reliable lesion coverage, even if boundary precision is not perfect. This balance ensures robust detection while minimizing the risk of missed diagnoses, aligning with the real-world demands of breast cancer screening and treatment planning.
Performance comparison of the first phase detection model
To assess the effectiveness of our detection module enhancements, we systematically evaluated the improved YOLOv10 against several widely adopted YOLO variants. The results, summarized in Table 2, demonstrate that our model achieves the optimal balance of precision, recall, and mAP. It also maintains superior robustness in identifying challenging mass lesions, particularly small or ambiguous ones.
Compared to the standard YOLOv10, our model—incorporating DualConv and VFL—yields notable gains. It shows a 1.3% increase in precision, a 1.3% increase in recall, and a 0.8% gain in mAP. These improvements come with only a marginal increase in computational cost. The consistency of these gains across training epochs is confirmed by the line graphs in Figure 8. Detection examples in Figure 9 and multi-scale heatmaps in Figure 10 provide visual validation, with the latter illustrating that our model focuses more strongly on lesion regions.

Visualization of the variations in different model evaluation indicators over the number of rounds. (a) Line graph depicting the change in P over the number of rounds; (b) line graph illustrating the change in mean average precision (mAP) changing over the number of rounds.

Example of detection results. The example results use (a) YOLOv5 model; (b) YOLOv7 model; (c) YOLOv8 model; (d) YOLOv10 model; (e) YOLOv9 model; (f) improved YOLOv10 model.

Original images and their heatmaps.
Against other mainstream YOLO variants, our model outperforms YOLOv9 in all key metrics. YOLOv8 achieves the highest precision at 0.852, but its recall of 0.832 and mAP of 0.841 are lower than those of our model, indicating a weaker ability to capture subtle lesions. YOLOx, despite having similar latency, lags behind with a precision of 0.801, a recall of 0.809, and an mAP of 0.800. This confirms the superiority of our design under comparable constraints.
In terms of efficiency, our model maintains 48.6G FLOPs and 11.3 ms latency on 640 × 640 inputs. This performance is on par with YOLOv8 and significantly better than heavier architectures such as YOLOv7 and DAMO-YOLO. YOLOv7 operates with a latency of around 12.2 ms and FLOPs exceeding 50G, while DAMO-YOLO has FLOPs exceeding 50G. Such efficiency ensures real-time applicability in clinical settings. In contrast, YOLOv5, though fast, lacks sensitivity in complex backgrounds. Our model, however, strikes a stronger balance between precision, recall, and inference speed.
Performance comparison of the second-stage segmentation model
To comprehensively evaluate the effectiveness of the second-stage segmentation module, we compared the performance of SAM and VRP-SAM under various configurations of the first-stage detection model. As shown in Table 4, when using the original YOLOv10 without enhancements, SAM achieves a dice score of 0.690 and an F1-score of 0.700, while VRP-SAM outperforms it with scores of 0.718 and 0.723, respectively. This baseline comparison highlights the advantage of reference-guided prompts, even without upstream improvements.
Encoder ablation experiment.
SAM: segment anything model; VRP-SAM: visual reference prompt SAM.
Further improvements are observed when individual enhancements are applied. Specifically, using DualConv alone improves VRP-SAM’s dice to 0.733 and F1-score to 0.734, while applying VFL alone leads to greater gains, with a dice of 0.776 and an F1-score of 0.777. These results confirm that both modules independently contribute to more robust lesion localization and segmentation.
When both DualConv and VFL are integrated into the YOLOv10 detection stage, the performance of VRP-SAM reaches its peak: precision of 0.797, recall of 0.815, dice of 0.802, and F1-score of 0.806. This configuration consistently outperforms all others across multiple metrics, validating the cumulative benefit of architectural and loss function improvements in the detection stage and their downstream impact on segmentation quality.
Figure 11 visually demonstrates the model’s overall superiority, while Figure 12 provides a detailed comparison of four key metrics—precision, recall, dice score, and F1-score—across different configurations. The bar chart clearly highlights the incremental improvements achieved by each module and underscores the comprehensive advantage of the final integrated design.

Segmentation result example. This figure shows segmentation examples of different models, where (a) is the original image; (b) EfficientDet; (c) Swin UNETR; (d) MobileNet; (e) nnUNet; (f) MedSAM; (g) SAMed; (h) SAMUS; (i) our model DVF-YOLO-Seg; and (j) ground truth.

Segmentation performance under different configurations of detection modules and prompt encoders.

Detection box accuracy evaluation score graph.
Ablation study on model modules and prompt strategies
To dissect the individual contributions of each module in our framework, we conducted targeted ablation experiments. These include comparative analyses of backbone convolution types, loss function designs, and prompt refinement strategies, offering insights into how each component influences final detection and segmentation performance.
As shown in Table 5, the proposed DualConv module achieves the best performance among the tested alternatives. Specifically, it yields a precision of 0.861, a recall of 0.825, an mAP of 0.844, and an F1-score of 0.843, outperforming PConv, SCConv, and DynamicConv. Despite a slight drop in recall compared to the baseline, DualConv shows a substantial gain in precision (+2.7%), indicating that it significantly reduces false positives while maintaining high recall. This confirms the effectiveness of its hybrid kernel design in improving spatial feature discrimination.
Convolution modules ablation experiment.
mAP: mean average precision.
PConv and SCConv both show lower overall scores, with F1-scores of 0.796 and 0.822, respectively, which demonstrates that DualConv not only improves feature extraction but also yields a better tradeoff between sensitivity and specificity in dense breast tissue environments.
Table 6 further analyzes the effect of combining DualConv and VFL. Using only VFL yields a balanced F1-score of 0.840, while DualConv alone results in a higher F1-score of 0.843. When both modules are integrated, the model achieves the highest performance across all metrics, including an F1-score of 0.857. This confirms that the two modules are complementary: DualConv boosts spatial discrimination, while VFL addresses class imbalance, especially for small or infrequent lesions.
Loss function ablation experiment.
mAP: mean average precision.
To assess the robustness of segmentation under different prompt qualities, we additionally conducted ablation experiments on the prompt refinement strategy. As shown in Table 7, we compared three settings: no perturbation, perturbation without majority voting, and full strategy (perturbation + voting).
Effect of perturbation and voting on segmentation performance.
The results demonstrate that the bounding box perturbation alone improves segmentation robustness, leading to a 1.4% increase in dice. When combined with majority voting, the performance further improves, achieving an additional 3% gain in dice and a total F1-score improvement of 3.1% compared to the unrefined baseline. This confirms that the prompt refinement strategy effectively mitigates the sensitivity of segmentation to box localization errors, particularly for small or low-contrast lesions, enhancing both accuracy and consistency.
Subjective and clinical evaluation
This study aims to provide valuable insights for the optimization of technology and clinical application in the medical imaging field by subjectively evaluating the performance of various tumor detection and segmentation models. The experimental evaluation was conducted by five experienced doctors from two hospitals: the Affiliated Cancer Hospital of Xinjiang Medical University and the First Affiliated Hospital of Xinjiang Medical University. The doctors were divided into three age groups: 25–35, 36–45, and 46–55, with all having more than six years of professional experience in radiology, imaging, and surgery. This ensures the diversity and objectivity of the evaluation results.
To ensure fairness, the doctors were unaware of which model the segmentation results corresponded to, assigning scores solely based on the quality of the segmentation. The evaluation considered both detection accuracy and segmentation quality. For the detection stage, the doctors assessed whether the model’s bounding boxes accurately covered the main tumor regions. For segmentation, they evaluated how well the segmented boundaries matched the true tumor boundaries, with a focus on clarity, smoothness, and avoidance of oversegmentation or undersegmentation. They also took into account the model’s ability to accurately locate the tumor and avoid any risk of misdiagnosis or missed lesions. Each aspect was rated on a 1–5 scale, where 1 represented poor performance and 5 represented perfect accuracy.
Detection box accuracy evaluation
The core task of tumor detection is to assess the accuracy of the detection boxes in covering the main regions of the tumor. The experimental results show significant performance differences among different detection models. According to the doctor’s ratings, the improved YOLOv10 outperforms all other models, with an average score of 4.2, demonstrating exceptionally high detection reliability and stability.
In contrast, YOLOv8 and YOLOv9 scored 3.8 and 3, respectively, showing slight deficiencies in accuracy when tumors had complex shapes, falling short when compared to the improved YOLOv10. YOLOv5 and YOLOv7 performed poorly, scoring 3 and 2, respectively, with detection box accuracy needing further optimization. The worst performer was YOLOv10, with an average score of 1.4. In 60% of the evaluations, it was considered that the detection boxes had significant deviations, leading to a high risk of misdiagnosis and missed diagnoses, which could result in severe diagnostic errors.
Although Table 2 reports a relatively high mAP of 0.845 for the original YOLOv10, the subjective rating in Figure 13 shows a much lower average score (1.4). This discrepancy arises from the difference between objective metrics and clinical expectations: mAP quantitatively evaluates the overlap between predicted and ground truth boxes across a range of IoU thresholds, while the subjective rating reflects physicians’ perception of diagnostic usability. In clinical contexts, even a bounding box with decent IoU might be considered inadequate if it fails to fully capture suspicious areas or is misaligned in dense tissue. This highlights the importance of complementing objective metrics with subjective evaluations to assess real-world clinical performance more accurately.

Segmentation quality evaluation heatmap. Shows the performance scores of various models on three metrics. The color gradient ranges from blue (lower score) to red (higher score), visually showing the difference in model performance.
Segmentation quality evaluation
Overlap between segmentation template and tumor boundary: The evaluation of tumor segmentation models initially depends on the overlap between the segmentation template and the actual tumor boundaries. According to subjective ratings, DVF-YOLO-Seg performed best in this regard, with an average score of 2.87, indicating that this model aligns well with the tumor boundary, providing a reliable reference for clinicians regarding mass size and shape. Following closely were nnUNet and MobileNet, with scores of 2.87 and 2.93, respectively. These models performed excellently in boundary alignment and are suitable for simpler clinical scenarios. However, EfficientDet and SAMed received lower scores of 2.27 and 2.47, respectively. These models struggled with complex cases or blurred boundaries, failing to effectively cover the entire mass boundary, leading to significant errors that could affect the assessment of the lesion’s extent.
Clarity of segmentation boundaries: In the evaluation of segmentation boundary clarity, DVF-YOLO-Seg, nnUNet, and MobileNet performed excellently, with average scores of 3.07, 3.2, and 2.93, respectively. These models demonstrated relatively clear segmentation boundaries with smooth transitions, effectively reducing blurry areas and helping doctors accurately distinguish the tumor from surrounding tissue, thereby improving diagnostic accuracy. In contrast, EfficientDet and SAMed exhibited poorer boundary clarity, scoring 2.4 and 2.33, respectively. These models presented blurred and irregular boundaries, making tumor recognition difficult and negatively impacting diagnostic precision.
Tumor localization accuracy: Regarding tumor localization accuracy, SAMUS, DVF-YOLO-Seg, and MobileNet performed excellently, with average scores ranging from 3.2 to 3.4. These models accurately localized tumors, enabling doctors to quickly focus on the lesion area for further analysis and diagnosis. For example, SAMUS scored 3.2, demonstrating minimal localization errors and accurately identifying the tumor region, providing essential support for diagnosis. In contrast, EfficientDet and Swin UNETR performed poorly in terms of location accuracy, scoring 2.6 and 2.8, respectively. These models were prone to misjudging lesion location, affecting subsequent treatment decisions.
Comprehensive evaluation and misdiagnosis risk
Considering segmentation precision, boundary clarity, and localization accuracy, DVF-YOLO-Seg ranked first with a comprehensive score of 3.11, offering balanced performance that effectively meets the basic requirements for clinical diagnosis while minimizing the risk of misdiagnosis and missed diagnoses. In contrast, EfficientDet had a lower comprehensive score of 2.42, underperforming in multiple dimensions. This model requires further optimization to improve segmentation precision, boundary clarity, and localization accuracy, thereby reducing misdiagnosis and missed diagnosis risks.
This subjective evaluation experiment provides a comprehensive assessment of various detection and segmentation models, highlighting their strengths and weaknesses in terms of performance. Among the detection models, the improved YOLOv10 exhibited the best detection box coverage accuracy, while DVF-YOLO-Seg performed most consistently across all evaluation metrics in segmentation, making it suitable for real-world clinical auxiliary diagnosis.
However, this study still has limitations, such as the small sample size of evaluating doctors, which may introduce some bias into the results. Future research should expand the sample size to include more doctors from diverse backgrounds and incorporate additional objective evaluation metrics to conduct a more comprehensive and in-depth assessment of the models. Furthermore, models with suboptimal performance should be improved through algorithmic adjustments and parameter optimization to enhance their capabilities, promoting the widespread and continuous development of detection and segmentation technologies in the medical field, thereby providing more precise support for clinical diagnosis. The results are shown in Figures 13 and 14.

Typical failure cases of DVF-YOLO-Seg in breast lesion detection and segmentation. The figure shows three typical failure types, one in each row: (a) row represents small or low-contrast lesions; (b) row represents lesions close to dense tissue or image boundaries; (c) row represents irregular or multifocal lesions. From left to right, the columns represent: original image, detection results, segmentation output, and zoomed area. Red box: predicted detection result; green area: model segmentation result; red outline: ground truth.
Small-lesion analysis
In addition to evaluating the overall lesion detection and segmentation performance, we conducted a detailed analysis focused on small lesions (those with a diameter <5 mm), as these are often the most challenging to detect due to their subtle appearance, low contrast, and frequent overlap with dense tissue. Detecting such small lesions is critical for early diagnosis, as they are often the first sign of malignancy. However, small lesions are particularly difficult for conventional models to detect due to the low visibility and high noise in mammography images.
In the analysis of small lesions, we focused on evaluating the segmentation performance of the model under various detection configurations. For this analysis, we selected a subset of 46 small lesions from the CBIS-DDSM dataset, each with a diameter <5 mm. These lesions were carefully chosen to represent a range of challenging detection cases, including both well-defined and ambiguous lesions. The model first detects lesions using different YOLOv10 configurations (baseline, with DualConv, with VFL, and both combined), generating bounding boxes. These bounding boxes then serve as input for the segmentation stage, where the VRP-SAM model is used to segment the detected lesion areas. Performance metrics, including dice score, F1-score, precision, and recall, were calculated based on the segmentation results, with a particular focus on recall to assess the model’s ability to correctly identify small lesions.
As shown in Table 8, the introduction of DualConv and VFL progressively improved the performance on small lesions. For the baseline YOLOv10 model, the dice score was 0.675, and the recall score was 0.700. The addition of DualConv improved these metrics, with a dice score of 0.711 and a recall score of 0.724. The VFL further boosted the performance, increasing the dice score to 0.728 and the recall score to 0.745. When both DualConv and VFL were integrated, the model achieved the highest performance, with a dice score of 0.750 and a recall score of 0.783, indicating that the combination of these two modules provided a substantial improvement in detecting and segmenting small lesions.
Segmentation performance on small lesions under different detection configurations.
The results indicate that the DualConv module significantly enhances the model’s ability to detect small lesions by improving the extraction of multi-scale features, which is essential for capturing subtle lesion boundaries. On the other hand, VFL addresses the class imbalance between small and large lesions, helping the model better prioritize the detection of small lesions during training. The combined use of these two modules resulted in the highest performance, improving both the recall and dice score, which are critical for detecting small, clinically significant lesions.
Failure case analysis
Although DVF-YOLO-Seg demonstrates superior performance in breast lesion detection and segmentation across most metrics and clinical scenarios, several failure cases were observed that reveal potential limitations. These failure cases can be broadly categorized into three types, as illustrated in Figure 15: (1) extremely small or low-contrast lesions, (2) lesions located near dense glandular tissue or image boundaries, and (3) irregularly shaped or multifocal lesions.
In the first category, the model occasionally fails to detect micro-lesions that are <5 mm in diameter, especially when embedded in dense fibroglandular backgrounds. These lesions tend to have subtle boundaries and low signal-to-noise ratios, which may lead to missed detections or inaccurate segmentation masks. In clinical practice, overlooking such small but potentially malignant foci could delay diagnosis or necessitate repeat imaging.
Secondly, lesions adjacent to high-density tissue or those close to the image boundary are prone to inaccurate box localization or boundary leakage. In some cases, the prompt encoder generates suboptimal guidance due to ambiguous detection results, leading to undersegmentation. This reflects the model’s sensitivity to localization noise and its reliance on upstream detection accuracy.
Lastly, complex lesions with irregular or spiculated margins challenge the model’s ability to fully capture lesion extent. While the prompt refinement strategy mitigates some of this error, segmentation outputs occasionally deviate from the actual lesion shape, affecting the assessment of morphology-related biomarkers.
These failure cases primarily stem from three key limitations of the model: insufficient detection capabilities in low-contrast regions, where small lesions are easily obscured by dense tissue or noise; poor localization accuracy for lesions adjacent to tissue boundaries, leading to positioning errors; and inadequate segmentation robustness for complex, irregularly shaped or multifocal lesions, resulting in imprecise boundary delineation. Addressing these limitations requires further refinement of the detection stage—especially for low-contrast or border-adjacent targets—and the integration of potential post-processing techniques such as shape-aware correction or adaptive prompt adjustment. Future work may also incorporate temporal or multi-view consistency (e.g., using both mediolateral oblique and craniocaudal views) to enhance robustness in such challenging scenarios.
Discussion
In this study, we proposed DVF-YOLO-Seg, a two-stage framework designed for accurate breast lesion mass detection and segmentation. The detection stage enhances YOLOv10 by incorporating the DualConv module, which enables improved multi-scale feature extraction and strengthens spatial and channel representations. This enhancement is particularly beneficial in scenarios with small, low-contrast lesions or heterogeneous tissue backgrounds. Additionally, the use of VFL effectively mitigates class imbalance by dynamically reweighting samples according to their quality and rarity, further improving model precision and sensitivity.
The segmentation stage employs a visual reference prompt-based SAM, where bounding box predictions from the detection stage act as prompts to guide fine-grained segmentation. This approach allows the segmentation model to benefit from spatial priors and contextual cues, resulting in more accurate delineation of lesion boundaries, even in cases with irregular shapes or blurred margins.
Beyond quantitative evaluation, we also conducted a subjective clinical assessment with radiologists from two hospitals. Their evaluations confirmed that the proposed framework significantly improves lesion localization and segmentation clarity, particularly for subtle lesions, highlighting its potential utility in real-world diagnostic workflows. The model’s 11.3 ms inference time is well-suited for real-time clinical use.
It is worth noting that the precision of the proposed DVF-YOLO-Seg framework, while moderate at 79.7%, reflects a deliberate design strategy prioritizing lesion completeness. Given the clinical context of breast cancer screening—where missing subtle lesions can lead to delayed diagnosis—our model favors broader inclusion to improve recall and ensure morphological coverage. This trade-off is further supported by high F1-score and dice metrics, and validated through subjective evaluations and failure case analysis. Collectively, these components contribute to the strong performance of DVF-YOLO-Seg across multiple evaluation metrics, validating its effectiveness in addressing core challenges in breast image analysis.
Limitations and future work
Despite the promising results, DVF-YOLO-Seg has several limitations. First, the model occasionally fails to detect extremely small lesions (<5 mm) or those located near dense glandular tissue or image boundaries. As illustrated in Section “Failure case analysis,” such cases often lead to inaccurate bounding boxes or suboptimal segmentation masks due to weak visual cues or spatial interference. Second, the effectiveness of the segmentation stage is inherently tied to the accuracy of detection prompts. Errors in the upstream detection stage may propagate and degrade segmentation quality.
Moreover, our current experiments are limited to 2D full-field digital mammograms from a single dataset (CBIS-DDSM), without incorporating multi-view consistency or multimodal information. This may limit the model’s generalizability to other imaging conditions, populations, or modalities such as ultrasound or MRI. Furthermore, while the model performs well in controlled environments, its relatively high computational cost may hinder deployment in resource-constrained clinical settings.
Building upon insights from the small-lesion analysis and failure case exploration, several research directions are envisioned to further enhance the capabilities of the proposed framework. To improve the detection of micro-lesions <5 mm, future designs will incorporate super-resolution feature enhancement modules capable of amplifying subtle lesion signals. This will be complemented by lesion-aware attention mechanisms aimed at suppressing interference from dense glandular regions and improving focus on diagnostically significant features.
Improved spatial context modeling will also be prioritized, particularly for lesions adjacent to image boundaries or embedded within heterogeneous tissue structures. Incorporating multi-view consistency—by aligning mediolateral oblique and craniocaudal projections—may offer richer contextual understanding and reduce localization errors. In addition, embedding shape priors into the segmentation architecture is expected to facilitate more accurate delineation of lesions with irregular or spiculated margins.
Enhancing the model’s generalizability is another critical goal. This includes validating performance on external datasets such as MIAS and extending applicability to multimodal imaging scenarios, including ultrasound and mammography. Cross-modal representation learning will be explored to effectively integrate heterogeneous image characteristics.
To improve deployment in clinical environments, efforts will also focus on lightweight optimization. This involves compressing the segmentation module through knowledge distillation and replacing standard convolutions with more efficient dynamic convolutional operations to reduce computational complexity without sacrificing accuracy.
