Sage Journals: Discover world-class research

Abstract

Objective

Accurate segmentation of breast lesions, especially small ones, remains challenging in digital mammography due to complex anatomical structures and low-contrast boundaries. This study proposes DVF-YOLO-Seg, a two-stage segmentation framework designed to improve feature extraction and enhance small-lesion detection performance in mammographic images.

Methods

The proposed method integrates an enhanced YOLOv10-based detection module with a segmentation stage based on the Visual Reference Prompt Segment Anything Model (VRP-SAM). A novel DualConv module is introduced to improve spatial, visual, and channel feature representation, while Varifocal Loss addresses class imbalance by emphasizing hard-to-detect lesions. The detection results are used to generate bounding box prompts for VRP-SAM, which refines the final lesion segmentation.

Results

Experiments on the curated breast imaging subset of the digital database for screening mammography dataset demonstrate that DVF-YOLO-Seg achieves a precision of 79.7%, a recall of 81.5%, a dice coefficient of 80.2%, and an F1-score of 80.6%, outperforming baseline models. Particularly for lesions <5 mm, the model shows improved sensitivity. Ablation studies confirm the effectiveness of the DualConv module and Varifocal Loss. Additionally, the framework shows better visual consistency and clearer boundaries in clinician-evaluated results.

Conclusion

DVF-YOLO-Seg significantly enhances the detection and segmentation accuracy for small breast lesions in mammography. By combining improved detection with prompt-based segmentation, this method offers a promising approach for computer-aided diagnosis in breast cancer screening.

Keywords

Breast cancer YOLOv10 VRP-SAM DualConv Varifocal Loss

Introduction

Breast cancer is the most commonly diagnosed malignancy and the leading cause of cancer-related death among women worldwide.¹ In 2022 alone, approximately 2.3 million new cases were reported, with over 670,000 deaths, according to the World Health Organization.² Early stage breast cancer typically lacks obvious symptoms, making timely detection extremely challenging. As the disease progresses, symptoms such as palpable lumps, breast pain, or nipple discharge may appear,³ often indicating advanced stages that are more difficult to manage. These facts underscore the urgent need for reliable and efficient methods for early diagnosis and treatment planning.

Imaging techniques such as mammography, ultrasound, and magnetic resonance imaging (MRI) have become indispensable in the early detection of breast tumors.⁴ However, their effectiveness often hinges on the experience and subjective judgment of radiologists, leading to variability and inconsistency in diagnoses. As a result, there has been a growing interest in leveraging artificial intelligence (AI) to assist with breast cancer diagnosis, particularly in the automation of lesion detection and segmentation.

Computer-aided detection systems have emerged to reduce diagnostic variability and enhance efficiency by automatically analyzing breast imaging data, especially mammograms.⁵ Convolutional neural networks (CNNs) such as U-Net,⁶ U-Net++,⁷ and U-Net 3+⁸ have significantly advanced segmentation accuracy in this context, enabling precise delineation of lesion boundaries—a critical step in breast cancer diagnosis. While CNNs have shown versatility in medical image analysis,⁹ their application to breast lesion segmentation faces unique challenges.

However, despite their effectiveness, CNN-based segmentation methods primarily rely on stacking convolutional layers to expand the receptive field, which can increase parameter complexity and reduce computational efficiency. Furthermore, their local operation nature often fails to capture global anatomical context, leading to over- or undersegmentation—particularly in dense breast tissues with indistinct lesion boundaries, where preserving subtle structural details is crucial. To address these issues, transformer-based segmentation models have gained traction due to their ability to capture long-range dependencies. Hybrid models such as HCT-Net¹⁰ and ScribFormer¹¹ combine CNNs and transformers to improve global context modeling, although simple feature fusion across scales often fails to deliver consistent representations.

Another major limitation in current research is the disconnection between lesion detection and segmentation. Most studies address these tasks separately, focusing either on identifying lesion locations or on delineating lesion boundaries. However, in clinical practice, both tasks are crucial: detection locates the region of interest, while segmentation provides morphological details such as size and shape. Bridging this gap is essential for developing comprehensive and clinically applicable AI systems.

In this study, we propose DVF-YOLO-Seg, a two-stage detection–segmentation framework, designed to enhance the precision of breast lesion detection and segmentation. While similar works, such as YOLOv8+SAM¹² and YOLOv9+SAM,¹³ also employ a detection–segmentation pipeline, our approach introduces several unique innovations. Specifically, DVF-YOLO-Seg integrates multiple modules in a synergistic manner, rather than a simple stacking of independent components.

First, we incorporate the DualConv¹⁴ module into the YOLOv10¹⁵ detection model to improve the detection stage. This module combines 3 × 3 and 1 × 1 grouped convolutions, enabling better extraction of multi-scale features, which significantly enhances the detection of small and irregular lesions. Varifocal Loss (VFL)¹⁶ is then applied to mitigate class imbalance issues, refining bounding box precision for small lesions.

The bounding boxes produced by the detection stage are then passed to the segmentation stage as spatial prompts, where a Visual Reference Prompt Segment Anything Model (VRP-SAM)¹⁷ further refines the segmentation output. Unlike the original segment anything model (SAM),¹⁸ which uses standard bounding boxes, VRP-SAM leverages visual reference prompts to guide the segmentation process, ensuring higher robustness to low-quality bounding box predictions. This significantly enhances segmentation accuracy, especially in cases where the bounding boxes may be imprecise or inaccurate.

Additionally, we introduce a perturbation-voting strategy that mitigates the impact of detection errors and improves segmentation stability. By incorporating multiple variations of the bounding box predictions and performing majority voting, we further reduce the risk of errors due to inaccurate detections, making the model more reliable in clinical scenarios.

The main contributions of this work are as follows:

A two-stage detection–segmentation framework is developed, where lesion bounding boxes generated by an optimized YOLOv10 are refined through VRP-SAM for high-precision segmentation.

YOLOv10 is enhanced with DualConv and VFL to strengthen feature extraction and weight distribution, increasing robustness in dense tissue and complex backgrounds.

Clinical applicability and accuracy are validated through subjective evaluations by physicians from two hospitals, demonstrating the model’s real-world effectiveness.

Related work

Detection of medical images

Breast cancer detection has undergone a paradigm shift, moving from traditional geometric and texture-based feature engineering to advanced deep learning models that integrate multimodal data and multitask learning. Among these, CNNs have gained widespread adoption due to their ability to automatically learn hierarchical features from medical images. However, their heavy reliance on large-scale annotated datasets remains a critical bottleneck, limiting their generalizability in real-world clinical settings where labeled data are often scarce and costly to obtain.

To address this, Zhang et al.¹⁹ incorporated a Bayesian framework into YOLOv4 to quantify prediction uncertainty and enhance model robustness in data-scarce environments. Despite these improvements, Das et al.²⁰ pointed out that traditional CNNs still face a high risk of overfitting when labeled data are limited, highlighting the need for more adaptive architectures. In response to small dataset challenges, Prinzi et al.²¹ proposed a YOLOv5-based model for mammographic detection, leveraging transfer learning to boost performance, even for difficult lesions such as asymmetries.

As a representative of single-stage detectors, the YOLO family has become a dominant solution for real-time object detection in clinical workflows, owing to its unified design for feature extraction and lesion localization. For example, Aly et al.²² applied YOLO to early breast lesion detection, significantly improving diagnostic efficiency. Building on this, Su et al.²³ enhanced YOLOv5 with transformer modules to capture long-range contextual dependencies in mammograms. While these advancements have improved performance, challenges still exist in detecting small lesions with high precision and handling complex tissues.

To meet the demands of real-time deployment in low-resource settings, current research has shifted toward lightweight architectures and multi-input fusion. For instance, combining weighted least squares regression with multi-view fusion enabled efficient thermal imaging-based detection on mobile devices.²⁴ Meanwhile, three-dimensional mammography has emerged to overcome the limitations of two-dimensional (2D) images. Umamaheswari and Mohanbabu²⁵ introduced a hybrid segmentation model for volumetric images, which combines adaptive thresholding, region-growing, and cat swarm optimization techniques.

Beyond imaging, novel multimodal approaches have been explored to improve breast cancer detection. BiaCanDet²⁶ integrates bioimpedance signals with deep learning using spatial–temporal attention to distinguish malignant tissues. Ensemble learning has also been employed, combining CNN architectures such as EfficientNet, AlexNet, ResNet, and DenseNet with customized scaling and feature fusion to further improve performance.²⁷ Recent advancements have also explored the potential of large language models (LLMs) in breast cancer diagnosis and treatment. Ghorbian et al.²⁸ demonstrated that LLMs, by analyzing vast amounts of medical data, can significantly enhance diagnostic accuracy, treatment decisions, and clinical workflows.

Segmentation of medical images

Early approaches to breast lesion segmentation primarily relied on handcrafted features, focusing on shape, boundary texture, and intensity. While intuitive, these methods suffered from poor generalizability, as manually designed features often failed to adapt to variations in lesion morphology, such as irregular boundaries, and differences in imaging conditions, including noise and contrast. The emergence of CNN-based models addressed this gap by enabling automatic hierarchical feature learning, significantly advancing the field. For example, Wang et al.²⁹ proposed the MLNnet to segment clustered microcalcifications in mammograms, with a particular focus on mitigating domain shifts caused by variations in patient pose, breast density, and imaging acquisition protocols. Following this, AAPFC-BUSnet³⁰ further enhanced segmentation accuracy by integrating multi-scale features through deformable convolutions to adapt to irregular lesion shapes and adaptive self-attention mechanisms to prioritize clinically relevant regions.

Despite these advances, CNN-based models are constrained by their local receptive fields, which limit their ability to capture fine morphological details—especially for lesions with ambiguous boundaries or subtle structural distortions. To overcome this limitation, Li et al.³¹ developed a dual-stream framework: a locality-preserving learner focuses on fine-grained boundary details, while a conditional map learner enhances global context, jointly improving segmentation precision and diagnostic prediction. Similarly, MF-Net³² addressed multi-scale challenges by combining transformer modules for global feature modeling with multi-path extraction for local detail preservation. For low-boundary-clarity scenarios, Jiang et al.³³ introduced a semi-supervised approach that leverages adaptive patch enhancement to highlight indistinct boundaries and contrastive learning to learn discriminative features from unlabeled data, effectively boosting performance in data-scarce settings.

Recently, the SAM revolutionized generic segmentation with a prompt-driven vision transformer framework, supporting flexible inputs such as points, boxes, text, and masks. Its successor, SAM2,³⁴ further improved usability by reducing interaction latency. However, SAM’s pretraining on natural images leaves it devoid of domain-specific knowledge, limiting its effectiveness in breast lesion segmentation.

To address this domain mismatch, several medical-oriented variants have been proposed. PATH-SAM2³⁵ integrates a Kolmogorov-Arnold classification module with a UNI encoder pretrained on histopathological data, enhancing adaptability to tissue-level features. MedSAM³⁶ and SAMMI³⁷ achieve better medical alignment via large-scale fine-tuning on medical images, though their heavy computational overhead hinders deployment in resource-constrained clinical settings. To reduce latency, SAMed³⁸ introduced low-rank adaptation (LoRA)-based adapters in self-attention layers, while Zhong et al.³⁹ combined parallel CNN branches with LoRA for efficient multi-scale feature sampling. For ultrasound-specific tasks, SAMUS⁴⁰ optimizes SAM with lightweight CNNs for low-cost feature extraction and location adapters to handle ultrasound speckle noise, boosting efficiency and accuracy. Though tailored for ultrasound, its core idea—using lightweight adaptations to enhance modality-specific robustness—offers insights for adapting to low-contrast or dense tissues in X-ray images.

Despite these innovations, a critical bottleneck remains in real-world clinical applications: prompt quality. Ambiguous or low-quality prompts often degrade segmentation performance, particularly for subtle lesions. VRP-SAM¹⁷ directly addresses this limitation by introducing a visual reference prompt encoder, which extracts semantic guidance from high-quality reference images, such as well-annotated lesion examples. This strategy enhances segmentation robustness under noisy prompting conditions and consistently outperforms geometric prompt baselines, marking a key advance in bridging the gap between generic and clinical segmentation.

Multi-stage learning

Multi-stage learning strategies have emerged as a critical advancement to overcome the limitations of single-stage models in capturing complex breast lesion structures, particularly in cases with irregular shapes, indistinct boundaries, or overlapping regions. While single-stage approaches remain effective for initial analysis, they often struggle to maintain a balance between localization accuracy and segmentation precision, highlighting the need for sequential refinement of detection and segmentation outputs.

Yaqub et al.⁴¹ introduced a multi-stage framework that integrates atrous spatial pyramid attention with a cross-scale U-Net, enabling adaptive feature extraction and improving segmentation performance across varying lesion morphologies. In a detection-guided design, Yan et al.⁴² combined YOLOv3 for lesion candidate identification with U-Net++ for boundary refinement, demonstrating the benefit of modular specialization through stage-wise optimization. Extending this paradigm, Khatua et al.⁴³ incorporated YOLOv8 with the SAM for instance segmentation, showing that integrating foundation models into detection–segmentation pipelines can leverage their generalization strengths in medical contexts. These works collectively illustrate that coupling robust detection backbones such as YOLO with advanced segmentation networks such as U-Net variants or SAM can effectively decouple localization and boundary modeling, thereby addressing limitations of single-stage designs.

In clinical practice, mammography and ultrasound often involve the presence of multiple coexisting lesions with diverse characteristics, including varying sizes and subtle or overlapping contours. Accurate segmentation under such complex conditions remains a persistent challenge. Although multi-stage frameworks have shown promise, many existing methods lack robustness when the initial detection results are imprecise and rarely incorporate mechanisms tailored to the nuances of breast imaging, such as detecting early stage subtle lesions. Enhancements, including adaptive prompting strategies, cross-stage feature refinement, and domain-specific representation learning, are essential to improving both accuracy and reliability in clinical scenarios.

Motivated by these limitations, we propose a dual-stage detection–segmentation framework that combines an enhanced YOLOv10 detector with a visual reference prompt-guided SAM architecture. This approach improves robustness to low-quality prompts, enhances domain adaptability, and advances segmentation precision in challenging breast lesion cases, thereby bridging the gap between technical accuracy and clinical utility.

Methods

Data source and preprocessing

This study employed the curated breast imaging subset of the digital database for screening mammography (CBIS-DDSM),⁴⁴ a widely used public dataset for breast cancer research. It contains 753 cases with calcifications and 891 cases with masses. In this study, we focused specifically on mass lesions with corresponding bounding box annotations and segmentation masks.

To enhance lesion visibility and improve contrast in dense breast tissue, a standardized preprocessing pipeline was applied. First, contrast-limited adaptive histogram equalization (CLAHE) was used to improve local contrast. The CLAHE algorithm was configured with a clip limit of 2.0 and a tile grid size of 8 × 8, settings commonly adopted in mammogram enhancement tasks to strike a balance between local contrast and noise suppression.

Next, Gaussian filtering was employed to suppress background noise while preserving important lesion boundaries. A 5 × 5 kernel was used, which provides effective smoothing without excessively blurring fine structures.

All mammograms were resized to 640 × 640 pixels to match the input requirements of the YOLOv10-based detection model. Bounding box annotations were converted to the YOLO format, with normalized center coordinates, width, and height. The refined bounding boxes were subsequently used as visual prompts for the VRP-SAM segmentation module, enabling accurate lesion mask generation.

Figure 1 illustrates the full preprocessing pipeline, including CLAHE enhancement, Gaussian smoothing, bounding box generation, and prompt-based segmentation. To ensure consistency with model input requirements, all visualizations in Figure 1 are based on resized 640 × 640 images, which correspond to the standard input size for YOLOv10.

Figure 1.

Preprocessing and prompt generation pipeline for breast lesion segmentation. (a) Original CBIS-DDSM mammogram image. (b) CLAHE-enhanced image to improve local contrast, particularly in dense breast tissue. (c) Gaussian-filtered image to suppress background noise while preserving lesion boundaries. (d) Lesion area detection using the YOLOv10-based model with refined bounding boxes. (e) High-precision lesion segmentation performed by the VRP-SAM model, guided by visual reference prompts.

Overall architecture of DVF-YOLO-seg

Detection module

In the detection stage, we adopted an enhanced YOLOv10 architecture tailored for breast imaging. As shown in Figure 2, the overall framework integrates two key components. These are the DualConv module and VFL, both of which are critical for improving lesion detection in challenging scenarios. To better extract lesion features under dense and noisy backgrounds, we incorporated the DualConv module into both the backbone and neck of YOLOv10. Its integration position is detailed in Figure 3. Figure 4 further illustrates the module’s internal structure. It employs a grouped convolution strategy, combining 3 × 3 kernels for spatial context modeling and 1 × 1 kernels for reducing channel redundancy. This dual-path design significantly improves the model’s capacity to detect small and irregular lesions without increasing computational burden. It also reduces misidentification of normal tissues, which is a common issue in low-contrast or cluttered surroundings.

Figure 2.

DVF-YOLO-Seg framework diagram. The upper part shows the detection stage, using an enhanced version of YOLOv10 (including DualConv and Varifocal Loss) to identify breast lesions and generate bounding box hints. The lower part is the segmentation stage, using the bounding box hints from the detection stage, the Visual Reference Prompt Segment Anything Model (VRP-SAM) module performs accurate segmentation of the lesion area. P represents the classification score, $\overset{\land}{b}$ and b represent the predicted and the ground truth bounding boxes, respectively, s denotes the spatial prior, and $α$ and $β$ are hyperparameters. The different modules and their functional details can be found in Figures 3 to 6.

Figure 3.

The architecture of the improved YOLOv10 detection module. The model incorporates DualConv modules in the backbone and neck to enhance feature extraction. Varifocal Loss is used in the classification head to address class imbalance. The red bounding box represents the initial lesion detection. A perturbation mechanism (shown as Perturbation × k) generates a set of k perturbed bounding boxes, which are visualized as stacked variations to simulate box localization uncertainty.

Figure 4.

Structure diagram of DualConv group convolution technology. M represents the input channels (input feature map depth), N is the number of filters (output channels), and G denotes the number of groups in the group and dual convolutions. The convolution kernel sizes are 3 × 3 and 1 × 1.

To address the severe class imbalance commonly found in mammographic datasets—where negative background samples dominate and tumor areas are rare—we integrated the VFL into the detection head. Unlike conventional binary losses, VFL dynamically adjusts the learning weight of each sample based on its classification confidence and bounding box quality. The formulation of the loss is defined as: $VFL (p, q) = {\begin{matrix} - q (q \log (p) + (1 - q) \log (1 - p)) & q > 0 \\ - α p^{γ} \log (1 - p) & q = 0 \end{matrix}$ (1)

In this context, p denotes the intersection over union (IoU)-aware classification score (IACS), which is a prediction output by the detection head of the YOLOv10 model. Unlike traditional object detection pipelines that decouple classification confidence and localization quality, the IACS combines both aspects into a unified score. Specifically, the model is trained to predict a scalar $p \in [0, 1]$ , which reflects both the confidence that an object exists at a location and the regression quality (IoU) of the predicted bounding box.

To facilitate this joint prediction, VFL adopts a star-shaped bounding box feature representation, in which each bounding box is associated with nine fixed sampling points (center, midpoints of edges, and corners). These points are used to extract geometric and contextual cues via deformable convolutions, enabling more accurate learning of IACS values. As illustrated in Figure 5, these sampling points are visualized as yellow circles, surrounding the initial red bounding box. This representation strengthens the correlation between classification and localization, improving ranking and final detection accuracy, especially for small or ambiguous lesions.

Figure 5.

Visualization of the Varifocal Loss mechanism and IoU-aware classification. The yellow circles represent nine fixed sampling points arranged in a star-shaped pattern, including the box center, edge midpoints, and corners. These points are used to extract geometric and contextual features through deformable convolution. The red bounding box shows the initial prediction, while the refined blue box represents the final accurate detection of the breast mass region.

The VFL function introduces two key hyperparameters: $α$ and $γ$ . These parameters control the loss weighting of negative samples. Specifically, $α$ influences the weight of the loss for negative samples (such as background or regions without the target lesion), while $γ$ adjusts the emphasis on difficult-to-detect negative samples, particularly those located near object boundaries or with ambiguous features. Larger values of $γ$ make the model more sensitive to these difficult negative samples, improving its robustness, especially for small or blurry lesions.

Segmentation module

Building on the detection-stage outputs, in the second stage, we adopted the VRP-SAM. This model introduces a VRP encoder to extend the original SAM framework. The VRP encoder projects various annotation types, such as points, boxes, and masks, into interpretable embeddings for the SAM decoder. In our study, only box prompts are used, aligning with clinical practice where bounding boxes are readily available from radiologist annotations, ensuring practical applicability.

As shown in Figure 6, the VRP encoder comprises two main modules: the feature enhancer and the prompt generator. The feature enhancer first extracts annotation-based prototype features from a reference image $I_{r}$ using its corresponding mask $M_{r}^{i}$ , and projects both the reference and target images into a shared semantic space using a semantic-aware encoder. Specifically, it computes the prototype feature for category i via masked average pooling: $P_{i} = MaskAvgPool (F_{r}, M_{r}^{i})$ (2)

Figure 6.

The Visual Reference Prompt Segment Anything Model (VRP-SAM) framework is based on box prompts from YOLOv10. Only box prompts are used in this study; other prompt types supported by VRP-SAM are excluded.

These prototype features $P_{i}$ are fused into the reference and target feature maps $F_{r}$ and $F_{t}$ , respectively, and then compressed via a shared 1 × 1 convolution, yielding refined features for downstream interaction. The resulting refined features are denoted as $F_{r}^{'}$ and $F_{t}^{'}$ , which serve as the input for the prompt generator’s cross-attention operations.

The prompt generator initializes a set of learnable queries Q, which interact with the prototype-enhanced reference features. In this stage, the prototype $P_{i}$ serves as a semantic prior that enhances the attention-based interaction between the queries and reference features. The attention operations are defined as follows:

$Q_{r}^{'} = {SelfAttn}_{1} ({CrossAttn}_{1} (Q, {F^{'}}_{r}))$ (3)

These reference-aware queries $Q_{r}^{'}$ are then used to attend to the enhanced target image features $F_{t}^{'}$ , refining the prompt representation: $Q_{t}^{'} = {SelfAttn}_{2} ({CrossAttn}_{2} ({Q^{'}}_{r}, {F^{'}}_{t}))$ (4)

The resulting prompt embeddings $Q_{t}^{'}$ are passed to the SAM mask decoder to generate the segmentation mask $M_{t}$ for the target image. This design allows the model to leverage prior information from reference images to guide segmentation in complex visual scenes.

In our framework, we use the bounding box prompts generated by the improved YOLOv10 as input to the prompt generator, guiding the segmentation process. Since breast lesions are typically small and may be obscured by surrounding tissue, providing accurate spatial priors is crucial. The bounding boxes help localize the lesion and direct attention to its boundary, allowing the model to generate more precise and context-aware visual prompts.

To enhance segmentation robustness under varying prompt conditions, we introduce a perturbation-based prompt refinement strategy. For each lesion, the original bounding box predicted by YOLOv10 is randomly expanded or contracted by 1–4 pixels along each axis. This process is repeated k times to simulate box localization noise, resulting in k slightly different prompt inputs.

Each perturbed box is individually fed into VRP-SAM to produce a segmentation mask, yielding a set of candidate masks: $M_{t}^{(1)}, M_{t}^{(2)}, \dots, M_{t}^{(k)}$ (5)

The final segmentation result $\overset{\land}{M_{t}}$ is obtained via pixel-wise majority voting: ${\hat{M}}_{t} = Mode ({M_{t}^{(1)}, M_{t}^{(2)}, \dots, M_{t}^{(k)}})$ (6)

This aggregation strategy mitigates the influence of inaccurate prompts and promotes consistent predictions. The integration of YOLOv10-generated box prompts, reference-aware attention, and perturbation-based refinement enables the VRP-SAM to segment lesions with improved accuracy and robustness, particularly for small or ambiguous targets.

Implementation details

To ensure optimal training and evaluation performance, specific hyperparameter settings were applied to each component of the proposed DVF-YOLO-Seg framework. These hyperparameters were carefully selected to improve convergence stability, reduce overfitting, and maintain computational efficiency throughout the training process. All experiments were conducted on an NVIDIA RTX 4090D GPU, using CUDA 11.3 on Ubuntu 20.04. The model was implemented in PyTorch 1.11.0 with Python 3.8. Table 1 summarizes the key hyperparameters used for the detection and segmentation modules, along with input preprocessing and learning configurations.

Table 1.

Summary of hyperparameter settings used for training DVF-YOLO-Seg.

Parameter	Value
GPU	NVIDIA RTX 4090D
OS & environment	Ubuntu 20.04, CUDA 11.3
Framework	PyTorch 1.11.0, Python 3.8
Image size	640 × 640
Batch size	8
Epochs	200
Optimizer	Adam
Learning rate	0.0001
Weight decay	1 × 10⁻⁴
Image encoder (Seg.)	ResNet-50
CLAHE parameters	Clip limit: 2.0; Grid size: 8 × 8
Gaussian filter	Kernel size: 5 × 5

GPU: graphical processing unit; OS: operating system; CLAHE: contrast limited adaptive histogram equalization.

Performance metrics

To comprehensively evaluate the detection and segmentation performance of the DVF-YOLO-Seg framework, we employed multiple evaluation metrics, including precision (P), recall (R), F1-score, mean average precision (mAP), and dice similarity coefficient (DSC). These metrics are derived from the confusion matrix, which contains true positives (TP), false positives (FP), and false negatives (FN). Below are the definitions and formulas for each metric:

Precision: Measures the proportion of true positive samples among all predicted positive samples: $Precision = \frac{TP}{TP + FP}$ (7)

Recall: Measures the proportion of actual positive samples that were correctly identified:

$Recall = \frac{TP}{TP + FN}$ (8)

F1-score: The harmonic mean of precision and recall, balancing both metrics, especially useful in class-imbalanced tasks: $F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (9)

mAP: Used to assess detection performance, particularly for object localization: $mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}$ (10)

DSC: Measures the overlap between the predicted segmentation mask and the ground truth: $Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$ (11)

In addition to these accuracy-related metrics, we also measured floating-point operations per second (FLOPs) and latency for the detection model in Table 2. FLOPs measures the number of floating-point operations needed for inference, while latency quantifies the inference time for a single image. These efficiency metrics provide insight into the computational cost and real-time performance of the models.

Table 2.

Detection performance and efficiency comparison of YOLO models.

Method	Precision	Recall	mAP	FLOPs (G)	Latency (ms)
YOLOv5	0.841	0.823	0.832	45.2	10.9
YOLOv7	0.790	0.800	0.795	51.6	12.2
YOLOv8	0.852	0.832	0.841	48.1	11.1
YOLOv9⁵³	0.798	0.817	0.807	46.9	10.7
YOLOv10¹³	0.834	0.854	0.845	47.5	10.8
YOLOx⁵⁴	0.801	0.809	0.800	49.0	11.4
DAMO-YOLO	0.785	0.796	0.785	52.3	12.5
Improved YOLOv10	0.847	0.867	0.853	48.6	11.3

mAP: mean average precision; FLOPs: floating-point operations.

Results

Performance comparison with mainstream segmentation models

To benchmark the overall segmentation effectiveness of DVF-YOLO-Seg, we compared it against a variety of state-of-the-art models across three prompt strategies: prompt-free, point-prompt, and box-prompt. This comparison aims to highlight how our two-stage framework, combined with optimized prompt design, outperforms existing approaches in clinical lesion segmentation.

As shown in Table 3 and Figure 7, DVF-YOLO-Seg achieved the best performance across all metrics, with a precision of 0.797, a recall of 0.815, a dice coefficient of 0.802, and an F1-score of 0.806, which surpasses both prompt-free and point-prompt models. This improvement highlights the benefit of the two-stage architecture and the integration of optimized box prompts for guiding segmentation.

Figure 7.

Line chart comparing the performance of the compared models on the evaluation metrics.

Table 3.

Performance comparison of different methods.

Method	Prompt Method	Stage	Precision	Recall	Dice	F1-score
Swin UNETR⁴⁵	×	One	0.731	0.757	0.741	0.744
CoTr⁴⁶	×	One	0.764	0.750	0.752	0.757
EfficientDet⁴⁷	×	One	0.722	0.734	0.725	0.728
MobileNet⁴⁸	×	One	0.753	0.752	0.750	0.752
nnUNet⁴⁹	×	One	0.770	0.741	0.755	0.755
UNetr⁵⁰	×	One	0.751	0.741	0.745	0.746
MedSAM³⁶	Point	One	0.766	0.757	0.755	0.761
HQSAM⁵¹	Point	One	0.748	0.739	0.735	0.743
SAMed³⁶	Point	One	0.732	0.735	0.739	0.733
SAMUS³⁷	Point	One	0.761	0.765	0.760	0.763
MobileSAM⁵²	Point	One	0.741	0.763	0.757	0.752
VRP-SAM¹⁷	Box	One	0.778	0.754	0.766	0.766
YOLOv8+SAM¹²	Box	Two	0.770	0.773	0.731	0.771
YOLOv9+SAM¹³	Box	Two	0.768	0.760	0.752	0.764
DVF-YOLO-Seg	Box	Two	0.797	0.815	0.802	0.806

Note: The symbol ‘×’ indicates that a prompt-free method was used.

Prompt-free models such as Swin UNETR, CoTr, and nnUNet show consistent but moderate performance, with dice scores between 0.725 and 0.755 and F1-scores from 0.744 to 0.757. While these methods benefit from end-to-end simplicity, their lack of prompt guidance limits sensitivity to small or low-contrast lesions, which are common in dense breast tissue. Point-prompt methods such as MedSAM and MobileSAM achieve higher recall values but at the cost of lower precision (0.732–0.766), indicating a tendency to over-segment, especially in ambiguous regions.

Box-prompt models, including YOLOv8+SAM and YOLOv9+SAM, show improved balance in P and R values, benefiting from spatial guidance. However, their overall performance remains behind DVF-YOLO-Seg, especially in dice score and recall. For example, YOLOv8+SAM achieved an F1-score of 0.771, compared to 0.806 for our model. Notably, DVF-YOLO-Seg exhibits clear advantages in handling complex structures and small tumor regions, due to the integration of DualConv and VFL in the detection stage and robust prompt refinement in segmentation.

Although the precision of DVF-YOLO-Seg is 79.7%, this reflects a deliberate tradeoff prioritizing coverage over boundary tightness. In breast lesion segmentation, especially in screening or early diagnosis scenarios, the primary clinical objective is to avoid missing any suspicious lesion—including primary tumors, micro-metastases, and multifocal lesions. Missing such regions, even if they are <5 mm, can lead to delayed diagnosis, increased recurrence risk, and reduced survival.

In our model, slightly lower precision is a result of favoring broader inclusion in uncertain or low-contrast regions, thereby improving recall (81.5%) and dice coefficient (80.2%), which are crucial for ensuring lesion completeness. The model achieves a significantly higher F1-score (80.6%), indicating a strong balance between identifying relevant lesions and minimizing false positives. This strategy is particularly vital for identifying subtle lesions in dense or heterogeneous tissue, where under-segmentation may cause clinically significant misses.

Moreover, subjective evaluation results (see section “Subjective and clinical evaluation”) support the clinical acceptability of this approach. Radiologists found that the segmentation results provide reliable lesion coverage, even if boundary precision is not perfect. This balance ensures robust detection while minimizing the risk of missed diagnoses, aligning with the real-world demands of breast cancer screening and treatment planning.

Performance comparison of the first phase detection model

To assess the effectiveness of our detection module enhancements, we systematically evaluated the improved YOLOv10 against several widely adopted YOLO variants. The results, summarized in Table 2, demonstrate that our model achieves the optimal balance of precision, recall, and mAP. It also maintains superior robustness in identifying challenging mass lesions, particularly small or ambiguous ones.

Compared to the standard YOLOv10, our model—incorporating DualConv and VFL—yields notable gains. It shows a 1.3% increase in precision, a 1.3% increase in recall, and a 0.8% gain in mAP. These improvements come with only a marginal increase in computational cost. The consistency of these gains across training epochs is confirmed by the line graphs in Figure 8. Detection examples in Figure 9 and multi-scale heatmaps in Figure 10 provide visual validation, with the latter illustrating that our model focuses more strongly on lesion regions.

Figure 8.

Visualization of the variations in different model evaluation indicators over the number of rounds. (a) Line graph depicting the change in P over the number of rounds; (b) line graph illustrating the change in mean average precision (mAP) changing over the number of rounds.

Figure 9.

Example of detection results. The example results use (a) YOLOv5 model; (b) YOLOv7 model; (c) YOLOv8 model; (d) YOLOv10 model; (e) YOLOv9 model; (f) improved YOLOv10 model.

Figure 10.

Original images and their heatmaps.

Against other mainstream YOLO variants, our model outperforms YOLOv9 in all key metrics. YOLOv8 achieves the highest precision at 0.852, but its recall of 0.832 and mAP of 0.841 are lower than those of our model, indicating a weaker ability to capture subtle lesions. YOLOx, despite having similar latency, lags behind with a precision of 0.801, a recall of 0.809, and an mAP of 0.800. This confirms the superiority of our design under comparable constraints.

In terms of efficiency, our model maintains 48.6G FLOPs and 11.3 ms latency on 640 × 640 inputs. This performance is on par with YOLOv8 and significantly better than heavier architectures such as YOLOv7 and DAMO-YOLO. YOLOv7 operates with a latency of around 12.2 ms and FLOPs exceeding 50G, while DAMO-YOLO has FLOPs exceeding 50G. Such efficiency ensures real-time applicability in clinical settings. In contrast, YOLOv5, though fast, lacks sensitivity in complex backgrounds. Our model, however, strikes a stronger balance between precision, recall, and inference speed.

Performance comparison of the second-stage segmentation model

To comprehensively evaluate the effectiveness of the second-stage segmentation module, we compared the performance of SAM and VRP-SAM under various configurations of the first-stage detection model. As shown in Table 4, when using the original YOLOv10 without enhancements, SAM achieves a dice score of 0.690 and an F1-score of 0.700, while VRP-SAM outperforms it with scores of 0.718 and 0.723, respectively. This baseline comparison highlights the advantage of reference-guided prompts, even without upstream improvements.

Table 4.

Encoder ablation experiment.

Encoder	YOLOv10 (DualConv)	YOLOv10 (Varifocal Loss)	Precision	Recall	Dice	F1-score
SAM	×	×	0.705	0.695	0.690	0.700
SAM	√	×	0.733	0.756	0.743	0.744
SAM	×	√	0.710	0.709	0.704	0.709
SAM	√	√	0.758	0.774	0.750	0.765
VRP-SAM	×	×	0.720	0.726	0.718	0.723
VRP-SAM	√	×	0.731	0.737	0.733	0.734
VRP-SAM	×	√	0.764	0.790	0.776	0.777
VRP-SAM	√	√	0.797	0.815	0.802	0.806

SAM: segment anything model; VRP-SAM: visual reference prompt SAM.

Further improvements are observed when individual enhancements are applied. Specifically, using DualConv alone improves VRP-SAM’s dice to 0.733 and F1-score to 0.734, while applying VFL alone leads to greater gains, with a dice of 0.776 and an F1-score of 0.777. These results confirm that both modules independently contribute to more robust lesion localization and segmentation.

When both DualConv and VFL are integrated into the YOLOv10 detection stage, the performance of VRP-SAM reaches its peak: precision of 0.797, recall of 0.815, dice of 0.802, and F1-score of 0.806. This configuration consistently outperforms all others across multiple metrics, validating the cumulative benefit of architectural and loss function improvements in the detection stage and their downstream impact on segmentation quality.

Figure 11 visually demonstrates the model’s overall superiority, while Figure 12 provides a detailed comparison of four key metrics—precision, recall, dice score, and F1-score—across different configurations. The bar chart clearly highlights the incremental improvements achieved by each module and underscores the comprehensive advantage of the final integrated design.

Figure 11.

Segmentation result example. This figure shows segmentation examples of different models, where (a) is the original image; (b) EfficientDet; (c) Swin UNETR; (d) MobileNet; (e) nnUNet; (f) MedSAM; (g) SAMed; (h) SAMUS; (i) our model DVF-YOLO-Seg; and (j) ground truth.

Figure 12.

Segmentation performance under different configurations of detection modules and prompt encoders.

Figure 13.

Detection box accuracy evaluation score graph.

Ablation study on model modules and prompt strategies

To dissect the individual contributions of each module in our framework, we conducted targeted ablation experiments. These include comparative analyses of backbone convolution types, loss function designs, and prompt refinement strategies, offering insights into how each component influences final detection and segmentation performance.

As shown in Table 5, the proposed DualConv module achieves the best performance among the tested alternatives. Specifically, it yields a precision of 0.861, a recall of 0.825, an mAP of 0.844, and an F1-score of 0.843, outperforming PConv, SCConv, and DynamicConv. Despite a slight drop in recall compared to the baseline, DualConv shows a substantial gain in precision (+2.7%), indicating that it significantly reduces false positives while maintaining high recall. This confirms the effectiveness of its hybrid kernel design in improving spatial feature discrimination.

Table 5.

Convolution modules ablation experiment.

Different convolution modules	Precision	Recall	mAP	F1-score
Baseline¹⁵	0.834	0.854	0.845	0.843
PConv	0.771	0.823	0.773	0.796
SCConv	0.800	0.846	0.829	0.822
Dynamic Conv	0.812	0.837	0.820	0.824
DualConv ¹⁴	0.861	0.825	0.844	0.843

mAP: mean average precision.

PConv and SCConv both show lower overall scores, with F1-scores of 0.796 and 0.822, respectively, which demonstrates that DualConv not only improves feature extraction but also yields a better tradeoff between sensitivity and specificity in dense breast tissue environments.

Table 6 further analyzes the effect of combining DualConv and VFL. Using only VFL yields a balanced F1-score of 0.840, while DualConv alone results in a higher F1-score of 0.843. When both modules are integrated, the model achieves the highest performance across all metrics, including an F1-score of 0.857. This confirms that the two modules are complementary: DualConv boosts spatial discrimination, while VFL addresses class imbalance, especially for small or infrequent lesions.

Table 6.

Loss function ablation experiment.

Loss function (Varifocal Loss)	Convolution (DualConv)	Precision	Recall	mAP	F1-score
√	×	0.829	0.851	0.835	0.840
×	√	0.861	0.825	0.844	0.843
√	√	0.847	0.867	0.853	0.857

mAP: mean average precision.

To assess the robustness of segmentation under different prompt qualities, we additionally conducted ablation experiments on the prompt refinement strategy. As shown in Table 7, we compared three settings: no perturbation, perturbation without majority voting, and full strategy (perturbation + voting).

Table 7.

Effect of perturbation and voting on segmentation performance.

Perturbation	Voting	Precision	Recall	Dice	F1-score
×	×	0.767	0.783	0.772	0.775
√	×	0.778	0.799	0.786	0.787
√	√	0.797	0.815	0.802	0.806

The results demonstrate that the bounding box perturbation alone improves segmentation robustness, leading to a 1.4% increase in dice. When combined with majority voting, the performance further improves, achieving an additional 3% gain in dice and a total F1-score improvement of 3.1% compared to the unrefined baseline. This confirms that the prompt refinement strategy effectively mitigates the sensitivity of segmentation to box localization errors, particularly for small or low-contrast lesions, enhancing both accuracy and consistency.

Subjective and clinical evaluation

This study aims to provide valuable insights for the optimization of technology and clinical application in the medical imaging field by subjectively evaluating the performance of various tumor detection and segmentation models. The experimental evaluation was conducted by five experienced doctors from two hospitals: the Affiliated Cancer Hospital of Xinjiang Medical University and the First Affiliated Hospital of Xinjiang Medical University. The doctors were divided into three age groups: 25–35, 36–45, and 46–55, with all having more than six years of professional experience in radiology, imaging, and surgery. This ensures the diversity and objectivity of the evaluation results.

To ensure fairness, the doctors were unaware of which model the segmentation results corresponded to, assigning scores solely based on the quality of the segmentation. The evaluation considered both detection accuracy and segmentation quality. For the detection stage, the doctors assessed whether the model’s bounding boxes accurately covered the main tumor regions. For segmentation, they evaluated how well the segmented boundaries matched the true tumor boundaries, with a focus on clarity, smoothness, and avoidance of oversegmentation or undersegmentation. They also took into account the model’s ability to accurately locate the tumor and avoid any risk of misdiagnosis or missed lesions. Each aspect was rated on a 1–5 scale, where 1 represented poor performance and 5 represented perfect accuracy.

Detection box accuracy evaluation

The core task of tumor detection is to assess the accuracy of the detection boxes in covering the main regions of the tumor. The experimental results show significant performance differences among different detection models. According to the doctor’s ratings, the improved YOLOv10 outperforms all other models, with an average score of 4.2, demonstrating exceptionally high detection reliability and stability.

In contrast, YOLOv8 and YOLOv9 scored 3.8 and 3, respectively, showing slight deficiencies in accuracy when tumors had complex shapes, falling short when compared to the improved YOLOv10. YOLOv5 and YOLOv7 performed poorly, scoring 3 and 2, respectively, with detection box accuracy needing further optimization. The worst performer was YOLOv10, with an average score of 1.4. In 60% of the evaluations, it was considered that the detection boxes had significant deviations, leading to a high risk of misdiagnosis and missed diagnoses, which could result in severe diagnostic errors.

Although Table 2 reports a relatively high mAP of 0.845 for the original YOLOv10, the subjective rating in Figure 13 shows a much lower average score (1.4). This discrepancy arises from the difference between objective metrics and clinical expectations: mAP quantitatively evaluates the overlap between predicted and ground truth boxes across a range of IoU thresholds, while the subjective rating reflects physicians’ perception of diagnostic usability. In clinical contexts, even a bounding box with decent IoU might be considered inadequate if it fails to fully capture suspicious areas or is misaligned in dense tissue. This highlights the importance of complementing objective metrics with subjective evaluations to assess real-world clinical performance more accurately.

Figure 14.

Segmentation quality evaluation heatmap. Shows the performance scores of various models on three metrics. The color gradient ranges from blue (lower score) to red (higher score), visually showing the difference in model performance.

Segmentation quality evaluation

Overlap between segmentation template and tumor boundary: The evaluation of tumor segmentation models initially depends on the overlap between the segmentation template and the actual tumor boundaries. According to subjective ratings, DVF-YOLO-Seg performed best in this regard, with an average score of 2.87, indicating that this model aligns well with the tumor boundary, providing a reliable reference for clinicians regarding mass size and shape. Following closely were nnUNet and MobileNet, with scores of 2.87 and 2.93, respectively. These models performed excellently in boundary alignment and are suitable for simpler clinical scenarios. However, EfficientDet and SAMed received lower scores of 2.27 and 2.47, respectively. These models struggled with complex cases or blurred boundaries, failing to effectively cover the entire mass boundary, leading to significant errors that could affect the assessment of the lesion’s extent.

Clarity of segmentation boundaries: In the evaluation of segmentation boundary clarity, DVF-YOLO-Seg, nnUNet, and MobileNet performed excellently, with average scores of 3.07, 3.2, and 2.93, respectively. These models demonstrated relatively clear segmentation boundaries with smooth transitions, effectively reducing blurry areas and helping doctors accurately distinguish the tumor from surrounding tissue, thereby improving diagnostic accuracy. In contrast, EfficientDet and SAMed exhibited poorer boundary clarity, scoring 2.4 and 2.33, respectively. These models presented blurred and irregular boundaries, making tumor recognition difficult and negatively impacting diagnostic precision.

Tumor localization accuracy: Regarding tumor localization accuracy, SAMUS, DVF-YOLO-Seg, and MobileNet performed excellently, with average scores ranging from 3.2 to 3.4. These models accurately localized tumors, enabling doctors to quickly focus on the lesion area for further analysis and diagnosis. For example, SAMUS scored 3.2, demonstrating minimal localization errors and accurately identifying the tumor region, providing essential support for diagnosis. In contrast, EfficientDet and Swin UNETR performed poorly in terms of location accuracy, scoring 2.6 and 2.8, respectively. These models were prone to misjudging lesion location, affecting subsequent treatment decisions.

Comprehensive evaluation and misdiagnosis risk

Considering segmentation precision, boundary clarity, and localization accuracy, DVF-YOLO-Seg ranked first with a comprehensive score of 3.11, offering balanced performance that effectively meets the basic requirements for clinical diagnosis while minimizing the risk of misdiagnosis and missed diagnoses. In contrast, EfficientDet had a lower comprehensive score of 2.42, underperforming in multiple dimensions. This model requires further optimization to improve segmentation precision, boundary clarity, and localization accuracy, thereby reducing misdiagnosis and missed diagnosis risks.

This subjective evaluation experiment provides a comprehensive assessment of various detection and segmentation models, highlighting their strengths and weaknesses in terms of performance. Among the detection models, the improved YOLOv10 exhibited the best detection box coverage accuracy, while DVF-YOLO-Seg performed most consistently across all evaluation metrics in segmentation, making it suitable for real-world clinical auxiliary diagnosis.

However, this study still has limitations, such as the small sample size of evaluating doctors, which may introduce some bias into the results. Future research should expand the sample size to include more doctors from diverse backgrounds and incorporate additional objective evaluation metrics to conduct a more comprehensive and in-depth assessment of the models. Furthermore, models with suboptimal performance should be improved through algorithmic adjustments and parameter optimization to enhance their capabilities, promoting the widespread and continuous development of detection and segmentation technologies in the medical field, thereby providing more precise support for clinical diagnosis. The results are shown in Figures 13 and 14.

Figure 15.

Typical failure cases of DVF-YOLO-Seg in breast lesion detection and segmentation. The figure shows three typical failure types, one in each row: (a) row represents small or low-contrast lesions; (b) row represents lesions close to dense tissue or image boundaries; (c) row represents irregular or multifocal lesions. From left to right, the columns represent: original image, detection results, segmentation output, and zoomed area. Red box: predicted detection result; green area: model segmentation result; red outline: ground truth.

Small-lesion analysis

In addition to evaluating the overall lesion detection and segmentation performance, we conducted a detailed analysis focused on small lesions (those with a diameter <5 mm), as these are often the most challenging to detect due to their subtle appearance, low contrast, and frequent overlap with dense tissue. Detecting such small lesions is critical for early diagnosis, as they are often the first sign of malignancy. However, small lesions are particularly difficult for conventional models to detect due to the low visibility and high noise in mammography images.

In the analysis of small lesions, we focused on evaluating the segmentation performance of the model under various detection configurations. For this analysis, we selected a subset of 46 small lesions from the CBIS-DDSM dataset, each with a diameter <5 mm. These lesions were carefully chosen to represent a range of challenging detection cases, including both well-defined and ambiguous lesions. The model first detects lesions using different YOLOv10 configurations (baseline, with DualConv, with VFL, and both combined), generating bounding boxes. These bounding boxes then serve as input for the segmentation stage, where the VRP-SAM model is used to segment the detected lesion areas. Performance metrics, including dice score, F1-score, precision, and recall, were calculated based on the segmentation results, with a particular focus on recall to assess the model’s ability to correctly identify small lesions.

As shown in Table 8, the introduction of DualConv and VFL progressively improved the performance on small lesions. For the baseline YOLOv10 model, the dice score was 0.675, and the recall score was 0.700. The addition of DualConv improved these metrics, with a dice score of 0.711 and a recall score of 0.724. The VFL further boosted the performance, increasing the dice score to 0.728 and the recall score to 0.745. When both DualConv and VFL were integrated, the model achieved the highest performance, with a dice score of 0.750 and a recall score of 0.783, indicating that the combination of these two modules provided a substantial improvement in detecting and segmenting small lesions.

Table 8.

Segmentation performance on small lesions under different detection configurations.

Configuration	Precision	Recall	Dice	F1-score
YOLOv10 (baseline)	0.703	0.700	0.675	0.701
YOLOv10 + DualConv	0.715	0.724	0.711	0.719
YOLOv10 + Varifocal Loss	0.747	0.745	0.728	0.746
YOLOv10 + DualConv + Varifocal Loss	0.772	0.783	0.750	0.777

The results indicate that the DualConv module significantly enhances the model’s ability to detect small lesions by improving the extraction of multi-scale features, which is essential for capturing subtle lesion boundaries. On the other hand, VFL addresses the class imbalance between small and large lesions, helping the model better prioritize the detection of small lesions during training. The combined use of these two modules resulted in the highest performance, improving both the recall and dice score, which are critical for detecting small, clinically significant lesions.

Failure case analysis

Although DVF-YOLO-Seg demonstrates superior performance in breast lesion detection and segmentation across most metrics and clinical scenarios, several failure cases were observed that reveal potential limitations. These failure cases can be broadly categorized into three types, as illustrated in Figure 15: (1) extremely small or low-contrast lesions, (2) lesions located near dense glandular tissue or image boundaries, and (3) irregularly shaped or multifocal lesions.

In the first category, the model occasionally fails to detect micro-lesions that are <5 mm in diameter, especially when embedded in dense fibroglandular backgrounds. These lesions tend to have subtle boundaries and low signal-to-noise ratios, which may lead to missed detections or inaccurate segmentation masks. In clinical practice, overlooking such small but potentially malignant foci could delay diagnosis or necessitate repeat imaging.

Secondly, lesions adjacent to high-density tissue or those close to the image boundary are prone to inaccurate box localization or boundary leakage. In some cases, the prompt encoder generates suboptimal guidance due to ambiguous detection results, leading to undersegmentation. This reflects the model’s sensitivity to localization noise and its reliance on upstream detection accuracy.

Lastly, complex lesions with irregular or spiculated margins challenge the model’s ability to fully capture lesion extent. While the prompt refinement strategy mitigates some of this error, segmentation outputs occasionally deviate from the actual lesion shape, affecting the assessment of morphology-related biomarkers.

These failure cases primarily stem from three key limitations of the model: insufficient detection capabilities in low-contrast regions, where small lesions are easily obscured by dense tissue or noise; poor localization accuracy for lesions adjacent to tissue boundaries, leading to positioning errors; and inadequate segmentation robustness for complex, irregularly shaped or multifocal lesions, resulting in imprecise boundary delineation. Addressing these limitations requires further refinement of the detection stage—especially for low-contrast or border-adjacent targets—and the integration of potential post-processing techniques such as shape-aware correction or adaptive prompt adjustment. Future work may also incorporate temporal or multi-view consistency (e.g., using both mediolateral oblique and craniocaudal views) to enhance robustness in such challenging scenarios.

Discussion

In this study, we proposed DVF-YOLO-Seg, a two-stage framework designed for accurate breast lesion mass detection and segmentation. The detection stage enhances YOLOv10 by incorporating the DualConv module, which enables improved multi-scale feature extraction and strengthens spatial and channel representations. This enhancement is particularly beneficial in scenarios with small, low-contrast lesions or heterogeneous tissue backgrounds. Additionally, the use of VFL effectively mitigates class imbalance by dynamically reweighting samples according to their quality and rarity, further improving model precision and sensitivity.

The segmentation stage employs a visual reference prompt-based SAM, where bounding box predictions from the detection stage act as prompts to guide fine-grained segmentation. This approach allows the segmentation model to benefit from spatial priors and contextual cues, resulting in more accurate delineation of lesion boundaries, even in cases with irregular shapes or blurred margins.

Beyond quantitative evaluation, we also conducted a subjective clinical assessment with radiologists from two hospitals. Their evaluations confirmed that the proposed framework significantly improves lesion localization and segmentation clarity, particularly for subtle lesions, highlighting its potential utility in real-world diagnostic workflows. The model’s 11.3 ms inference time is well-suited for real-time clinical use.

It is worth noting that the precision of the proposed DVF-YOLO-Seg framework, while moderate at 79.7%, reflects a deliberate design strategy prioritizing lesion completeness. Given the clinical context of breast cancer screening—where missing subtle lesions can lead to delayed diagnosis—our model favors broader inclusion to improve recall and ensure morphological coverage. This trade-off is further supported by high F1-score and dice metrics, and validated through subjective evaluations and failure case analysis. Collectively, these components contribute to the strong performance of DVF-YOLO-Seg across multiple evaluation metrics, validating its effectiveness in addressing core challenges in breast image analysis.

Limitations and future work

Despite the promising results, DVF-YOLO-Seg has several limitations. First, the model occasionally fails to detect extremely small lesions (<5 mm) or those located near dense glandular tissue or image boundaries. As illustrated in Section “Failure case analysis,” such cases often lead to inaccurate bounding boxes or suboptimal segmentation masks due to weak visual cues or spatial interference. Second, the effectiveness of the segmentation stage is inherently tied to the accuracy of detection prompts. Errors in the upstream detection stage may propagate and degrade segmentation quality.

Moreover, our current experiments are limited to 2D full-field digital mammograms from a single dataset (CBIS-DDSM), without incorporating multi-view consistency or multimodal information. This may limit the model’s generalizability to other imaging conditions, populations, or modalities such as ultrasound or MRI. Furthermore, while the model performs well in controlled environments, its relatively high computational cost may hinder deployment in resource-constrained clinical settings.

Building upon insights from the small-lesion analysis and failure case exploration, several research directions are envisioned to further enhance the capabilities of the proposed framework. To improve the detection of micro-lesions <5 mm, future designs will incorporate super-resolution feature enhancement modules capable of amplifying subtle lesion signals. This will be complemented by lesion-aware attention mechanisms aimed at suppressing interference from dense glandular regions and improving focus on diagnostically significant features.

Improved spatial context modeling will also be prioritized, particularly for lesions adjacent to image boundaries or embedded within heterogeneous tissue structures. Incorporating multi-view consistency—by aligning mediolateral oblique and craniocaudal projections—may offer richer contextual understanding and reduce localization errors. In addition, embedding shape priors into the segmentation architecture is expected to facilitate more accurate delineation of lesions with irregular or spiculated margins.

Enhancing the model’s generalizability is another critical goal. This includes validating performance on external datasets such as MIAS and extending applicability to multimodal imaging scenarios, including ultrasound and mammography. Cross-modal representation learning will be explored to effectively integrate heterogeneous image characteristics.

To improve deployment in clinical environments, efforts will also focus on lightweight optimization. This involves compressing the segmentation module through knowledge distillation and replacing standard convolutions with more efficient dynamic convolutional operations to reduce computational complexity without sacrificing accuracy.

Footnotes

Acknowledgments

The authors would like to thank the team of The Cancer Imaging Archive (TCIA) for providing access to the publicly available CBIS-DDSM dataset used in this study. We also acknowledge the valuable discussions and technical advice provided by colleagues from our research group who are not listed as coauthors.

ORCID iDs

Halidanmu Abudukelimu

Yuxin Gao

Abudukelimu Abulizi

Mayilamu Musideke

Shuqin Wu

Mengfei Wang

Mireguli Aizizi

Gulimiremu Yehaiya

Mayila Abudukelimu

Ethical considerations

This study used only publicly available,fully anonymized data from the CBIS-DDSM,which is hosted by The Cancer Imaging Archive. As no new human subjects or patient-identifiable data were collected,ethical approval and informed consent were not required.

Consent to participate

Not applicable. The study does not involve any identifiable personal data,images,or videos.

Contributorship

HA: Conceptualization,data curation,funding acquisition,project administration,validation,and writing—review and editing. YG: Conceptualization,formal analysis,investigation,validation,visualization,and writing—original draft,review,and editing. AA: Conceptualization,Data curation,funding acquisition,validation,visualization,writing – original draft,review,and editing. MM: Investigation,funding acquisition,resources,supervision,validation,and writing—review and editing. SW: Investigation,validation,visualization,and writing—review and editing. MW: Conceptualization,methodology,and writing—review and editing. MA: Formal analysis,validation,visualization,and writing—review and editing. GY: Conceptualization,data curation,project administration,and writing—review and editing. MA: Conceptualization,project administration,resources,supervision,and writing—review and editing.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research was funded by the National Natural Science Foundation of China (grant number 62366050);the Natural Science Foundation of Xinjiang Uygur Autonomous Region (grant number 2024D01A38);and the Key Laboratory of Optoelectronics Information Technology,Ministry of Education (grant number 2024KFKTO16).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data availability statement

The CBIS-DDSM dataset used in this study is publicly available and de-identified. It can be accessed at:

Guarantor

Halidanmu Abudukelimu is the guarantor of this article and accepts full responsibility for the integrity of the work,including the data,analysis,and manuscript content.

References

Sung

Ferlay

Siegel

, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2021; 71: 209–249.

WHO . Breast cancer. https://www.who.int/zh/news-room/fact-sheets/detail/breast-cancer (2024 , accessed 13 March 2024).

Salzman

Fleegle

Tully

. Common breast problems. Am Fam Physician 2012; 86: 343–349.

Lee

Dershaw

Kopans

, et al. Breast cancer screening with imaging: recommendations from the Society of Breast Imaging and the ACR on the use of mammography, breast MRI, breast ultrasound, and other technologies for the detection of clinically occult breast cancer. J Am Coll Radiol 2010; 7: 18–27.

Huang

Luo

Zhang

. Breast ultrasound image segmentation: a survey. Int J Comput Assist Radiol Surg 2017; 12: 493–507.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III, pp. 234–241. Cham: Springer,

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. UNet++: a nested U-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings, pp. 3–11. Cham: Springer.

Huang

Lin

Tong

, et al. UNet 3+: a full-scale connected U-Net for medical image segmentation. In: ICASSP 2020 – IEEE international conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, May 4–8, 2020, pp. 1055–1059. New York: IEEE.

Ghorbian

. Usefulness of machine learning and deep learning approaches in screening and early detection of breast cancer. Heliyon 2023; 9: e22427.

10.

Yang

Xie

. HCTNet: a hybrid CNN-transformer network for breast ultrasound image segmentation. Comput Biol Med 2023; 155: 106629.

11.

Zheng

Shan

, et al. Scribformer: transformer makes CNN work better for scribble-based medical image segmentation. IEEE Trans Med Imaging 2024; 43: 2254–2265.

12.

Pandey

Chen

K-F

Dam

. Comprehensive multimodal segmentation in medical imaging: combining YOLOv8 with SAM and HQ-SAM models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV 2023), Paris, France, October 1–6, 2023 pp. 2592–2598. New York: IEEE.

13.

Da Silva

Berton

Carvalho

, et al. Nuclei segmentation in hepatocytes using YOLO and SAM. In: 2024 31st international conference on systems, signals and image processing (IWSSIP), Graz, Austria, July 9–11, 2024, pp. 1–6. New York: IEEE.

14.

Zhong

Chen

Mian

. Dualconv: dual convolutional kernels for lightweight deep neural networks. IEEE Trans Neural Netw Learn Syst 2022; 34: 9528–9535.

15.

Wang

Chen

Liu

, et al. YOLOv10: real-time end-to-end object detection. arXiv preprint arXiv 2405.14458 2024.

16.

Zhang

Wang

Dayoub

, et al. VarifocalNet: an IoU-aware dense object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2021), Nashville, TN, USA, June 20–25, 2021, pp. 8514–8523. New York: IEEE.

17.

Sun

Chen

Zhang

, et al. VRP-SAM: SAM with visual reference prompt. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2024), Seattle, WA, USA, June 17–21, 2024, pp. 23565–23574. New York: IEEE.

18.

Kirillov

Mintun

Ravi

, et al. Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV 2023), Paris, France, October 1–6, 2023, pp. 4015–4026 New York: IEEE.

19.

Zhang

, et al. Tumor detection using deep learning method in automated breast ultrasound. Biomed Signal Process Control 2021; 68: 102677.

20.

Das

Neog

, et al. Breast cancer detection: shallow convolutional neural network against deep convolutional neural networks based approach. Front Genet 2023; 13: 1097207.

21.

Prinzi

Insalaco

Orlando

, et al. A YOLO-based model for breast cancer detection in mammograms. Cognit Comput 2024; 16: 107–120.

22.

Aly

Marey

El-Sayed

, et al. YOLO Based breast masses detection and classification in full-field digital mammograms. Comput Methods Programs Biomed 2021; 200: 105823.

23.

Liu

Xie

, et al. YOLO-LOGO: a transformer-based YOLO segmentation model for breast mass detection and segmentation in digital mammograms. Comput Methods Programs Biomed 2022; 221: 106903.

24.

Tang

Zhou

Flesch

, et al. A multi-input lightweight convolutional neural network for breast cancer detection considering infrared thermography. Expert Syst Appl 2025; 263: 125738.

25.

Umamaheswari

Mohanbabu

. CNN-FS-IFuzzy: a new enhanced learning model enabled by adaptive tumor segmentation for breast cancer diagnosis using 3D mammogram images. Knowl Based Syst 2024; 288: 111443.

26.

Xiao

Liu

, et al. Biacandet: bioelectrical impedance analysis for breast cancer detection with space–time attention neural network. Expert Syst Appl 2025; 269: 126223.

27.

Shah

Khan

MAU

Abrar

, et al. Optimizing breast cancer detection with an ensemble deep learning approach. Int J Intell Syst 2024; 2024: 5564649.

28.

Ghorbian

Ghobaei-Arani

Ghorbian

. Transforming breast cancer diagnosis and treatment with large language models: a comprehensive survey. Methods 2025; 239: 85–110.

29.

Wang

Xie

, et al. MLN-net: a multi-source medical image segmentation method for clustered microcalcifications using multiple layer normalization. Knowl Based Syst 2024; 283: 111127.

30.

Sushma

Pulikala

. AAPFC-BUSnet: hierarchical encoder-decoder based CNN with attention aggregation pyramid feature clustering for breast ultrasound image lesion segmentation. Biomed Signal Process Control 2024; 91: 105969.

31.

Chen

Nailon

, et al. Dual convolutional neural networks for breast mass segmentation and diagnosis in mammography. IEEE Trans Med Imaging 2021; 41: 3–13.

32.

Wang

Liu

, et al. MF-Net: multiple-feature extraction network for breast lesion segmentation in ultrasound images. Expert Syst Appl 2024; 249: 123798.

33.

Jiang

Chen

, et al. PH-Net: semi-supervised breast lesion segmentation via patch-wise hardness. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2024), Seattle, WA, USA, June 17–21, 2024, pp. 11418–11427. New York:IEEE.

34.

Ravi

Gabeur

, et al. SAM 2: Segment anything in images and videos. arXiv preprint. DOI: arXiv:2408.00714.

35.

Zhang

Wang

Chen

, et al. Path-SAM2: Transfer SAM2 for digital pathology semantic segmentation. arXiv preprint 2024. DOI: 10.48550/arXiv.2408.03651

36.

, et al. Segment anything in medical images. Nat Commun 2024; 15: 654.

37.

Huang

Yang

Liu

, et al.

Segment anything model for medical images?

Med Image Anal 2024; 92: 103061.

38.

Zhang

Liu

. Customized segment anything model for medical image segmentation. arXiv preprint. Epub ahead of print 26 April 2023. doi: 10.48550/arXiv.2304.13785

39.

Zhong

Tang

, et al. Convolution meets LoRA: parameter efficient finetuning for segment anything model. arXiv preprint. Epub ahead of print 31 January 2024. doi: 10.48550/arXiv.2401.17868

40.

Lin

Xiang

, et al. Beyond adapting SAM: towards end-to-end ultrasound image segmentation via auto prompting. arXiv preprint. Epub ahead of print 13 September 2023. doi: 10.48550/arXiv.2309.06824

41.

Yaqub

Jinchao

Aijaz

, et al. Intelligent breast cancer diagnosis with two-stage using mammogram images. Sci Rep 2024; 14: 16672.

42.

Yan

Conze

Quellec

, et al. Two-stage multi-scale breast mass segmentation for full mammogram analysis without user intervention. Biocybern Biomed Eng 2021; 41: 746–757.

43.

Khatua

Bhattacharya

Goswami

, et al. Developing approaches in building classification and extraction with synergy of YOLOv8 and SAM models. Spatial Inf Res 2024; 32: 511–530.

44.

Sawyer-Lee

Gimenez

Hoogi

, et al. Curated breast imaging subset of digital database for screening mammography (CBIS-DDSM) [Data set]. The Cancer Imaging Archive 2016. 10.7937/K9/TCIA.2016.7O02S9CY

45.

Hatamizadeh

Nath

Tang

, et al. Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI BrainLesion workshop, September 2021, pp. 272–284. Cham, Switzerland: Springer International Publishing.

46.

Xie

Zhang

Shen

, et al. Cotr: efficiently bridging CNN and transformer for 3D medical image segmentation. In: International conference on medical image computing and computer-assisted intervention (MICCAI 2021), Strasbourg, France, 27 September–1 October 2021, pp. 171–180. Cham, Switzerland: Springer International Publishing.

47.

Tan

Pang

. Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2020), Seattle, WA, USA, June 13–19, 2020, pp. 10781–10790. IEEE.

48.

Andrew

Menglong

. Efficient convolutional neural networks for mobile vision applications. Mobile Net 2017; 10: 151.

49.

Isensee

Jaeger

Kohl

, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021; 18: 203–211.

50.

Hatamizadeh

Tang

Nath

, et al. UNETR: transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV 2022), Waikoloa, HI, USA, January 3–8, 2022, pp. 574–584. New York: IEEE.

51.

Danelljan

, et al. Segment anything in high quality. Adv Neural Inf Process Syst 2023; 36: 29914–29934.

52.

Zhang

Han

Qiao

, et al. Faster segment anything: towards lightweight SAM for mobile applications. arXiv preprint. Epub ahead of print 23 June 2023. DOI: 10.48550/arXiv.2306.14289

53.

Wang

Yeh

Mark Liao

HYM

. Yolov9: learning what you want to learn using programmable gradient information. In: European conference on computer vision (ECCV 2024), Milan, Italy, 29 September–4 October 2024, pp. 1–21. Cham, Switzerland: Springer Nature Switzerland.

54.

Liu

Wang

, et al. YOLOX: exceeding YOLO series in 2021. arXiv preprint, 2021, arXiv:2107.08430.

DVF-YOLO-Seg: A two-stage breast mass segmentation model with enhanced feature extraction and small lesion detection

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Related work

Detection of medical images

Segmentation of medical images

Multi-stage learning

Methods

Data source and preprocessing

Overall architecture of DVF-YOLO-seg

Detection module

Segmentation module

Implementation details

Performance metrics

Results

Performance comparison with mainstream segmentation models

Performance comparison of the first phase detection model

Performance comparison of the second-stage segmentation model

Ablation study on model modules and prompt strategies

Subjective and clinical evaluation

Detection box accuracy evaluation

Segmentation quality evaluation

Comprehensive evaluation and misdiagnosis risk

Small-lesion analysis

Failure case analysis

Discussion

Limitations and future work

Footnotes

Acknowledgments

ORCID iDs

Ethical considerations

Consent to participate

Contributorship

Funding

Declaration of conflicting interests

Data availability statement

Guarantor

References