Sage Journals: Discover world-class research

Abstract

Background and objective

Contrast-enhanced spectral mammography (CESM) is a new imaging modality that integrates digital mammography and the use of iodinated contrast agents in a dual-energy acquisition protocol, facilitating the differentiation between benign and malignant breast lesions. However, its clinical use is constrained by the potential risks associated with iodinated contrast agents and the increased radiation exposure.

Method

The proposed strategy adapts a self-calibrated pixel attention (SCPA) mechanism to enhance the ability of conditional image-to-image translation models to synthesize contrast uptake in breast lesions. To assess the generalizability and feasibility of our approach under realistic data variations, we used two independent CESM datasets. The performance of the model was evaluated through quantitative and qualitative analyses, considering both whole-image performance and contrast-uptake regions. The results were compared with state-of-the-art conditional generative models. Furthermore, the clinical reliability of the images synthesized by the proposed strategy was assessed through a qualitative analysis performed by an expert breast radiologist.

Results

The proposed strategy outperformed state-of-the-art approaches in the synthesis of both whole images and contrast-enhanced regions. In particular, the SCPA-conditional generative adversarial network exhibited superior preservation of the morphology of the lesion while maintaining the overall quality of the generated images. Moreover, the clinical feasibility evaluation showed acceptable realism and technical quality suitable for radiologist interpretation, although in some cases, residual synthesis errors were observed that could affect patient management.

Conclusion

The integration of attention mechanisms substantially improves the synthesis of contrast uptake in regions of interest (ROIs) within CESM studies with limited and heterogeneous datasets. These findings demonstrate the feasibility of generating contrast enhancement from non-contrast mammography images, highlighting the potential of this approach to reduce the need for contrast administration while preserving clinically relevant diagnostic information. This work should therefore be regarded as a feasibility study that establishes a clinically aligned ROI-focused framework for CESM virtual contrast synthesis and provides a foundation for future work.

Keywords

Contrast-enhanced spectral mammography (CESM)image-to-image translation self-calibrated pixel-attention breast cancer medical imaging

Introduction

Breast cancer remains a leading cause of cancer-related mortality worldwide, with approximately 2.3 million new cases diagnosed annually and nearly 600,000 deaths.¹ While full-field digital mammography (FFDM) continues to be the screening standard, its sensitivity is significantly compromised in women with dense breast tissue, where overlapping fibroglandular structures can obscure underlying lesions.^2,3 Contrast-enhanced spectral mammography (CESM) addresses this by leveraging tumor neovascularization via iodinated contrast agents, achieving diagnostic performance comparable to breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) while offering lower costs.^4,5

Technically, CESM utilizes a dual-energy protocol, capturing paired low-energy (LE) (25–34 kVp) and high-energy (45–49 kVp) images. Acquisitions above the iodine K-edge (33 keV) maximize attenuation relative to breast tissue.⁶ A weighted logarithmic subtraction then generates a recombined (RC) image, suppressing background parenchyma to highlight contrast-enhanced abnormalities. Figure 1 shows a representative example of CESM images from the left breast of a patient with high breast density, in which the underlying and obscured lesion is highlighted in the RC image.

Figure 1.

An illustrative example of CESM images of the left breast from a 44-year-old patient with high breast density. Low-energy images are shown in the CC (panel A) and MLO (panel C) projections. RC images, in the same CC and MLO projections, are shown in panels (B) and (D), respectively. The red rectangle indicates mass lesions, which are obscured in the low-energy images, while they are highlighted in the RC counterpart.

Despite these clinical advantages, CESM implementation is constrained by the risks of iodinated media—such as nephrotoxicity—and increased ionizing radiation exposure.^6–8 Moreover, dual-energy acquisition entails an increased radiation dose, which may be particularly relevant for radiation-sensitive populations, such as carriers of BRCA1 gene mutations.⁷ In addition, the requirement for specialized equipment and contrast administration workflows can limit accessibility, especially in resource-constrained settings.^7,8 These safety, dose, and accessibility considerations motivate interest in virtual-contrast approaches that aim to reproduce the diagnostic information contained in RC images while reducing or avoiding intravenous contrast administration.

Image-to-image translation approaches have been investigated in DCE-MRI and contrast-enhanced computed tomography (CT), showing that synthetic contrast-enhanced images can be generated from non-contrast inputs.^9–13 However, in CESM, synthesizing RC images from LE projections has been only lightly explored.^14–19 One of the first approaches to address this problem was proposed by Gao et al.,¹⁵ who introduced a U-Net-based architecture (RiedNet) incorporating Inception–Residual blocks to synthesize RC image patches that were then overlapped to reconstruct the whole RC image. Jiang et al.¹⁷ evaluated a supervised cycle generative adversarial network (CycleGAN) framework for synthesizing RC images, in which information from the CC and mediolateral oblique (MLO) projections was fused to achieve coherent image synthesis across views. More recently, Rofena et al.¹⁶ evaluated three representative state-of-the-art models, including an autoencoder, Pix2Pix for supervised conditional synthesis, and CycleGAN for unsupervised synthesis, reporting CycleGAN as the best-performing model in both quantitative and qualitative evaluations. Across these studies, the primary objective has been to generate globally realistic RC images, with performance predominantly assessed using whole-image similarity metrics, without explicitly modelling or evaluating clinically relevant contrast-uptake regions. More recently, building on their previous work, Rofena et al.¹⁸ proposed integrating segmentation maps of contrast-agent uptake regions into a CycleGAN-based model, incorporating these maps into the overall loss through additional region-focused terms that localize the loss computation to the regions of interest (ROIs). However, this approach relies on the availability of reliably annotated uptake regions, which may be difficult to obtain in routine clinical practice, and the study did not systematically analyze performance within these regions, but rather in terms of global image quality.

On the other hand, data augmentation strategies based on image-to-image translation models have been proposed to improve performance in detection and classification tasks that use CESM images. Gao et al.¹⁴ introduced a shallow four-layer convolutional neural network (CNN) to generate RC image patches containing suspicious lesions from LE image patches, aiming to enlarge the dataset and improve the classification of benign and malignant lesions. Similarly, Amin et al.¹⁹ evaluated several generative models, including a conventional generative adversarial network (GAN) that uses a random noise vector as input, as well as CycleGAN and UNIT, the latter two being unsupervised approaches. In these studies, the synthesized RC images were used to augment training data for downstream detection and classification models rather than as virtual replacements for contrast-enhanced acquisitions.

Taken together, these approaches demonstrate the potential of artificial intelligence (AI) models to generate realistic and high-quality images in CESM. Nevertheless, current CESM virtual-contrast literature remains centered on standard convolutional and GAN-based architectures that are inherently limited by fixed-size kernels, which restrict the receptive field and hinder the integration of long-range dependencies,^20,21 underscoring the need for alternative architectures that can more effectively focus on clinically relevant contrast-uptake regions.

Accordingly, in this study, we propose integrating a self-calibrated pixel attention (SCPA) block into widely used conditional generative architectures (U-Net and Pix2Pix) for CESM RC image synthesis. We hypothesize that pixel-level attention prioritizes the synthesis of high-value features in contrast-uptake zones, preserving the morphological and functional fidelity required for radiological interpretation.²² Within an incremental architectural novelty, this work proposes a clinically aligned framework that shifts the emphasis from whole-image similarity toward regional fidelity under realistic CESM data constraints. Our contributions include: (i) integrating SCPA modules to prioritize synthesis accuracy in contrast-uptake regions; (ii) evaluating regional fidelity and ROI-level interpretability alongside traditional global metrics on two independent CESM datasets; (iii) establishing a multi-level validation approach combining ROI-focused metrics and attention-map analysis together with a radiologist-led feasibility assessment to bridge the gap between technical output and clinical utility.

Materials and methods

The methodology is structured into three stages: data preprocessing, model training, and multi-level performance evaluation (Figure 2). The objective is to synthesize the contrast agent response in RC images by leveraging non-contrast LE inputs, while preserving the morphological and functional information required for radiological interpretation. In the first stage, the LE (input) and RC (target) images were preprocessed to ensure data uniformity. This process included windowing adjustment, breast cropping, spatial resizing of the images to optimize computational cost, and pixel intensity normalization. The second stage involved the training of the proposed strategy, incorporating the attention mechanisms into the U-Net and Pix2Pix models. Furthermore, training strategies based on supervised learning were implemented, specifically designed to preserve the structural and clinical fidelity of the generated images relative to the real images. Finally, in the third stage, a multi-level evaluation strategy was implemented to compare the proposed approach with the baseline models through whole-image metrics, ROI-focused analyses, and an expert breast radiologist assessment, with particular emphasis on regions exhibiting high contrast uptake.

Figure 2.

Overview of the proposed methodology for the generation and evaluation of RC images in CESM. The workflow consists of three main stages: (1) preprocessing of the dataset to standardize the images; (2) training stage for the proposed strategy and baseline models; and (3) quantitative and qualitative evaluation stage, focusing on the complete image and the regions with contrast agent uptake.

Self-calibrated pixel-attention block

Visual attention layers in deep learning models allow them to focus on the most relevant regions of an image during the learning and prediction processes. This enhances the network’s ability to prioritize important areas in the input, optimizing its focus on key features and enabling a more accurate contextual interpretation of the data.

Among the first prominent implementations of attention mechanisms in vision is self-attention, which enables models to capture long-range relationships between different parts of an image by calculating the attention of each pixel with respect to all others.²³ However, self-attention poses significant challenges, such as its high computational and memory cost, especially when applied to high-resolution images due to its quadratic complexity with respect to the input size.

This limits its scalability in practical applications, leading to the development of more efficient variants such as Spatial-Attention²⁴ and Channel-Attention.²⁵ While Spatial-Attention assigns attention weights to specific regions within the image space, Channel-Attention focuses on learning the relative importance of each feature channel, enabling the model to prioritize the most relevant channels for a specific task. Moreover, the combination of these mechanisms, as seen in the convolutional block attention module,²⁶ integrates both perspectives, enhancing the ability of the model to jointly capture spatial and channel relationships, thereby optimizing performance in computer vision tasks.

The self-calibrated pixel-attention (SCPA) block, introduced in Zhao et al.,²⁷ is an attention block designed to improve pixel-level representation through local and contextual recalibration. This approach combines a pixel-attention layer with a self-calibrated convolutional structure. The self-calibration operation adaptively expands the receptive fields of the convolutional layers, improving their ability to capture more complex spatial relationships. Meanwhile, the pixel-attention layer assigns weights to different regions of the input feature maps, enabling the model to focus on the most relevant areas. Thus, unlike traditional mechanisms such as Spatial or Channel Attention, SCPA directly integrates both local and global context at the pixel level. This enables a more precise representation, complementing the self-calibration operation by providing a more comprehensive approach to capturing key features in complex images.

In addition, the Breast Imaging-Reporting and Data System (BI-RADS) for CESM highlights the importance of morphological characteristics and local intensity patterns in regions with contrast uptake, which radiologists consider crucial to characterizing and interpreting these lesions.²⁸ In this context, SCPA offers a principled mechanism to emphasize diagnostically relevant enhancement patterns within contrast-uptake regions, aligning the learned representations with clinical reading criteria.

Network architectures

We integrated the SCPA block into two benchmark architectures: a U-Net and a Pix2Pix-based conditional GAN (cGAN), hereafter referred to as SCPA-UNet and SCPA-cGAN, respectively.

SCPA-UNet

Figure 3 illustrates the SCPA architecture, which builds upon a U-Net structure.²⁹ In this design, SCPA blocks are integrated into both the encoder and the decoder to explicitly guide the model in the extraction and refinement of diagnostically relevant features. In the encoder, the blocks enhance the network's ability to capture critical patterns from LE images; in the decoder, they prioritize the reconstruction of contrast-related structures. This dual-integration strategy enables the model to balance fine-grained local details and leverage global contextual cues, supporting coherent synthesis of RC images in regions with and without contrast uptake.

Figure 3.

General structure of the self-calibrated pixel attention (SCPA)-UNet architecture, in which SCPA blocks are integrated into both the encoder and the decoder of a U-Net-based framework. The objective is to extract relevant features and incorporate global contextual information through the pixel attention layer at each stage of the model.

SCPA-cGAN

The SCPA-cGAN utilizes the same attention-augmented generator structure (Figure 4). A PatchGAN discriminator³⁰ was used to assess image realism at a local level. Notably, SCPA blocks were intentionally not incorporated into the discriminator to avoid an architectural imbalance, which could otherwise lead to training collapse by allowing the discriminator to overpower the generator and provide non-informative gradients.

Figure 4.

General structure of the SCPA-cGAN architecture, where SCPA blocks are integrated into the generator model, while the discriminator follows a PatchGAN scheme.

Architectural specifications

To ensure a fair comparison between models, both SCPA-UNet and SCPA-cGAN share a consistent block configuration:

Encoder/downsampling: Convolutional layers followed by instance normalization (momentum = 0.8) and LeakyReLU activation (slope = 0.2).

Decoder/upsampling: Transposed convolutions, instance normalization (momentum = 0.8), and LeakyReLU.

Skip connections: Feature maps are concatenated and processed through an additional convolutional layer and LeakyReLU.

Loss functions

Consequently, based on the implementation of the two proposed architectures, specific loss functions were defined for the optimization of each model. For the SCPA-UNet model, the mean absolute error (MAE) loss function was employed. This function computes the average of the absolute differences between the pixel values of the real RC image and those of the generated RC image, providing a direct measure of the overall discrepancy between the two images. Equation (1) defines the MAE function, where N denotes the number of paired LE and RC images in the training dataset. The term y_i represents the real RC image, and ${\hat{y}}_{i}$ denotes the generated RC image. $L_{MAE} (G) = λ_{pixel} \frac{1}{N} \sum_{i = 1}^{N} ‖ y_{i} - {\hat{y}}_{i 1} ‖$ (1)

For the SCPA-cGAN model, the cGAN loss function was employed, as defined in equation (2). The term $L_{L 1} (G)$ evaluates the pixel-level discrepancy between the generated and real images; for this term, the MAE loss function, previously described in equation (1), was used and weighted with a scaling factor of $λ = 100$ , following the recommendations of the Pix2Pix framework.³¹

Meanwhile, the adversarial component $L_{LS - cGAN} (G, D)$ was implemented using the least squares generative adversarial network formulation,³² as presented in equation (3). This choice is supported by previous reports demonstrating that such a formulation improves training stability and promotes the generation of higher-quality results. $L_{total} (G, D) = min_{G} max_{D} L_{LS - GAN} (G, D) + λ L_{L 1} (G)$ (2) $L_{LS - cGAN} (G, D) = \frac{1}{2} E_{x, y} [{(D (x, y) - 1)}^{2}] + \frac{1}{2} E_{x, z} [{(D (x, \hat{y}))}^{2}]$ (3)

Data collection and image preprocessing

To evaluate the robustness of the proposed strategy against population-specific variability and heterogeneous acquisition protocols, we utilized two independent datasets from Colombia (SURA-CESM) and Italy (UCBM-CESM). These cohorts introduce critical variations in hardware, contrast agent types, and X-ray acquisition settings. Thus, we sought to simulate realistic heterogeneity in CESM practice and to test the feasibility of the proposed approach beyond a single-center setting.

SURA-CESM dataset

A dataset of CESM images was collected from a cohort of 89 patients (ages 25–71) treated at Ayudas Diagnósticas-Sura, Medellín, Colombia, between January 2020 and May 2022. In this study, an inclusion criterion was applied to select patients imaged with the same contrast agent and the same acquisition device to reduce variability associated with the imaging protocol. As a result, a total of 69 patients were included in the analysis. The contrast agent used was iopromide (300 mg iodine concentration), and the acquisition device was a Selenia Dimensions system. In total, 356 paired LE and RC images were available for analysis. The dataset exhibits inherent resolution variability: 157 pairs at 3328 × 2560 pixels and 199 pairs at 4096 × 3328 pixels, all captured with a 12-bit depth (0–4095 intensity range) prior to any preprocessing.

UCBM-CESM dataset

The UCBM-CESM dataset comprises 569 paired CESM images acquired from 204 patients (aged 31–90 years) at the Fondazione Policlinico Universitario Campus Bio-Medico in Rome between September 2021 and October 2022, using a GE Healthcare Senographe Pristina FFDM system.¹⁶ For each patient, medical report data were extracted, including breast density classification according to the BI-RADS system. Breast density categories were distributed as follows: 37 patients corresponded to category A, 72 to category B, 73 to category C, and 22 to category D. Furthermore, malignant lesions were confirmed in 42 patients and benign lesions in 22 patients, all verified by biopsy. The remaining patients showed no evidence of tumor abnormalities.

Images processing

In this study, a robust preprocessing pipeline for CESM images was implemented, designed to standardize the input images, reduce areas of irrelevant information, and ensure the preservation of relevant anatomical structures.

One of the key challenges in designing this pipeline was managing the wide dynamic range inherent to mammograms. To address this heterogeneity and enhance the visibility of clinically significant details, the windowing technique was one of the key challenges in designing this pipeline, which was managing the wide dynamic range inherent to mammograms. To address this heterogeneity and enhance the visibility of clinically significant details, the windowing technique was applied. This method adjusts brightness and contrast levels to optimize the perception of anatomical structures within the breast. The window width and window level values required for this adjustment were extracted directly from the DICOM file metadata, ensuring consistent and image-specific parametrization tailored to the unique characteristics of each mammogram.

Mammography images often include large areas of black background that are irrelevant for identifying anatomical or pathological features. To improve focus on the relevant areas, each image was cropped to include only the breast using a bounding box. Moreover, due to the high spatial resolution of these images, the training and validation of the models involve high computational costs. To address this issue, the images were resized while preserving the original aspect ratio to ensure that the anatomical structures remained intact. After calculating the average aspect ratio of the cropped images, an approximate value of 1.23 was obtained for the SURA-CESM dataset and 2.01 for the UCBM-CESM dataset. As a result, final dimensions of 640 × 512 pixels were defined for SURA-CESM and 640 × 320 pixels for UCBM-CESM. This approach aims to eliminate inherent background biases that could influence the model while simultaneously reducing computational costs in image processing and synthesis.

Additionally, to evaluate the most effective pixel intensity scaling method, we implemented an evaluation framework that included various normalization techniques: Min–Max 0–1, Min–Max −1 to 1, Z-score, and the original intensity of the images. For this purpose, we adjusted the internal activation functions and the output activation function of the model according to each technique. This evaluation framework demonstrated that Min–Max 0–1 normalization yielded the best results in synthesizing the characteristics of regions with contrast uptake. However, as this is not the primary focus of the current study, the results of this evaluation are not discussed further.

Performance evaluation

Quantitative evaluation

Quantitative evaluation was performed by calculating four metrics that measure image quality using different criteria between generated RC images $\hat{y}$ and real RC images y. These metrics include the mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity (SSIM) index measure, and a perceptual radiologic metric (PercepRad).

The MSE calculates the pixel-to-pixel error between the generated images $\hat{y}$ and the target images y. A lower MSE value indicates a smaller pixel-by-pixel intensity error. This metric allows for the analysis of error in terms of pixel intensity and is represented in equation (4), where N represents the total number of pixels in the image. $L_{MSE} (y, \hat{y}) = \frac{1}{N} \sum_{i = 1}^{N} (y - \hat{y})^{2}$ (4)

We computed PSNR to measure the value of noise that affects the quality of the generated images $\hat{y}$ relative to the real images y. The value of this metric is expressed in decibels (dB), with higher values indicating better quality of the image generated. This is represented in equation (5). $PSNR (y, \hat{y}) = 10 \cdot \log_{10} (\frac{max {(y)}^{2}}{MSE (y, \hat{y})})$ (5)

The SSIM metric³³ calculates the similarity between two images in terms of luminance, contrast, and structure. This is represented in equation (6), where $μ_{\hat{y}}$ and $μ_{y}$ represent the mean pixel intensity, $σ_{\hat{y}}$ and $σ_{y}$ represent the standard deviation of pixel intensity, and $σ_{\hat{y} y}$ represents the covariance between the two images. The values of this metric range from 0 to 1, with higher values indicating greater similarity between the two images. $SSIM (y, \hat{y}) = \frac{(2 μ_{y} μ_{\hat{y}} + c_{1}) (2 σ_{y \hat{y}} + c_{2})}{(μ_{y}^{2} + μ_{\hat{y}}^{2} + c_{1}) (σ_{y}^{2} + σ_{\hat{y}}^{2} + c_{2})}$ (6)

The PercepRad metric is a perceptual metric that calculates the similarity between two images based on human perception. Unlike other perceptual metrics in the literature that use deep neural networks pre-trained on natural image datasets, the deep neural network model used by this metric was pre-trained using RadImageNet,³⁴ a dataset consisting of medical images such as CT, MRI, and ultrasound. The input images y and $\hat{y}$ are passed through the pre-trained deep neural network ResNet-50, which extracts feature representations. The difference between these feature representations of the two images is then calculated, as shown in equation (7), where $H_{n} W_{n}$ are the dimensions of the features in layer n of the model, and $D^{n} (y) - D^{n} (\hat{y})$ are the features extracted from layer n of the images y and $\hat{y}$ , respectively. Since this metric calculates the distance between the extracted features of the images, a lower value indicates greater perceptual radiological similarity between the images. $PercepRad (y, \hat{y}) = \sum_{n}^{N} \frac{1}{H_{n} W_{n}} | D^{n} (y) - D^{n} (\hat{y}) |_{2}^{2}$ (7)

We utilized these quantitative metrics in two evaluation strategies: The first strategy involves evaluating the complete image to analyze the accuracy in synthesizing global characteristics, such as the shape and structure of the breast and the average intensity. The second evaluation strategy focuses on analyzing the synthesis of the contrast agent response in local regions with contrast uptake. To achieve this strategy, we extracted regions exhibiting contrast-enhanced lesion characteristics from the RC images in the test set, resulting in a total of 31 regions. Additionally, an equal number of regions showing enhancement in the RC images but not corresponding to lesions or radiological findings (i.e. corresponding to the enhancement of fibroglandular tissue), were also extracted.

Statistical analysis

To determine the statistical significance of the performance differences between the proposed strategy and its baseline counterpart, a two-stage statistical analysis was conducted. First, a normality test was performed to verify the distribution of the data, followed by mean comparison tests using either parametric or non-parametric approaches, as appropriate.

The null hypothesis (H0) stated that there is no significant difference in performance between the proposed models and their baseline counterparts in terms of the evaluation metrics considered. Conversely, the alternative hypothesis (H1) suggested that the proposed models achieve significantly superior performance compared with their baselines.

In the first stage of the analysis, we assessed whether the differences in evaluation metrics between each pair of models followed a normal distribution. To this end, the Shapiro–Wilk test³⁵ was employed, given its robustness for small and moderate sample sizes. If the test yielded a p-value greater than the significance level ( $α = 0.05$ ), the data were assumed to follow a normal distribution.

When the normality assumption was satisfied (p > 0.05), a paired-sample t-test³⁶ was applied to determine whether the mean difference between the proposed model and its baseline was statistically significant. Otherwise, if the data did not follow a normal distribution (p ≤ 0.05), the non-parametric Wilcoxon signed-rank test³⁷ was used. This test is suitable when the normality assumption is violated and enables the evaluation of the statistical significance of the differences in performance metrics between the proposed models and their counterparts.

The outcomes of the statistical tests were interpreted with respect to the stated hypotheses. When the p-value obtained from the comparison test (t-test or Wilcoxon, as appropriate) was lower than the significance threshold (p < 0.05), the null hypothesis (H0) was rejected and the alternative hypothesis (H1) was accepted, indicating that the proposed model achieved significantly superior performance compared to the baseline model. Conversely, when the p-value was greater than or equal to the significance level (p ≥ 0.05), the null hypothesis was not rejected, indicating that no statistically significant differences were found between the proposed model and its counterpart. This interpretation allowed us to identify which models achieved significant improvements and in which specific metrics, thus providing a solid quantitative basis for evaluating the impact of the SCPA models in comparison with their conventional counterparts.

Qualitative evaluation

A two-stage qualitative evaluation was performed by a radiologist with 15 years of experience interpreting breast images (caseload of 500 studies per year) to assess the clinical feasibility of using the synthesized RC CESM images to make clinical decisions.

In the first stage, the realism and clinical quality of the images were evaluated. For this purpose, the test set from both datasets was split at the patient level, assigning real RC images from 17 patients, as well as RC images generated by the SCPA-UNet and SCPA-cGAN models, also from 17 patients each. In this phase, only RC images in the cranio-caudal (CC) and MLO projections of both breasts were presented. For the evaluation, two questions were formulated: (i) whether the presented image corresponded to an acquisition with conventional devices or, alternatively, was computationally generated, and (ii) whether the presented image had sufficient technical quality to support a reliable diagnosis.

In the second stage, the consistency of contrast enhancement was evaluated. For this purpose, the test set from both datasets was again split at the patient level, using images generated by the SCPA-UNet and SCPA-cGAN models, corresponding to 25 patients each. In this phase, the complete studies of each patient were presented, including the LE images together with the real RC image for each projection and breast, in addition to the RC image generated by the corresponding model. For the evaluation, two questions were formulated on a 5-point scale: (i) whether the generated uptake was consistent with the real uptake and the visual characteristics of the image, and (ii) whether the generated uptake allowed for a clinical interpretation similar to that of the real uptake.

Experimental results

Experimental settings

To evaluate the performance of the proposed strategy—the incorporation of attention layers through the SCPA module into conventional conditional generative models—the SCPA block was integrated into a U-Net²⁹ architecture and into the generator of the Pix2Pix model.³¹ This implementation was conducted to assess the contribution of the SCPA blocks in synthesizing the contrast agent response in regions with contrast uptake. The strategy was then compared against its respective baseline models (U-Net and Pix2Pix), as well as against the RiedNet architecture,¹⁵ which was previously proposed for domain translation in medical imaging.

RiedNet is a deep learning architecture specifically developed for domain translation in medical imaging, with a particular focus on CESM images. Its design combines residual and inception blocks integrated into a symmetrical structure based on the U-Net model. This configuration enables RiedNet to preserve key morphological details during domain transformation, maximizing its effectiveness in medical image synthesis tasks and ensuring the quality and relevance of the generated images.

Pix2Pix is a cGAN model designed for supervised image-to-image translation across different domains, using paired datasets. This supervised model learns a direct mapping between an input domain and an output domain by employing not only the adversarial loss characteristic of GANs but also a similarity-based loss to ensure that the generated images maintain high structural fidelity with respect to the target images.

Both the baseline models and the proposed strategy underwent heuristic testing with 150, 200, and 250 training epochs within the experimental framework. It was observed that 200 epochs were sufficient for model convergence, using a batch size of 8. We employed the Adam optimizer³⁸ with a learning rate ( $α$ ) of 0.0002 for the U-Net-based models, and learning rates of 0.0002 for the generator and 0.00005 for the discriminator in the cGAN models, with parameters $β_{1} = 0.9$ and $β_{2} = 0.999$ . The models were implemented in PyTorch and trained on an NVIDIA GeForce RTX 2070 GPU. The image datasets were split at the patient level, with a random allocation of 80% for training, 5% for validation, and 15% for testing.

Quantitative comparison

Tables 1 and 2 present the quantitative evaluation results of the proposed strategy compared with the baseline models on the test RC images. Table 1 reports the results for complete images, whereas Table 2 shows the results for the extracted regions with contrast enhancement across both datasets. For each metric, the values represent the mean performance across all test images, accompanied by the corresponding standard deviations. The U-Net-based models (highlighted in blue) were compared among themselves, as were the cGAN-based models (highlighted in gray). Within both groups, the best-performing model for each metric is highlighted in bold.

Table 1.

Quantitative results of pixel-level and perceptual similarity assessment on whole images across both datasets.

Dataset	Model	MSE↓	PSNR ↑	SSIM↑	PercepRad↓
SURA-CESM	UNet	0.0019 ± 0.0	27.799 ± 2.03	0.747 ± 0.04	1.209 ± 0.15
	RiedNet	0.0021 ± 0.0	27.197 ± 1.98	0.746 ± 0.05	1.200 ± 0.15
	SCPA-UNet	0.0017 ± 0.0	28.303 ± 2.15	0.757 ± 0.04	1.173 ± 0.15
	Pix2Pix	0.0022 ± 0.0	27.004 ± 1.86	0.685 ± 0.05	0.590 ± 0.13
	SCPA-cGAN	0.0020 ± 0.0	27.303 ± 1.72	0.697 ± 0.04	0.651 ± 0.13
UCBM-CESM	UNet	0.0007 ± 0.0	32.968 ± 3.36	0.907 ± 0.03	0.712 ± 0.21
	RiedNet	0.0007 ± 0.0	32.858 ± 2.9	0.907 ± 0.03	0.757 ± 0.23
	SCAP-UNet	0.0006 ± 0.0	33.626 ± 2.92	0.908 ± 0.03	0.715 ± 0.23
	Pix2Pix	0.0007 ± 0.0	32.451 ± 2.86	0.871 ± 0.04	0.495 ± 0.14
	SCPA-cGAN	0.0006 ± 0.0	33.072 ± 2.81	0.869 ± 0.04	0.476 ± 0.16

MSE: mean squared error; PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network.

The values for each metric represent the mean performance over all test images, accompanied by the corresponding standard deviations. The best-performing model for each metric is highlighted in bold.

Table 2.

Quantitative results of pixel-level and perceptual similarity assessment on regions with contrast uptake across both datasets.

Dataset	Model	MSE↓	PSNR↑	SSIM↑	PercepRad↓
SURA-CESM	UNet	0.0071 ± 0.0	22.241 ± 2.43	0.586 ± 0.09	1.416 ± 0.26
	RiedNet	0.0084 ± 0.01	21.531 ± 2.75	0.555 ± 0.09	1.602 ± 0.19
	SCAP-UNet	0.0062 ± 0.0	22.973 ± 2.72	0.610 ± 0.09	1.462 ± 0.25
	Pix2Pix	0.0073 ± 0.0	22.077 ± 2.44	0.535 ± 0.08	0.720 ± 0.28
	SCPA-cGAN	0.0066 ± 0.0	22.494 ± 2.39	0.554 ± 0.08	0.775 ± 0.28
UCBM-CESM	UNet	0.0071 ± 0.01	24.178 ± 4.74	0.720 ± 0.12	1.023 ± 0.26
	RiedNet	0.0071 ± 0.01	24.289 ± 4.69	0.723 ± 0.12	1.068 ± 0.25
	SCAP-UNet	0.0065 ± 0.01	24.541 ± 4.55	0.727 ± 0.12	0.962 ± 0.28
	Pix2Pix	0.0073 ± 0.01	23.759 ± 4.24	0.671 ± 0.11	0.729 ± 0.25
	SCPA-cGAN	0.0061 ± 0.01	24.261 ± 4.19	0.686 ± 0.11	0.667 ± 0.26

The values for each metric represent the mean performance across all test images, accompanied by their respective standard deviations. The best-performing model for each metric is highlighted in bold.

As presented in Table 1, in the SURA-CESM dataset, the proposed SCPA-UNet model achieves the best overall performance, yielding a reduction in MSE compared with U-Net and RiedNet. This improvement corresponds to a relative decrease of 10.5% with respect to U-Net and 19% with respect to RiedNet. Consistently, the PSNR increases to 28.30 dB, outperforming U-Net by 1.8% and RiedNet by 4.1%. In terms of SSIM, SCPA-UNet reaches an SSIM of 0.757, representing a relative improvement of 1.3% over U-Net and 1.5% over RiedNet. Finally, for the perceptual metric (PercepRad), SCPA-UNet reduces the value to 1.173, which corresponds to a 3% improvement compared with both reference models. In contrast, the proposed SCPA-cGAN model achieved competitive performance compared with Pix2Pix. A relative improvement of 1.5% was observed in SSIM, while PSNR showed a gain of 1.1%. However, Pix2Pix achieved better performance than SCPA-cGAN on the PercepRad metric, with SCPA-cGAN showing a relative decrease of 10.3%.

In the UCBM-CESM dataset, the improvements were even more pronounced. The MSE of SCPA-UNet (0.0006) represented a relative reduction of 14.3% compared with both U-Net and RiedNet (0.0007 each). The PSNR increased to 33.63 dB, outperforming U-Net by 1.8% and RiedNet by 2.4%. The SSIM reached the highest value, with a relative gain of 0.1% over U-Net and 2.6% over RiedNet. Regarding PercepRad, SCPA-UNet showed comparable performance to U-Net but superior performance to RiedNet. The behavior of the cGAN models was consistent, with SCPA-cGAN reducing MSE by 14.3% compared with Pix2Pix and increasing PSNR to 33.07 dB, with a relative gain of 1.7%. Although SSIM was slightly lower (0.869 vs. 0.871, a difference of −0.2%), the perceptual metric again revealed a clear advantage, corresponding to a relative improvement of 3.8%.

Table 2 presents the results obtained when restricting the analysis to the ROIs, which enables a more precise evaluation of the models’ ability to preserve diagnostically relevant structures. In the SURA-CESM dataset, the SCPA-UNet model achieved the best performance within the U-Net family. In terms of MSE, the value decreased to 0.0062, representing a relative reduction of 12.7% compared with U-Net, and 25.9% compared with RiedNet. Consistently, the PSNR increased to 22.97 dB, surpassing U-Net by 3.3% and RiedNet by 6.7%. For SSIM, a value of 0.610 was achieved, corresponding to relative improvements of 4.1% and 9.9%, respectively. Regarding the perceptual metric, the score was slightly higher than U-Net (1.416; a relative deterioration of 3.2%) but lower than RiedNet (1.602; a relative improvement of 8.7%). For the adversarial models, SCPA-cGAN showed notable improvements over Pix2Pix. The MSE achieved a relative reduction of 9.6%, while the PSNR increased from 22.08 to 22.49 dB (a gain of 1.9%). In terms of SSIM, an improvement of 3.6% was observed. Regarding the perceptual metric, Pix2Pix outperformed SCPA-cGAN by 7.6%, possibly because this metric is more sensitive to local variations in fine texture, which may favor models that emphasize high-frequency details.

In the UCBM-CESM dataset, the SCPA-UNet model achieved the lowest MSE (0.0065), representing a relative improvement of 8.5% compared with both U-Net (0.0071) and RiedNet (0.0071). The PSNR increased to 24.54 dB, with gains of 1.5% over U-Net and 1.1% over RiedNet. In SSIM, it reached the highest value (0.727), improving by 0.7% compared with U-Net and 0.6% compared with RiedNet. For the PercepRad metric, a value of 0.962 was obtained, representing a relative improvement of 6.0% over U-Net and 9.9% compared with RiedNet. For the adversarial models, SCPA-cGAN exhibited similar behavior. The MSE decreased to 0.0061, representing a 16.4% improvement over Pix2Pix. PSNR also increased to 24.26 dB (+2.1%), and SSIM improved by 2.2%. In terms of PercepRad, SCPA-cGAN achieved a value of 0.667, a relative improvement of 8.5% compared with Pix2Pix.

The results of the statistical test are presented in Tables 3 and 4. If the data followed a normal distribution, a paired Student's t-test (*) was used. Otherwise, the Wilcoxon signed-rank test (+) for paired samples was employed. The results are reported using symbols representing different levels of statistical significance:

(+++) or () indicate significant differences with a p-value < 0.001.

(++) or () indicate significant differences with a p-value < 0.01.

(+) or () indicate significant differences with a p-value < 0.05.

(−) indicates that the difference is not significant.

Table 3.

Statistical significance of the evaluation metrics (PSNR, SSIM, and PercepRad) between the proposed strategy and its baseline counterparts on the SURA-CESM dataset.

Comparison	Analysis	PSNR	SSIM	PercepRad
SCPA-UNet versus U-Net	Full image	+++	+++	**
	ROIs	***	+++	***
SCPA-UNet versus RiedNet	Full image	***	+++	+++
	ROIs	+++	***	+++
SCPA-cGAN versus Pix2Pix	Full image	+++	***	***
	ROIs	***	+++	+++

PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network; ROI: region of interest.

The symbols used are as follows: (+) indicates the Wilcoxon test, (*) indicates the paired Student's t-test, (+++ or ***) denote statistical significance at p < 0.001, (++ or **) at p < 0.01, (+ or *) at p < 0.05, and (−) indicates that the difference is not significant.

Table 4.

Statistical significance of the evaluation metrics (PSNR, SSIM, and PercepRad) between the proposed strategy and its baseline counterparts on the UCBM-CESM dataset.

Comparison	Analysis	PSNR	SSIM	PercepRad
SCPA-UNet versus U-Net	Full image	+	−	−
SCPA-UNet versus U-Net	ROIs	−	+	**
SCPA-UNet versus RiedNet	Full image	+++	−	+++
SCPA-UNet versus RiedNet	ROIs	++	−	+++
SCPA-cGAN versus Pix2Pix	Full image	−	++	+
SCPA-cGAN versus Pix2Pix	ROIs	*	*	*

In the SURA-CESM dataset, the SCPA-UNet model showed consistent improvements over U-Net and RiedNet, achieving significant reductions in MSE and increases in PSNR and SSIM, both at the whole-image level and within ROIs. Statistical tests confirmed the high significance (p < 0.001) of these improvements for PSNR and SSIM across all comparisons, as well as for PercepRad in most cases. This indicates that the integration of the SCPA block into the U-Net architecture not only improves the mean values of the metrics but does so robustly and consistently across different evaluation scenarios. For the adversarial models, SCPA-cGAN also outperformed Pix2Pix with significant differences in PSNR and SSIM (p < 0.001 in both whole images and ROIs), demonstrating that SCPA attention enhances perceived realism in addition to structural fidelity.

In the UCBM-CESM dataset, the results present a more nuanced scenario. Although the SCPA-UNet model achieved the best mean values in MSE, PSNR, and SSIM, the differences with respect to the U-Net were not always statistically significant. In particular, the significance analysis revealed that improvements over U-Net in PSNR and SSIM were marginal (p < 0.05 or not significant in certain cases), whereas the differences with RiedNet did reach high significance (p < 0.001). This result suggests that, in this dataset, the impact of the SCPA block is more clearly evidenced when compared with optimized architectures such as RiedNet, rather than with U-Net. For SCPA-cGAN, the improvements over Pix2Pix in PSNR and SSIM were statistically significant in the ROI analysis (p < 0.05 to p < 0.01), underscoring the importance of evaluating relevant regions to capture differences that may not always be apparent at the global level.

Qualitative comparison

Figures 5 and 6 present the results of three test cases, comparing the real RC images with those generated by the baseline models and the proposed strategy. In these figures, the first, third, and fifth rows show the test images alongside their corresponding synthesized images, while the second, fourth, and sixth rows display the difference maps between each generated image and the real RC image. The first and second columns depict the LE images used as input to the models, together with the real RC images. The remaining columns present the RC images generated by the baseline models and by the proposed strategy, respectively.

Figure 5.

Visual comparison of generated images in the SURA-CESM dataset. Top rows show low-energy, reference recombined, and generated images from baseline (U-Net, RiedNet, and Pix2Pix) and proposed (SCPA-UNet and SCPA-cGAN) models. Bottom rows display absolute error maps, with higher values (yellow) indicating larger discrepancies. Red boxes highlight regions of interest used to assess structural detail preservation and contrast uptake performance.

Figure 6.

Visual comparison of generated images in the UCBM-CESM dataset. Top rows show low-energy, reference recombined, and generated images from baseline (U-Net, RiedNet, and Pix2Pix) and proposed (SCPA-UNet and SCPA-cGAN) models. Bottom rows display absolute error maps, with higher values (yellow) indicating larger discrepancies. Red boxes highlight regions of interest used to assess structural detail preservation and contrast uptake performance.

Three representative cases (A, B, and C) with different levels of contrast uptake were selected. In the first case (A), an image was included in which a region with contrast uptake is visible in both the LE and RC images. The second case (B) corresponds to a patient with high breast density, in which the lesion is masked in the LE image; however, in the RC image, the fibroglandular tissue is suppressed, and the lesion is highlighted more clearly. Finally, in the third case (C), a patient with high breast density is presented in the LE image, but with no evidence of contrast uptake in the RC image.

These cases were selected to evaluate the ability of the models to accurately represent the contrast uptake response of lesions, regardless of their visibility in the LE image. In addition, they allow for analysis of the models’ ability to avoid generating false positives with contrast uptake.

In both datasets, it can be observed that the U-Net, RiedNet, and SCPA-UNet models generate reconstructions with a certain degree of smoothing, both in the global image and in the regions with contrast uptake. This effect is expected, since the optimization of these models is based exclusively on minimizing pixel-level discrepancies. Nevertheless, among them, SCPA-UNet stands out for its ability to reconstruct the ROIs with greater accuracy.

In contrast, the Pix2Pix and SCPA-cGAN models do not exhibit the smoothing effect mentioned above. This is because their optimization is not based exclusively on a pixel-level loss function but also incorporates a component that optimizes the realism of the generated images. Nevertheless, SCPA-cGAN stands out in the synthesis of morphological features and in the reproduction of lesion enhancement patterns, showing lower discrepancies, as evidenced in the difference maps.

On the other hand, Figure 6 presents three cases with different contrast uptake characteristics from the UCBM-CESM dataset. As illustrated in this figure, the breast features differ from those in the SURA-CESM dataset. In UCBM-CESM, a lower presence of normal breast tissue characteristics is observed in the RC images, which may be attributable to differences in the acquisition protocols between the two datasets.

Despite this difficulty, the proposed strategy, SCPA-UNet and SCPA-cGAN models show greater accuracy in reproducing the lesion characteristics, as can be observed in the difference maps. In contrast, the baseline models U-Net, RiedNet, and Pix2Pix fail to detect this lesion, highlighting the limitations of these approaches in synthesizing regions with contrast uptake.

On the other hand, Case C in both datasets corresponds to a scenario where the LE image shows high breast density but no contrast uptake in the RC image. We observed that the models generate certain discrepancies in the synthesis of normal breast tissue, introducing regions of uptake that are not present in the real image. This highlights the technical challenge that such cases pose for generative models. Overall, these qualitative examples illustrate both the ability of SCPA-based models to recover subtle uptake patterns and their current limitations in very dense or non-enhancing cases, in line with the feasibility scope of this study.

Attention score analysis

Figure 10 shows the attention scores corresponding to pixel-attention in the final layer of the proposed strategy. The first three columns of the figure present the LE images, the real RC images, and the images generated by the model, respectively. The next four columns display the attention maps corresponding to the different attention layers of the model. Each row represents a set of images, including an enlarged crop of an ROI (marked in red and black), highlighting areas with contrast agent uptake.

Attention scores 0 and 1 indicate that the model focuses on the background tissue of the breast and the black background of the image. This suggests that the model directs its attention to the larger areas within the image, identifying global features. In the marked regions, it is observed that the model concentrates on the surrounding tissue of the areas with contrast agent response.

Attention scores 2 and 3 show a distribution of attention similar to Attention Map 1. However, Attention Score 2 shows a notable concentration of attention in the fibroglandular tissue with contrast agent response. This suggests that the model has learned to identify specific and relevant features associated with potential anomalies in the fibroglandular tissue. This behavior indicates that the fibroglandular tissue with contrast agent response contains critical information that the model considers essential for generating accurate and detailed synthetic images. Although Attention Score 3 presents a pattern similar to Attention Score 1, it shows that the highest attention is concentrated in the brighter regions of the RC image. This suggests that it focuses on the hyperintense areas compared to the parenchymal background.

The analysis of the attention scores demonstrates that the model has learned to identify and emphasize important features in regions with contrast agent response, generating more accurate and detailed images in both global and local characteristics. This is essential for the detection and analysis of anomalies in CESM studies.

Clinical feasibility evaluation

In the first phase, the radiologist was asked to distinguish between real and synthetic RC images, as well as to evaluate their technical quality. Figure 7 presents the confusion matrices corresponding to the SCPA-UNet and SCPA-cGAN models, respectively. For SCPA-UNet, 10 out of 17 patients were correctly identified as synthetic, while 3 were misclassified as real and 4 were evaluated as uncertain. For the real RC images, 11 out of 17 were correctly classified, 1 was considered synthetic, and 4 were evaluated as uncertain. These results suggest that, although the model generates images with realistic characteristics, the radiologist was able to discriminate between real and synthetic acquisitions in most cases, albeit with a non-negligible level of uncertainty. According to the expert radiologist, this uncertainty is associated with the smoothing effect observed in the images, a phenomenon expected since the loss function employed in U-Net-based models tends to induce such an effect.

Figure 7.

Confusion matrix for realism assessment between images generated by the SCPA-UNet (left) and SCPA-cGAN (right).

In the case of SCPA-cGAN, a similar pattern was observed: 10 out of 17 synthetic cases were correctly classified, while 6 were incorrectly considered real and only 1 was evaluated as uncertain. Similarly, 11 out of 17 real cases were correctly classified, 1 was labeled as synthetic, and 4 were evaluated as uncertain. Compared with SCPA-UNet, SCPA-cGAN produced images that were more frequently perceived as real, suggesting higher perceptual realism, albeit accompanied by slightly lower discriminability. This behavior can be explained by the fact that, as a generative model, adversarial loss contributes significantly to the realism of the images. Nevertheless, the expert radiologist noted the presence of noise-like artifacts (white pixels), which in some cases motivated the classification of an image as computationally generated. This phenomenon could be attributed to the limited receptive field of the PatchGAN architecture used in the SCPA-cGAN discriminator.

In the second phase, the evaluation focused on determining whether the synthesized uptake patterns were consistent with the true enhancement and allowed for appropriate clinical interpretation. Figure 8 illustrates the radiologist's assessment in 50 patients. Overall, 80% of the synthesized studies were rated as either moderately consistent or highly consistent, with 32% (16 out of 50) classified as “highly consistent” and 28% (14 out of 50) as “moderately consistent.” Only 20% of the cases were rated as inconsistent (8%) or highly inconsistent (12%), reflecting that most generated images exhibited a plausible distribution of enhancement.

Figure 8.

Bar chart of the consistency level of contrast uptake synthesis in the images generated by the proposed strategy.

The analysis of the 5-point similarity scale (Figure 9) reinforced this observation: 15 patients (30%) received the highest similarity score (level 5), 12 patients (24%) were rated at level 3, and 9 patients (18%) at level 4, whereas only 14 patients (28%) were classified in the lowest similarity levels (1 and 2). These findings demonstrate that, in most cases, the synthesized images were visually consistent with the true enhancement and amenable to clinical interpretation.

Figure 9.

Bar chart of the similarity level of the synthesized uptake with respect to clinical interpretability.

These results show that, although the radiologist was able to discriminate between real and synthetic images in most cases, either due to the smoothing effect present in the images generated by the SCPA-UNet model or the appearance of small white-pixel artifacts in the images generated by SCPA-cGAN, the consistency of contrast uptake was largely preserved, particularly at the higher levels of similarity. Moreover, the radiologist noted that these limitations did not affect medical interpretation, supporting the potential of SCPA-based models as a promising alternative for synthesizing contrast uptake in CESM.

Discussion

This study evaluates the impact of integrating SCPA into conditional generative models for virtual contrast synthesis in CESM images. By shifting the focus from global image quality to regional fidelity, the proposed strategy prioritizes the preservation of diagnostically relevant features in contrast-uptake zones. In this feasibility setting, we deliberately build on established U-Net- and Pix2Pix-based baselines reported in the current CESM virtual-contrast state-of-the-art.

The superior performance of SCPA-based models, particularly in ROI-focused metrics, indicates that attention mechanisms help to better model localized iodine attenuation and lesion-level signal. As shown in Table 2 and in the statistical analysis summarized in Tables 3 and 4, SCPA-integrated models achieved significant reductions in MSE and improvements in PSNR and SSIM within contrast-uptake zones compared with baseline architectures. The ability of the SCPA block to generate 3D attention maps allows the model to recalibrate features spatially and across channels, effectively suppressing background noise. This behavior is visually evident in Figure 10, where the attention maps prioritize high-frequency details in lesions while largely ignoring unenhanced parenchymal tissue.

Figure 10.

Distribution of attention maps generated by the proposed SCPA models. The figure displays the low-energy images, the real RC images, and the generated images, along with the attention maps corresponding to the final attention layer in the SCPA architecture. These maps highlight the areas where the model focuses its attention.

The third row in Figure 5 is of particular interest: the LE image displays extremely dense fibroglandular tissue, whereas the RC image reveals a lesion with high contrast uptake. This scenario is highly challenging for image-to-image translation models, as dense tissue can mask potential lesions. In this case, the RiedNet and Pix2Pix models failed to detect the hidden lesion, and the U-Net model could not accurately synthesize its shape and enhancement pattern. In contrast, the proposed strategy demonstrated superior performance in synthesizing the characteristics of this hidden lesion. This suggests that SCPA layers allow the model to focus on regions with high contrast uptake during training, even when these regions are obscured by dense fibroglandular tissue, which is particularly relevant in patients with high breast density.

Table 1 presents the whole-image evaluation results and shows that the proposed strategy achieves the best overall performance in most metrics. However, the differences with respect to the reference models are not statistically significant at this level on the UCBM-CESM dataset, suggesting that global image characteristics (such as overall breast shape, mean and standard deviation of intensity, and global texture) are reasonably well synthesized by baseline architectures. Taken together with the ROI-focused metrics (Table 2), these findings indicate that the added value of SCPA is more evident when analyzing diagnostically relevant regions than when averaging over the entire image, which is expected given that contrast-uptake regions usually represent only a small fraction of the total breast area. Moreover, the radiological assessment showed that the proposed strategy consistently synthesizes contrast-agent uptake while maintaining an adequate correspondence with the enhancement patterns observed in real images (Figures 8 and 9). Most synthesized studies were rated as moderately or highly consistent, and the majority of images were judged to have sufficient technical quality for diagnostic use, even though the radiologist was often able to distinguish real from synthetic images (Figure 7). In other words, the expert reader could often recognize synthetic images, but most studies were considered visually consistent enough to support clinical interpretation, which demonstrates the feasibility of moving toward virtual contrast generation in CESM studies.

Despite these promising results, this study has several limitations that should be considered. Deep learning models achieve better performance when trained on large datasets containing sufficient samples to represent the biological variability of the phenomenon; therefore, the present study is constrained by a relatively limited number of cases and by the retrospective nature of data collection. Although the use of two independent datasets with different acquisition characteristics (SURA-CESM and UCBM-CESM) provides initial evidence of robustness, both remain modest compared with datasets commonly used in other imaging domains, which restricts clinical generalizability. In addition, the preliminary assessment performed by a single highly experienced breast radiologist should be regarded as exploratory and may be subject to observer bias.

Accordingly, future studies should expand the dataset size and ensure an adequate representation of the inherent variability in breast images, including breast composition, parenchymal distribution, and tumor morphology, as well as acquisition protocol-related factors such as contrast-agent dose and technical imaging parameters. Moreover, multi-reader studies involving radiologists with different levels of experience are required to more objectively assess interobserver consistency and the potential impact on diagnostic interpretation. From an architectural perspective, this study provides a modest modification by integrating SCPA into established conditional generative models used for CESM synthesis. Future work should explore newer conditional translation families, including conditional diffusion models, transformer-based generators (e.g. TransUNet and Swin-UNet), and multiscale strategies (e.g. Laplacian-pyramid GANs or StyleGAN-inspired variants), which could better capture fine detail, long-range dependencies, and global–local consistency of images. A systematic comparison against these more complex architectures therefore lies beyond the scope of the present study, and it could be better addressed in future work once larger multi-center CESM datasets become available.

In parallel, future work should address integration into clinical workflows. Radiology departments require AI tools that can interface with PACS/RIS systems, operate with minimal latency, and provide output that is interpretable to clinicians. In this regard, the attention maps generated by our models could represent a valuable interpretability feature, supporting radiologists’ trust in AI-generated images and aligning with ongoing efforts to promote explainability in medical AI.³⁹ In addition, regulatory approval will be indispensable. AI-based systems intended for diagnostic purposes are categorized as Software as a Medical Device and therefore require validation through regulatory frameworks such as Food and Drug Administration clearance in the United States or CE marking in Europe. These processes involve rigorous testing across diverse populations and clinical settings, as well as demonstrating safety, efficacy, and reproducibility, and have proven to be a major bottleneck for the translation of AI tools into clinical practice.³⁹ Therefore, in the near term, AI-generated CESM images are most realistically positioned as adjunctive decision-support tools in human-in-the-loop settings, helping radiologists to explore virtual contrast scenarios while preserving clinical responsibility and patient safety.

Conclusions

This study establishes a robust framework for the synthesis of virtual contrast in CESM by integrating SCPA into conditional generative architectures. Our findings demonstrate that prioritizing regional fidelity over global similarity metrics significantly improves the morphological and functional accuracy of synthesized lesions, addressing a fundamental limitation of conventional CNN-based image-to-image translation models in this setting.

Across two independent CESM datasets, SCPA-based models achieved better quantitative performance, particularly in ROI-focused metrics, and were evaluated by an expert breast radiologist who determined that the generated images provided realism and technical quality for clinical interpretation in most cases. At the same time, the persistence of occasional artifacts and subtle discrepancies in enhancement patterns, together with the limited and heterogeneous nature of the available datasets, means that these results should be interpreted as evidence of technical and clinical feasibility rather than as validation for immediate diagnostic replacement of contrast-enhanced acquisitions.

Future adoption in clinical workflows will depend on large-scale, multi-reader trials and the integration of explainable AI features to ensure diagnostic safety. Ultimately, this research bridges the gap between generative AI theory and oncological application, moving closer to a safer, more accessible paradigm for breast cancer diagnosis.

Footnotes

Acknowledgements

None.

ORCID iDs

Kevin Osorno-Castillo

Rubén D. Fonnegra

María Liliana Hernández

Carlos Mera-Banguero

Gloria M. Díaz

Ethical approval

This study represents one of the research directions within project P20213 approved by the Ethics Committee of the Instituto Tecnológico Metropolitano (Act No. 02, June 2020). Images used for the training and validation models were retrospectively obtained from patients who came for clinical studies as part of the health programs. Written informed consent was obtained from all participants as part of the clinical practice of “Ayuda Diagnósticas-Sura.”

Author contributorship

KOC, RDF, CMB, MLH, and GMD contributed to the conceptualization of the proposed model and the experimental design. The methodology was implemented by KOC and RDF, while the validation of the results was carried out by GMD, MLH, and RDF. The formal analysis was performed by KOC, who was also responsible for the visualization of the results. The original draft was written by KOC, and the review and editing were performed by GMD, CMB, RDF, and MLH. The study was supervised by GMD, who also managed the project administration and secured the acquisition of funding. All authors have read and approved the final version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Instituto Tecnológico Metropolitano (research grant P20213), the Institución Universitaria Pascual Bravo (research grant CE-007-2020), Imágenes Diagnósticas SURA, and the Agencia de Educación Postsecundaria de Medellín - SAPIENCIA.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024; 74: 229–263.

Reece

Neal

EFG

Nguyen

, et al. Delayed or failure to follow-up abnormal breast cancer screening mammograms in primary care: a systematic review. BMC Cancer 2021; 21: 73.

Hussein

Abbas

Keshavarzi

, et al. Supplemental breast cancer screening in women with dense breasts and negative mammography: a systematic review and meta-analysis. Radiology 2023; 306: e221785.

Caschera

Lazzara

Piergallini

, et al. Contrast agents in diagnostic imaging: present and future. Pharmacol Res 2016; 110: 65–75.

Gelardi

Ragaini

Sollini

, et al. Contrast-enhanced mammography versus breast magnetic resonance imaging: a systematic review and meta-analysis. Diagnostics 2022; 12: 1890.

Jochelson

Lobbes

MBI

. Contrast-enhanced mammography: state of the art. Radiology 2021; 299: 36–48.

Coffey

Jochelson

. Contrast-enhanced mammography in breast cancer screening. Eur J Radiol 2022; 156: 110513.

Sogani

Mango

Keating

, et al. Contrast-enhanced mammography: past, present, and future. Clin Imaging 2021; 69: 269–279.

Mallio

Radbruch

Deike-Hofmann

, et al. Artificial intelligence to reduce or eliminate the need for gadolinium-based contrast agents in brain and cardiac MRI: a literature review. Invest Radiol 2023; 58: 746–753.

10.

Cañaveral

Mera-Banguero

Fonnegra

. Síntesis de imagen medica postcontraste en estudios de DCE-MRI de mama usando aprendizaje profundo. TecnoLógicas 2024; 27: e3052.

11.

Fonnegra

Hernández

Caicedo

, et al. Synthesizing late-stage contrast enhancement in breast MRI: a comprehensive pipeline leveraging temporal contrast enhancement dynamics. Comput Biol Med 2025; 196: 110660.

12.

Chun

Chang

, et al. Synthetic contrast-enhanced computed tomography generation using a deep convolutional neural network for cardiac substructure delineation in breast cancer radiation therapy: a feasibility study. Radiat Oncol 2022; 17: 1–12.

13.

Yang

Liu

Zhan

, et al. OA-GAN: organ-aware generative adversarial network for synthesizing contrast-enhanced medical images. Biomed Phys Eng Express 2024; 10: 035012.

14.

Gao

, et al. SD-CNN: a shallow-deep CNN for improved breast cancer diagnosis. Comput Med Imag Graph 2018; 70: 53–62.

15.

Gao

Chu

, et al. Deep residual inception encoder–decoder network for medical imaging synthesis. IEEE J Biomed Health Inf 2020; 24: 39–49.

16.

Rofena

Guarrasi

Sarli

, et al. A deep learning approach for virtual contrast enhancement in contrast-enhanced spectral mammography. Comput Med Imag Graph 2024; 116: 102398.

17.

Jiang

Zheng

Jia

, et al. Synthesis of contrast-enhanced spectral mammograms from low-energy mammograms using cGAN-based synthesis network. In: Medical image computing and computer assisted intervention – MICCAI, Strasbourg, France, 27 September–1 October 2021, pp.68–77. Cham: Springer.

18.

Rofena

Manchia

Piccolo

, et al. Lesion-aware generative artificial intelligence for virtual contrast-enhanced mammography in breast cancer. In: 2025 IEEE 38th international symposium on computer-based medical systems (CBMS), Madrid, Spain, 18–20 June 2025, pp.141–146. Los Alamitos, CA, USA: IEEE Computer Society.

19.

Amin

Rushdi

Kamal

, et al. A deep learning approach for contrast-agent-free breast lesion detection and classification using adversarial synthesis of contrast-enhanced mammograms. Image Vis Comput 2025; 162: 105692.

20.

Zhao

Hou

Pan

, et al. Attention-based generative adversarial network in medical imaging: a narrative review. Comput Biol Med 2022; 149: 105948.

21.

Papanastasiou

Dikaios

Huang

, et al. Is attention all you need in medical image analysis? A review. IEEE J Biomed Health Inf 2024; 28: 1398–1411.

22.

Guo

Liu

, et al. Attention mechanisms in computer vision: a survey. Comp Visual Media 2022; 8: 331–368.

23.

Zhang

Goodfellow

Metaxas

, et al. Self-attention generative adversarial networks. In: Proceedings of the 36th international conference on machine learning, Long Beach, USA, 9–15 June 2019, pp.7354–7363. Long Beach, CA, USA: PMLR.

24.

Shen

Sun

. Squeeze-and-excitation networks. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018, pp.7132–7141. Los Alamitos, CA, USA: EEE Computer Society.

25.

Wang

Zhu

, et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In: IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020, pp.11531–11539. Los Alamitos, CA, USA: IEEE Computer Society.

26.

Woo

Park

Lee

, et al. CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp.3–19. Cham, Switzerland: Springer..

27.

Zhao

Kong

, et al. Efficient image super-resolution using pixel attention. In: European conference on computer vision (ECCV) workshops, Glasgow, UK, 23–28 August 2020, pp.56–72. Cham, Switzerland: Springer.

28.

Lee

Phillips

Sung

. ACR BI-RADS® contrast enhanced mammography (CEM): supplement to ACR BI-RADS® mammography 2013 atlas. American College of Radiology, 2022, https://www.acr.org/-/media/ACR/Files/RADS/BI-RADS/BIRADS CEM 2022.pdf.

29.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention – MICCAI, Munich, Germany, 5–9 October 2015, pp.234–241. Cham, Switzerland: Springer.

30.

Wand

. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: 14th European conference on computer vision, Amsterdam, The Netherlands, 11–14 October 2016, pp.702–716. Cham, Switzerland: Springer.

31.

Isola

Zhu

Zhou

, et al. Image-to-image translation with conditional adversarial networks. In: IEEE conference on computer vision and pattern recognition (CVPR), 24–26 July 2017, pp.5967–5976. New York, NY, USA: IEEE.

32.

Mao

Xie

, et al. Least squares generative adversarial networks. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 2813–2821. New York, NY, USA: IEEE.

33.

Wang

Bovik

Sheikh

, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004; 13: 600–612.

34.

Mei

Liu

Robson

, et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiol Artif Intell 2022; 4: e210315.

35.

Shapiro

Wilk

. An analysis of variance test for normality (complete samples). Biometrika 1965; 52: 591–611.

36.

Kim

. T test as a parametric statistic. Korean J Anesthesiol 2015; 68: 540–546.

37.

Rey

Neuhäuser

. Wilcoxon-signed-rank test. In: Wiley encyclopedia of clinical trials. NJ, USA: John Wiley & Sons, 2007, pp.1–3.

38.

Kingma

. Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR), San Diego, USA, 7–9 May 2015, p.13. Ithaca, NY, USA: arXiv.

39.

Sadeghi

Alizadehsani

Cifci

, et al. A review of explainable artificial intelligence in healthcare. Comput Electr Eng 2024; 118: 109370.

Self-calibrated pixel-attention autoencoder: A strategy for contrast-enhanced spectral mammography image-to-image translation