Abstract
Keywords
Introduction
Breast cancer remains a leading cause of cancer-related mortality worldwide, with approximately 2.3 million new cases diagnosed annually and nearly 600,000 deaths. 1 While full-field digital mammography (FFDM) continues to be the screening standard, its sensitivity is significantly compromised in women with dense breast tissue, where overlapping fibroglandular structures can obscure underlying lesions.2,3 Contrast-enhanced spectral mammography (CESM) addresses this by leveraging tumor neovascularization via iodinated contrast agents, achieving diagnostic performance comparable to breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) while offering lower costs.4,5
Technically, CESM utilizes a dual-energy protocol, capturing paired low-energy (LE) (25–34 kVp) and high-energy (45–49 kVp) images. Acquisitions above the iodine K-edge (33 keV) maximize attenuation relative to breast tissue. 6 A weighted logarithmic subtraction then generates a recombined (RC) image, suppressing background parenchyma to highlight contrast-enhanced abnormalities. Figure 1 shows a representative example of CESM images from the left breast of a patient with high breast density, in which the underlying and obscured lesion is highlighted in the RC image.

An illustrative example of CESM images of the left breast from a 44-year-old patient with high breast density. Low-energy images are shown in the CC (panel A) and MLO (panel C) projections. RC images, in the same CC and MLO projections, are shown in panels (B) and (D), respectively. The red rectangle indicates mass lesions, which are obscured in the low-energy images, while they are highlighted in the RC counterpart.
Despite these clinical advantages, CESM implementation is constrained by the risks of iodinated media—such as nephrotoxicity—and increased ionizing radiation exposure.6–8 Moreover, dual-energy acquisition entails an increased radiation dose, which may be particularly relevant for radiation-sensitive populations, such as carriers of BRCA1 gene mutations. 7 In addition, the requirement for specialized equipment and contrast administration workflows can limit accessibility, especially in resource-constrained settings.7,8 These safety, dose, and accessibility considerations motivate interest in virtual-contrast approaches that aim to reproduce the diagnostic information contained in RC images while reducing or avoiding intravenous contrast administration.
Image-to-image translation approaches have been investigated in DCE-MRI and contrast-enhanced computed tomography (CT), showing that synthetic contrast-enhanced images can be generated from non-contrast inputs.9–13 However, in CESM, synthesizing RC images from LE projections has been only lightly explored.14–19 One of the first approaches to address this problem was proposed by Gao et al., 15 who introduced a U-Net-based architecture (RiedNet) incorporating Inception–Residual blocks to synthesize RC image patches that were then overlapped to reconstruct the whole RC image. Jiang et al. 17 evaluated a supervised cycle generative adversarial network (CycleGAN) framework for synthesizing RC images, in which information from the CC and mediolateral oblique (MLO) projections was fused to achieve coherent image synthesis across views. More recently, Rofena et al. 16 evaluated three representative state-of-the-art models, including an autoencoder, Pix2Pix for supervised conditional synthesis, and CycleGAN for unsupervised synthesis, reporting CycleGAN as the best-performing model in both quantitative and qualitative evaluations. Across these studies, the primary objective has been to generate globally realistic RC images, with performance predominantly assessed using whole-image similarity metrics, without explicitly modelling or evaluating clinically relevant contrast-uptake regions. More recently, building on their previous work, Rofena et al. 18 proposed integrating segmentation maps of contrast-agent uptake regions into a CycleGAN-based model, incorporating these maps into the overall loss through additional region-focused terms that localize the loss computation to the regions of interest (ROIs). However, this approach relies on the availability of reliably annotated uptake regions, which may be difficult to obtain in routine clinical practice, and the study did not systematically analyze performance within these regions, but rather in terms of global image quality.
On the other hand, data augmentation strategies based on image-to-image translation models have been proposed to improve performance in detection and classification tasks that use CESM images. Gao et al. 14 introduced a shallow four-layer convolutional neural network (CNN) to generate RC image patches containing suspicious lesions from LE image patches, aiming to enlarge the dataset and improve the classification of benign and malignant lesions. Similarly, Amin et al. 19 evaluated several generative models, including a conventional generative adversarial network (GAN) that uses a random noise vector as input, as well as CycleGAN and UNIT, the latter two being unsupervised approaches. In these studies, the synthesized RC images were used to augment training data for downstream detection and classification models rather than as virtual replacements for contrast-enhanced acquisitions.
Taken together, these approaches demonstrate the potential of artificial intelligence (AI) models to generate realistic and high-quality images in CESM. Nevertheless, current CESM virtual-contrast literature remains centered on standard convolutional and GAN-based architectures that are inherently limited by fixed-size kernels, which restrict the receptive field and hinder the integration of long-range dependencies,20,21 underscoring the need for alternative architectures that can more effectively focus on clinically relevant contrast-uptake regions.
Accordingly, in this study, we propose integrating a self-calibrated pixel attention (SCPA) block into widely used conditional generative architectures (U-Net and Pix2Pix) for CESM RC image synthesis. We hypothesize that pixel-level attention prioritizes the synthesis of high-value features in contrast-uptake zones, preserving the morphological and functional fidelity required for radiological interpretation. 22 Within an incremental architectural novelty, this work proposes a clinically aligned framework that shifts the emphasis from whole-image similarity toward regional fidelity under realistic CESM data constraints. Our contributions include: (i) integrating SCPA modules to prioritize synthesis accuracy in contrast-uptake regions; (ii) evaluating regional fidelity and ROI-level interpretability alongside traditional global metrics on two independent CESM datasets; (iii) establishing a multi-level validation approach combining ROI-focused metrics and attention-map analysis together with a radiologist-led feasibility assessment to bridge the gap between technical output and clinical utility.
Materials and methods
The methodology is structured into three stages: data preprocessing, model training, and multi-level performance evaluation (Figure 2). The objective is to synthesize the contrast agent response in RC images by leveraging non-contrast LE inputs, while preserving the morphological and functional information required for radiological interpretation. In the first stage, the LE (input) and RC (target) images were preprocessed to ensure data uniformity. This process included windowing adjustment, breast cropping, spatial resizing of the images to optimize computational cost, and pixel intensity normalization. The second stage involved the training of the proposed strategy, incorporating the attention mechanisms into the U-Net and Pix2Pix models. Furthermore, training strategies based on supervised learning were implemented, specifically designed to preserve the structural and clinical fidelity of the generated images relative to the real images. Finally, in the third stage, a multi-level evaluation strategy was implemented to compare the proposed approach with the baseline models through whole-image metrics, ROI-focused analyses, and an expert breast radiologist assessment, with particular emphasis on regions exhibiting high contrast uptake.

Overview of the proposed methodology for the generation and evaluation of RC images in CESM. The workflow consists of three main stages: (1) preprocessing of the dataset to standardize the images; (2) training stage for the proposed strategy and baseline models; and (3) quantitative and qualitative evaluation stage, focusing on the complete image and the regions with contrast agent uptake.
Self-calibrated pixel-attention block
Visual attention layers in deep learning models allow them to focus on the most relevant regions of an image during the learning and prediction processes. This enhances the network’s ability to prioritize important areas in the input, optimizing its focus on key features and enabling a more accurate contextual interpretation of the data.
Among the first prominent implementations of attention mechanisms in vision is self-attention, which enables models to capture long-range relationships between different parts of an image by calculating the attention of each pixel with respect to all others. 23 However, self-attention poses significant challenges, such as its high computational and memory cost, especially when applied to high-resolution images due to its quadratic complexity with respect to the input size.
This limits its scalability in practical applications, leading to the development of more efficient variants such as Spatial-Attention 24 and Channel-Attention. 25 While Spatial-Attention assigns attention weights to specific regions within the image space, Channel-Attention focuses on learning the relative importance of each feature channel, enabling the model to prioritize the most relevant channels for a specific task. Moreover, the combination of these mechanisms, as seen in the convolutional block attention module, 26 integrates both perspectives, enhancing the ability of the model to jointly capture spatial and channel relationships, thereby optimizing performance in computer vision tasks.
The self-calibrated pixel-attention (SCPA) block, introduced in Zhao et al., 27 is an attention block designed to improve pixel-level representation through local and contextual recalibration. This approach combines a pixel-attention layer with a self-calibrated convolutional structure. The self-calibration operation adaptively expands the receptive fields of the convolutional layers, improving their ability to capture more complex spatial relationships. Meanwhile, the pixel-attention layer assigns weights to different regions of the input feature maps, enabling the model to focus on the most relevant areas. Thus, unlike traditional mechanisms such as Spatial or Channel Attention, SCPA directly integrates both local and global context at the pixel level. This enables a more precise representation, complementing the self-calibration operation by providing a more comprehensive approach to capturing key features in complex images.
In addition, the Breast Imaging-Reporting and Data System (BI-RADS) for CESM highlights the importance of morphological characteristics and local intensity patterns in regions with contrast uptake, which radiologists consider crucial to characterizing and interpreting these lesions. 28 In this context, SCPA offers a principled mechanism to emphasize diagnostically relevant enhancement patterns within contrast-uptake regions, aligning the learned representations with clinical reading criteria.
Network architectures
We integrated the SCPA block into two benchmark architectures: a U-Net and a Pix2Pix-based conditional GAN (cGAN), hereafter referred to as SCPA-UNet and SCPA-cGAN, respectively.
SCPA-UNet
Figure 3 illustrates the SCPA architecture, which builds upon a U-Net structure.
29
In this design, SCPA blocks are integrated into both the encoder and the decoder to explicitly guide the model in the extraction and refinement of diagnostically relevant features. In the encoder, the blocks enhance the network's ability to capture critical patterns from LE images; in the decoder, they prioritize the reconstruction of contrast-related structures. This dual-integration strategy enables the model to balance fine-grained local details and leverage global contextual cues, supporting coherent synthesis of RC images in regions with and without contrast uptake.
General structure of the self-calibrated pixel attention (SCPA)-UNet architecture, in which SCPA blocks are integrated into both the encoder and the decoder of a U-Net-based framework. The objective is to extract relevant features and incorporate global contextual information through the pixel attention layer at each stage of the model.
SCPA-cGAN
The SCPA-cGAN utilizes the same attention-augmented generator structure (Figure 4). A PatchGAN discriminator
30
was used to assess image realism at a local level. Notably, SCPA blocks were intentionally not incorporated into the discriminator to avoid an architectural imbalance, which could otherwise lead to training collapse by allowing the discriminator to overpower the generator and provide non-informative gradients.
General structure of the SCPA-cGAN architecture, where SCPA blocks are integrated into the generator model, while the discriminator follows a PatchGAN scheme.
Architectural specifications
To ensure a fair comparison between models, both SCPA-UNet and SCPA-cGAN share a consistent block configuration:
Loss functions
Consequently, based on the implementation of the two proposed architectures, specific loss functions were defined for the optimization of each model. For the SCPA-UNet model, the mean absolute error (MAE) loss function was employed. This function computes the average of the absolute differences between the pixel values of the real RC image and those of the generated RC image, providing a direct measure of the overall discrepancy between the two images. Equation (1) defines the MAE function, where
For the SCPA-cGAN model, the cGAN loss function was employed, as defined in equation (2). The term
Meanwhile, the adversarial component
Data collection and image preprocessing
To evaluate the robustness of the proposed strategy against population-specific variability and heterogeneous acquisition protocols, we utilized two independent datasets from Colombia (SURA-CESM) and Italy (UCBM-CESM). These cohorts introduce critical variations in hardware, contrast agent types, and X-ray acquisition settings. Thus, we sought to simulate realistic heterogeneity in CESM practice and to test the feasibility of the proposed approach beyond a single-center setting.
SURA-CESM dataset
A dataset of CESM images was collected from a cohort of 89 patients (ages 25–71) treated at Ayudas Diagnósticas-Sura, Medellín, Colombia, between January 2020 and May 2022. In this study, an inclusion criterion was applied to select patients imaged with the same contrast agent and the same acquisition device to reduce variability associated with the imaging protocol. As a result, a total of 69 patients were included in the analysis. The contrast agent used was iopromide (300 mg iodine concentration), and the acquisition device was a Selenia Dimensions system. In total, 356 paired LE and RC images were available for analysis. The dataset exhibits inherent resolution variability: 157 pairs at 3328 × 2560 pixels and 199 pairs at 4096 × 3328 pixels, all captured with a 12-bit depth (0–4095 intensity range) prior to any preprocessing.
UCBM-CESM dataset
The UCBM-CESM dataset comprises 569 paired CESM images acquired from 204 patients (aged 31–90 years) at the Fondazione Policlinico Universitario Campus Bio-Medico in Rome between September 2021 and October 2022, using a GE Healthcare Senographe Pristina FFDM system. 16 For each patient, medical report data were extracted, including breast density classification according to the BI-RADS system. Breast density categories were distributed as follows: 37 patients corresponded to category A, 72 to category B, 73 to category C, and 22 to category D. Furthermore, malignant lesions were confirmed in 42 patients and benign lesions in 22 patients, all verified by biopsy. The remaining patients showed no evidence of tumor abnormalities.
Images processing
In this study, a robust preprocessing pipeline for CESM images was implemented, designed to standardize the input images, reduce areas of irrelevant information, and ensure the preservation of relevant anatomical structures.
One of the key challenges in designing this pipeline was managing the wide dynamic range inherent to mammograms. To address this heterogeneity and enhance the visibility of clinically significant details, the windowing technique was one of the key challenges in designing this pipeline, which was managing the wide dynamic range inherent to mammograms. To address this heterogeneity and enhance the visibility of clinically significant details, the windowing technique was applied. This method adjusts brightness and contrast levels to optimize the perception of anatomical structures within the breast. The window width and window level values required for this adjustment were extracted directly from the DICOM file metadata, ensuring consistent and image-specific parametrization tailored to the unique characteristics of each mammogram.
Mammography images often include large areas of black background that are irrelevant for identifying anatomical or pathological features. To improve focus on the relevant areas, each image was cropped to include only the breast using a bounding box. Moreover, due to the high spatial resolution of these images, the training and validation of the models involve high computational costs. To address this issue, the images were resized while preserving the original aspect ratio to ensure that the anatomical structures remained intact. After calculating the average aspect ratio of the cropped images, an approximate value of 1.23 was obtained for the SURA-CESM dataset and 2.01 for the UCBM-CESM dataset. As a result, final dimensions of 640 × 512 pixels were defined for SURA-CESM and 640 × 320 pixels for UCBM-CESM. This approach aims to eliminate inherent background biases that could influence the model while simultaneously reducing computational costs in image processing and synthesis.
Additionally, to evaluate the most effective pixel intensity scaling method, we implemented an evaluation framework that included various normalization techniques: Min–Max 0–1, Min–Max −1 to 1,
Performance evaluation
Quantitative evaluation
Quantitative evaluation was performed by calculating four metrics that measure image quality using different criteria between generated RC images
The MSE calculates the pixel-to-pixel error between the generated images
We computed PSNR to measure the value of noise that affects the quality of the generated images
The SSIM metric
33
calculates the similarity between two images in terms of luminance, contrast, and structure. This is represented in equation (6), where
The PercepRad metric is a perceptual metric that calculates the similarity between two images based on human perception. Unlike other perceptual metrics in the literature that use deep neural networks pre-trained on natural image datasets, the deep neural network model used by this metric was pre-trained using RadImageNet,
34
a dataset consisting of medical images such as CT, MRI, and ultrasound. The input images
We utilized these quantitative metrics in two evaluation strategies: The first strategy involves evaluating the complete image to analyze the accuracy in synthesizing global characteristics, such as the shape and structure of the breast and the average intensity. The second evaluation strategy focuses on analyzing the synthesis of the contrast agent response in local regions with contrast uptake. To achieve this strategy, we extracted regions exhibiting contrast-enhanced lesion characteristics from the RC images in the test set, resulting in a total of 31 regions. Additionally, an equal number of regions showing enhancement in the RC images but not corresponding to lesions or radiological findings (i.e. corresponding to the enhancement of fibroglandular tissue), were also extracted.
Statistical analysis
To determine the statistical significance of the performance differences between the proposed strategy and its baseline counterpart, a two-stage statistical analysis was conducted. First, a normality test was performed to verify the distribution of the data, followed by mean comparison tests using either parametric or non-parametric approaches, as appropriate.
The null hypothesis (H0) stated that there is no significant difference in performance between the proposed models and their baseline counterparts in terms of the evaluation metrics considered. Conversely, the alternative hypothesis (H1) suggested that the proposed models achieve significantly superior performance compared with their baselines.
In the first stage of the analysis, we assessed whether the differences in evaluation metrics between each pair of models followed a normal distribution. To this end, the Shapiro–Wilk test
35
was employed, given its robustness for small and moderate sample sizes. If the test yielded a
When the normality assumption was satisfied (
The outcomes of the statistical tests were interpreted with respect to the stated hypotheses. When the
Qualitative evaluation
A two-stage qualitative evaluation was performed by a radiologist with 15 years of experience interpreting breast images (caseload of 500 studies per year) to assess the clinical feasibility of using the synthesized RC CESM images to make clinical decisions.
In the first stage, the realism and clinical quality of the images were evaluated. For this purpose, the test set from both datasets was split at the patient level, assigning real RC images from 17 patients, as well as RC images generated by the SCPA-UNet and SCPA-cGAN models, also from 17 patients each. In this phase, only RC images in the cranio-caudal (CC) and MLO projections of both breasts were presented. For the evaluation, two questions were formulated: (i) whether the presented image corresponded to an acquisition with conventional devices or, alternatively, was computationally generated, and (ii) whether the presented image had sufficient technical quality to support a reliable diagnosis.
In the second stage, the consistency of contrast enhancement was evaluated. For this purpose, the test set from both datasets was again split at the patient level, using images generated by the SCPA-UNet and SCPA-cGAN models, corresponding to 25 patients each. In this phase, the complete studies of each patient were presented, including the LE images together with the real RC image for each projection and breast, in addition to the RC image generated by the corresponding model. For the evaluation, two questions were formulated on a 5-point scale: (i) whether the generated uptake was consistent with the real uptake and the visual characteristics of the image, and (ii) whether the generated uptake allowed for a clinical interpretation similar to that of the real uptake.
Experimental results
Experimental settings
To evaluate the performance of the proposed strategy—the incorporation of attention layers through the SCPA module into conventional conditional generative models—the SCPA block was integrated into a U-Net 29 architecture and into the generator of the Pix2Pix model. 31 This implementation was conducted to assess the contribution of the SCPA blocks in synthesizing the contrast agent response in regions with contrast uptake. The strategy was then compared against its respective baseline models (U-Net and Pix2Pix), as well as against the RiedNet architecture, 15 which was previously proposed for domain translation in medical imaging.
RiedNet is a deep learning architecture specifically developed for domain translation in medical imaging, with a particular focus on CESM images. Its design combines residual and inception blocks integrated into a symmetrical structure based on the U-Net model. This configuration enables RiedNet to preserve key morphological details during domain transformation, maximizing its effectiveness in medical image synthesis tasks and ensuring the quality and relevance of the generated images.
Pix2Pix is a cGAN model designed for supervised image-to-image translation across different domains, using paired datasets. This supervised model learns a direct mapping between an input domain and an output domain by employing not only the adversarial loss characteristic of GANs but also a similarity-based loss to ensure that the generated images maintain high structural fidelity with respect to the target images.
Both the baseline models and the proposed strategy underwent heuristic testing with 150, 200, and 250 training epochs within the experimental framework. It was observed that 200 epochs were sufficient for model convergence, using a batch size of 8. We employed the Adam optimizer
38
with a learning rate (
Quantitative comparison
Tables 1 and 2 present the quantitative evaluation results of the proposed strategy compared with the baseline models on the test RC images. Table 1 reports the results for complete images, whereas Table 2 shows the results for the extracted regions with contrast enhancement across both datasets. For each metric, the values represent the mean performance across all test images, accompanied by the corresponding standard deviations. The U-Net-based models (highlighted in blue) were compared among themselves, as were the cGAN-based models (highlighted in gray). Within both groups, the best-performing model for each metric is highlighted in bold.
Quantitative results of pixel-level and perceptual similarity assessment on whole images across both datasets.
MSE: mean squared error; PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network.
The values for each metric represent the mean performance over all test images, accompanied by the corresponding standard deviations. The best-performing model for each metric is highlighted in bold.
Quantitative results of pixel-level and perceptual similarity assessment on regions with contrast uptake across both datasets.
MSE: mean squared error; PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network.
The values for each metric represent the mean performance across all test images, accompanied by their respective standard deviations. The best-performing model for each metric is highlighted in bold.
As presented in Table 1, in the SURA-CESM dataset, the proposed SCPA-UNet model achieves the best overall performance, yielding a reduction in MSE compared with U-Net and RiedNet. This improvement corresponds to a relative decrease of 10.5% with respect to U-Net and 19% with respect to RiedNet. Consistently, the PSNR increases to 28.30 dB, outperforming U-Net by 1.8% and RiedNet by 4.1%. In terms of SSIM, SCPA-UNet reaches an SSIM of 0.757, representing a relative improvement of 1.3% over U-Net and 1.5% over RiedNet. Finally, for the perceptual metric (PercepRad), SCPA-UNet reduces the value to 1.173, which corresponds to a 3% improvement compared with both reference models. In contrast, the proposed SCPA-cGAN model achieved competitive performance compared with Pix2Pix. A relative improvement of 1.5% was observed in SSIM, while PSNR showed a gain of 1.1%. However, Pix2Pix achieved better performance than SCPA-cGAN on the PercepRad metric, with SCPA-cGAN showing a relative decrease of 10.3%.
In the UCBM-CESM dataset, the improvements were even more pronounced. The MSE of SCPA-UNet (0.0006) represented a relative reduction of 14.3% compared with both U-Net and RiedNet (0.0007 each). The PSNR increased to 33.63 dB, outperforming U-Net by 1.8% and RiedNet by 2.4%. The SSIM reached the highest value, with a relative gain of 0.1% over U-Net and 2.6% over RiedNet. Regarding PercepRad, SCPA-UNet showed comparable performance to U-Net but superior performance to RiedNet. The behavior of the cGAN models was consistent, with SCPA-cGAN reducing MSE by 14.3% compared with Pix2Pix and increasing PSNR to 33.07 dB, with a relative gain of 1.7%. Although SSIM was slightly lower (0.869 vs. 0.871, a difference of −0.2%), the perceptual metric again revealed a clear advantage, corresponding to a relative improvement of 3.8%.
Table 2 presents the results obtained when restricting the analysis to the ROIs, which enables a more precise evaluation of the models’ ability to preserve diagnostically relevant structures. In the SURA-CESM dataset, the SCPA-UNet model achieved the best performance within the U-Net family. In terms of MSE, the value decreased to 0.0062, representing a relative reduction of 12.7% compared with U-Net, and 25.9% compared with RiedNet. Consistently, the PSNR increased to 22.97 dB, surpassing U-Net by 3.3% and RiedNet by 6.7%. For SSIM, a value of 0.610 was achieved, corresponding to relative improvements of 4.1% and 9.9%, respectively. Regarding the perceptual metric, the score was slightly higher than U-Net (1.416; a relative deterioration of 3.2%) but lower than RiedNet (1.602; a relative improvement of 8.7%). For the adversarial models, SCPA-cGAN showed notable improvements over Pix2Pix. The MSE achieved a relative reduction of 9.6%, while the PSNR increased from 22.08 to 22.49 dB (a gain of 1.9%). In terms of SSIM, an improvement of 3.6% was observed. Regarding the perceptual metric, Pix2Pix outperformed SCPA-cGAN by 7.6%, possibly because this metric is more sensitive to local variations in fine texture, which may favor models that emphasize high-frequency details.
In the UCBM-CESM dataset, the SCPA-UNet model achieved the lowest MSE (0.0065), representing a relative improvement of 8.5% compared with both U-Net (0.0071) and RiedNet (0.0071). The PSNR increased to 24.54 dB, with gains of 1.5% over U-Net and 1.1% over RiedNet. In SSIM, it reached the highest value (0.727), improving by 0.7% compared with U-Net and 0.6% compared with RiedNet. For the PercepRad metric, a value of 0.962 was obtained, representing a relative improvement of 6.0% over U-Net and 9.9% compared with RiedNet. For the adversarial models, SCPA-cGAN exhibited similar behavior. The MSE decreased to 0.0061, representing a 16.4% improvement over Pix2Pix. PSNR also increased to 24.26 dB (+2.1%), and SSIM improved by 2.2%. In terms of PercepRad, SCPA-cGAN achieved a value of 0.667, a relative improvement of 8.5% compared with Pix2Pix.
The results of the statistical test are presented in Tables 3 and 4. If the data followed a normal distribution, a paired Student's (+++) or () indicate significant differences with a (++) or () indicate significant differences with a (+) or () indicate significant differences with a (−) indicates that the difference is not significant.
Statistical significance of the evaluation metrics (PSNR, SSIM, and PercepRad) between the proposed strategy and its baseline counterparts on the SURA-CESM dataset.
PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network; ROI: region of interest.
The symbols used are as follows: (+) indicates the Wilcoxon test, (*) indicates the paired Student's
Statistical significance of the evaluation metrics (PSNR, SSIM, and PercepRad) between the proposed strategy and its baseline counterparts on the UCBM-CESM dataset.
PSNR: peak signal-to-noise ratio; SSIM: structural similarity; PercepRed: perceptual radiologic metric; SCPA: self-calibrated pixel attention; cGAN: conventional generative adversarial network; ROI: region of interest.
The symbols used are as follows: (+) indicates the Wilcoxon test, (*) indicates the paired Student's t-test, (+++ or ***) denote statistical significance at
In the SURA-CESM dataset, the SCPA-UNet model showed consistent improvements over U-Net and RiedNet, achieving significant reductions in MSE and increases in PSNR and SSIM, both at the whole-image level and within ROIs. Statistical tests confirmed the high significance (
In the UCBM-CESM dataset, the results present a more nuanced scenario. Although the SCPA-UNet model achieved the best mean values in MSE, PSNR, and SSIM, the differences with respect to the U-Net were not always statistically significant. In particular, the significance analysis revealed that improvements over U-Net in PSNR and SSIM were marginal (
Qualitative comparison
Figures 5 and 6 present the results of three test cases, comparing the real RC images with those generated by the baseline models and the proposed strategy. In these figures, the first, third, and fifth rows show the test images alongside their corresponding synthesized images, while the second, fourth, and sixth rows display the difference maps between each generated image and the real RC image. The first and second columns depict the LE images used as input to the models, together with the real RC images. The remaining columns present the RC images generated by the baseline models and by the proposed strategy, respectively.

Visual comparison of generated images in the SURA-CESM dataset. Top rows show low-energy, reference recombined, and generated images from baseline (U-Net, RiedNet, and Pix2Pix) and proposed (SCPA-UNet and SCPA-cGAN) models. Bottom rows display absolute error maps, with higher values (yellow) indicating larger discrepancies. Red boxes highlight regions of interest used to assess structural detail preservation and contrast uptake performance.

Visual comparison of generated images in the UCBM-CESM dataset. Top rows show low-energy, reference recombined, and generated images from baseline (U-Net, RiedNet, and Pix2Pix) and proposed (SCPA-UNet and SCPA-cGAN) models. Bottom rows display absolute error maps, with higher values (yellow) indicating larger discrepancies. Red boxes highlight regions of interest used to assess structural detail preservation and contrast uptake performance.
Three representative cases (A, B, and C) with different levels of contrast uptake were selected. In the first case (A), an image was included in which a region with contrast uptake is visible in both the LE and RC images. The second case (B) corresponds to a patient with high breast density, in which the lesion is masked in the LE image; however, in the RC image, the fibroglandular tissue is suppressed, and the lesion is highlighted more clearly. Finally, in the third case (C), a patient with high breast density is presented in the LE image, but with no evidence of contrast uptake in the RC image.
These cases were selected to evaluate the ability of the models to accurately represent the contrast uptake response of lesions, regardless of their visibility in the LE image. In addition, they allow for analysis of the models’ ability to avoid generating false positives with contrast uptake.
In both datasets, it can be observed that the U-Net, RiedNet, and SCPA-UNet models generate reconstructions with a certain degree of smoothing, both in the global image and in the regions with contrast uptake. This effect is expected, since the optimization of these models is based exclusively on minimizing pixel-level discrepancies. Nevertheless, among them, SCPA-UNet stands out for its ability to reconstruct the ROIs with greater accuracy.
In contrast, the Pix2Pix and SCPA-cGAN models do not exhibit the smoothing effect mentioned above. This is because their optimization is not based exclusively on a pixel-level loss function but also incorporates a component that optimizes the realism of the generated images. Nevertheless, SCPA-cGAN stands out in the synthesis of morphological features and in the reproduction of lesion enhancement patterns, showing lower discrepancies, as evidenced in the difference maps.
On the other hand, Figure 6 presents three cases with different contrast uptake characteristics from the UCBM-CESM dataset. As illustrated in this figure, the breast features differ from those in the SURA-CESM dataset. In UCBM-CESM, a lower presence of normal breast tissue characteristics is observed in the RC images, which may be attributable to differences in the acquisition protocols between the two datasets.
Despite this difficulty, the proposed strategy, SCPA-UNet and SCPA-cGAN models show greater accuracy in reproducing the lesion characteristics, as can be observed in the difference maps. In contrast, the baseline models U-Net, RiedNet, and Pix2Pix fail to detect this lesion, highlighting the limitations of these approaches in synthesizing regions with contrast uptake.
On the other hand, Case C in both datasets corresponds to a scenario where the LE image shows high breast density but no contrast uptake in the RC image. We observed that the models generate certain discrepancies in the synthesis of normal breast tissue, introducing regions of uptake that are not present in the real image. This highlights the technical challenge that such cases pose for generative models. Overall, these qualitative examples illustrate both the ability of SCPA-based models to recover subtle uptake patterns and their current limitations in very dense or non-enhancing cases, in line with the feasibility scope of this study.
Attention score analysis
Figure 10 shows the attention scores corresponding to pixel-attention in the final layer of the proposed strategy. The first three columns of the figure present the LE images, the real RC images, and the images generated by the model, respectively. The next four columns display the attention maps corresponding to the different attention layers of the model. Each row represents a set of images, including an enlarged crop of an ROI (marked in red and black), highlighting areas with contrast agent uptake.
Attention scores 0 and 1 indicate that the model focuses on the background tissue of the breast and the black background of the image. This suggests that the model directs its attention to the larger areas within the image, identifying global features. In the marked regions, it is observed that the model concentrates on the surrounding tissue of the areas with contrast agent response.
Attention scores 2 and 3 show a distribution of attention similar to Attention Map 1. However, Attention Score 2 shows a notable concentration of attention in the fibroglandular tissue with contrast agent response. This suggests that the model has learned to identify specific and relevant features associated with potential anomalies in the fibroglandular tissue. This behavior indicates that the fibroglandular tissue with contrast agent response contains critical information that the model considers essential for generating accurate and detailed synthetic images. Although Attention Score 3 presents a pattern similar to Attention Score 1, it shows that the highest attention is concentrated in the brighter regions of the RC image. This suggests that it focuses on the hyperintense areas compared to the parenchymal background.
The analysis of the attention scores demonstrates that the model has learned to identify and emphasize important features in regions with contrast agent response, generating more accurate and detailed images in both global and local characteristics. This is essential for the detection and analysis of anomalies in CESM studies.
Clinical feasibility evaluation
In the first phase, the radiologist was asked to distinguish between real and synthetic RC images, as well as to evaluate their technical quality. Figure 7 presents the confusion matrices corresponding to the SCPA-UNet and SCPA-cGAN models, respectively. For SCPA-UNet, 10 out of 17 patients were correctly identified as synthetic, while 3 were misclassified as real and 4 were evaluated as uncertain. For the real RC images, 11 out of 17 were correctly classified, 1 was considered synthetic, and 4 were evaluated as uncertain. These results suggest that, although the model generates images with realistic characteristics, the radiologist was able to discriminate between real and synthetic acquisitions in most cases, albeit with a non-negligible level of uncertainty. According to the expert radiologist, this uncertainty is associated with the smoothing effect observed in the images, a phenomenon expected since the loss function employed in U-Net-based models tends to induce such an effect.
Confusion matrix for realism assessment between images generated by the SCPA-UNet (left) and SCPA-cGAN (right).
In the case of SCPA-cGAN, a similar pattern was observed: 10 out of 17 synthetic cases were correctly classified, while 6 were incorrectly considered real and only 1 was evaluated as uncertain. Similarly, 11 out of 17 real cases were correctly classified, 1 was labeled as synthetic, and 4 were evaluated as uncertain. Compared with SCPA-UNet, SCPA-cGAN produced images that were more frequently perceived as real, suggesting higher perceptual realism, albeit accompanied by slightly lower discriminability. This behavior can be explained by the fact that, as a generative model, adversarial loss contributes significantly to the realism of the images. Nevertheless, the expert radiologist noted the presence of noise-like artifacts (white pixels), which in some cases motivated the classification of an image as computationally generated. This phenomenon could be attributed to the limited receptive field of the PatchGAN architecture used in the SCPA-cGAN discriminator.
In the second phase, the evaluation focused on determining whether the synthesized uptake patterns were consistent with the true enhancement and allowed for appropriate clinical interpretation. Figure 8 illustrates the radiologist's assessment in 50 patients. Overall, 80% of the synthesized studies were rated as either moderately consistent or highly consistent, with 32% (16 out of 50) classified as “highly consistent” and 28% (14 out of 50) as “moderately consistent.” Only 20% of the cases were rated as inconsistent (8%) or highly inconsistent (12%), reflecting that most generated images exhibited a plausible distribution of enhancement.
Bar chart of the consistency level of contrast uptake synthesis in the images generated by the proposed strategy.
The analysis of the 5-point similarity scale (Figure 9) reinforced this observation: 15 patients (30%) received the highest similarity score (level 5), 12 patients (24%) were rated at level 3, and 9 patients (18%) at level 4, whereas only 14 patients (28%) were classified in the lowest similarity levels (1 and 2). These findings demonstrate that, in most cases, the synthesized images were visually consistent with the true enhancement and amenable to clinical interpretation.
Bar chart of the similarity level of the synthesized uptake with respect to clinical interpretability.
These results show that, although the radiologist was able to discriminate between real and synthetic images in most cases, either due to the smoothing effect present in the images generated by the SCPA-UNet model or the appearance of small white-pixel artifacts in the images generated by SCPA-cGAN, the consistency of contrast uptake was largely preserved, particularly at the higher levels of similarity. Moreover, the radiologist noted that these limitations did not affect medical interpretation, supporting the potential of SCPA-based models as a promising alternative for synthesizing contrast uptake in CESM.
Discussion
This study evaluates the impact of integrating SCPA into conditional generative models for virtual contrast synthesis in CESM images. By shifting the focus from global image quality to regional fidelity, the proposed strategy prioritizes the preservation of diagnostically relevant features in contrast-uptake zones. In this feasibility setting, we deliberately build on established U-Net- and Pix2Pix-based baselines reported in the current CESM virtual-contrast state-of-the-art.
The superior performance of SCPA-based models, particularly in ROI-focused metrics, indicates that attention mechanisms help to better model localized iodine attenuation and lesion-level signal. As shown in Table 2 and in the statistical analysis summarized in Tables 3 and 4, SCPA-integrated models achieved significant reductions in MSE and improvements in PSNR and SSIM within contrast-uptake zones compared with baseline architectures. The ability of the SCPA block to generate 3D attention maps allows the model to recalibrate features spatially and across channels, effectively suppressing background noise. This behavior is visually evident in Figure 10, where the attention maps prioritize high-frequency details in lesions while largely ignoring unenhanced parenchymal tissue.

Distribution of attention maps generated by the proposed SCPA models. The figure displays the low-energy images, the real RC images, and the generated images, along with the attention maps corresponding to the final attention layer in the SCPA architecture. These maps highlight the areas where the model focuses its attention.
The third row in Figure 5 is of particular interest: the LE image displays extremely dense fibroglandular tissue, whereas the RC image reveals a lesion with high contrast uptake. This scenario is highly challenging for image-to-image translation models, as dense tissue can mask potential lesions. In this case, the RiedNet and Pix2Pix models failed to detect the hidden lesion, and the U-Net model could not accurately synthesize its shape and enhancement pattern. In contrast, the proposed strategy demonstrated superior performance in synthesizing the characteristics of this hidden lesion. This suggests that SCPA layers allow the model to focus on regions with high contrast uptake during training, even when these regions are obscured by dense fibroglandular tissue, which is particularly relevant in patients with high breast density.
Table 1 presents the whole-image evaluation results and shows that the proposed strategy achieves the best overall performance in most metrics. However, the differences with respect to the reference models are not statistically significant at this level on the UCBM-CESM dataset, suggesting that global image characteristics (such as overall breast shape, mean and standard deviation of intensity, and global texture) are reasonably well synthesized by baseline architectures. Taken together with the ROI-focused metrics (Table 2), these findings indicate that the added value of SCPA is more evident when analyzing diagnostically relevant regions than when averaging over the entire image, which is expected given that contrast-uptake regions usually represent only a small fraction of the total breast area. Moreover, the radiological assessment showed that the proposed strategy consistently synthesizes contrast-agent uptake while maintaining an adequate correspondence with the enhancement patterns observed in real images (Figures 8 and 9). Most synthesized studies were rated as moderately or highly consistent, and the majority of images were judged to have sufficient technical quality for diagnostic use, even though the radiologist was often able to distinguish real from synthetic images (Figure 7). In other words, the expert reader could often recognize synthetic images, but most studies were considered visually consistent enough to support clinical interpretation, which demonstrates the feasibility of moving toward virtual contrast generation in CESM studies.
Despite these promising results, this study has several limitations that should be considered. Deep learning models achieve better performance when trained on large datasets containing sufficient samples to represent the biological variability of the phenomenon; therefore, the present study is constrained by a relatively limited number of cases and by the retrospective nature of data collection. Although the use of two independent datasets with different acquisition characteristics (SURA-CESM and UCBM-CESM) provides initial evidence of robustness, both remain modest compared with datasets commonly used in other imaging domains, which restricts clinical generalizability. In addition, the preliminary assessment performed by a single highly experienced breast radiologist should be regarded as exploratory and may be subject to observer bias.
Accordingly, future studies should expand the dataset size and ensure an adequate representation of the inherent variability in breast images, including breast composition, parenchymal distribution, and tumor morphology, as well as acquisition protocol-related factors such as contrast-agent dose and technical imaging parameters. Moreover, multi-reader studies involving radiologists with different levels of experience are required to more objectively assess interobserver consistency and the potential impact on diagnostic interpretation. From an architectural perspective, this study provides a modest modification by integrating SCPA into established conditional generative models used for CESM synthesis. Future work should explore newer conditional translation families, including conditional diffusion models, transformer-based generators (e.g. TransUNet and Swin-UNet), and multiscale strategies (e.g. Laplacian-pyramid GANs or StyleGAN-inspired variants), which could better capture fine detail, long-range dependencies, and global–local consistency of images. A systematic comparison against these more complex architectures therefore lies beyond the scope of the present study, and it could be better addressed in future work once larger multi-center CESM datasets become available.
In parallel, future work should address integration into clinical workflows. Radiology departments require AI tools that can interface with PACS/RIS systems, operate with minimal latency, and provide output that is interpretable to clinicians. In this regard, the attention maps generated by our models could represent a valuable interpretability feature, supporting radiologists’ trust in AI-generated images and aligning with ongoing efforts to promote explainability in medical AI. 39 In addition, regulatory approval will be indispensable. AI-based systems intended for diagnostic purposes are categorized as Software as a Medical Device and therefore require validation through regulatory frameworks such as Food and Drug Administration clearance in the United States or CE marking in Europe. These processes involve rigorous testing across diverse populations and clinical settings, as well as demonstrating safety, efficacy, and reproducibility, and have proven to be a major bottleneck for the translation of AI tools into clinical practice. 39 Therefore, in the near term, AI-generated CESM images are most realistically positioned as adjunctive decision-support tools in human-in-the-loop settings, helping radiologists to explore virtual contrast scenarios while preserving clinical responsibility and patient safety.
Conclusions
This study establishes a robust framework for the synthesis of virtual contrast in CESM by integrating SCPA into conditional generative architectures. Our findings demonstrate that prioritizing regional fidelity over global similarity metrics significantly improves the morphological and functional accuracy of synthesized lesions, addressing a fundamental limitation of conventional CNN-based image-to-image translation models in this setting.
Across two independent CESM datasets, SCPA-based models achieved better quantitative performance, particularly in ROI-focused metrics, and were evaluated by an expert breast radiologist who determined that the generated images provided realism and technical quality for clinical interpretation in most cases. At the same time, the persistence of occasional artifacts and subtle discrepancies in enhancement patterns, together with the limited and heterogeneous nature of the available datasets, means that these results should be interpreted as evidence of technical and clinical feasibility rather than as validation for immediate diagnostic replacement of contrast-enhanced acquisitions.
Future adoption in clinical workflows will depend on large-scale, multi-reader trials and the integration of explainable AI features to ensure diagnostic safety. Ultimately, this research bridges the gap between generative AI theory and oncological application, moving closer to a safer, more accessible paradigm for breast cancer diagnosis.
Footnotes
Acknowledgements
None.
ORCID iDs
Ethical approval
This study represents one of the research directions within project P20213 approved by the Ethics Committee of the Instituto Tecnológico Metropolitano (Act No. 02, June 2020). Images used for the training and validation models were retrospectively obtained from patients who came for clinical studies as part of the health programs. Written informed consent was obtained from all participants as part of the clinical practice of “Ayuda Diagnósticas-Sura.”
Author contributorship
KOC, RDF, CMB, MLH, and GMD contributed to the conceptualization of the proposed model and the experimental design. The methodology was implemented by KOC and RDF, while the validation of the results was carried out by GMD, MLH, and RDF. The formal analysis was performed by KOC, who was also responsible for the visualization of the results. The original draft was written by KOC, and the review and editing were performed by GMD, CMB, RDF, and MLH. The study was supervised by GMD, who also managed the project administration and secured the acquisition of funding. All authors have read and approved the final version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Instituto Tecnológico Metropolitano (research grant P20213), the Institución Universitaria Pascual Bravo (research grant CE-007-2020), Imágenes Diagnósticas SURA, and the Agencia de Educación Postsecundaria de Medellín - SAPIENCIA.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
