Abstract
Introduction
Plasmids are key drivers of bacterial evolution, mediating horizontal gene transfer 1 and playing crucial roles in virulence and antibiotic resistance, 2 posing significant clinical challenges. In this context, an important role is mediated by their copy number, representing an important evolutionary parameter, influencing gene dosage, expression levels, fitness costs, and stability. 3 Indeed, high-copy-number plasmids can amplify resistance or virulence phenotypes, 4 while low-copy-number plasmids may exhibit tighter host control and long-term persistence with minimal burden.5,6 The widespread adoption of sequence-based methods, in both research and clinical settings, has led to the development of several tools7-9 to identify plasmids, from both short and long reads, but the estimation of plasmid copy number remains underexplored, with no standardized tool existing to perform plasmid copy number (PCN) estimation. Consequently, few genomic studies report PCN alongside plasmid identification, unless performed via the standard quantitative polymerase chain reaction (qPCR) quantification, despite its importance for understanding plasmid stability, host burden, and gene dosage effects. 3 To address this gap, we developed Plasmid Copy Number Estimator (PCNE), a tool designed to estimate PCNs directly from short-read whole-genome sequencing (WGS) data. The PCNE automates PCN estimation by integrating established bioinformatic tools into a cohesive, multi-step pipeline, facilitating its use for a wide range of users. It requires standard inputs, such as a draft or complete genome assembly and short reads in paired-end format. The PCNE aims to complement well-established plasmid identification tools by providing a standardized yet versatile approach to PCN estimation.
Methods
Plasmid copy number estimator overview
The PCNE accepts 2 primary input configurations, shortly defined as mode 1 and mode 2. Mode 1 requires separate chromosome and plasmid FASTA files, while mode 2 requires the whole-assembled genome in FASTA format, and a file listing plasmid-contigs, with the option to also provide the list of chromosomal contigs. These 2 inputs ensure a straightforward integration with the common outputs of plasmid identification tools. In both cases, short reads in paired-end format are required.
Normalization and bias correction
The PCNE provides 1 normalization method and an optional correction feature to ensure precise PCN estimates. The default normalization is Whole Chromosome based, which calculates the ratio between the median plasmid coverage depth (
The way
where
where
This approach was selected as the default due to its simplicity and robustness, as it utilizes all available chromosomal data to establish a comprehensive coverage baseline.
The GC content variations are known to introduce systematic biases in sequencing coverage when using short-read sequencer, 10 especially when GC content exceeds the 45% to 65% values. 11 This systematic bias can distort coverage depth and, consequently, affect the accuracy of PCN estimates. To mitigate this bias, PCNE implements an optional GC-bias correction module. When applied, it performs a Locally Estimated Scatterplot Smoothing (LOESS) regression to model the relationship between GC content and observed mean depth of chromosomal windows (Supplemental Figure S1). If not otherwise specified (–gc-frac), the LOESS span parameter α (also called the smoothing fraction) is selected automatically by k-fold cross-validation with k = 5. The best α is defined as the one with the lowest Mean Squared Error (MSE). The LOESS model is then used to predict the expected depth based on GC content and used to correct the observed depth of all windowed data, using the following equation:
with
Plasmid copy number estimator workflow
The initial step involves the processing of short-read sequencing data. BWA 12 is used for indexing the reference FASTA file and aligning the reads. The resulting SAM file is then converted to BAM format, sorted, and indexed using samtools. 13 User can apply optional filtering based on reads-mapping quality (–mapping-quality) and SAM flags (–filter). The reference FASTA file is then divided into windows (default: 1k bp), and for each window, both GC content and coverage depth are calculated. If no GC-correction is applied, then the plasmid and chromosomal depths are calculated, and the copy number is estimated by a ratio between the 2 depths. When GC-correction (–gc-correction) is applied, then the LOESS regression is applied on chromosomal windows, calculating the expected depth based on GC content. The expected depth is then used on all windowed data to adjust observed depth. The final step involves copy number estimation by calculating the ratio of the plasmid’s median depth to the chromosomal median depth. The results are formatted and written to a tab-separated values (.tsv) file.
The complete workflow is summarized in Figure 1.

Plasmid copy number estimator workflow. (A) General workflow illustrating how PCNE integrates into a standard sequencing analysis. This panel was created with BioRender.com. (B) In detail, all steps performed by PCNE, including optional filters and optional sub workflow. This panel was created with Draw.io.
Plasmid copy number estimator validation
To evaluate the accuracy and versatility of PCNE, we conducted analyses using simulated sequencing data with known PCNs and on real data with PCN ground truth estimated via qPCR.
Simulated data
Reference sequences for
Real data
To validate PCNE on a real data set, raw reads from the Xiang et al, Emerg Microbes Infect. 2024, were retrieved from SRA (SRP473989), with data deposited under BioProject numbers PRJNA1044738. Raw reads were trimmed using fastP (v1.0.1) 16 (-q 30) and checked for quality using fastQC (v0.12.1) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Filtered reads were then assembled using Shovill (v1.1.0) (https://github.com/tseemann/shovill) (–assembler spades), and plasmids were identified using Platon (v1.7) 9 and MOB-suite (mob_recon was used) (v3.1.9). 8 Since MOB-suite failed to identify any plasmid, it was excluded from further analysis. Then, PCNE was run (–single-plasmid) on the generated output, and the reliability of the outcome was evaluated using Pearson correlation, with qPCR results as reference.
Computational performance
The PCNE has been evaluated in terms of runtime (minutes) and maximum memory required (gigabytes) per run. To test the performance, 2 distinct groups were compared: the 36 samples from the simulated data set and the 14 samples from the real-life data set. Within each group, we compared the standard workflow with the GC-bias correction workflow. All analyses have been performed on HPC, running 1 job per isolate and giving 5 CPUs and 32 GB of memory per job.
Case study
We also tested PCNE in a real-life case study, where no qPCR data were available, to show its practical utility in a common scenario. To do this, we selected an outbreak study, involving hypervirulent and multidrug-resistant strains of
Results and Discussion
Simulated data set
The PCNE’s performance was first validated using simulated short-read data where the true PCN was known. We tested the PCNE across datasets with chromosomal coverage of 50× and PCNs of 1×, 2×, and 5×. Quantitative analysis revealed that PCNE consistently produced estimates close to the true simulated values (Figure 2), with an overall MAE of 0.017 ± 0.02. There are no major differences between the application of GC-correction or not, with GC-correction yielding an overall MAE of 0.018 and no GC-correction yielding an overall MAE of 0.016. Data estimated from reads with GC bias were the least accurate (MAE = 0.028 and RMSE = 0.03), while those without GC bias yield an MAE of 0.006 and an RMSE of 0.008, with no differences between the application or not of GC-correction. A detailed table is available as Supplemental Table S1.

PCNE results on simulated data set.
To simulate a standard workflow, the simulated reads were assembled and classified using Platon and MobSuite, separately. The PCNE was able to directly use the outputs from both tools to provide copy number estimates, confirming its easy integration into different analysis pipelines. Results were concordant for both tools, with an overall MAE of 0.08 ± 0.06 and an RMSE of 0.1 ± 0.01 and are summarized in Supplemental Table S2.
Real data set
All assemblies generated were of high quality. Platon correctly identified 1 AMR-plasmid per isolate, as expected. The PCNE was run with and without GC-correction, correctly identifying all isolates for which an increase in copy numbers was expected, with a Pearson’s correlation of 0.88 (Figure 3). Regarding the accuracy of estimates, the overall MAE was 0.798 ± 1.06, 0.15 for 1 copy isolates and 1.28 for the multiple copies isolates. The reason for the reduced accuracy is due to 2 isolates, the samples EM4N3 and EM4N4, for which the estimation was lower than expected. Given the high accuracy in all other simulated and real cases, a possible explanation relies on the qPCR results taken as reference and the WGS sequences used. Indeed, as the authors reported, the estimation was taken over several days, during which copy numbers varied. A hypothesis is that for the 2 discordant isolates, the sequence record does not correspond to the time point from which the qPCR data were reported in the main text of the reference manuscript but corresponds to one of the time points reported in their Supplementary Material. Despite this minor inconvenience, which we acknowledge as a limitation of this test, PCNE correctly identified the significant trend of increased copy number in isolates EM4N1-EM4N4, as well as the stability of isolates EM2N1 and EM2N3, and the control ECNX52. A detailed table is available as Supplemental Table S3.

PCNE results on a real data set.
Computational performance
The overall runtime was low for all datasets, with a mean of 01:36 ± 01:04 minutes. The simulated data set was processed faster and with lower memory usage than real-world data set, with a mean runtime (minutes) of 01:00 ± 00:15, compared to 02:59 ± 00:36 (Figure 4A). In terms of required memory, more variability has been found, depending mainly on chromosomal coverage depth and GC-correction. The simulated data set required a less sensitive amount than the real-world data set, with a mean memory (Gb) usage of 0.90 ± 0.49, compared to 10.1 ± 5.8 (Figure 4B). As expected, GC-correction runs required the longest runtime and the highest amount of memory, for both data sets. All data are summarized in Supplemental Table S4.

PCNE computational performance. (A) PCNE runtime, real data vs simulated data. (B) PCNE peak memory usage, real data vs simulated data.
Case study
All assemblies generated were of high quality. The MOB-suite identified the same 5 plasmids in all isolates, confirming the isolate’s closeness as reported in the original study. The application of PCNE, however, expanded this genomic characterization by revealing heterogeneity in the PCNs. Our analysis identified 2 distinct quantitative profiles: isolates KP_01, KP_02, and KP_05 maintained the blaKPC-2-carrying plasmids at a low copy number (1.1-1.3 copies), while isolates KP_03 and KP_04 exhibited a 2- to 3-fold amplification of these same plasmids (2.3-3.1 copies). The presence of the primary virulence plasmid carrying the aerobactin siderophore genes (
Limitations
A limitation of all coverage-based copy number estimation methods, including PCNE and qPCR-based approaches, is the chromosomal replication dynamic. During active replication, bacteria possess multiple replication forks, leading to a gradient of DNA copy number across the chromosome: regions near the origin of replication (oriC) are present in higher copies than regions near the terminus (ter). 19 This phenomenon introduces a systematic bias, as the chromosomal coverage used for normalization is not uniform. The PCNE attempts to mitigate the effects of this gradient by using the median coverage of all chromosomal windows as the baseline for normalization, rather than the mean. The median is inherently more robust to the high-coverage outliers near the oriC and low-coverage outliers near the ter. However, this statistical approach does not fully correct for the underlying biological bias, which remains a caveat for any sequencing data derived from a replicating cell population. Another limitation of the current PCNE version is its dependency on the limited accuracy of upstream assembly and classification tools for short-reads data. This limitation is partially overcome by the possibility to provide hybrid assemblies. However, we acknowledge that the possibility and the need to perform hybrid sequencing are very limited and not cost-effective; for this reason, future work will focus on expanding its functionality to long reads only sequencing data, for which we aim to overcome the current major limitations.
Conclusion
Plasmid Copy Number Estimator was developed to enhance and standardize current short-read–based plasmid analysis workflow, via an easy-to-use, versatile, and efficient pipeline. The validation on simulated data demonstrates that PCNE provides reliable results across all its different configurations. Its flexible input modes allow for seamless integration with the outputs of widely used plasmid identification tools like Platon and MOB-suite. This was confirmed also by the test on real datasets, for which PCNE results were concordant. Furthermore, in our case study application, PCNE successfully uncovered significant PCN heterogeneity within a clonal outbreak, providing the quantitative data necessary to decouple plasmid dosage from the observed phenotypic resistance and virulence.
Supplemental Material
sj-docx-1-bbi-10.1177_11779322251410037 – Supplemental material for PCNE: A Tool for Plasmid Copy Number Estimation
Supplemental material, sj-docx-1-bbi-10.1177_11779322251410037 for PCNE: A Tool for Plasmid Copy Number Estimation by Riccardo Bollini and Valeria Cento in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-2-bbi-10.1177_11779322251410037 – Supplemental material for PCNE: A Tool for Plasmid Copy Number Estimation
Supplemental material, sj-xlsx-2-bbi-10.1177_11779322251410037 for PCNE: A Tool for Plasmid Copy Number Estimation by Riccardo Bollini and Valeria Cento in Bioinformatics and Biology Insights
Footnotes
Author Contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was supported by EU funding within the NextGenerationEU-MUR PNRR Extended Partnership initiative on Emerging Infectious Diseases (project no. PE00000007, INF-ACT).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The source and documentation for PCNE are available at https://github.com/riccabolla/PCNE. All the scripts to reproduce data in this paper are available at https://github.com/riccabolla/PCNE_paper_script/. The chromosome and the plasmid sequence used for the validation are deposited on the NCBI GenBank under accession numbers CP003200 and CP003223, respectively. All the data used in this manuscript are available at
.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
