Abstract
Introduction
Single-cell RNA sequencing (scRNA-seq) has provided unprecedented power to interrogate gene expression transcriptome-wide in single cells. 1 Coupled with advances in computational biology to analyze scRNA-seq data,2 -4 biological samples can be resolved to identify heterogeneity among the single cells in a pooled sample. 5 Analyzing a scRNA-seq data set or integrating several data sets can be challenging because the data tend to be extremely high dimensional and very sparse. However, several methods have been recently developed to aid in the computation. 6 For example, batch correction and harmonization of scRNA-seq data sets help to minimize systematic differences so that the data can be compared on the gene expression level. 7 Unfortunately, there is a lack of computational approaches that serve to compare scRNA-seq data sets according to similarity and differences in phenotypic heterogeneity.
We developed the scCompare computational pipeline to facilitate the mapping of phenotypic labels from 1 scRNA-seq data set to another. The aims of the pipeline are to establish comparability and to potentially discover unique cell types. In scCompare, a mapping scRNA-seq data set with known cell type identities is processed with the standard pipeline steps including the identification of Leiden clusters and projection of the single cells into uniform manifold approximation and projection (UMAP) space. Given phenotypic annotations of the single cells in the clusters, cell type–specific prototype signatures are generated based on the average gene expression of each cluster. Using those cells assigned to each cluster, distributions of the correlations of gene signatures between the cells and the corresponding prototype signature are determined. Statistical thresholds for inclusion or exclusion are derived from the resulting distribution and used to evaluate each single cell from a test scRNA-seq data set to assign a phenotypic label. Single cells that fall outside of the distributions are labeled as unmapped. We evaluated scCompare on benchmark scRNA-seq data sets from peripheral blood mononuclear cells (PBMCs) and compared the performance to single-cell variational inference (scVI), a state-of-the-art computational tool for general scRNA-seq analyses including label transfer. 8 In addition, we used publicly available scRNA-seq data sets from atlases and experimental protocols that differentiate human induced pluripotent stems cells (hiPSCs) into cardiomyocytes (CMs) to test the utility of scCompare to detect similarities and differences in single-cell populations.
Materials and Methods
Data sets
Human Protein Atlas (proteinatlas.org)
The Human Protein Atlas (HPA) scRNA-seq data were collected as previously described. 9 Briefly, scRNA-seq read count data from 81 cell types from 31 different data sets (Supplemental file 1) were downloaded. The scRNA-seq data are based largely on the Chromium single-cell gene expression platform from 10× Genomics (version 2 or 3), constituting a single-cell suspension from tissues without pre-enrichment of cell types. This includes only studies with >4000 cells and 20 million read counts, and only data sets whose pseudo-bulk transcriptomic expression profile is highly correlated with the transcriptomic expression profile of the corresponding HPA tissue bulk samples. An exception was made for the eye (~12.6 million reads allowed), the rectum (2638 cells allowed), and the heart muscle (plate-based scRNA-seq platform) to include these additional cell types in the analysis.
Tabula Sapiens
The Tabula Sapiens (TS) scRNA-seq data were collected as previously described. 10 Briefly, 24 tissues in total were collected from 2 cohorts of donors (Supplemental file 2). This allowed biological replicates for almost all tissues. More details of the samples are available from the metadata posted on Figshare (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). The scRNA-seq raw read count data from the 10× Genomics pipeline were downloaded (TS cell atlas) for analysis.
Peripheral blood mononuclear cells
scRNA-seq data from approximately 3000 single PBMCs (3k data set) from a donor were contributed by 10× Genomics as a public resource. The raw read count data were downloaded for analysis. 11 In addition, scRNA-seq data from approximately 68 000 single PBMCs (68k data set) from a donor were generated using the 10× Genomics platform. 12 The raw read count data were used for analysis.
Cardiomyocytes
hiPSCs were differentiated to CMs in 90 days using 2 different protocols as previously described. 13 Briefly, in protocol 1, CMs were differentiated using small molecules with CHIR99021 and IWP2. In protocol 2, CMs were differentiated using cytokines Activin A, BMP4, and XAV939 plus small molecules with CHIR99021. Samples were collected on days 0 (D0), 12 (D12) and 24 (D24) for protocol 1 and days 0 (D0), 14 (D14), and 26 (D26) for protocol 2. The difference in the days of collection is due to the lag in the differentiation initiation day between the 2 protocols. scRNA-seq raw read count data were generated using the SPLiT-seq 14 (split-pool barcoding) library preparation methodology and from sequencing on an Illumina NextSeq.
Data filtering, preprocessing, and clustering
Single-cell analysis in Python (scanpy)
4
v1.9.2 was used exclusively for filtering, preprocessing, clustering the data, and marker gene identification. For data processed in the R environment, the SeuratDisk converts Seurat objects to AnnData objects via the h5Seurat file format specification. The unique molecular identifiers (UMIs) in each data set were annotated using the human Ensembl gene model.
15
The data were filtered such that transcripts not present in at least 3 cells were excluded from further analysis. Cell barcode identifiers were filtered to only include those with a minimum of 2000 non-zero transcripts, those having ⩾5% mitochondrial transcripts, and those having ⩽10% ribosomal transcripts. The count data were normalized to counts per million (CPM) for each cell by dividing the expression of each gene by the total count of its respective cell, multiplying the result by a million, and then applying a log base 2 transformation with an offset of 1. The normalized data were scaled across single cells to a mean expression = 0 and variance = 1. Highly variable genes were selected using the variance-stabilizing transformation option in scanpy followed by principal component analysis (PCA) to reduce the dimension of the data. Using the Kneedle heuristic
16
to determine the number of principal components (PCs) according to the point of maximum curvature of the explained variance and
scCompare core functionalities: generate bulk signatures, statistical thresholding, and mapping phenotypic labels
After phenotypic labels were derived, either provided from external ground truth information sources or arrived at via unsupervised clustering, the scCompare pipeline was initiated. Normalized transcript count measurements and phenotypic labels from the training (mapping) data were used to build phenotypic label-specific prototype signatures that were based on the average expression of each phenotypic label using only highly variable genes. For each phenotypic label, distributions of the correlations of each cell’s highly variable genes to the phenotypic label’s prototype were generated. The correlation of each cell’s gene signature to each prototype signature was calculated, and the cell was initially assigned the phenotypic label to which it has the highest correlation.
The median absolute deviation (MAD), a measure of dispersion in data, uses the median as a statistic of central tendency and is robust against outliers.
17
If
We primarily used 5*MAD below the median as the statistical cutoff for the distribution of the Pearson correlation coefficients for a given phenotypic annotation to exclude phenotypic label assignment in the single cells in the testing data set. As an alternative approach to determining the statistical cutoff for distributions that might be highly skewed, we employed the Fisher transformation to convert the Pearson correlation coefficients to
where ln is the natural logarithm, and
where
Marker gene identification
We looked for marker genes for each CM differentiation protocol (small molecule and cytokine-driven) at matched timepoints. Specifically, D12 and D24 of the small molecules in protocol 1 were compared with D14 and D26 of the small molecules and a cytokine in protocol 2, respectively. Marker genes were identified using scanpy’s rank_genes_groups method. A
Performance metrics for the phenotypic mapping
Accuracy is the proportion of the mapping of the correct phenotypic labels to the test data set over all levels.
Specificity is the probability of not falsely mapping the correct phenotypic label to the test data set.
Recall (sensitivity) is the probability of mapping the correct phenotypic label to the test data set.
Results
scCompare overview
scCompare is a pipeline computational process using a training (mapping) scRNA-seq data set with known phenotypic annotations of single cells to label cells from a test data set of scRNA-seq data. The primary goal is to compare the phenotypic characteristics of the single cells (Figure 1). The mapping data set is processed to cluster the single cells for projection into UMAP 2-dimensional space. Cell type-specific prototype signatures are created using phenotypic labels (either provided by external sources or arrived at by unsupervised clustering), by calculating the average gene expression of all cells within each label. Then, distributions are obtained by comparing each cell’s gene signature to the prototype of its respective label. Next, the gene signature from each single cell in the test data set is evaluated against statistical thresholds for each label distribution to assign a phenotypic annotation. Single cells that fall outside of the distributions are labeled as “unmapped.” Finally, a comparison of the proportion of assigned cluster between data sets is visualized in a scatter plot, and

Compare workflow. Semi-supervised clustering is performed to group the mapping data set into phenotypically relevant subsets. Bulk signatures derived from the prototype gene expression signature for each cluster are generated, and distributions of within-cluster Pearson correlations of each member cell to the cluster’s prototype signature are formed. These provide statistical cutoffs for cluster inclusivity. The cells from the test data set are correlated with each prototype signature, and the Pearson correlation is compared with the statistical cutoffs. Cells that pass the threshold are considered mapped and are labeled with the cluster’s phenotype; otherwise, they are labeled as unmapped. Finally, fractions of mapped cells are compared between the mapping and testing data sets to facilitate comparability of phenotypic representations.
Comparison of scCompare to scVI to map phenotypic labels
The scVI 8 is a scRNA-seq analysis tool that uses an optimization routine and deep neural networks to join information across cells and genes that are similar, and to estimate the distributions that represent the observed expression values. An advantage of scVI is that it uses a neural network-based approach to construct latent-space representations from the matrix of read counts and batch information to account for experimental and technical variation in the data, and then estimate biological differences between cells. A couple of disadvantages of scVI is that it uses a non-deterministic algorithm leading to alternative results with different initializations, and for genes with few cells, the prior and the inductive bias of the neural network may not fit the data optimally. The expected read count data (batch-corrected and normalized) are used in scVI for comparison with scCompare.
To assess the ability of scCompare and scVI to align biological annotations between scRNA-seq data sets, we randomly sampled the 3k PBMCs scRNA-seq data 50 times in training and testing splits (80:20) and used the top 2000 highly variable genes for signature construction. Performance is based on the accuracy, precision, and sensitivity of scCompare and scVI mapping the phenotypic labels to the test data sets. As shown in Table 1, the accuracy of predictions was similar, but scCompare matches or outperforms scVI in precision and sensitivity for most cell types, most strikingly with the dendritic cells and megakaryocytes. However, scVI scored slightly higher for CD4 T cells, FCGR3A monocytes, and natural killer (NK) cells in precision for the former and sensitivity for the latter two.
Performance comparison between scCompare and scVI using the 3k PBMCs scRNA-seq data set.
Performance parameters based on the average of 50 iterations of 80%:20% splits of the data training to test and using 2000 highly variable genes. Bold indicates that scCompare most strikingly outperforms scVI in precision and recall with the dendritic cells and megakaryocytes.
We also evaluated the ability of scCompare and scVI to map known transcriptomic signatures from the 3k PBMCs scRNA-seq data set to the 68k PBMCs scRNA-seq data set (Figure 2). As shown in Figure 2B, the cell type identities from the 3k data set were mapped equivalently to the 68k data set except for the megakaryocytes in the case of scVI where they were labeled as NK cells. The gene representations of the clusters of single cells for the phenotypic labels are comparably similar except for the megakaryocytes (Figure 2B). Furthermore, scCompare identified a previously unannotated cell type, plasmacytoid dendritic cell. 18 The application of the MAD-based statistical cutoff allowed for discovery of potentially novel cell types, although not all unmapped cells would be novel (not accounted for in the mapping data set). To determine whether a group of unmapped cells was novel, 2 parameters are assessed: (1) a differentially expressed gene-based metric and (2) a UMAP Euclidean distance-based metric (Figure 2C, Table S1, Figure S1, and Figure S2). The cell cluster identified as plasmacytoid dendritic cells was observed to have a gene signature divergent from that of dendritic cells (the pre-cutoff assigned phenotype) and was separated from the mapped dendritic cells in UMAP space. Furthermore, differential gene expression analysis revealed the expression of MZB1 in this cluster, a marker gene for plasmacytoid dendritic cells. 18 This not only demonstrates scCompare’s usefulness in mapping phenotypes between data sets, but it also provides the added ability to discover novel cell types not accounted for in the mapping data.

scCompare and scVI 3k PBMCs data set-derived models applied to the 68k PBMCs data set and the discovery of an unannotated cell type. (A) 3k PBMCs data set: (i) UMAP of 3k PBMCs data set, (ii) correlation dendrogram of 3k PBMCs phenotypes, and (iii) 3k PBMCs gene expression dotplot showing expression of key phenotypic markers. (B) Results of mapping 3k PBMCs identities onto the 68k PBMCs data set using scCompare and scVI. (i-ii) UMAPs showing phenotypes assigned by scCompare (i) and scVI (ii), (iii) gene expression dotplots showing consistency of key gene expression between the 3k PBMCs data set and the assigned phenotypes in the 68k data set (top—scCompare, bottom—scVI). scVI does not predict any megakaryocytes in the 68k PBMCs data set despite evidence for their presence (small PPBP-expressing subcluster (A iii)). The intensity scale represents the mean expression of a gene within a cell type. The size of the dots represents the fraction of the cells within a cell type. (C) Statistical cutoff-assisted discovery of plasmacytoid dendritic cells. Cells not meeting statistical cutoff were assessed for novel phenotype status using differential gene expression list analysis and UMAP Euclidean distance metrics (Figure S1). After assessment and remapping, a cluster of plasmacytoid dendritic cells was discovered. This phenotype is confirmed by the expression of MZB1. Further characterization can be found in Figure S1.
Mapping between scRNA-seq cell atlases
To demonstrate a broader applicability of the scCompare pipeline, we performed cell atlas mapping using HPA as the training data set and TS as the test data set. These atlases contain samples obtained from primary human adult tissue from multiple individuals across many tissues. These samples have also been extensively annotated with various levels of descriptive labels including cell phenotype, organ-of-origin, sex, and other information, typifying the breadth and depth of heterogeneity present within human biology. To demonstrate scCompare’s performance on atlas-to-atlas mapping, phenotypic label alignment between atlases was necessary for a ground truth comparison. First, for each data set, we combined phenotypic and organ-of-origin annotations and excluded groupings (phenotype; organ-of-origin, for example: T cells; salivary gland) that did not contain at least 300 cells or were not present in both HPA and TS. We then removed groupings that denoted phenotypes that were too broadly defined, likely containing more than 1 distinct phenotype. For example, stromal cells may be comprised of fibroblasts, adipocytes, pericytes, vascular cells, etc. We then removed all cells that were annotated as stem cells as it is unclear how to align phenotypes of unknown maturation state across atlases. Finally, after filtering, resulting labels were hand-aligned by ensuring overlap between the annotated phenotype and the organ-of-origin (Table S2) and each resulting grouping subset to 300 cells to streamline processing.
We obtained bulk signatures corresponding to each of the 22 annotation groups in HPA sampled at available tissue level. Bulk signatures were captured using the top 2000 highly variable genes. These bulk signatures were then applied to the TS data set comprising 50 corresponding sub-categories sampled from 21 different tissue locations throughout the body. No statistical cutoff was applied as we were testing direct 1:1 mapping fidelity (Table 2).
scCompare performance in finer categorized label transfer task.
Abbreviation: N/A, not applicable.
Figure 3 depicts a summary of the classification performance based on differential expression analysis, along with visualizations showing the alignment of classification mapping between HPA and TS cell atlases. Differential gene expression analysis was conducted on HPA, using the same parameters as in the cardiac differentiation comparison. The top differentially expressed genes from HPA are displayed for both atlases, allowing for a direct comparison (Figure 3A). In this comparison, cells from the TS atlas were not ground truth labels, but rather were categorized according to their phenotypic labels assigned by scCompare. The results show a broad agreement in the expression patterns of the differentially expressed genes derived from HPA. Key marker genes for each phenotype are highlighted in the legend to provide additional biological context and phenotypic validation. Circos plots display a visual representation of classification accuracy along a variety of axes including tissue of origin, cell type, and cell type sub-categorizations (Figure 3B). The vast majority of cells fall into the correct corresponding phenotypic grouping with more than 90% overall classification accuracy. The class weighted and unweighted prediction accuracy scores were 90.2% and 88.1%, respectively. Cell type confusion did occur between closely related subtypes of cells, such as different subtypes of immune cells and epithelial cells. This can likely be attributed to shared lineage and functional categories. For instance, confusion among T lymphocytes (T cells), B lymphocytes (B cells), and NK cells was observed (Table 2). These cell types are of shared lineage and descend from a shared ancestor, the common lymphoid progenitor. Similarly, there is confusion among enterocytes and other epithelial cells from the large and small intestines. In the case of club cells, their primary class confusion occurred with type 2 alveolar cells. Their confusion can likely be attributed to their common functions, as both cell types are secretory epithelia featuring specialized adaptations to the respiratory tract. Evaluation of the associated clustermap (Figure S3) highlights the core similarities in scCompare phenotypic signatures that were likely the root cause of high misclassification rates for the previously mentioned cell types.

Visualizations of HPA and TS atlas mapping experimental results. (A) Differential expression analysis of HPA cells by phenotypic label with corresponding expression of scCompare-classified TS cells displayed using a dotplot. Each row denotes the similarity in expression between HPA-phenotypically labeled cells and corresponding TS-mapped cells post-scCompare analysis. Key markers for each phenotypic label that arose from the differential expression analysis are annotated here, macrophages: AIF1 (allograft inflammatory factor 1), B cells: MS4A1 (membrane spanning 4-domains A1, CD20), T cells: CD3E (CD3e molecule), NK cells: NKG7 (natural killer cell granule protein 7), Müller glia cells: GFAP (glial fibrillary acidic protein), cardiomyocytes: TNNI3 (troponin I3, cardiac type), hepatocytes: ALB (albumin), endothelial cells: VWF (von Willebrand factor), fibroblasts: COL1A2 (collagen type I alpha 2 chain), Smooth muscle cells: ACTA2 (actin, alpha 2, smooth muscle, aorta), plasma cells: JCHAIN (joining chain of multimeric IgA and IgM), salivary duct cells: KRT7/KRT19 (keratin 7/keratin 19), basal prostatic cells: PIP (prolactin-induced protein), respiratory ciliated cells: PIFO (primary cilia formation protein), club cells: SFTPA2 (surfactant protein A2), alveolar cells type 2: SFTPC (surfactant protein C), basal keratinocytes: KRT14 (keratin 14), ductal cells: KRT8/KRT19 (keratin 8/keratin 19), exocrine glandular cells: CTRB2 (chymotrypsinogen B2), distal enterocytes: FABP1 (fatty acid binding protein 1), intestinal goblet cells: MUC2 (mucin 2), and proximal enterocytes: ALDOB (aldolase B, fructose-bisphosphate). Highlighting key expression of these genes helps to validate HPA phenotype and TS mapping fidelity using canonical gene expression. (B) Atlas mapping experimental results are subdivided to further interrogate performance. Endothelial cells (i) display high classification accuracy despite a variety of sampling locations. In contrast, T cells (ii) display a higher rate of misclassification, which has a clear bias toward some sub-types showing higher rates of misclassification (eg, cd8 + alpha-beta t cells, generic t cells). This is likely due to the broader phenotypic heterogeneity that is captured within the T-cell phenotype. (iii) Lung tissue higher rates of misclassification among club cells are shown (possibly due to the commonalities between respiratory club cells and type 2 alveolar cells, as both as secretory epithelia with similar tissue-specific gene expression profiles). (iv) Cell types sampled from both the large and small intestines whereby an interesting pattern is depicted showing the confusion rate for enterocytes of the small intestine higher misclassification rates and unidirectional, as enterocytes of the large intestine are correctly classified at a higher rate.
Comparison of cardiomyocyte differentiation protocols
To demonstrate the utility of scCompare to compare scRNA-seq data sets from a study design, we leveraged the data from an experiment that evaluated 2 different protocols to differentiate hiPSCs to CMs.
13
Differentiation protocol 1 used small molecules whereas protocol 2 used a cytokine in addition to the small molecules. The differentiation of the hiPSCs was carried out for 90 days. As represented in Figure 4A, protocol 1 samples were taken for scRNA-seq at D0, D12, and D24, whereas protocol 2 samples were taken for scRNA-seq at D0, D14, and D26. The difference in the days of collection was due to the lag in the differentiation initiation day between the 2 protocols. We selected the scRNA-seq data from protocol 1 as the training (mapping) data set and the scRNA-seq data from protocol 2 as the testing data set. scRNA-seq analysis of the training data set generated 13 Leiden clusters (Figure 4B). The UMAP representation of the clusters shows well-formed clusters and an abundance of heterogeneity in the data which is expected in a “developmental” (differentiation) setting compared with “adult”/mature tissues. Differential expression analysis of the Leiden clusters 0 to 9 revealed marker genes

scCompare recapitulates published mapping of phenotypes between 2 directed differentiation protocols. (A) Sample schematic indicating the types of differentiation protocols and timepoints (dashed box) analyzed. (B) UMAP of Leiden clustering of “mapping” data set, protocol 1 D12 and D24, and violin plots displaying marker gene expression profiles for each Leiden cluster, with cell type indicated above. (C) Correlation dendrogram of highly variable genes and Leiden clusters in the “mapping” data set, protocol 1, and UMAP with colors indicating cell type. (D) UMAP of “testing” data set, protocol 2 D14 and D26, after mapping using the annotated protocol 1 (training) cell types, and UMAP of Pearson correlation coefficients of each labeling. (E) Scatter plot showing the relative ratios (natural log) of annotated cell types between protocol 1 and protocol 2. An
Figure 4C depicts a correlation heat map of the highly variable genes expressed by the single cells in the Leiden clusters and revealed a high degree of similarity between the CM cell types (clusters 0-9 and PCM cluster 11). The other clusters have moderate to low correlation to each other. The cell type annotations in the protocol 1 (mapping) data showed good uniformity of the CM clusters and relatively good individual clustering of the other cell types. However, after scCompare mapping of the phenotypic identities to the protocol 2 (test) data, there appeared to be heterogeneity in the CM cluster, very few ectodermal, endothelial-like, and smooth muscle-like cells, and less representation of endodermal cells (Figure 4D). The aforementioned reduction in cell type mapping in protocol 2 is represented in Figure 4E numerically as proportions and graphically as a scatter plot of the cluster fraction of cells with the annotated cell type in protocol 1.
The

Dotplot visualization of marker genes between protocols 1 and 2. Differentially expressed genes between clusters of cells in (A) protocol 1 versus protocol 2 for timepoint 1 (D12/D14) and (B) protocol 1 versus protocol 2 for timepoint 2 (D24/D26). The legend denotes the size of the dots as the percentage of the cells in the groups expressing the genes, and the color bar represents the level of mean expression of the genes within the groups.
scCompare number of genes and cluster resolution parameter assessment
A final goal of our analysis was to assess the scCompare pipeline for its sensitivity to 2 key user-defined parameter selections. To assess this, we used a subset of 10 000 cells obtained from the HPA data set and separated these into 5000 mapping and test cells. The scCompare pipeline was repeatedly performed on this data set using an increasing size selection of highly variable genes and clustering resolutions (Figure 6). Highly variable gene selection was bounded between 50 and 10 000. Clustering resolution was bounded between 0.05 and 2.5, which corresponded to 2 to 40 distinct clusters. Figure 6A colors each individual iteration by weighted-F1, a statistical measure of prediction accuracy bounded by 0 and 1. With respect to clustering resolution, the results demonstrate the impact of overclustering on performance, as greater resolutions tend to produce ambiguity between classes resulting in misclassification.

scCompare grid search exploring the effects of parameter optimization on results using the HPA scRNA-seq data set. (A) Each point (a subset of 10 000 cells) represents an individual run of scCompare, with its highly variable genes assigning position in X and Leiden clustering resolution assigning its position in Y. The color of each point is assigned by its classification performance as measured by class-size weighted F1 score. Panels (B) and (C) present a different view of this data, where the points are positioned by map and test misclassification rates in X and Y, respectively. The identify line acts as a visual guide, where points deviating from this line have unequal percentages of map and test set misclassification rates. The plots are colored by (B) number of Leiden clusters and (C) number of highly variable genes, respectively. Panel (D) shares the same axis as (A), but points are instead colored by the absolute difference in misclassification rate between map and test sets.
When assessing the impact of gene signature length, the results suggest that beyond a certain minimum threshold, most signature length selections tend to perform well. Figure 6B and C illustrates the same data set evaluating each individual run by map and test misclassification rates, with colors provided by input variable. These plots best capture the trend toward higher input values of clustering resolutions and signature lengths tending to produce higher test misclassification than map misclassification, an outcome that could be considered as overfitting. Figure 6D provides another view of overfitting, where points are colored by the absolute difference in map and test misclassification rates. Higher values tend to occur at the highest clustering resolutions and gene signature lengths. As a note, Figure 6D masks cases in which the misclassification rate is equally poor in test and mapping data sets, as is the case in the low gene signature length (Figure 6A). In general, our results suggest that aside from excessive clustering resolutions and gene signature lengths, our pipeline produces robust results.
Discussion
Advances in next-generation sequencing (NGS) allow for high-resolution gene expression profiling transcriptome-wide at the single-cell level. The ability to assess heterogeneity within samples provides a unique insight into the complexity of biology. Over the years, the size and scope of scRNA-seq experiments increased as the technology became more readily available. At the same time, sophisticated computational tools to analyze the data were developed. However, there is a paucity of methods to compare scRNA-seq data sets at the phenotypic level.
We developed scCompare as an analytical pipeline to compare clusters of single-cell identities to another data set for assessment of similarities and differences in phenotypic characterization (Figure 1). scCompare leverages the workflow in scanpy to filter, preprocess, cluster the data, and identify marker genes making the ease of entry extremely useful for a wide range of users in the community. For those who prefer the R environment, SeuratDisk converts Seurat objects to AnnData objects via the h5Seurat file format specification and is portable into scCompare. The novelty of scCompare lies in the following. Using labels in a scRNA-seq data set, bulk signatures for each label are generated by correlating highly variable genes within each label to a gene expression prototype. Based on statistical cutoffs for each distribution of bulk signatures, labels from a test scRNA-seq data set are compared with the threshold for each distribution of the labels’ bulk signatures to assign phenotypic characterization or label them as unmapped. The unmapped cells provide opportunity for further investigation of similarities and differences in representation of cell identities and/or biological discovery.
In comparison with scVI, a deep learning probabilistic model for perusing unexplored biological diversity for scRNA-seq data, scCompare matched or outperformed the tool in higher accuracy and specificity for most of the cell types (Table 1 and Figure 2). More importantly, in contrast with scVI, scCompare has a flexible utility in that it can compare atlases of scRNA-seq data (Table 2) (Figure 3) or, in the case of CM differentiation protocols, reveal an unmapped cluster of single cells (Figure 4). scCompare also allows for the identification of “unmapped” cells based on their dissimilarity to computed signatures by setting statistical cutoffs. We demonstrated the utility of this feature in our comparison of CM differentiation protocols, which revealed an unmapped group of cells which may have some biological relevance (Figures 4 and 5).
It is clear that scRNA-seq data are large, complex and requires a fair amount of bioinformatics expertise, computational savviness, and biological intuition to mine the data effectively. scCompare provides a fairly straightforward analysis pipeline for novices to use. In fact, our assessment of 2 key parameters used in the scCompare pipeline gives guidance to users on how to optimize analysis results (Figure 6). However, there are a few caveats to consider when using scCompare to glean biological insight from a comparison of 2 scRNA-seq data sets: (1) the mapping data set requires identities of the cells or biological expertise to phenotypically label them; (2) often times the mapping data set has a hierarchical structure (Figure S3) to the cell identities, and as such, the results will likely differ based on the level of phenotypic annotation used; thus, it is important to evaluate the correlation dendrogram produced during the scCompare pipeline run to identify phenotypic signatures that are highly correlated as they may lead to misclassification of the test data set (Figure S3); (3) the statistical cutoffs are user-defined and can be adjusted to refine the phenotypic characterization in the test data set; (4) cells may be misclassified or unlabeled if they have significantly higher sparsity than the cells used to generate the signatures; and (5) it is plausible that the cell cycle may be confounding and cells may map to cell cycle signatures on the sole basis of that phenotypic property, despite, for example, originating from a totally different germ layer.
Given the utility and practicality of scCompare, we envision that the tool will be of value to the basic science research community, biotechnology, and pharmaceutical industries, when needing to compare large scRNA-seq data sets.
Supplemental Material
sj-pdf-3-bbi-10.1177_11779322241280866 – Supplemental material for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples
Supplemental material, sj-pdf-3-bbi-10.1177_11779322241280866 for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples by Dan C Wilkinson, Elizabeth Tallman, Mishal Ashraf, Tatiana Gelaf Romer, Jeehoon Lee, Benjamin Burnett and Pierre R Bushel in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-1-bbi-10.1177_11779322241280866 – Supplemental material for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples
Supplemental material, sj-xlsx-1-bbi-10.1177_11779322241280866 for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples by Dan C Wilkinson, Elizabeth Tallman, Mishal Ashraf, Tatiana Gelaf Romer, Jeehoon Lee, Benjamin Burnett and Pierre R Bushel in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-2-bbi-10.1177_11779322241280866 – Supplemental material for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples
Supplemental material, sj-xlsx-2-bbi-10.1177_11779322241280866 for A Strategy to Compare Single-Cell RNA Sequencing Data Sets Provides Phenotypic Insight into Cellular Heterogeneity Underlying Biological Similarities and Differences Between Samples by Dan C Wilkinson, Elizabeth Tallman, Mishal Ashraf, Tatiana Gelaf Romer, Jeehoon Lee, Benjamin Burnett and Pierre R Bushel in Bioinformatics and Biology Insights
Footnotes
Author Contributions
Declaration of conflicting interests:
Funding:
Data and Programming Code Availability Statements
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
