Abstract
1. Introduction
The advantages offered by restriction maps for genome analysis are realized by the Optical Mapping System (OM) because it develops and analyzes datasets of individually mapped DNA molecules (Schwartz et al., 1993; Dimalanta et al., 2004; Teague et al., 2010). (A restriction map constructed from a single DNA molecule is called an “Rmap.”) Briefly, large genomic DNA molecules (≈0.5 Mb) are restriction digested after microfluidic deposition onto positively charged glass surfaces. The microfluidic device uses a combination of fluid flow and interaction with a charged surface to unravel and straighten normally coiled DNA molecules. Because deposited molecules are under tension, like a stretched rubber band, restriction cleavage triggers newly formed molecule ends to relax, producing visible gaps and creating strings of discrete DNA fragments that are imaged by automated fluorescence microscopy after staining with a fluorochrome dye. An Rmap acts like a barcode that assigns genomic location to the DNA molecule. Responsive to the unique noise characteristics of OM measurements, computational analysis assembles putative overlapping Rmaps from the many thousands of analyzed molecules into a genome-wide restriction map in ways that closely parallel assembly of shotgun sequence reads. Early in the development of OM, microbial genomes were primarily analyzed as approachable model systems for learning more about the challenges of dealing with larger, more complex genomes. Accordingly, advances in OM have enabled insightful analysis of human (Teague et al., 2010; Kidd et al., 2008; Antonacci et al., 2010) and plant genomes (Zhou et al., 2007; Schnable et al., 2009; Zhou et al., 2009).
Whether or not, and where a given Rmap overlaps another restriction map presents a fundamental inference problem in OM. We are interested in the centrally important question of how to align an Rmap to the genome-wide map created
Assembly is the computational process in which the Rmaps become positioned relative to each other, organized at the genomic scale, and summarized by a consensus map. For relatively small genomes, the Gentig algorithm produces
The optical mapping system was designed and developed to comprehensively reveal human and cancer structural variation: structural variation describes those genomic polymorphisms and mutations ≥1 kb (Scherer et al., 2007). The main strategy for such identification is to assemble the Rmaps and then to detect aberrations by a screen of the assembly against the reference genome (Teague et al., 2010). The detection of copy number variants is a special case that is possible without assembly and thus has some advantages (Sarkar, 2006). Briefly, a copy number gain is indicated if an abnormally large number of Rmaps align at a given locus, while a loss is indicated when too few Rmaps align. Because very long Rmaps (≈0.5 Mb) are tallied in place of probes or short sequence reads, findings are complementary to traditional copy number variant (CNV) analysis (Sebat et al., 2004). There is a statistical bottleneck created by this assembly-free strategy, however, because even in the absence of copy number variation there is variation across the genome in the probability of alignment. As a result, the successfully aligned Rmaps represent a non-uniform thinning of the originals, and a baseline against which copy number variants may be measured is similarly non-uniform. A solution is possible via normalization if we can compute the probability of alignment for each Rmap. We show how a certain self-alignment score provides a fast approximation to the probability of alignment, thus facilitating a normalization of optical map coverage.
Our methods are developed and tested on Rmaps from GM07535, a normal human lymphoblastoid cell line, one of the first applications of OM to the human genome (Lim, 2004). The dataset consists of 206,796 Rmaps, the subset exceeding 0.3 Mb from a larger panel. The Rmaps were aligned against an
2. Methods
2.1. Map significance
Denote an Rmap by
Our approach avoids specifying a probability model for the generation of Rmaps. An accurate model might need to be overly complex, while misspecifying the model could lead to additional errors. Instead, we calibrate
We make the nonparametric assumption on

Lack of auto-correlation in
The assumptions taken are precisely those justifying permutation analysis to calibrate
Our computational experiments utilize two scoring functions: a custom score implemented in the SOMA software package (Kohn, 2003), and a likelihood-ratio (LR) score (see Appendix 5.1) (Valouev et al., 2006). Figure 2 exposes the strong dependence of the best spurious SOMA score on

Dependence of spurious scores on Rmap. For each Rmap, the optimal scores for ungapped global alignment against two independent permutations of the reference are plotted against each other. The scores are highly correlated, indicating a significant Rmap-specific component in the distribution of the best spurious score. The two-dimensional (2-D) histogram uses hexagonal binning (Carr et al., 1987).

Likelihood ratio (LR) scores for ungapped global alignment (Valouev et al., 2006). Optimal scores for GM07535 Rmaps aligned against two independent permutations of the
under

Variance of errors. The
Then estimate the scale parameter

Parametric models for
A further difficulty is in finding a suitable reference distribution for either standardized optimal alignment score. Since standardization has removed the dominant Rmap effects, we can use the empirical distribution of standardized scores computed across all Rmaps on a single genome shuffling. (This step makes the inclusion of
A
2.2. Alignment probability
Even for Rmaps derived from a normal genome, significant alignments are not distributed uniformly along the genome, owing to fluctuations in the local characteristics of the normal restriction map. Figure 6 illustrates the non-uniform alignment process. Knowing the null probability that an Rmap will align is necessary to normalize coverage in order to call significant copy number variants in a test genome. To address this problem we consider the logistic approximation

Schematic representation of “thinning” in Rmap alignments. The horizontal axis represents the underlying genome, with vertical lines indicating restriction sites.
where
Being the perfect match score, the self-alignment score
3. Results
3.1. Map significance
There is no gold-standard available for comparison, but the two approaches provide similar filters on the GM07535 Rmap collection.
At the nominal specificity 99.9% (i.e., significance level 0.1%), 73.42% of the Rmaps had their correct alignment declared as significant. Of these, 0.53% (0.39% of all Rmaps) had at least one spurious alignment declared to be significant in addition to the correct one. 0.27% of the Rmaps had only spurious significant alignments. The remaining (26.31%) did not align. The rate of false positives is somewhat larger than the nominal rate, but this is not surprising given the presence of large segmental duplications in the genome and the homology between the X and Y chromosomes.
On the 50,000 simulated Rmap set, the modified-constant procedure (
The regression approach and the modified-constant approach give comparable yields of aligned Rmaps, with the regression approach having the additional advantage of allowing calibration of error rates. A more difficult issue has to do with the quality of the aligned Rmaps. This comes to our central finding about how the regression approach improves characteristics of Rmap assemblies.

Comparison of significance strategies through the iterative assembly procedure. Chromosome 2 is assembled using the
A simple measure of the success of an alignment strategy is the the proportion of Rmaps passing the alignment step that are included in the ultimate assembly. The higher the better, as the set of aligned maps exhibit a high level of internal consistency when successfuly assembled. By this retention ratio, the regression cutoffs perform better than the modified-constant cutoffs with a comparable number of input Rmaps (Fig. 7, upper panel). Other quality measures that consider bases covered by the assembly, gaps, and unaligned Rmaps consistently favor alignment cutoffs by the regression approach rather than the modified-constant approach (Fig. 7, lower panel).
3.2. Alignment probability
Even if all declared alignments are correct, the set of inferred locations is a subset of the true locations because not all Rmaps successfully align. The probability that an Rmap successfully aligns depends in part on the origin of the Rmap. Understanding this dependence is necessary to normalize observed coverage; for example, increased coverage in a region could be due to increased copy number of the underlying genome, but could also be due to increased alignment probability of Rmaps from that region.
The location-specific alignment rate can be estimated using Monte Carlo simulation of noisy Rmaps from a normal reference map followed by alignment, thus replicating the pipeline actual Rmaps go through. The most time consuming step in this process is alignment. As an alternative that bypasses alignment, we considered the logistic regression model (2), where the probability of an Rmap being aligned is modeled as a function of the self-alignment score

Comparison of empirical and predicted alignment rates. The probability that an Rmap will be successfully aligned depends on the origin of the Rmap. The (relative) fluctuation in the alignment rate as a function of location is an important quantity, but its estimation requires alignment of many simulated maps, which is computationally expensive. Here we assess the performance of an approximate method. The data are roughly 10,000 simulated Rmaps from human chromosome 14. The first curve is the kernel density estimate of locations obtained from alignments declared significant; this density can be used as a relative alignment rate. The second curve is the density of the true locations of the same simulated Rmaps, but with weights given by model (2). The alignment-free method provides an accurate approximation.
The self-alignment score

Optimal scores with the real and a permuted reference map are plotted against

Analogue of Figure 9 for real Rmaps. Optimal scores with the real and a permuted reference map are plotted for each GM07535 Rmap against alignment score with itself. Possible explanations for the differences are explored in the text.
4. Discussion
Alignment is a fundamental, but not fully solved problem in optical mapping. Prior work has focused primarily on the score functions for use in dynamic programming algorithms. Here we have proposed a framework to study the distribution of spurious optimal scores, from any given score function, in order to reduce alignment errors and improve assembly of large genomes. We have also noted the utility of the self-alignment score of an Rmap in providing an
The methods presented are not restricted to a specific score function. Figure 3 plots the best spurious ungapped global alignment score against two permuted references using the likelihood ratio (LR) score proposed by Valouev et al. (2006). The correlation is weaker than with the SOMA score, but an Rmap specific cutoff is still more appropriate than a constant cutoff. We apply the direct approach as before with
Our choice of null hypothesis deserves comment. Formally, we test the hypothesis that the observed Rmap is independent of the reference map. However, it is not unlikely for an Rmap, especially a short noisy one, to originate from somewhere in the reference but have its optimal alignment somewhere else; the null hypothesis of independence is not true in such a case, yet we would not want to declare the optimal alignment significant. The hypothesis we really want to test is that the optimal alignment is a spurious alignment. Unfortunately, there are problems with this approach. Even when the optimal alignment correctly identifies the origin of a Rmap, the alignment itself may not be completely correct; so, when is an alignment sufficiently different from the true alignment to be considered spurious? Should alignments to incorrect but homologous regions be considered spurious? These issues are avoided by formulating the problem as a test of independence, as is common in alignment literature (Doolittle, 1981; Karlin and Altschul, 1990; Mitrophanov and Borodovsky, 2006). The case of a short noisy Rmap described above is not problematic in practice, as the optimal score (as well as the correct alignment score) will rarely exceed the significance threshold obtained under the null hypothesis of independence, even though that null is not strictly true.
Valouev et al. (2006) suggest a model-based approach to determine significance that is similar to ours. They postulate that the fragment lengths in the reference genome
Estimating the mean spurious score
Our use of a limited number of permutations is essential but somewhat unusual, in the sense that the test statistics include Monte Carlo variation. This raises the question: how many permutations are sufficient and how do they affect the inference? In our examples, six permutations of the reference define the test: four to estimate

Variability in test statistics due to permutations. Two separate sets of permutations are used to derive the test statistics
In this article, we have addressed the question of significance of Rmap alignments to a reference map. Significance of alignments are determined by their scores. Our primary goal was to obtain the null distribution, with as few assumptions as possible, of the optimal alignment score of an Rmap given any score function. We achieved this using alignments to permutations of the reference map, and developed conditional permutation tests for significance with control over error rates. This approach was further simplified to obtain simple Rmap specific score cutoffs that have been validated using simulation and through use in iterative assembly. We have outlined ways to use this approach to compare different score functions. Our investigations have also provided new insight into the nature of optical map data and led to a Rmap-specific summary score that may help simplify certain aspects of optical map analysis.
5. Appendix
5.1. Score functions for alignment
Let
indicating a correspondence between the cut sites
this search can be performed efficiently using variants of the Needleman-Wunsch and Smith-Waterman dynamic programming algorithms. Non-additive score functions may be appropriate in certain situations, but have not been investigated.
The sensitivity with which alignments can detect locations of Rmaps depends primarily on the score function used. Different scores are appropriate for different types of alignments. A natural approach to derive score functions is to base it on model-based likelihood ratio tests (Altschul, 1991). Such scores have most recently been derived by Valouev et al. (2006) for alignment of two Rmaps (both being subject to noise), as well as for Rmaps against an noise-free reference map. The model they use is in essence similar to the one described in Sarkar (2006), but excludes desorption and scale errors. We refer the reader to the original article for details.
Another score function for Rmap to reference map alignment has been developed as part of the SOMA software suite (Kohn, 2003). Although this score is largely heuristic, it has been used quite extensively and successfully. Since there is no published reference, we give some details here. The score of the full alignment is determined by the score of each chunk
where
5.2. Iterative assembly
Gentig produces highly accurate assemblies of bacterial genomes. However, it does not scale to genomes of mammalian size because the algorithm is quadratic and there is no obvious way to parallelize it. One solution is a heuristic assembler which uses pairwise Smith-Waterman alignment (Kohn, 2003; Valouev et al., 2006) to subdivide the assembly problem in many smaller problems and uses Gentig as the low-level assembly engine (Mullikin and Ning, 2003). The computational work for both the alignment and assembly steps is distributed over a large network of clustered workstations (Litzkow et al., 1988).
The algorithm is iterative, and the output of each step of the iteration is an approximate map of the genome. In the subsequent step, this approximation is used as the reference map against which all the Rmaps in the dataset are aligned. Then the Rmaps are clustered according to the location of their alignments to the reference, and each cluster is assembled locally. The consensus maps from these assemblies give rise to the reference map for the next step.
The algorithm emerged from the following reasoning. Within species, we expect a high degree of genomic conservation punctuated by structural variants that are commonly spanned by long Rmaps. Consequently, in a region of the genome where the differences between the target genome and the reference are minor, Rmaps tend to align (because the dynamic programming scoring function is designed to tolerate optical mapping errors and minor differences are typically in the domain of that error model). The data, then, can drive the approximation in the right direction within that region by assembling the Rmaps that align there. In a non-conserved region of the genome, Rmaps tend not to align to the
The quality of an assembly is a function of the extent of the coverage of the genome by the consensus maps and of the accuracy of the consensus maps. That accuracy, in turn, is a function of the depth of the Rmap coverage within the contigs. To achieve a high quality assembly, we stringently control the input to and output from the local assembly phase at each step in the iteration. The goal of these controls is to correctly place Rmaps within the contigs, even at the expense of diminishing the depth or extent of coverage because an incorrectly placed Rmap could introduce errors in the consensus. In subsequent steps of the iteration, these errors could be reinforced and propagated.
We implemented the controls as follows for the experiment described in Section 3.1:
• Input to local assembly ○ Pairwise alignment threshold: We used two regression cutoffs with nominal specificities 99.9% and 99.0%. An alignment was accepted if the score exceeded (9.711 − 0.216 • Output from local assembly ○ Minimum number of Rmaps in a contig: Contigs must be supported by at least five Rmaps in order to be propagated to the next iteration. ○ Consensus map trimming: The ends of consensus maps were trimmed so that the first and last fragments were supported by at least four Rmaps.
5.3. Simulation model
Rmaps were simulated from the
Origins were selected uniformly from the reference genome. The total length of the Rmaps followed a left-truncated exponential distribution with a minimum size of 300 kb and average size of 440 kb. Small fragments are rarer in Rmaps than in the reference. To model this, fragments less than 400 bp were merged with neighboring fragments, and remaining fragments were dropped with probability 1 −
reflects the image processing error in measuring length and
is a Rmap-specific “rubber-banding” factor that reflects local uncertainty in the estimated scale factor. Whether true cut sites are identified (success) or not (failure) is modeled as independent Bernoulli trials, with probability
Footnotes
Acknowledgments
We thank Christina Kendziorski, Michael Waterman, and Anton Valouev for helpful discussions. This work was supported in part by funding from the National Institutes of Health grants R01-CA64364 to M.A.N. and R01-HG00225 to D.C.S.
Disclosure Statement
No competing financial interests exist.
