Abstract
INTRODUCTION
Huntington’s disease (HD) is caused by the expansion of a CAG repeat in exon one of the
Indeed, MPS has also recently proven useful for the detection of microsatellite somatic mosaicism in tumour samples [28, 29]. At high read-depth, MPS data should allow precise and accurate quantification of somatic repeat length variants, whilst also pro-viding information about genetic variants within and around the repeat. However, massively parallel seq-uencing is not commonly used to sequence, size and quantify the somatic mosaicism associated with trinucleotide expansions. The main reason for this is that sequencing reads generated by commonly used MPS platforms (<150 nt) are too short for the accurate sizing of the repeat, which requires repeat-spanning sequencing reads. However, long-read sequencing platforms have recently been successfully applied to the sequencing of trinucleotide repeat expansions. Using PacBio single-molecule real-time (SMRT) long-read sequencing, Cumming et al. [30] have demonstrated the usefulness of bulk-PCR sequencing to sequence myotonic dystrophy type 1 (DM1)-assoc-iated
Although MPS approaches allow some of the limitations of electrophoresis to be overcome by providing information about genetic variants within and around the repeat, they might be limited in their ability to detect very large repeat expansions. However, no formal comparison of electrophoresis and MPS approaches for the quantification of somatic mosaicism of trinucleotide expansions has previously been presented. In this study, we have applied bulk-PCR sequencing approaches using Illumina MiSeq and PacBio SMRT long-read sequencing to assess their usefulness for the estimation of the modal allele, and the quantification of somatic expansions, associated with small (∼50 to ∼100 CAG) and large (∼100 to ∼500 CAG)
MATERIALS AND METHODS
Samples
In order to evaluate the ability of bulk-PCR seq-uencing approaches to sequence different numbers of CAGs, we used previously described [33] DNA samples from an allelic series of R6/2 transgenic mice [34] carrying a single copy of a human
Characteristics of the R6/2 DNA samples analysed
One row corresponds to one mouse analysed. *: previously estimated by capillary electrophoresis [33].+: sample analysed. – : sample not available.
MiSeq library preparation for four R6/2 mice inheriting HTT transgenes with ∼55 or ∼110 CAGs
The
Preparation of bulk-PCR products for PacBio SMRT sequencing
Bulk-PCR products of the
Samples were processed in two batches, depending on the number of CAGs in the modal allele, with the aim to sequence the two batches on two separate PacBio RSII SMRT cells. This precaution was taken because fragment loading on PacBio RSII SMRT cells was known to be biased towards the loading of smaller molecules. The first batch corresponded to the samples with ∼55 and ∼110 CAGs, and the second batch corresponded to samples with ∼255 and ∼470 CAGs. One DNA sample, from the tail at weaning of the 20-week-old mouse with 258 CAGs (Table 1), was included in both batches to allow evaluation of potential inter-SMRT cell heterogeneity in sequencing quality.
The first batch of PCR products corresponded to the R6/2 samples with the shorter modal alleles (four or five tissues for each of the four mice with ∼55 or ∼110 CAGs, Table 1). For this first batch, five PCRs were performed (as described above) for each sample (i.e., tissue from a particular mouse). After amplification, the PCR products were pooled to obtain one pool of ∼70
The second batch of PCR products corresponded to the R6/2 samples with the longer modal alleles (four or five tissues for each of the four mice with ∼255 and ∼470 CAGs, Table 1). Five PCRs were performed (as described above) for each sample (i.e., tissue from a particular mouse) with ∼255 CAGs. After amplification, the PCR products were pooled to obtain one pool of ∼70
Eight to 60 PCRs were performed (as described above) for each sample with ∼470 CAGs depending on PCR yield (i.e., more PCRs were performed for samples associated with lower PCR yield). After amplification, the PCR products were pooled to obtain one pool per sample. Each of these pools was purified using a 0.6X AMPure® XP clean-up procedure [38] with a final elution volume equal to 1/2 the volume of beads used. This lower amount of 0.6X AMPure® XP beads, relative to the amount used for the samples with ∼55, ∼110 and ∼255 CAGs, was used for the samples with ∼470 CAGs in an attempt to remove smaller fragments that would, if present, be preferentially sequenced. The quality and quantity of each of these PCR product pools were then assessed on a Bioanalyzer (Agilent) as described above. The pools of ∼470 CAGs PCR products were then combined with the pool of ∼255 CAGs PCR products to form an equimolar pool (same number of molecules per sample based on Bioanalyzer-estimated mol-arity). The equimolar pool was then concentrated using a 1.6X AMPure® XP clean-up procedure to obtain a solution containing ≥500 ng of PCR product at ≥13 ng
The 500 ng PCR product pools from batch one and two were sent separately to the Earlham Institute (Norwich, UK) for PacBio RSII library preparation and sequencing on one SMRT cell per batch. Magbead loading, 150,000 zero-mode waveguides per SMRT cell and the C4-P6 chemistry were used for the PacBio RSII SMRT sequencing. Circular consensus sequencing (CCS) reads (the consensus sequence resulting from the alignment between subreads obt-ained from a single DNA molecule [40]) were produced from the raw PacBio subreads using the SMRT Portal’ RS_ReadsOfInsert protocol (settings used: Minimum Full Passes = 2; Minimum Predicted Accuracy = 90%; Minimum Length of Reads of Insert (In Bases) = 500; Maximum Length of Reads of Insert (In Bases) = 9,000). Demultiplexing of the PacBio reads was carried out as part of the same protocol (Minimum Barcode Score = 23, which is equivalent to 99.5% calling accuracy [39]) to obtain a fastq file containing CCS reads for each sample.
Estimation of the percentage of on-target and full-length reads for each experiment
The percentage of on-target and full-length reads for each experiment (i.e., a particular number of CAGs sequenced on a particular sequencing platform) was estimated for one representative sample per experiment (the cerebellum of the older mice) by subsequently aligning the sequencing reads to reference sequences corresponding to the 5’-flank plus CAGs, the 3’-flank plus CAGs, or only to a CAG repeat (See Supplementary File 1 for more details). Reads aligned to both flanks were considered full-length. Reads aligned to at least one of the flanks or the pure CAG repeat reference sequence were considered on-target but not full-length. Reads that did not align to either flank or to the pure CAG repeat reference sequence, were considered off-target. Assuming that the most likely source of off-target reads would be non-specific PCR of mouse DNA, the most likely source of the off-target reads was determined using Blastn [41] against all
Genotyping of HTT alleles by aligning the sequencing reads to synthetic reference sequences
The sequencing reads obtained were processed on the Galaxy instance of the University of Glasgow (https://heighliner.cvr.gla.ac.uk) [43] using an align-ment-based approach. Before alignment, Illumina sequencing adapters were trimmed from the 3’-end of the single-end forward (R1) MiSeq reads. Both types of reads (single-end forward (R1) MiSeq re-ads or PacBio CCS reads) were then aligned using BWA-MEM [44] to multiple synthetic reference seq-uences each containing a different number of CAGs. To facilitate alignment of each sequencing read to the reference sequence with the same number of CAG repeats, BWA-MEM alignment parameters were modified to use a mismatch cost markedly lower than gap-related costs [45]. This gives greater weight to the alignment of each read to the reference sequence containing the most similar number of CAGs and less weight to base-base mismatches not related to CAG length variation. The default BWA-MEM para-meters were used, except for three parameters that were set as follows: penalty for a mismatch = 1; and, gap open penalties = 2,2; gap extension penalties =2,2. Synthetic reference sequences were designed to include sequences flanking the
PCR-capillary electrophoresis for the two R6/2 mice carrying ∼55 HTT CAG repeats
The
Small pool-PCR
Previous small-pool PCR (SP-PCR) experiments using the striatum from the 117-week-old R6/2 mo-use with ∼55 CAGs showed that a small percentage of somatic CAG expansions are very large (>90 CAGs) [33]. It is not clear if these very large somatic CAG expansions can also be detected by capillary electrophoresis or either of the parallel-sequencing approaches in combination with bulk-PCR. To investigate this, we used SP-PCR to quantify these very large CAG expansions in the striatum sample of the R6/2 mouse with ∼55 CAGs. The SP-PCR quantification involved a combination of single-molecule PCRs (to derive the overall length distribution and precisely quantify the amount of input DNA) and PCRs using higher template concentrations (to estimate the frequency of the rarer large repeat length increases) [48]. A concentration range experiment between 5 and 50 pg of template DNA per PCR was first conducted to establish the correct quantity of template DNA to achieve single-molecule PCRs and 17.5 pg per PCR was selected. Overall, 288 PCRs with 17.5 pg (single molecule level), 132 with 150 pg, and 44 with 250 pg of genomic DNA template as starting material, were performed as previously described [33]. The PCR products obtained were resolved by agarose gel electrophoresis, Southern blotted and hybridised as previously described [33]. Individual bands (>250 CAG) were identified and sized by comparing against the 1 Kb Plus DNA Ladder (Invitrogen) using the CLIQS 1D gel analysis software (TotalLabs, UK). Assuming the number of bands is proportional to the amount of template, the 17.5 pg data were used to calculate expected bands/lane at 150 pg under assumptions of a Poisson distribution [48] and the frequency of very long expansions in the 150 pg PCR products was determined.
Data availability
The sequencing reads for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB41395 (https://www.ebi.ac.uk/ena/browser/view/PRJEB41395).
RESULTS
Qualitative assessment of the usefulness of MiSeq and PacBio SMRT sequencing to quantify somatic mosaicism in R6/2 mice with ∼55 CAGs and comparison with capillary electrophoresis
For each organ analysed, regardless of the method used to size the number of CAGs in the PCR pro-ducts (capillary electrophoresis, MiSeq or PacBio se-quencing) we could identify a mode of the CAG frequency distribution ∼55 CAGs (Fig. 1). These esti-mates are in the same range as those previously estimated by SP-PCR Southern blot and bulk-PCR capillary electrophoresis (Table 1) [33]. The percen-tage of MiSeq reads that uniquely aligned to a synthetic reference sequence (i.e., reads not discarded post alignment) was very high for both young and old mice with ∼55 CAGs (96.5% and 96.15% on average for the 6-week-old and the 117-week-old mouse respectively). These percentages were slightly lower for both mice for the PacBio CCS reads (80.91% and 81.20% on average for the 6-week-old and the 117-week-old mouse respectively). These lower percentages for the PacBio CCS reads might be exp-lained by the fact that sequencing errors are more frequent in PacBio CCS reads (Fig. 2C and Supple-mentary Figure 1) than MiSeq reads (Fig. 2A and Supplementary Figure 1), and the fact that more reference sequences were considered for the PacBio CCS reads (200 and 123 reference sequences were considered for the PacBio CCS and MiSeq reads respectively – see above). Indeed, a higher number of sequencing errors and of reference sequences considered both increase the likelihood for a sequencing read to be aligned equally well to two or more reference sequences and therefore to be discarded post-alignment. It was very clear from all the MiSeq forward alignments, as well as from all the PacBio CCS reads alignments, that all the mice with ∼55 CAGs carried an

Qualitative assessment of somatic mosaicism comparing CAG frequency distributions obtained by capillary electrophoresis, MiSeq or PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 117-week-old R6/2 mouse with ∼55 CAGs. Capillary electrophoresis data in black, MiSeq sequencing data in white and PacBio SMRT sequencing data in grey.

Representative sequence alignments of the 400 nt MiSeq reads (A and B), PacBio CCS reads (C and D) and PacBio subreads (E and F) uniquely aligned (i.e., reads not discarded post alignment) to a synthetic reference sequence with 115 CAGs. Alignments shown correspond to 30 sequencing reads obtained from the tail at weaning of the 20-week-old mouse with ∼110 CAGs. The part of the alignment shown corresponds to the four nucleotides in the immediate 5’–flank of the
Comparison of three bulk-PCR approaches with SP-PCR on the 117-week-old striatum sample with ∼55 CAGs
As previously shown [33], SP-PCR revealed a high frequency of somatic expansions in the striatum of the 117-week-old R6/2 mouse with ∼55 CAGs (Fig. 3A). Most of these somatic expansions can be seen on the autoradiographs as bands between 55 CAGs (i.e., size of the modal allele) and 70 CAGs (Fig. 3A). SP-PCR also detected frequent somatic expansions with 70 to 80 CAGs, and rarer somatic expansion ≥80 CAGs (Fig. 3A). The frequency of somatic expansions ≥70 CAGs detected by SP-PCR was estimated by genotyping ∼1,300 molecules across 464 SP-PCRs. This allowed a quantitative comparison between SP-PCR and the three bulk-PCR approaches presented here (capillary electrophoresis, MiSeq and PacBio SMRT) that demonstrates that large expansions are better detected by SP-PCR. The percentage of somatic expansions with 70 to 80 CAGs was similar for the three bulk-PCR approaches (∼4.5%, Figs. 1 and 3B) and lower than the one estimated by SP-PCR (∼6%, Fig. 3B). Similar percentages of somatic expansions with 80 to 89 CAGs were detected by bulk or SP-PCR (∼1%, Fig. 3B). No somatic expansions ≥90 CAGs were detected by bulk-PCR PacBio SMRT sequencing (Fig. 3B, C). The percentage of somatic expansions with 90 to 99 CAGs estimated by SP-PCR (0.38%, Fig. 3B, C) was one order of magnitude higher than that estimated by bulk-PCR capillary electrophoresis and MiSeq (0.02% and 0.04% respectively, Fig. 3B, C). No somatic expansions ≥100 CAGs were detected by capillary electrophoresis and the percentage of such somatic expansion estimated by SP-PCR (0.61%, Fig. 3B, C) was one order of magnitude higher than that estimated by bulk-PCR MiSeq (0.04%, Fig. 3B). Read depth must be considered when directly comparing the results obtained by bulk-PCR PacBio SMRT sequencing and bulk-PCR MiSeq sequencing. For example, the much lower number of PacBio CCS reads obtained for the samples with ∼55 CAGs (Fig. 1) is a very likely explanation why somatic expansion with >90 CAGs could be detected with 23,064 MiSeq reads for the striatum with 55 CAGs, but not with 444 CCS PacBio reads (Fig. 3B, C).

SP-PCR can detect very large
Qualitative assessment of the usefulness of MiSeq and PacBio SMRT sequencing to quantify somatic mosaicism in R6/2 mice with ∼110 CAGs
The percentage of MiSeq reads uniquely aligned to a synthetic reference sequence (i.e., reads not discarded post alignment) for samples from mice with ∼110 CAGs was much lower (24.99% and 22.10% on average for the 4-week-old and the 20-week-old mouse respectively) than observed the mice with ∼55 repeats. This is most likely due to the high frequency of sequencing errors at the end of the CAG repeat in 400 nt MiSeq reads containing ∼110 CAGs (Fig. 2B). These sequencing errors are probably caused by the fact that the base calling accuracy drops sharply at the end of the MiSeq reads, with the sharp drop in base calling accuracy happening downstream of the CAG repeat for reads with <60 CAGs and within the end of the CAG repeat for reads with ≥60 CAGs (Supplementary Figure 3). The percentage of uniquely aligned PacBio CCS reads (i.e., reads not discarded post alignment) was much higher than that of MiSeq reads and similar for samples from both mice with ∼110 CAGs (71.77% and 66.50% on average for the 4-week-old and the 20-week-old mouse respectively for the PacBio CCS reads). The mode of the CAG frequency distribution obtained with PacBio CCS reads for the liver of the 20-week-old mouse was bimodal with a mode at ∼117 CAGs (like the progenitor allele identified in the tail at weaning) and a mode at ∼130 CAGs (Fig. 4). We could identify a mode of the CAG frequency distributions between 110 and 120 CAGs for all the other organs analysed with both MiSeq and PacBio data (Fig. 4). These estimates are in the same range as the ones previously estimated by SP-PCR Southern blot and bulk-PCR capillary electrophoresis (Table 1) [33]. However, it must be noted that the mode of the CAG distributions obtained using the MiSeq read was ∼5 CAGs smaller than the ones obtained using the PacBio CCS reads. Given the amount of somatic mosaicism in the samples analysed (as illustrated by the CAG frequency distributions obtained using the PacBio CCS reads), we would have expected a large proportion of the MiSeq reads to align to references sequences with ≥120 CAGs. In particular, all reads containing ≥123 CAGs should have aligned to the reference sequence containing 123 CAGs, the theoretical maximum number of CAGs that could have been sequenced with the PCR primer pair used and a 400 nt MiSeq read. However, only a small proportion of the MiSeq reads aligned to references sequences with ≥120 CAGs. Together with the low percentage of MiSeq reads aligned (∼23%) and the high frequency of sequencing errors at the end of the 400 nt MiSeq reads, this illustrates that the CAG length frequency distributions obtained for the samples from mice with ∼110 CAGs using MiSeq (Fig. 4) cannot be relied upon for the estimation of modal allele sizes or the quantification of somatic expansions. The maximum number of CAGs that can reliably be sequenced using 400 nt MiSeq reads probably lies ∼115 CAGs.

CAG frequency distributions obtained by MiSeq or PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 117-week-old R6/2 mouse with ∼110 CAGs. MiSeq sequencing data in white and PacBio SMRT sequencing data in grey. The dotted line on the MiSeq sequencing data panels indicates 123 CAGs, which is the theoretical maximum number of CAGs that could have been sequenced using the PCR primer pair (31329/33934) and a 400 nt MiSeq read.
It was very clear from all the PacBio CCS re-ads alignments that all mice with ∼110 CAGs car-ried a typical
As previously illustrated for the mice with ∼55 CAGs, a qualitative assessment of somatic mosaicism in the different tissues of the two mice analysed should be possible by comparing the CAG frequency distributions obtained. The CAG frequency distributions obtained for different tissues in the 20-week-old mouse should be interpreted relative to the progenitor allele which corresponds to the modal number of CAGs in the tail at weaning (Fig. 4). The CAG frequency distributions obtained with the PacBio CCS reads reflect the expected age-dependent and tissue-specific nature of
Qualitative assessment of the usefulness of PacBio sequencing to quantify somatic mosaicism in R6/2 mice with ∼255 CAGs
PCR products for one sample, the cerebellum of the 6-week-old mouse with ∼255 CAGs, were gen-erated and sequenced independently in each of the two PacBio RSII runs performed. For that sample, the percentage of PacBio CCS reads that could be aligned to synthetic reference sequences was higher in PacBio run one than for PacBio run two (28.45% and 4.96%). This is likely a consequence of the fact that the sequencing quality from the second PacBio run was lower (Supplementary File 2). The percentage of PacBio CCS reads aligned to a syn-thetic reference sequence was similar for samples from both mice with ∼255 CAGs (4.76% and 4.65% on average for the 6-week-old and the 20-week-old mouse respectively). No clear mode could be identified in the CAG frequency distribution obtained for the liver of the 20-week-old mouse. For all the other organs analysed, we could identify a mode of the CAG frequency distribution of ∼270 CAGs (Fig. 5). These estimates are in the same range as the ones previously estimated by SP-PCR Southern blot and bulk-PCR capillary electrophoresis (Table 1) [33]. It was very clear from all the PacBio CCS read alignments that both mice with ∼255 CAGs carried a typical

CAG frequency distributions obtained by PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 20-week-old R6/2 mouse with ∼255 CAGs. The tail at weaning data for the 20-week-old mouse is not shown because only two reads with 266 and 274 CAGs were obtained post-alignment and post-discard.
Qualitative assessment of the usefulness of PacBio sequencing to quantify somatic mosaicism in R6/2 mice with ∼470 CAGs
The percentage of PacBio CCS reads aligned to a synthetic reference sequence was similar for samples from both mice with ∼470 CAGs (5.85% and 7.07% on average for the 6-week-old and the 116-week-old mouse respectively). Strikingly, most of the aligned PacBio CCS reads contained between 100 and 200 CAGs (Fig. 6). This is in stark contrast with the CAG repeat distributions previously obtained on the same DNA samples by both SP-PCR and bulk-PCR capillary electrophoresis which had modes ∼500 CAGs (Table 1) [33]. To investigate if this distribution was caused by the data processing post-sequencing, we compared the length distribution of the PCR products before sequencing with the length distribution of the unprocessed PacBio subreads and CCS reads. This revealed that most of the subreads and CCS reads were much shorter than expected based on the size of the PCR products produced for PacBio SMRT sequencing of these samples. Indeed, the estimated modal allele length of the PCR products was ∼577 CAGs (Supplementary Figure 4A) while the mode of the subreads and CCS read length distribution was between 150 and 200 CAGs (Supplementary Figure 4B, C). The skew to-wards lower (<300) CAG lengths observed in the frequency distributions obtained from the alignment of CCS reads (Fig. 6) is thus not caused by the data processing post-sequencing, but corresponds to loading and/or sequencing bias towards smaller fragment on the PacBio RSII sequencing platform. It must be noted, however, that some PacBio CCS reads contained >450 CAGs (Fig. 6). To confirm that PacBio SMRT sequencing was indeed useful to sequence

CAG frequency distributions obtained by PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 116-week-old R6/2 mouse with ∼470 CAGs.
The PacBio sequencing data illustrates the effect of the number of inherited CAG repeats on somatic expansion
Somatic mosaicism of the
DISCUSSION
Bulk-PCR sequencing
In the present study, we have applied bulk-PCR sequencing approaches to sequence
It is unclear whether larger CAG repeat length and/or inter-run variability inherent to the PacBio technology is/are responsible for the lower sequencing quality associated with PacBio run 2 relative to PacBio run 1 (Supplementary File 2). This difference in sequencing quality between the two runs warrants the future use of a sequencing control, similar to the Illumina PhiX control [50], for PacBio sequencing. Nevertheless, we have shown that the PacBio platform can be used to sequence
Comparison of three bulk-PCR approaches with SP-PCR
The CAG frequency distributions obtained using bulk-PCR followed by capillary electrophoresis or MiSeq sequencing for the samples with ∼55 CAGs were very similar for the smaller and more frequent somatic expansions (<85 CAGs, Fig. 1). However, compared to the SP-PCR data, large expansions >90 CAG repeats were underrepresented using the capillary electrophoresis and MiSeq approaches, and undetectable in the PacBio data (Fig. 3). A major factor in driving this disparity is the reduced PCR efficiency in amplifying large alleles compared to smaller ones. This yields a relatively lower number of end-products per input molecule for larger alleles. In SP-PCR, the reduced amplification of larger alleles is at least partially compensated for by greater hybridisation efficiency to a repeat unit probe, and by the spatial resolution offered by the low number of input molecules, multiple reactions and gel electrophoresis. That means that the products of single input molecules can still be readily detected by SP-PCR independent of their size and relative amplification efficiency (at least up to ∼1,000 repeats). In the capillary electrophoresis approach, each PCR product contains only a single fluorescent moiety incorporated into one of the primers independent of the size of the molecule. Thus, large fragments that amplify less efficiently yield a lower signal. When such larger alleles are relatively rare, the signal from such molecules becomes lost in the inevitable background fluorescence observed using this approach. The sequencing-based approaches do not yield any inherent background and, as in the MiSeq data, with high enough read depth (typically ∼45,000 sequencing reads per sample), rare large expansions can be detected, albeit at a lower absolute frequency than the relative frequency of input molecules due to the amplification bias, and probably a higher frequency of sequencing errors that likely reduces alignment efficiency too. It seems reasonable to assume that such large rare expansions would be similarly detectable with PacBio, assuming read depth was high enough.
PacBio sequencing produced similar modal allele estimates to the ones obtained with MiSeq and capillary electrophoresis but showed wider CAG length frequency distributions. This may be due to the high frequency of indels in the PacBio subreads whi-ch may, in turn, lead to an inaccurate number of CAGs in the CCS reads. Previous PacBio analyses of a DM1 patient carrying a single variant CCG repeat within their expanded CTG array revealed that this variant was not detected in 17% of reads, suggesting a possible issue with CCS read generation [30]. Additional analyses are required to establish the absolute utility of the CCS pipeline in handling simple sequence repeats correctly. Nonetheless, like MiSeq and capillary electrophoresis, PacBio sequencing can detect small frequent somatic expansions and capture some of the age-dependent and tissue-specific nature of the somatic mosaicism of the
Main characteristics of different methods for the preparation of libraries for MiSeq and PacBio for the sequencing of CAG repeats and the quantification of somatic mosaicism
Observations described in this study are indicated in bold, the other information corresponds to expectations for approaches not used in this study based on observations described in this study (S), on what has been described on other trinucleotide loci in the literature (L) and/or on the manufacturer information available for each sequencing platform (M). *: Max modal allele size for which the modal allele size can be estimated by sequencing and for which somatic mosaicism will reliably be quantified. **: based on [63]. †: assuming a minimum of ∼5,000 reads per sample and a maximum of 384 samples per sequencing run. ‡: assuming 20 SP-PCRs per sample and 250 reads per SP-PCRs. #: assuming ∼20 reads per single molecule and the genotyping of 5,000 single molecules.
The three bulk-PCR approaches (capillary electrophoresis, MiSeq and PacBio SMRT) share the same bulk-PCR pitfalls, i.e., lack of detection of very large and rare expansions that can be detected by SP-PCR (Fig. 3). The use of bulk-PCR makes the sequencing library preparation robust and straightforward, which allows the processing of a high volume of samples. This is particularly true for the MiSeq platform, for which the number of reads produced is much higher than on the PacBio platforms (Table 2). However, bulk-PCR artefacts make it difficult to estimate the size of larger progenitor alleles, hampering our ability to quantify somatic expansions very accurately and making it very difficult to quantify somatic contractions (Table 2).
Future directions in the field for the use of parallel and single-molecule sequencing
Although strongly biased towards expansion, somatic instability is thought to involve a combination of small frequent expansions and contractions [54]. Contractions of the
Some of the limitations of bulk-PCR may be overcome by using recombinase polymerase amplification, an isothermal replacement to PCR that has been shown to produce fewer
Such an amplification-free approach for sequencing of repeat-expansion loci has been developed by Tsai et al. [57]. This approach, named “no-amp targeted sequencing”, utilises the capture and enrichment of the region(s) of interest using the CRISPR/Cas9 system (Fig. 7A). In trinucleotide expansion studies, no-amp targeted sequencing has so far been used in combination with PacBio or Oxford Nanopore Technologies (ONT) long-read single-molecule sequencing [31, 57–59]. This is because the size of the repeat expansion of interest was expected to be very large [31, 59] but also be-cause the captured fragments are typically several kilobases long [49, 57]. Hafford-Tear et al. [31] have demonstrated that no-amp targeted PacBio SMRT sequencing can capture somatic mosaicism by showing that the variance in the number of repeats increases with the modal number of CAG•CTG in the third intron of
No-amp targeted ONT (Oxford Nanopore Technologies) sequencing has so far only been applied to cell line DNA to sequence the
The no-amp targeted long-read sequencing approaches published so far [31, 57–59] are of great utility to sequence and size very large repeat expansions (⪢3,000 repeats at some loci in some tissues) that are refractory to PCR and can only be assessed by relatively crude Southern blot analysis of restriction digested genomic DNA, or that can only be detected by SP-PCR. However, they have several major limitations that make them unsuitable, in their current form, for high-throughput analysis of low levels of variation and for modifier studies that require the analysis of large cohorts. Indeed, they require micrograms of DNA (Table 2) and produce low numbers of reads (∼5,000 on target PacBio CCS reads per RSII SMRT cell [49] and < 1,000 on target ONT reads per MinION flow cell [58]) at a high cost (assuming a best-case scenario of 5% on-target reads and the production of 4,000,000 PacBio CCS reads on one Sequel II SMRT cell, the sequencing cost would be ∼$100 per sample (Table 2) [63] with a requirement of 5,000 reads per sample to quantify somatic mosaicism). MiSeq sequencing-based methods, that offer higher throughput, remain more cost-effective for routinely quantifying somatic mosaicism in primary tissue samples from individuals having inherited HD-causing alleles with <100 CAGs (sequencing cost would be ∼$8 per sample (Table 2) [63] if producing 40,000 MiSeq reads per sample and processing 380 samples per MiSeq run).
Another potential route for quantifying somatic mosaicism that reduces some of the limitations of bulk-PCR is to use single-molecule DNA barcoding in combination with PCR. Through this approach, one can trace the sequence reads back to a single input molecule which allows the identification of PCR and sequencing errors. In the context of a repeated sequence, such an approach can be used to correct for PCR slippage and identify the repeat size in the original molecule (Fig. 7B) [64]. Besides, this approach should also, at least partially, correct for the PCR amplification efficiency problem that generates relatively fewer yields per input molecule for larger alleles. Several methods have been developed to achieve single-molecule barcoding, including molecular inversion probe (MIP) capture using probes with degenerate tags and one- to three-cycle PCR using barcoded primers at low concentrations [65, 66]. Such methods, combined with high-throughput sequencing, allow multiplexing across loci in large cohorts while providing single-molecule level sequence data. Single-molecule barcoding has been mostly explored in human tumour sequencing analysis and was shown to perform very well in detecting single-base somatic variants at an allele frequency of <0.5% [65, 66]. With a primary focus on single nucleotide variants, variation in microsatellite regions has been somewhat less well explored to date, but with some evidence showing that, for example, MIP capture is less efficient for repeat regions when compared to single base variation in multi-loci capture assays [67]. However, single-molecule barcoding by MIP capture was recently shown to be a sensitive method in detecting repeat length variants, including repeat contractions, across multiple microsatellite loci in

Method summary for somatic mosaicism quantification at the level of a single molecule in HD. A) Generalised schematics for CRISPR/Casp9-mediated targeted enrichment of
CONFLICT OF INTEREST
V.C.W. is a scientific advisory board member of Triplet Therapeutics, a company developing new therapeutic approaches to address triplet repeat disorders such as HD and myotonic dystrophy and of LoQus23 Therapeutics, and has provided paid consulting services to Alnylam. Her financial interests in Triplet Therapeutics were reviewed and are managed by Massachusetts General Hospital and Partners HealthCare in accordance with their conflict of interest policies. S.K. is employed by CHDI Management, Inc., as an advisor to the CHDI Foundation. D.G.M. has been a scientific consultant and/or received honoraria or stock options from Biogen Idec, AMO Pharma, Charles River, Vertex Pharmaceuticals, Triplet Therapeutics, LoQus23, and Small Molecule RNA and has had research contracts with AMO Pharma and Vertex Pharmaceuticals.
