Abstract
Introduction
New sequencing technologies and decreasing sequencing costs are leading to a rapid increase in the availability of DNA sequence data. This is true both within and across species, permitting increasing numbers of genome-wide scans for positive selection incorporating a more diverse range of species.1–4 A general aim of these analyses is to identify genes that evolved adaptively, either across a clade or on specific lineages, and to understand the biological processes targeted during periods of phenotypic change. The
The performance of these models can be strongly influenced by gene misannotation, alignment error, and sequence quality.6–11 Unfortunately, with the adoption of next-generation sequencing methods, 12 the likelihood of certain errors, including sequencing errors and misalignment caused by splicing variants, has vastly increased.13–15 The inclusion of alignments containing nonhomologous data caused by these effects can drastically inflate false-positive rates in PAML, and may also influence false-negative rates.6,7 Alignment quality is therefore of major importance in the accurate inference of positive selection.
The alignment program chosen can have a significant effect on the reliability of PAML analyses, with some, such as PRANK, 16 outperforming others (ClustalW, 17 MAFFT, 18 ProbCons, 19 and T-coffee 20 ).6,7 Postalignment filtering provides an additional step to improving alignment quality and has been implemented in two main ways. Column-based programs, such as G-Blocks 21 or Noisy, 22 examine the degree of conservation at each position in the alignment, removing contiguous stretches of sequence that are not conserved across species. G-Blocks' original purpose was not to filter alignments for tests of positive selection, but instead to remove unreliable sequence data for phylogenetic studies. 21 As such, although this approach may have some benefits with low-quality alignments, the columnwise nature of the method can remove high proportions of data, greatly reducing power. 7 In addition G-Blocks will fail to remove sequencing errors that affect just one species in a large, multiple-sequence alignment. Where sequencing error is present at a site in one species, Gblocks and other column based filtering methods will not mask the data if that site is conserved across the rest of the alignment. Branch-specific analyses of evolutionary rates will therefore be vulnerable to this source of error.
An alternative approach is to use a measure of alignment confidence to filter the data set. These can be obtained from some alignment programs16,20 or through additional programs such as GUIDANCE 23 or ALISCORE. 24 These filters can effectively reduce false positives when alignment confidence is low. However, adding these additional filters provides little benefit beyond using the top-performing alignment program for well-supported alignments. 7
The merit of implementing existing filters is therefore open to debate. 7 This is particularly true when sequence divergence is low, leading to alignments with high confidence. In these cases, short stretches of sequencing errors or longer stretches of nonhomologous sequence caused, for example, by splicing variation or misannotation, can have an overly dominant effect on tests for positive selection. To this end, we have developed a Sliding Window Alignment Masker for PAML (SWAMP). This script provides an additional preprocessing step designed to mask these problematic sections of sequence.
Implementation
SWAMP analyses DNA sequences in a phylogenetic context, identifying regions with a high concentration of nonsynonymous substitutions along a branch, over a short sequence window. The method utilizes the summary of nonsynonymous codon substitutions along branches within a phylogeny obtained by running a one-ratio model (model = 0, NSsites = 0) in
Description of key SWAMP parameters.

Illustrative schematic of SWAMP masking. (A) unmasked alignment with two potential data errors, short stretches of divergent sequence possibly caused by sequencing errors, and longer stretches possibly caused by exon splicing. (B) Application of a two-step filtering masks data errors; a shorter window of 3 AA with a maximum of 2 substitutions effectively masks possible sequence error (orange), and a longer window of 15 AA with a maximum of 10 substitutions effectively masks longer alignment errors (blue). (C) The ‘Interscan’ option further masks the data if short stretches of sequence are surrounded by masked data and the sum of the masked data either side of the sequence exceeds the length of the interceding unmasked data (purple).
A “branchnames” file, provided by the user, defines which row(s) of sequences (in most cases this will be species) will be affected by substitution patterns along each branch in the phylogeny. Generally, this relates species data to their ancestral lineages. The use of this “branchnames” file provides a further advantage in that a user may vary the masking parameters across specific branches or sequences, allowing different masking regimens to be applied to different parts of the phylogeny through multiple SWAMP iterations. This is achieved by the user listing only a subset of sequences and/or branches in the “branchnames” file in each of multiple iterations of SWAMP, while supplying different thresholds and window sizes for each run. An example of this is provided in the SWAMP documentation along with step-by-step instructions on how to implement the program. This branch-specific masking may be useful in a number of contexts, for example, if the data for one of a number of species is more likely to contain errors than others, where assembly quality varies, or if significant variation in branch length demands a flexible approach to alignment masking.
In some cases, initial masking can leave small “islands” of sequence data flanked by masked sequence. These potentially problematic stretches of sequence in close proximity to masked sections can be masked with the optional “interscan” function (Table 1). This function masks regions based on their length in comparison to neighboring masked regions (Fig. 1B and C). This ensures longer stretches of nonhomologous data that by chance share some similarity are still masked from the alignment.
Finally, SWAMP also notifies the user if the total length of the sequence falls below a defined minimum. These cases are likely to be incomplete sequences, which may then be excluded from downstream analyses if desired.
SWAMP is not a computationally expensive filtering approach. For example, across a data set of >6,000 four-way 1:1 orthologs, described below, using a threshold of 10 and window size of 15, SWAMP ran in 93.3 seconds on a Mac with a 2.93-GHz i7 and 16-GB RAM.
Results and Discussion
Effects of SWAMP on Branch-Site Tests for Adaptive Evolution
For developmental purposes, and to provide some guidance on initial parameters, we utilized a primate data set, consisting of 6,379 orthologs from
Protein-coding genes for

Effects of SWAMP filtering on branch-site tests for positive selection. Branch-site tests were conducted on the terminal
The effect of varying the window size is illustrated in Figure 3. Here we focus on the terminal
Comparison with the Gorilla Genome Analysis
A similar approach was previously implemented in a genome-wide evolutionary analysis of protein-coding genes across African Great Apes 4 using an unpublished forerunner of SWAMP. This analysis masked a 1:1 orthologous genes set for humans, chimpanzee, gorilla, orangutan, macaque, and marmoset using a window size of 15 codons and a threshold of 10 nonsynonymous substitutions per window. Of 11,538 gene alignments 1,156 (10.1%) were masked in at least one window. This mirrors our results in which of 6,379 alignments 1,022 (16.0%) genes were masked under the threshold of 5 nonsynonymous substitutions in 15 codons and 429 (6.7%) were masked under the threshold of 10 nonsynonymous substitutions in 15 codons. Notably, this analysis also found much lower numbers of genes being partially masked in humans compared to other primates that have lower quality genomes. The masking performed in the analysis of the gorilla genome 4 affected downstream PAML analyses in a similar way to that described above (G. E. Jordon and S. H. Montgomery, personal observation).

Effects of masking at different window sizes. In (A) and (B), the LR statistics of genes significant for the branch-site test on the terminal
Comparison with Column-Based Masking
A major difference between the approach taken in SWAMP and currently available postalignment filtering methods is the orientation of data analysis. Existing methods tend to filter alignments based on conservation within a column of a multiple-sequence alignment (ie, at a codon or nucleotide across species), whereas SWAMP analyzes data within rows. This is advantageous as sequencing or alignment errors may not sufficiently reduce similarity at conserved sites to be filtered by column-based approaches. To demonstrate the effects of this difference, we filtered our alignments using G-Blocks 21 under default parameters. The filtering resulted in the removal of data from 185 (2.9%) of 6,379 alignments. In contrast to SWAMP masking, the downstream PAML results based on these G-Block-masked alignments are almost identical to those obtained using unmasked data (Supplementary Fig. 1). Of course, across more divergent data sets that include regions where the alignment is problematic, users may find column-based filtering more useful; indeed, this was the intended use of G-Blocks. 21 However, given PAML is optimized for data sets that are unsaturated at synonymous sites, and therefore relatively well conserved, we expect the phylogenetic row-based approach of SWAMP will be preferable in the majority of cases.
Usefulness of SWAMP and Potential Caveats
These results demonstrate SWAMPs utility on data from genomes of lower quality than those of the gold standard model organisms (Fig. 2). SWAMP provides a flexible framework to mask large data sets, removing stretches of low-quality alignment, probable sequencing errors, and nonhomologous data that could otherwise inflate false-positive rates in tests for adaptive evolution.
Effective masking with the approach taken by SWAMP will most likely produce conservative results as a minority of masked sequence may reflect genuine divergence concentrated in a short stretch of sequence. This could conceivably occur in proteins with key functional domains coded by a contiguous stretch of sequence. While this may result in some false negatives, a conservative approach is preferable in genome-wide studies, particularly when used to generate candidates for functional analyses. However, if this is a concern for a user, we recommend testing for enrichment of protein domain types within the masked genes and extending the window size during filtering.
A further caveat is that users must currently optimize their masking parameters manually. This can be done based on the genome-wide average rate of nonsynonymous substitutions/codons or simply by optimizing the parameters to ensure genes with significant results in downstream analyses do not contain spurious alignments when the most significant genes are inspected manually. The increased confidence in downstream analyses and the reduction in manual filtering of results should offset this investment in time.
It is generally accepted that short sequences and those that contain internal stop codons should be removed from genome-wide scans for positive selection. Sequences that contain repetitive elements could also be masked, for example, with Repeatmasker.28–30 By implementing SWAMP in conjunction with optimal alignment programs and these established masking steps, researchers can increase their confidence in conclusions drawn from evolutionary and phylogenetic analysis performed in PAML are other analysis suites. SWAMP provides a useful addition to methods of postalignment filtering, improving the reliability and reproducibility of genome-wide analyses using PAML.
Conclusions
SWAMP effectively masks regions with high rates of non-synonymous substitutions concentrated in short runs of sequence typical of sequence or alignment errors, preventing their inclusion in downstream evolutionary analyses. This removes sequence that violates the assumptions of the phylogenetic model implemented in the software package PAML that could otherwise give a false signal of positive selection. SWAMP effectively masks short stretches of erroneous sequence that may not be detected by existing masking/filtering methods but will be prevalent in low-coverage genomes and the branch- and sequence-specific operation allows different masking regimens to be applied to selected parts of the phylogeny. Although specifically designed for implementation with PAML, SWAMP will be useful as a preprocessing step for any analysis that requires the prevention of the influence of sequence error and misannotation. In addition to the reduction in false-positive rates achieved through SWAMP preprocessing, the inclusion of the implemented SWAMP parameters in future publication methodologies will improve the reproducibility of genome-wide analyses for positive selection.
Availability and Requirements
Author Contributions
Conceived and designed the experiments: PWH, GEJ, SHM. Developed and tested the software: PWH, GEJ. Wrote the first draft of the manuscript: PWH, SHM. Contributed to the writing of the manuscript: PWH, GEJ, SHM. Revised the manuscript: PWH, SHM. All authors reviewed and approved of the final manuscript.
