Abstract
Introduction
Transposable elements (TEs) are DNA segments capable of changing their position in the genome. They are a major component of genomes of many organisms, comprising 66%–69% of the human genome 1 and even up to 85% of genetic material of certain plants (eg, maize). 2
Recently, genomes of many species have been fully or partially sequenced and several bioinformatic tools have been created facilitating a more systematic identification and annotation of numerous TE families. On the basis of the mechanism of transposition, TEs were divided into two groups, ie, class I (retrotransposons) and class II (DNA transposons).3,4 Retrotransposons transpose via an RNA intermediate (copy-and-paste mechanism), each transposition event leading to an increase of their copy number. DNA transposons change their chromosomal localization by physical excision of the element from the donor site and reintegration in the acceptor site. This group contains two subclasses, depending on the number of DNA strands cut during transposition. Subclass 1 is mobilized in accordance with the cut-and-paste mechanism, where both strands are cut at each end resulting in the complete excision of the TE. Most Subclass 1 DNA transposons carry Terminal Inverted Repeats (TIR) and create Target Site Duplication (TSD) upon insertion. Subclass 2 elements transpose on the basis of the rolling circle mechanism, where only one strand is cut. Classes and subclasses are further divided into orders, superfamilies, and families.
The effect of TE proliferation on genome evolution is noticeable.
5
Moreover, the importance of accurate TE annotation and masking for the structural characterization of newly sequenced genomes drives the interest in developing new methods for TE detection and analysis.
6
The exhaustive list of tools and resources for TE analysis compiled by Bergman Lab (http://bergmanlab.smith.man.ac.uk/) contains about 120 items. A large number of them are designed for particular TE families (eg,
MUST 18 allows the user to search for all Miniature Inverted-repeat TEs (MITEs) that satisfy given criteria corresponding to minimum and maximum length of TIR, TSD and size of MITEs (up to 1000 bp). TRANSPO, 20 in addition to the functionality of MUST, enables the user to specify the sequence of TIRs and maximum number of errors allowed in TIRs. STAN 19 is the most flexible tool. It finds all sequences containing a given pattern specified in the SVG grammar. However, STAN uses fixed (non-parameterized) definition of inverted repeats which makes it less suitable for searching class II transposons.
All of the tools mentioned above have been developed to facilitate identification of MITEs, ie, elements that have TIRs but lack any coding capacity. Our TE discovery tool, TIRfinder, (which also captures specific structural features) offers functionality that goes beyond the proposed methods. It combines an efficient approach based on suffix trees that allows for de novo TE detection, with the possibility of a deep analysis of a specific TE family, based on its structural characteristics. In particular, while searching for all putative TEs, TIRfinder allows the user to specify TIR and TSD patterns as a sequence of A, C, T, G or symbols from extended IUPAC nomenclature.
21
This provides the ability to define TIR/TSD patterns as consensus sequences which combine conserved and non-conserved positions. A detailed description of the tool has not been provided previously, however the prototype, stand-alone version of TIRfinder was used for mining TEs in
Recently, deep sequencing projects have increased dramatically, raising the possibility of the repetitive DNA characterization of not completely sequenced organisms. Notice that our approach based on a suffix tree requires a large contiguous genomic sequence. Therefore for analysis of non-assembled raw reads produced by next generation sequencing, recently developed tools which utilize a clustering approach 24 should be applied.
Methods of TE Detection and Analysis
TIRfinder analysis workflow
TIRfinder allows efficient searches of the DNA for structured set of motifs that define TEs to be conducted. Then it performs a BLAST search to find transposase or any other TE-related open reading frames (ORFs), aiming at the detection of autonomous copies, as well as elements directly derived from them. The method works in three stages: structural analysis, functional analysis and MITE analysis (see Fig. 1).

The control flow through different phases of the proposed TEs detection method.
In the first step, TIRfinder scans the input sequence for all putative TEs, based on the given structural characteristic of a particular TE family, ie, patterns describing TIR, TSD, the number of allowed mismatches and size limits of desired TEs. The algorithmic details of the structural analysis are described further in this section.
Next, all elements identified in the structural analysis stage are analyzed to check whether they contain the ORF coding for a protein specified by the user, most often it would be a transposase. For this purpose, they are aligned with the protein sequence (using an appropriate version of the BLAST algorithm) 25 and elements with statistically significant similarity are marked as putative autonomous. Each autonomous element is then used to search for so called Derivatives. These are non-autonomous elements that have emerged from an autonomous elements during the course of evolution, often by internal deletions disrupting the ORFs. To classify the TE as a derivative of a given autonomous element, we require that both pairs of subterminal regions share a significant level of similarity which is defined by the user (eg, BLAST E-value < e-10).
Finally, the remaining set of putative TEs (ie, those not classified as autonomous TEs and their derivatives) is analyzed to find short elements which cluster together based on their sequence similarity. The level of similarity is defined by the user. Clustering is performed using BLAST-clust algorithm. 25 The TEs assigned to clusters are labeled as MITEs.
Structural analysis
Our application follows a structure-based approach, ie, it relies on the detection of specific models of TE architecture consisting of a pair of TIRs (inverted repeats) that are flanked by TSDs (direct repeats).
For detecting TEs we use suffix trees which are very efficient data structures, commonly used in computer science over last four decades.26 They consists of a root, nodes and labeled edges representing one or more characters from input sequence. All suffixes of the input sequence can be obtained by traversing the suffix tree from the root to leaves. The core idea is that all suffixes that share the common prefix are hanged off on the common node, which reduces the total number of nodes and memory usage.
The algorithm implemented in TIRfinder takes as an input a DNA sequence, a mask corresponding to combined TSD and TIR patterns, and other parameters (ie, the number of mismatches), see Figure 2 for the pseudocode of the algorithm and Figure 3 for graphical explanation of the mask notion.

TIRfinder algorithm.

TIRfinder–-structural analysis. (A) Explanation of TIR and TSD mask concept. (B) Overview of TEs detection phase.
First, the DNA sequence is divided into a set of smaller fragments of the same length, corresponding to the maximal TE size. A suffix tree is built for every two consecutive fragments, with overlaps of one fragment length (see Fig. 3B). This is to ensure that we do not miss any match. The algorithm for each fragment independently searches all matches for the mask. In this step, all potential 3′-ends of a TIR should be found. Then we determine if the complementary (5′-end) part of the repeat exists, which is the most time-consuming part of the algorithm. In order to do this efficiently the suffix tree is used (it is built in linear time with respect to a length of the fragment).27 For each match (TIR + TSD) found in the previous step, the reverse complement of TIR followed by TSD is searched in the suffix tree (see Fig. 2).
The principal advantage of our approach is memory efficiency: the genomic DNA sequence is split into smaller fragments and the suffix tree data structure needs only a linear amount of space. Thus, we do not impose any limits to the total length of the sequence and the stand-alone version of TIRfinder can be run even on standard PC computers.
Usage
The user must first select the genomic sequence to be searched for TEs. Several plant genomes are currently available in the TIRfinder website (
Next, the user has to define several parameters such as a minimal and maximal distance between TIRs and patterns of inverted and direct repeats. Patterns of the particular TIR or TSD can be determined by means of a finite string composed of extended IUPAC nucleotide alphabet. 21
Note that the user has to define only one side of the repeat. The second part will be computed as a reverse complement in the case of TIR or simply duplicated in the case of TSD. Furthermore, there is a possibility to specify a maximal number of mismatches between TSDs and TIRs flanking each copy of identified TEs. TIRfinder reads a DNA sequence given in a FASTA file to search for results and saves them in a simple text format (see Appendix for the detailed description). Then the file can be further processed in order to obtain more detailed and accurate data.
Carefully prepared input data, such as TIR and TSD masks, are extremely important in order to get satisfactory results. One possible way is to take TIR sequences from individual copies of TEs representing the desired family and their corresponding flanking TSDs, and create consensus for TSD and TIR sequences. Afterwards, the user can set up other parameters of TIRfinder, which allow for the control of similarity between mined elements and the mask (MaskMismatches) or between corresponding 5′- and 3′- TIRs and TSDs of found elements (SeqMismatches) see Figure 3A. The ability to manipulate these parameters makes it easier to search for known, well conserved elements as well as to mine new (sub) families of TEs de novo.
Subsequently, the user may specify parameters for the identification of TE copies carrying ORFs, which for simplicity was dubbed functional analysis. First, the protein sequence (in FASTA format) is provided to detect ORFs (a significance threshold is required). Then, long non-autonomous copies–-direct derivatives are identified as elements sharing similarity of subterminal regions (length parameter in bp and alignment significance threshold are provided by the user) with one of the previously found autonomous TEs. If the ORF sequence is unknown to the user, the functional analysis step can be omitted in the course of analysis.
Finally the user defines constraints for MITE analysis, ie,: minimal and maximal length of MITEs, maximal number of MITE clusters and the level of in-cluster similarity.
TIRfinder implementation
Current release of TIRfinder is an open source web application built with Java, Perl, Apache and other tools (ie, BLAST, BLAST-clust). Previous, stand-alone release of TIRfinder, used for case studies reported in,22,23 is available at http://sourceforge.net/projects/tirfinder/.
The program is organized into web tier and background processes. Web tier includes a set of Perl scripts responsible for communication with user, execution of subsequent tasks, and results presentation. Time consuming operations (ie, structural analysis, BLAST alignment, clustering) are performed as background processes, which prevents the user interface from freezing during calculations.
The application is hosted on one of the University of Warsaw servers with 16-cores and 64GB of RAM.
Case Studies
As an example of TIRfinder application, we searched the genome of
Pogo-like TEs found by TIRfinder vs. annotated in Repbase.
% of TIRfinder output masked by Repbase data.
numbers of

Pogo-like TEs landscape detected by TIRfinder in
Moreover, we performed similar search of

The example of TIRfinder outcomes: search for
Discussion
Efficient methods for computational analysis of TEs are of great interest in the light of the TE contribution to genome structure and evolution. We developed TIRfinder to provide users with a fully functional web server that allows structure-based detection and clustering of class II TEs carrying TIRs.
There are several unique features of TIRfinder, in comparison with other tools designed for mining TEs carrying TIRs. It provides exibility in defining TIR/TSD masks, allowing for a range of actions, from de novo identification of TEs belonging to a given superfamily in newly annotated genomes
22
to extensive mining of all copies belonging to a particular family (eg,
Although our tool is highly efficient for well-determined TSD/TIR mask, the complexity of the method may significantly increase with relaxing the mask constraints (eg, allowing large number of mismatches between TIR/TSD). Moreover, TIRfinder helps identifying elements with coding capacity, however some additional downstream processing, eg, ORF prediction, should be performed for all copies classified as autonomous, if a more detailed functional characteristics is required.
Author Contributions
TG, MS, JP developed the software and performed tests. AG, KW and DG supervised all the software developments, and carried case studies. TG, MS, DG, and AG wrote the manuscript. All authors reviewed and approved of the final manuscript.
Funding
Author(s) disclose no funding sources.
Competing Interests
Author(s) disclose no potential conflicts of interest.
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
