Abstract
Introduction
Eukaryotic genomes are known to be densely made up of repetitive elements, mainly microsatellites and transposable elements (TEs). These repetitive elements, when characterized in a plant species, generate information that can be applied for different purposes in a plant breeding program. For instance, microsatellites can be applied as molecular markers for mapping quantitative trait loci (QTL) for paternity tests, 1 and in the case of transposons for gene regulation, epigenetic studies, genetic engineering, and gene therapy. 2
Transposable elements are classified into 2 main classes, based on the molecular mechanism that mediates their transposition. The elements that use a “copy-and-paste” mechanism belong to class I, and those that use a “cut-and-paste” belong to class II. 3 The increasing diversity of TEs identified in different taxa, mainly in plants, unleashed the unified TE classification system. 4
Transposable elements may respond to more than 50% of the total content of some genomes. 5 This amount can be even higher, up to 70%, in the genomes of some grasses. 6 Although most TEs groups are ancestral and present in basically all the kingdoms, these elements differ significantly from each other, reaching to thousands of different families, only in the plant kingdom. 7 It is known that the expansion and contraction waves in TE numbers can result in dramatic differences between genomes. 8
The repetitive pattern and structural signatures typically found in TEs make them natural candidates for a large-scale bioinformatics analysis. There are 2 computational approaches for the identification and annotation of TEs; the first method is based on structural features (de novo), and the second is the search for similarities in databases (homology based). 9 Although there are many tools for annotation of TEs, 10 this is still an open field of research in the area of bioinformatics. 11
A detailed description of repeats can be useful in refining genome assembling and annotation (especially in complex genomes like those of plants). Moreover, it provides information on genome variability and how they diversified over the evolutionary process. Recent insertion of TE families can help to better understand the evolutionary mechanisms involved in species differentiation. 12 Besides, the epigenetic silencing mechanism may help in understanding the regulation of the transposition activity in plants. 13
The
Date palm (
The Brazilian breeding program on
This study provides a characterization and comparison of the TEs and microsatellites present in the genomes of the American and African oil palms, as well as the date palm. This analysis can provide insights into the repetitive content of these species and the application of these regions to explore the genetic variability within and among palm species. A comparative analysis based on a scaffold assembly of these genomes was performed, allowing the distribution of TEs on the chromosomes of
Materials and Methods
A pipeline for the analysis of repetitive elements (repeats), which includes some free software typically used in repeats analysis, such as Tandem Repeats Finder (TRF), RepeatModeler, and RepeatMasker, was developed and is detailed below. Local scripts, using programming languages Perl and Python, were developed to automate the data transformation between steps of the scrutiny. This pipeline is under performance enhancement to improve speed through parallelism techniques (Fork, Perl), as well as normalization of software multithread parameters (L.S. Brito et al, 2016 unpublished data).
DNA sequence data
The chromosomes and/or scaffolds from 4 genome drafts were used in this study: (1)
Identification and classification of microsatellites
The content of microsatellites in oil and date palm genomes was studied. The TRF software was applied to identify microsatellite repeats, 19 using the following parameters: match 2, mismatch 7, delta 7, PM 80, PI 10, minscore 50, maxperiod 500, -f (flanking sequence), -d (data file), and -m (masked sequence file). To summarize the results obtained, the Tandem Repeats Analysis Program (TRAP) software 20 was applied, using the following parameters: -id = 70 (minimum match percentage), -tbf = html + csv (table format), -sort = size (sort field), -rr (flag—create redundancy report), and -trf (flag—create trf-like file).
Identification of repetitive elements
The first step was preformatted with the RepeatModeler software (default settings) that makes up a pipeline with RECON software, 21 RepeatScout, 22 RepeatMasker, TRF, 19 and RMBlast, for the de novo identification of TEs. The types of long terminal repeat (LTR) retrotransposons were identified using the LTR_FINDER software, 23 applying default parameters. All the repeats greater than 100 bp were included in the TE library.
Classification of repetitive elements
The resulting TE library was classified using Blastn (e-value ≤ 1e−5, identity ≥ 70%, and minimum size alignment ≥ 80 bp) against Repbase and the public database MIPS Repeat database, which integrates other databases (TRansposable Elements Platform [TREP], TIRG
Annotation of repetitive elements
The RepeatMasker software was applied, with a custom library (combination of repeats of RepBase, MIPs—Munich information center for protein sequence and TE library de novo), to search for TE coordinates. This software was also used to generate a version of a masked sequence with repeat regions. The tool “one code to find them all,” 24 a Perl script to parse the RepeatMasker output file, was used, aiming to organize, summarize, and produce statistics about the RepeatMasker results.
The data generated by “one code to find them all” were used to measure divergence between copies of TEs, by means of the correlation of divergences (in relation to reference), and the proportion of the length of the reconstructed copy compared with the reference element. 24
Results
Large proportions of the 4 genomes studied are repeat sequences: 50.96% of the
Repeat content in oil and date palm genomes.
Abbreviations: chr., chromosomes;
Pipeline results, except for referenced items: aAl-Mssallem et al,18 bSingh et al, 14 and cNational Center for Biotechnology Information Assembly (www.ncbi.nlm.nih.gov/assembly/).
Bold value indicates proportion of TEs in the genome, rather than percentage among TEs.
Total repeat content on
Abbreviation: TE, transposable element.
Detailed classification of transposable elements identified on
Abbreviations: LINE, long interspersed nuclear element; LTR, long terminal repeats; RC, rolling circle; SINE, short interspersed nuclear element; TEs, transposable elements; tRNA, transfer RNA.
Bold value indicates that RC/Helitron stand out from the other Class II-DNA because they have an exclusive transposition mechanism called rolling circle (RC).
Identification of repeats
Long terminal repeat retrotransposons are the TEs predominantly identified in

Distribution of transposable elements in the genome of
In

Frequency (%) of the most common simple sequence repeat (SSR) motifs in the genome of
In total, 155 726 loci of microsatellites (between mononucleotide and hexanucleotide) were identified in the genomes of oil and date palms. For

Comparison of simple sequence repeat (SSR) amount among oil and date palm genomes. Amount of mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide in
The composition of TEs was very similar among the 4 genomes studied. For the 4 sets of scaffolds used (
Distribution and classification of TEs on the chromosomes of African oil palm
A total of 212 722 TE copies were identified, with a total size of 174 195 kb, representing 26.47% of the sequence. Among the 16 chromosomes of the African oil palm, chromosomes 6 and 15 are the ones presenting the highest repeat coverage (Table 2 and additional data given in Table S6); however, the distribution of TE classes was to a certain degree similar in all chromosomes (Figure 4).

Chromosomal distribution of TEs in
Figure 5 shows the most representative TE families in each chromosome. The most characterized LINE families are L1 and L1-Tx1, whereas the 2 most represented DNA transposon families are CMC-EnSpm and hAT-Ac. For the LTR retrotransposons, Copia and Gypsy were the most frequent superfamilies. Copia is the most abundant one on all chromosomes. The distribution of the main families of TEs per chromosome was also examined. The repeats have been classified and are described below.

Chromosomal distribution of the most represented transposable elements (TEs) in
Among all the class I retrotransposons identified in the African oil palm chromosomes, 25 558 copies have been classified as LTR elements, totalizing 44 226 kb. The 4 main superfamilies are Caulimovirus, Copia, Gypsy, and ERV1 (Table 3). Chromosomes with the largest representation of these elements were 6 and 9 (28.94% and 27.77%, respectively).
However, only 609 and 148 copies have been classified as belonging to the LINE and SINE families, respectively, totalizing 372 and 20 kb. The 5 main LINE families are L1, L1-Txt1, L2, RTE-BovB, and Tad1 (Table 3). Chromosomes 2 and 5 are the ones with the greatest abundance of this element (0.29% and 0.34%, respectively). The SINE/transfer RNA family responded to 95.95% of the SINE elements found (Table 3).
A total of 15 254 copies have been classified as class II (DNA transposons) on the African oil palm chromosomes, totalizing 16 983 kb. CMC-EnSpm is the most frequent one, with a total of 7544 copies and 18 165 fragments, totalizing 10 825 kb (Table 3). CMC-EnSpm is widely dispersed among the 16 chromosomes, with the lowest percentage of appearance on chromosome 14 and the highest on 15. Besides this family, 8 other families were identified: Academ (27 kb), Crypton (2 kb), Dada (240 kb), Hat (families Ac, Blackjack, Charlie, Tag1, and Tip100, totalizing 2873 kb), Mule-MuDR (2879 kb), PIF-Harbinger (44 kb), Sola (94 bp), and rolling-circle transposons—Helitron (461 kb) (Table 3).
The majority of TEs copies (80.30%) was grouped as unclassified, being subdivided into 2 groups: unspecified (43.92%) and unknown (36.37%). Altogether, they account for 170 814 copies, totalizing 112 131 kb (Table 3).
Divergence of TEs
Ratios close to “1” (full-length elements) and divergence close to “0” could indicate events of recent insertion of TEs in the genome. Figure 6 shows DNA transposon and LTR retrotransposon superfamilies as potential recent insertions (with some full-length elements), whereas LINE-like elements present low divergence but of different sizes. Each point represents a TE copy.

The plot of the divergence of transposable element (TEs) in
Discussion
Microsatellites and TEs present in the oil and date palm genomes were identified and analyzed using a pipeline for de novo and homology-based identification of repetitive elements. This report is the first with a detailed analysis of repeats in the whole genome of oil palm. Here, the not yet published genome of an Amazonian oil palm genotype belonging to the
Oil and date palm genomes are mainly composed of repeats
A large portion of these 4 genomes available and studied is composed of TEs (50.96%—
The difference in the repetitive content of the 2 American oil palm genomes reflects the discrepancy in the assembly stage of these genome projects.
The content of TEs found in the African oil palm genome scaffolds (39.41%) was different from that described by Singh et al
14
(57%). Nonetheless, the amount (in percentage) of LTR retrotransposons found by Beulé et al
16
is very close to the results found in this work. This study shows that, on average, 26.47% of
The TE effects have great influence on gene expression and genome evolution in plants. 30 Considering that exactly the same analysis was applied to these 4 different data sets, one can observe quantitative and qualitative differences in TE profiles of the African and the American oil palm genome sequences, which may be evidence of different mechanisms of transposition and regulation of such elements in the 2 species.
Diversity of microsatellites
This study has identified 155 726 microsatellite loci, which are potential molecular markers of
It was found that dinucleotide repeats are the most frequent in the genomes studied, corroborating what is found in other plant species (48%-67%), in different data sets.
34
Within the dinucleotide class, the most frequently identified was AT. Due to the lower instability of A/T bonds, probably the mutation rate in this genome is high,
35
which ultimately increases the level of polymorphism. These observations are consistent with studies in apple,
36
There was a clear difference in dinucleotide content between
Our result corroborates those found in
Regarding the microsatellite content in the evaluated genomes, there is considerable variation (between 1.65% and 2.24%) among them. This level of variation is expected to be found within species that are phylogenetically close, such as oil palm (
Our results on the characterization of microsatellites in the genome of
Recent studies have implemented the genome-wide strategy for the development of microsatellite markers in plants.40,41 The advantage of this approach is to get a large number of markers distributed evenly throughout the genome, which is ideal for genetic mapping studies. The construction and deployment of a microsatellite database for the scientific community would have a high impact on the genetic studies of oil palm due to the fact that this type of marker is highly informative and has a wide range of applications.
Using the tools of TRF and TRAP software, included in our pipeline, oil palm genome was systematically searched for microsatellites to develop genetic markers. This approach saves both cost and time. This result showed that in addition to SSRs developed from traditional genetic library screening 42 and other methods, oil palm genome sequence is a rich resource for the rapid identification and development of microsatellites.
Abundance of the different classes of TEs
Little differences in TE classes were found among the 4 genomes used in this study. Retrotransposons are the most abundant TEs in
In class I, there was a much greater presence of LTR compared with LINE and SINE families. The 2 superfamilies that stood out among the LTR families were Copia and Gypsy—what appears to be typical of monocot genome. 45 The LINE and SINE ratio was low because such elements appear to be more abundant in animal genome than in plant genome. 4
Class II of TEs is poorly represented in oil palm genomes, and the most present superfamilies of DNA transposons in American and African oil palms, as well as date palm, are the CMC-EnSpm and hAT elements. Members of the hAT superfamily are found in many monocotyledonous, such as those of the Ac-Ds family in maize. 46
An interesting fact was the high proportion of elements not classified in
In conclusion, to the best of the authors’ knowledge, this is the first detailed description of all genome repeats for American and African oil palms, as well as date palm. In the genomes analyzed, there are high diversity and abundance of TEs and microsatellites . The identified repeats are potential genetic markers for these species and will be used for assembly and genome full annotation of these complex plant genomes. Moreover, the SSRs which are being developed and validated will be used as framework markers to allow the bridging of other marker types, such as SNPs, and relevant information (eg, structure) between breeding populations. In addition, the complexity of this analysis stimulated us to produce a pipeline to improve efficiency in full TEs and tandem repeat analyses, under optimization and documentation (LS Brito et al, unpublished).
Footnotes
Peer review:
Funding:
Declaration of conflicting interests:
Author Contributions
Internet Resources
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
