Abstract
Introduction
Micro satellites, or simple sequence repeats (SSRs), are short tandem repeats of 1–6 bp nucleotides, which have increasingly been used in population genetics analysis, crop breeding, and conservation genetics. SSRs are widespread in both protein-coding and noncoding regions in plants. 1 Because of their abundance, high levels of allelic variation, and codominant inheritance characteristics, SSRs are considered to be an effective genetic marker in genetic diversity analysis, gene tagging, and conservation biology in plants.2,3 SSRs are generally categorized into two common groups based on their origins, genomic SSRs from genomic sequences, and expressed sequence tag (EST)-SSRs from transcribed RNA sequences. 4 The de novo development of genomic SSRs is time consuming and labor intensive, involving the screening of a genomic DNA library screening with specific SSR probes to isolate the microsatellite sequences. 4 In contrast, development of EST-SSR markers has many advantages, including low cost, technical simplicity, and the ability to capture the functional diversity in natural populations or germplasm collections, as well as high transferability to related species. 4 The NCBI dbEST database collects cDNA sequences from a number of organisms. However, dbEST does not include the vast majority of known species, having particular poor coverage of nonmodel organisms, especially those with low economic value. Lack of EST data makes it difficult to develop effective EST-SSR markers.
Next-generation sequencing (NGS) is an important tool for generating large quantities of genomic data from nonmodel organisms in a cost-effective manner. In particular, genome-scale transcriptome analysis by RNA-seq enables identification of genes that have high differential expression in response to environment changes, determination of the genetic basis of critical phenotypes, and mapping genomic diversity in nonmodel organisms. RNA-seq can accelerate the development of SSR markers, which is particularly useful for genetic analysis in the nonmodel species. Nonmodel plant species that have benefited from RNA-seq based SSR marker development include
Methods
Plant material
Summary of assembly and annotation results for V. baillonii.
RNA extraction and sequencing
Total RNA was extracted from the samples, using a CTAB procedure. 9 A260/A280 ratios of the RNA samples dissolved in 10 mm Tris (pH 7.6) ranged from 1.9 to 2.1. The integrity of the RNA samples was examined with an Agilent 2100 Bioanalyzer and their RIN (RNA integrity number) values ranged from 8.6 to 10.0, with no sign of degradation. RNA from each replicate was pooled with equal volumes to obtain enough RNA for RNA-Seq.
A total of 20 μg RNA was used for RNA sequencing, and the mRNA was fragmented into small pieces using divalent cations at an elevated temperature. The cDNA library was constructed via poly-A enriched RNA method, and 200–300 bp fragments were chosen for paired-end sequencing based on Illumina protocols (San Diego, CA, USA). The double strands were synthesized with random hexamer primers. The short fragments were purified with the QIAquick PCR Purification kit (Qiagen Inc.). The purified DNA libraries were first amplified via PCR and then sequenced on Illumina HiSeq™ 2000 platform.
De novo assembly
Raw reads were first filtered by removing the adaptors and reads with >eight ambiguous bases and >50% of the bases with a quality score ≤5 using in-house
SSR locus search, primer acquisition, and validation
The unigenes were used for detecting SSR loci by MicroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa) 11 Criteria includes a minimum of five repeats for simple motifs, and three repeats for complex or imperfect repeats, a motif length of 2–10 bp, and for compound SSRs, a maximum interruption distance of 100 bp between different SSRs. To facilitate SSR detection, only 1-6-nucleotide motifs were considered, and the minimum repeat unit was defined as 10 for mono-, 6 for di-, and 5 for tri-, tetra-, penta-, and hexanucleotides. Primer pairs of each unique SSR were designed using Primer 3.0, 12 with target microsatellites containing at least five repeats and yielding PCR products of 80–500 bp. One hundred fifty-one primer pairs were synthesized and used for validation (Supplementary Table 2). Screened primer pairs giving good amplification were subsequently used to characterize polymorphism among 40 individuals from six populations (Supplementary Table 1). PCR was performed in a 25–μL volume, containing 10–40 ng plant DNA. The PCR reactions were carried out under the following conditions: DNA initial denaturation at 94 °C for 4 minutes, 35 cycles of 94 °C for 1 minute 30 seconds, annealing temperature ranging from 45 °C to 60 °C for 50 seconds, 72 °C for 50 seconds, and a final extension at 72 °C for 7 minutes. The PCR products were purified before sequencing to remove excess primers and deoxynucleotide triphosphates using a TIAN quick Midi Purification Kit (Tiangen Biotechnology Co. Ltd.), and then, sequencing reactions were performed using ABI Prism Sequencing Ready Reaction Kit with the same primers as PCRs and analyzed on the ABI 3730 genetic analyzer (Applied Biosystems).
Functional annotation for unigenes containing SSRS
Functional annotation of SSR-containing coding sequences was conducted using the program Blast2GO. 13 All the SSR-containing unigenes were blasted against the NCBI's NR protein database using BLASTx. The E-value threshold was set as 1e-6. The contig was assigned with gene names according to best BLASTx hits. The distributions of functional categories were plotted with the program WEGO. 14
Population genetic analyses
We used POPGEN v1.32 15 to calculate the number of alleles and expected and observed heterozygosity. Five primer pairs were selected for phylogenetic analysis using 23 individuals from four populations. We used MEGA6 16 to construct the dendrogram tree using the unweighted pair-group as implemented in the UPGMA method.
Results and Discussion
De novo assembly
A total of 28,483,317 high-quality RNA-Seq reads passed our stringent quality assessment and filtering. We used the Trinity assembler,
10
to assemble these reads, resulting in 133,019 contigs. The length of the contigs in our assembly ranged from 201 to 13830 bp, with an average 1263 bp and median of 850 bp. The N50 value of our assembly is 2104 bp. N50 is an important measurement for quantifying assembly quality and is measured by the length of the contigs for which all contigs of that length or longer contain 50% of the bases in the assembly. The N50 for Length distributions for all 133,019 transcriptome contigs for 
Frequency and distribution of different types of SSR markers
MISA was used to analyze the 133,019 contigs, identifying 40,885 putative SSRs (Table 1). The number of SSRs obtained in this study was lower than known model plants, including
Counts of various SSR types with different repeat motifs in V. baillonii.
Of the two possible types of mononucleotide repeats, the most abundant was (A/T)n (97.1%), as in most plants,20,21 and the (G/C)n contributed 1.90% to total SSRs, which was higher than 0.05% in tree peony.
21
For the dinucleotide repeat category, different species have unique motif frequency distributions. For example, AG/CT repeats were more frequent in
Trinucleotide repeats AGC/CGT, AGG/CCT, and CCG/CGG were observed more frequently in all the monocot species, whereas A/T-rich repeats, such as AAC/GTT, AAG/CTT, and AAT/ATT, were preferred in dicots. Similar to the results of Sonah et al.
20
, A/T-rich repeats were the dominant trinucleotide SSRs in
SSR loci were divided into two groups based on the size of SSR tracts and possible informative genetic markers: Class I, or hypervariable markers, consisted of SSRs ≥20 bp, and Class II, or potentially variable markers, consisted of SSRs ≥12 bp and <20 bp. Class I microsatellites are generally highly polymorphic and more informative because of the large size and long repeats, which was first observed in human beings,
23
and subsequently confirmed by studies in a number of other organisms, including rice.22,24 Class II microsatellites are less variable, representing mutations that have accumulated recently because of sporadic SSR expansion. In
SSR-containing coding sequences annotation
A large number of the contigs in our assembly (28,912, 70.7%) had more than one hit in the NCBI NR database with an E-value of 1e-6. This percentage was higher than GO classification of SSRs in coding regions. The 
Validation of SSR assays and UPGMA analysis
Fourteen SSR primers, size, and summary statistic across four populations in V. baillonii.

UPGMA dendrogram constructed based on eight genotypes from four representative populations and five SSR markers developed in this study. Two clusters were identified, generally corresponding to two geologic locations: the Mekong River and the Jinsha River (YN, Yun Nan Province; SC, Si Chuan Province).
Plant genomes are very complex and contain large amounts of repetitive DNA, including microsatellites, which has immediate practical implications for the success of SSR marker development. The 14 SSRs identified in this study represent high-quality and polymorphic genomic loci, which will allow us to further explore the genetic diversity and genetic structure in wild populations of
Conclusions
In this study, we used RNA-seq to determine a de novo transcriptome for
Author Contributions
Carried out the laboratory experiments and statistical analysis: LW. Participated in the sample collections and the statistical analysis: ZKW. Assisted with bioinformatics tools: JBC. Guided the appropriateness of the tools: CYL, WLZ. Performed bioinformatics analysis and participated in writing: LYW. Conceived the idea, guided the study, performed bioinformatics analysis, and participated in writing: LHM. All the authors participated in the editing of the manuscript, reviewed, and approved the final manuscript.
Supplementary Materials
Supplementary Table 1
Locations of populations of sampled
Supplementary Table 2
Primer pairs synthesized and used for validation based on SSR sequences from
Supplementary Table 3
The frequency of repeat motif for eight genotypes observed in
