Abstract
Introduction
MicroRNAs (miRNA) are small noncoding RNAs with an average length of 22 nucleotides (NTs).1,2 MiRNAs are believed to play important roles in gene regulation by targeting the untranslated regions of messenger RNAs, which leads to cleavage or translational repression.3–5 MiRNA sequences are encoded in the genome and are transcribed by RNA polymerases. 6 The primary microRNA (pri-miRNA) transcript folds itself back to a typical hairpin secondary structure. The ribonuclease Drosha cleaves the pri-miRNA in the nucleus, resulting in the precursor miRNA (pre-miRNA). The length of the pre-miRNA differs from species to species, but has an approximate stem-loop length of 60–70 NTs. 1 Exportin-5 transports the pre-miRNA from the nucleus to the cytoplasm, where it is cleaved into about 22 NT duplexes (5’ and 3’). The mature miRNAs as well as the pre-miRNAs are conserved among several species.1,6 The typical hairpin secondary structure is important due to the fact that it acts as a structural motif for Exportin-5, 7 and also for Dicer to cleave the pre-miRNA into 5’ and 3’ mature miRNA. 8
MiRNAs play an important role in human disease pathways 9 and, due to the availability of high-throughput sequencing, it is advantageous to use computational methods to detect potential pre-miRNA sequences. In particular, studies have identified dysregulation of miRNAs as having a role in human cancers. 10 Creating genome-wide miRNA expression profiles is an important step in uncovering such dysregulation cases in cancer subtypes. For example, renal clear-cell carcinoma accounts for approximately 90% of cases of kidney cancer in adults. 11 Due to the lack of reliable biomarkers indicating early stages of the disease, many patients develop metastases, leading to poor prognoses. 12 Survival rates significantly improve if these cancers are detected early. 13 Recently, a study that used miRNA sequencing identified miRNAs differentially expressed in fresh-frozen, clear-cell renal cell carcinoma (ccRCC) versus nontumoral renal cortex cells. 14 Computational miRNA prediction methods can reduce the high number of possible sequences that have to be biologically validated when analyzing miRNA profiles of cancer subtypes.
Previously published methods for predicting novel pre-miRNA sequences used different combinations of features and classifier algorithms, like Triplet Support Vector Machine (SVM), 15 MiPred, 16 , and PmirP, 17 among several other methods.
Xue et al.
15
first proposed a method based on local contiguous structure composition centered on the middle nucleotide of each extracted structure triplet. These 32 nucleotide-structure triplets were used to train a SVM. MiPred implemented a random forest (RF) classifier using the same nucleotide-structure triplets as mentioned above.
16
However, the authors added important feature characteristics of pre-miRNAs: the minimum free energy (MFE) of the secondary structure and the
Comparison of features in different selection methods.
In this study, we focus on the classification of real and pseudo pre-miRNAs using a new combination of features including a new variant of the nucleotide-structure triplets described in Xue et al. 15 We use two machine-learning algorithms and validate the algorithms on completely unseen data, in contrast to most of the previous work. In addition, we test the performance of a minimal classifier, using only the most highly predictive features, and compare it to classifiers from previous studies.
Methods
Data set
All miRNAs that are biologically validated and published are stored in miRBase.18–21 We downloaded the human pre-miRNA sequences from the release 18 in March 2012, which comprises 1,527 sequences. Only human pre-miRNAs, whose secondary structures contain one single loop, were considered, resulting in 1,478 sequences, which were used as a positive label set for classification. While a negative label set is required for a classifier training, no such data were available in public as negative results are seldom reported for novel miRNA discovery. We had to create a negative label set by ourselves. Using the process described in Xue et al. 15 Jiang et al. 16 and Zhao et al. 17 protein-coding sequences (CDS) of human RefSeq genes without alternative splicing sites were downloaded from the UCSC Genome browser. 22 We joined the CDS and extracted nonoverlapping segments, keeping the same length distribution of the current human real pre-miRNAs. To ensure that these pseudo pre-miRNA sequences had similar characteristics to the true miRNAs, the pseudo pre-miRNAs were filtered. The following criteria were used in the filter: the secondary structure contained only one single loop, the MFE of the secondary structure was at most -4.30, and the minimum number of base pairings at the stem was 14. We used these numbers because -4.30 is the maximum value of MFE in the true pre-miRNA set, and 14 is the minimum number of paired nucleotides in the stem part of the human real pre-miRNAs. In total, 21,836 pseudo sequences were generated and used as the negative sample set.
To train and validate the classifiers, we generated two data sets: a training and an external holdout set. The training set consisted of 1,183 true, positive-labeled pre-miRNAs and 17,469 negative samples from the set of pseudo pre-miRNAs. For validation, we used the remaining 295 positive and 4,367 negative instances. We also generated a validation set by holding out the top 30 differentially expressed miRNAs in the ccRCC miRNA expression data set (National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) study ID: GSE37616 14 ). These 30 were removed from the training data to test the ability to predict verified and biologically meaningful known miRNAs as if they were undiscovered.
Feature collection
We collected 115 different features related to the pre-miRNA sequence and its secondary structure. This is a superset of all features used in previous studies plus unused, novel features such as right-structure triplets. The features can be categorized into four classes: (1) MFE, (2) sequence, (3) secondary structure, and (4) a combination of sequence and secondary structure information. These classes and their feature values are described below in more detail.
Minimum free energy
The MFE is predicted by the Vienna RNAfold package.
23
Due to the fact that the energy value decreases with an increasing length of a pre-miRNA sequence, the MFE is also normalized by the length of the hairpin. To significantly distinguish random sequences from real ones based on the MFE value, a Monte Carlo randomization test (previously described in Bonnet et al.
24
) was performed. For one given sequence, 1,000 random sequences were generated with the same dinucleotide distribution. The random doublet-preserving permutation algorithm
25
was used for the random sequence generation. The work described in Workman and Krogh
26
shows that random sequences with the same dinucleotide distribution are more likely to have similar MFEs than mononucleotide shuffled ones. A permutation score is calculated by
Sequence information
The pre-miRNA contains sequence information about the 5’ and 3’ mature miRNAs. Given this information, knowing the full length of the pre-miRNA sequence becomes important. Based on the nucleotides in the pre-miRNA sequence, the GC ratio is also calculated as the proportion of bases G and C to all four bases (A, C, G, and T/U).
Nucleotide pairs
The secondary structures of the pre-miRNAs were predicted using the Vienna RNAfold package.23,27 The structures are represented in bracket notation, which contains only two statuses for a nucleotide: paired and unpaired. Open and closed parentheses, “(”and”)”, are used for a nucleotide pair between a nucleotide on the 5’ end and a nucleotide on the 3’ end, respectively. Dots represent unpaired nucleotides. We did not distinguish between a nucleotide on the 5’ or 3’ end in this study, so we used “(” for both cases. A typical secondary structure of a pre-miRNA contains a stem and a single loop as displayed in Figure 1-a. Pre-miRNAs containing multiple loops in their secondary structures were not considered. The bracket notation gives information about the number of paired and unpaired nucleotides, as well as the ratio between them. The number of nucleotides on the 5’ and 3’ stem arm can be different due to bulges caused by unpaired nucleotides. This fact was used to normalize the number of nucleotide pairs by the longer stem arm. The loop part is normalized using the length of the pre-miRNA hairpin.

The secondary structure of a pre-miRNA can change if one nucleotide is different. This Figure illustrates (
It is important to encode the secondary structure with the sequence information because, as illustrated in Figure 1, the change of only one nucleotide in a pre-miRNA sequence can result in a different secondary structure. Due to this fact, the structure sequence in bracket notation is divided into overlapping triplets, considering each nucleotide in each triplet. For each base, there are 8 (23) possible triplet structures: “…”, “.(”, “.(.”, “.((”, “(.”, “(.(”, “((.” and “(((”. With the left, middle, or right nucleotide of each triplet, there are 96 (4 (bases) x 8 (triplets) x 3 (different nucleotides)) possible nucleotide-structure combinations, which we list as “A…_l”, “A…_m”, “A…_r”, …, “U(((_l”, “U(((_m”, “U(((_r”, where “l,” “m,” and “r” represent left, middle, and right, respectively.
We only consider the 5’ and 3’ stem arm to extract the triplets from the secondary structure of a pre-miRNA, as shown in Figure 2. The numbers of different nucleotide-structure triplets are counted and then normalized by the number of the extracted triplets per nucleotide type (left, middle, or right) to generate a 96-dimensional feature vector.

An overview of the workflow to extract different nucleotide-structure triplets.
Another characteristic of RNAs is the possible pairing of the bases G and U. These GU-wobble pairs are very common in RNAs, including miRNAs. 28 The normalized number of GU-wobble pairs (the total number normalized by the number of pairs) is also important because the number of GU-wobble pairs increases with the number of pairs in the hairpin.
Feature selection
We applied the RELIEF, 29 which is a feature selection method with tolerance and low (linear) computation cost, to rank features in terms of their contribution to the classification performance of the machine-learning model.
RELIEF uses a randomized mechanism to draw instances
Implementation
We used in-house Perl scripts to compile the real and pseudo miRNA sequences, extract the comprehensive set of features, and filter the sequences. We used the Weka 30 Java implementation of the RELIEF feature extraction algorithm. In-house R scripts were used for the creation of training and validation sets for each method, as well as the classifier training and evaluation. We used in-house-developed R functions and the R packages “randomForest” 31 and “ROCR” for implementation.
Results
Selected features
Figure 3 shows RELIEF scores for selected features. We selected the top 30 features among the initial 115 using RELIEF scores. The number 30, chosen as the RELIEF curve, had a kink at 30 and this made our feature set size similar to those of comparing methods. We used 10-fold cross-validation to estimate the performance of the trained model. Consistent cross-validation performance shows a generalizability and lack of over-fitting with the model. We applied logistic regression (LR) and RF models to compare the performance of our feature set with other previously published feature sets. For a fair and comprehensive comparison, we used optimal parameters within each machine-learning method for each feature set (Table 2). Based on the slope change pattern in the RELIEF curve, we then selected the top 7 features to examine the performance of a “minimal” classifier as compared to more comprehensive feature sets.

Feature contribution to outcome prediction according to the RELIEF score.
Parameters used for RF for each feature set.
Performance comparison of feature sets
The performance of the two classifier models with 10-fold cross-validation is illustrated in Table 3 and Figure 4. For classifier evaluation, we used the area under the ROC curve (AUC) as it is a well-known aggregated discrimination performance measure that does not commit to a particular threshold. 32 Our feature set combination shows consistently higher AUC values than the feature sets of Xue et al. 15 and comparable AUC values to those published by Jiang et al. 16 and Zhao et al. 17 in both types of classifiers (LR and RF). We also used a holdout set of 20% of the data for external validation. The AUC values resulting from the use of the classifiers on this unseen data were slightly higher (Table 4) than they were in the training set, indicating low likelihood of over-fitting. Figure 4 shows the estimated AUC values for each feature set applied to the LR and RF models. Given the similarity in discrimination among the last three methods–Jiang, Zhao, and the proposed method–we investigated to identify the common features across all three methods that are most critical for prediction. Based on the RELIEF score, we selected the top seven common features and trained LR and RF. The “minimal” classifiers reached the same performance level as the ones with features more than seven, with AUC values of 0.9824 and 0.9796 for LR and RF, respectively, on the validation set. In addition, we confirmed the robustness of the RELIEF selections by sequentially adding features onto a RF classifier based on the RELIEF scores (Fig. 6).
Average performance of different feature sets and classifier models in 10-fold cross-validation.

Performance of two classifier models on the validation set: (
Performance of different feature sets and classifier models on the external validation data set.

AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.
Real data miRNAs
Our feature set and that of Jiang et al. 16 had equal sensitivity (0.9) at a threshold of 0.5, while those of Xue et al. 15 and Zhao et al. 17 had sensitivities of 0.067 and 0.633, respectively, on the holdout true-positive miRNA set from the renal cancer study. In a histogram of prediction probabilities (Fig. 5), our feature set exhibited a skewed distribution at high-valued output estimates, indicating that, when the classifiers that use our feature set output a high score, there is high likelihood that this is a true miRNA. Of interest are a few real miRNAs that had low scores (below 0.5) for all feature sets. None of the feature sets were able to predict hsa-miR-660, hsa-miR-15a, and hsa-miR–532. miRBase 19 stem-loop diagrams show that all three of these miRNAs have noncanonical two-loop structures, which cause all the feature sets to fail.

Histogram of estimates (ie, prediction probabilities) for the positive samples from the renal cancer study.
Discussion
We compared our minimal feature set with information about the sequence and structure of a pre-miRNA with three other published feature sets. An important characteristic that defines real pre-miRNA sequences is the MFE. Adding this energy value to a feature set increased the prediction performance, regardless of the classifier model. This was verified through the similar performance of all methods that use MFE, including ours, Jiang, 16 and Zhao. 17
Previously, other methods have relied on nucleotide triplets to contain the sequence and structure information used to accurately predict pre-miRNAs. A single-nucleotide change can have a significant impact on the secondary structure, so these features were useful for capturing these characteristics of miRNAs. MFE values were added to improve classifier performance, based on studies on the MFE of noncoding RNAs which determined that pre-miRNAs have lower free energy than randomized sequences, while other ncRNAs do not. 24 After comprehensive evaluation, we find that with MFE and base pairing counts, the fine-grain information from nucleotide triplet features is no longer needed. Since the sequence is used to determine the secondary structure in the first place through the Vienna RNAfold package, the resulting MFE and aggregate base pairing information already contain this critical information used to classify pre-miRNAs. This discovery allows us to create the minimal classifier for pre-miRNA prediction using biologically explainable, structure-wide features. The increased ease in implementation of this classifier makes it a good baseline for the creation of more sophisticated methods to predict noncanonical miRNAs and other ncRNAs.
A limitation of this work is in the scope of questions addressed. We analyzed a comprehensive set of miRNA features to identify the importance of top features and compared the results among similar feature sets. We also tested the ability of the sets to identify biologically relevant renal cancer miRNAs as if they were undiscovered in order to assess the ability of the feature sets to predict truly novel miRNAs. However, our method may still have limited ability to accurately predict noncanonical miRNAs.
The two paths for extension of this work are applying more advanced machine-learning methods on our feature set and applying our feature set to identify novel miRNAs in additional data sets. Issues such as creating the artificial negative data sets, class imbalance between the negative and the positive training data, and addition of features and methods to identify noncanonical miRNAs were outside of the scope of this work, yet may be useful in improving the accuracy of miRNA prediction methods.
Conclusion
We compared feature sets in classifier performance for the task of predicting novel pre-miRNAs. A minimal set of seven is sufficient to attain the same classification performance as more comprehensive feature sets. Further validation using other data sets will help determine whether the number and type of features is generalizable to precursor miRNAs found in different types of samples.
Author Contributions
PS and JK conceived and designed the experiments. PS, EL, and JK analyzed the data. PS and EL wrote the first draft of the manuscript. PS, EL, JK, XJ, and LOM contributed to the writing of the manuscript. PS, EL, and JK jointly developed the structure and arguments for the paper. PS, EL, JK, XJ, and LOM agreed with manuscript results and conclusions. JK and LOM made critical revisions. All authors reviewed and approved of the final manuscript. We thank Tyler Bath for proofreading the manuscript.
