Abstract
Introduction
MicroRNAs (miRNAs) are small endogenous single-stranded noncoding RNAs containing about 22 nucleotides, and they usually regulate the gene expression at the posttranscriptional level by binding to the 3′-untranslated region of related messenger RNAs (mRNAs).1-3 In 1993, the first miRNA
Recently, several studies reveal that miRNAs are highly relevant to the development of human complex diseases, including various cancers, diabetes, acquired immune deficiency syndrome, neurological disorders, and so on. 16 For example, in the breast cancer patient, the expression level of miRNA-141 is increased. 17 Besides, miRNA-145 is downregulated in atypical meningiomas and negatively functioned by regulating the proliferation and motility of meningioma cells. 18 And compared with normal people, the expression level of miRNA-106a in glioblastoma patients is significantly higher. 19 According to those studies, the statistics of the Human microRNA Disease Database (HMDD) 3.0 has collected 32 281 experimentally supported miRNA-disease association entries from 17 412 papers, including 1102 miRNA genes and 850 diseases. 20 Also, several studies indicate that more than one-third of genes are regulated by miRNAs, 21 which further demonstrates the associations between miRNAs and diseases. As indicated by those previous study results, miRNAs are considered as novel potential biomarkers or diagnostic tools for diseases.22,23 Therefore, exploring the relationships between miRNAs and diseases is meaningful for the prognosis, diagnosis, treatment, and prevention of human complex diseases.24-26
Nevertheless, traditional experimental methods for identifying the miRNA-disease associations are costly and time-consuming. As previous biological studies on miRNAs provided us massive and reliable miRNA data and their related data, 20 researchers began to develop some in silico methods to predict miRNA-disease associations, which makes the follow-up biological validation experiment much more convenient and effective. 27 Currently, most of the computational approaches are based on networks, which include miRNA association networks, disease phenotype networks, 20 miRNA-disease networks, 28 gene co-expression networks, 29 and protein-protein interaction (PPI) networks. 30 The basic assumption of most computational methods is that functionally similar miRNAs are more likely to be associated with the phenotypically same or similar diseases and vice versa. 31 Therefore, the key to judging whether an miRNA is related to a specific disease is the similarity computation, which is based on known miRNA-disease relationships and some external information such as gene ontology, PPIs, and gene expression. In recent years, with the development of machine learning, some prediction approaches based on machine learning have also been proposed. Here, we discuss the previous approaches from 2 aspects: network similarity methods and machine learning methods.
Network similarity methods, according to the information involved in similarity computation, can be grouped into 2 categories:
32
local network similarity methods24,33 and global network similarity methods.31,34 Local network similarity–based methods only consider the directed edge information contained in the involved networks, which ignore the global structure of these networks. For example, Jiang et al
24
proposed a Boolean network method that uses hypergeometric distribution to identify the miRNA-disease associations based on an miRNA-miRNA network, a disease-disease network, and an miRNA-disease relationship network. Xuan et al
33
proposed a
The machine learning–based prediction methods usually face 2 challenges: first, the current data sets include only positive samples without negative samples; second, extracting the feature vectors of miRNA-disease pairs is nontrivial. Although there are some limitations, the excellent performance of machine learning methods can still guarantee high-quality prediction models. The first machine learning–based method for miRNA-disease association prediction was proposed by Xu et al, which extracted features from miRNA-disease network data and train a support vector machine (SVM). 24 After that, Chen and Yan 35 proposed the model of regularized least squares for miRNA-disease association (RLSMDA), which is a global and sim-supervised learning method. Niu et al 36 integrated random walk and binary regression to identify novel miRNA-disease associations that are based on global similarity and supervised learning method. Although the existing computational methods have already achieved great performance, there is still some room for improvement.
In recent years, many researchers attempted to use deep neural networks to solve bioinformatics computing and got promising results. 37 For instance, Peng et al 38 identified the miRNA-disease associations by a learning-based framework, MDA-CNN, which is based on convolution neural networks, and Luo et al 39 predicted disease-gene associations by multimodal deep belief network (DBN) learning. It has been proved that DBNs can perform both unsupervised learning by automatically learning the high-level abstract features and supervised learning by backpropagation to fine-tune the weights got from the unsupervised learning with a few labeled data. 40 The shortcoming of DBNs is time-consuming when handing a large database, but it shows great performance in extracting features for regular data and performing supervised training with just a few labeled data. The properties of DBNs show that DBN is suitable for the miRNA-disease association prediction that owns a few labeled data and the database is not so big.
In this study, we present a DBN-based matrix factorization model, DBN-MF, for miRNA-disease prediction. The main idea is factorizing the miRNA-disease adjacency matrix to 2 matrices with DBNs, one represents all the miRNAs’ features, whereas the other one represents all the diseases’ features. Then an association score of each miRNA-diseases pair is calculated for the prediction according to a classifier consisting of 2 DBNs and a cosine score function. The results of our computational experiments show that DBN-MF outperforms the state-of-the-art approaches.
Materials and Methods
Restricted Boltzmann machines
A restricted Boltzmann machine (RBM) is a stochastic neural network that only has 2 layers, a visible layer at the bottom and a hidden layer at the top.
41
The basic structure of an RBM is shown in Figure 1 and contains
where
where

The basic structure of RBM. RBM indicates restricted Boltzmann machine.
The probability distribution of the input data
The purpose of RBM training is to obtain the parameters set
where
where
DBNs
DBN is a probabilistic neural network proposed by Hinton in 2006.
44
A DBN model includes 1 input layer

The basic structure of the DBN model. DBN indicates deep belief network; RBM, restricted Boltzmann machine.
The probability distribution of the DBN model
where
The key to the DBN model is training the parameters. First, we trained the RBM one by one and obtained each RBM’s parameters by the contrast divergence algorithm. After training all the hidden layers, the last layer represents the feature extracted from DBN.
DBN-MF model
Problem statement
Suppose there are
In this study, we try to factorize matrix
Model-based methods45,46 usually assume that there is an underlying model that can predict the association score as follows:
where
Therefore, the key question becomes how to define the function
The process of DBN-MF model
The framework of the DBN-MF model is shown in Figure 3.

The flow chart of the DBN-MF model. DBN-MF indicates deep belief network–based matrix factorization.
Then, a cost function is used to measure the difference between the predicted score and the real label, and backpropagating is applied to update the parameters according to the cost function.
The cost function is also an important component of deep learning. The squared loss function is one common and simple cost function, yet it cannot perform well with implicit data that the target value
which we use in this study.
Experiments and Results
Data sources
For evaluating its effectiveness of model DBN-MF, we perform DBN-MF on the HMDD 50 database. HMDD is a manually collected database on human miRNA-disease associations with experimentally supported evidence. HMDD V2.0 was published in 2013, which includes 5441 pairs of positive associations between 501 miRNAs and 383 diseases after combining the miRNAs from different stages, such as has-let-7a-1 and has-let-7a-2. Then in 2018, a new version HMDD V3.0 was published that contains 2-fold more entries than the HMDD V2.0. After doing the same combining operation, HMDD V3.0 contains 17 198 positive associations between 1065 miRNAs and 894 diseases. As there are no confirmed negative samples, we randomly choose a negative set with the same size as the positive set from all nonpositive (unknown) associations for the supervised training.
Evaluation methods
In this article, 10-fold cross-validation (10-fold CV) was used to evaluate the performance of DBN-MF. The 10-fold CV randomly divides the known positive associations and the same number of unknown samples into 10 folds, and each fold takes in turn as the test samples and the rest as the train set at each time. We do not use leave-one-out cross-validation (LOOCV), because the database is big enough for 10-fold CV and the computational model is based on a deep neural network which would be time-consuming with LOOCV.
To evaluate the result of the 10-fold CV from different aspects, the area under receiver operating characteristics (ROC) curve (AUC), the area under
Hyperparameters
In this study, several hyperparameters affect the performance of the prediction. Because the supervised learning fine-tunes the parameters of unsupervised learning, the number of hidden layers
Another 3 hyperparameters that determine whether the model is well trained are learning rate (
Comparison with other algorithms
Comparison with the methods that integrated different kinds of evidence
DBN-FM model predicts miRNA-disease associations only based on the miRNA-disease adjacency matrix. However, most of the prediction methods integrated different kinds of data, such as gene co-expression networks, PPI networks, and disease phenotype network, to get more information. In this section, we compare the performance of DBN-MF in predicting miRNA-disease associations with the other 5 competing approaches, CIPHER,
52
Boolean network method,
24
Shi,
34
PBMDA,
53
and MDA-CNN.
38
These 5 methods are all based on heterogeneous networks. CIPHER is a network-based regression model that extracts the relationships between phenotypes and genotypes, Boolean network method is a local similarity–based method, Shi is a random walk–based global similarity method, PBMDA is a path-based method by constructing a heterogeneous network, and MDA-CNN is a machine learning–based method. All these methods are tested on HMDD V2.0 data by a 10-fold CV evaluation method. Table 1 shows the AUC, AUPR,
The comparison between DBN-MF and other 5 methods on AUC, AUPR,
Abbreviations: AUC, area under the curve; AUPR, area under
In Table 1, the bolded number is the largest in each column. According to the experimental results shown in Table 1, it is obvious that DBN-FM achieves the best performance on AUC, AUPR,
Comparison with the methods based on the same information
In section “Comparison with the methods that integrated different kinds of evidence,” we compared the performance of DBN-FM with some other prediction methods based on the heterogeneous networks. In this section, we compare DBN-FM with the method random walk and binary regression–based miRNA-disease association prediction (RWBRMDA) 35 that also predicts miRNA-disease associations only using the miRNA-disease association matrix. RWBRMDA was proposed in 2019 and it integrated random walk and binary regression to identify novel miRNA-disease associations and has a global similarity and supervised learning method. We perform DBN-FM and RWBRMDA on both HMDD V2.0 and HMDD V3.0, respectively, and the ROC and precision-recall curve (PRC) of the prediction results is shown in Figure 4.

The comparison between DBN-MF and RWBRMDA on data HMDD V2.0 and HMDD V3.0. (A) The ROC of DBN-MF and RWBRMDA. (B) The P-R curve of DBN-MF and RWBRMDA. DBN-MF indicates deep belief network–based matrix factorization; HMDD, Human microRNA Disease Database; ROC, receiver operating characteristics; RWBRMDA, random walk and binary regression–based miRNA-disease association.
Figure 4A shows that the DBN-FM achieves the AUC value of 0.92 on HMDD V2.0 (old data) and 0.94 on HMDD V3.0 (new data), which are both higher than the AUC value of RWBRMDA on HMDD V2.0 and HMDD V3.0, respectively. In addition, both DBN-FM and RWBRMDA have better prediction performance on the new data than the old data. Figure 4B shows the AUPR value of these 2 methods on both the new database and old database, and it has the same trend as the AUC value that DBN-FM achieves higher value than the RWBRMDA and they perform better on the new version database than on the old version database. In a word, 2 conclusions can be drawn from Figure 4. First, the performance of DBN-FM is superior to the RWBRMDA method when they predict miRNA-disease associations based on the same information. Second, a bigger database can help DBN-FM model improve the prediction ability.
Effects of DBN-MF components
To evaluate the performance of each step of DBN-MF, we compare DBN-MF with the other version of DBN-MF, which is DBN-SVM. In DBN-SVM, the first step is the same as DBN-MF, which uses DBNs to extract the features of miRNAs and diseases. Then, DBN-SVM trains an SVM-based classifier with the extracted features in the first step. Each pair of miRNA-disease is considered as a sample, and we combine their features extracted from the first step to represent the features of a sample. Figure 5 shows the AUC value and AUPR value of DBN-MF and DBN-SVM on database HMDD V2.0 and HMDD V3.0.

The comparison between DBN-MF and DBN-SVM on the data HMDD V2.0 and HMDD V3.0. (A) The ROC of DBN-MF and DBN-SVM. (B) The P-R curve of DBN-MF and DBN-SVM. DBN-MF indicates deep belief network–based matrix factorization; HMDD, Human microRNA Disease Database; SVM, support vector machine; ROC, receiver operating characteristics.
According to Figure 5A, DBN-MF achieves a higher AUC value than DBN-SVM in both HMDD V2.0 and HMDD V3.0. Figure 5B shows that DBN-MF has better prediction ability than DBN-SVM on old data, while DBN-MF has the same prediction performance compared with DBN-SVM on new data when evaluated in terms of AUPR. Besides, DBN-SVM has much better performance than the RWBRMDA method no matter based on the new data set or old data set. All in all, DBN-SVM also can effectively predict the miRNA-disease associations, and it has better performance than RWBRMDA, but its performance is still not as good as DBN-MF, especially when the database is not so big. All these results demonstrate that both the DBN part and the backpropagation part play important roles in the good prediction performance of DBN-MF, and the backpropagation is especially crucial when the data are not big enough. In addition, the results on the old database and new database further illustrate that a big database can result in better performance in miRNA-disease association prediction than the small database.
Case study
To further demonstrate the prediction ability of DBN-MF in identifying novel miRNA-disease associations, DBN-MF is conducted on HMDD V2.0 for predicting all the unknown associations. The other 3 databases (HMDD V3.0, dbDEMC,
54
and miRCancer
55
) are used to verify the novel associations predicted by DBN-MF on database HMDD V2.0, and we also search the literature to confirm the newly predicted associations. In the prediction on data HMDD V2.0, 5441 positive associations and 5441 unknown associations are chosen as training samples. According to these 10 882 samples, DBN-MF trains a classifier, and the well-trained classifier is used to predict the association score for all the unknown associations. For a certain disease
Lung cancer is one of the most common cancers that have a high rate to cause death because it is difficult to diagnose at the early stage. 56 Nevertheless, miRNAs can act as biomarkers that help diagnose cancers in an early stage. Table 2 shows the top 20 candidate miRNAs associated with lung cancer, which are predicted by the DBN-MF model based on the HMDD V2.0 data set. In these miRNAs, 19 of 20 miRNAs have been verified to have associations with lung cancer according to database HMDD V3.0, dbDEMC, miRCancer, or previous literature. Furthermore, for the unconfirmed miRNA has-mir-208b, a previous study 57 showed that has-mir-208b was significantly upregulated in all moderate pulmonary hypertension subjects, and pulmonary hypertension is a common phenomenon in lung cancer patients, which indicates that has-mir-208b also has a high probability to associate with lung cancer. The results in Table 2 demonstrate the effectiveness of our DBN-MF model in predicting novel associations between miRNAs and lung cancer.
The prediction results of the top 20 new miRNA-disease associations of lung cancer.
Abbreviations: HMDD, Human microRNA Disease Database; miRNA, microRNA.
Pancreatic neoplasm is another high incidence of disease that also causes a large number of deaths every year. To further demonstrate the performance of DBN-MF, we analyze the top 20 novel associations between miRNAs and pancreatic neoplasm that predicted by the DBN-MF model based on data HMDD V2.0. The results are shown in Table 3, in which 18 of 20 novel associations have confirmed in database dbDEMC or HMDD V3.0, and only 2 predicted associations has-mir-499a and has-mir-372 are not confirmed. The prediction results in Table 3 further illustrate the validity and feasibility of our prediction model.
The prediction results of the top 20 new miRNA-disease associations of pancreatic neoplasms.
Abbreviations: HMDD, Human microRNA Disease Database; miRNA, microRNA.
Conclusions
MiRNAs were demonstrated to associate with a variety of diseases and can be biomarkers of diseases. Identifying miRNA-disease associations contributes to understand the underlying pathogenesis of diseases and provide proper disease treatment. As more and more miRNA-related and disease-related databases were created based on the biological experiments, researchers began to focus on predicting the miRNA-disease associations by computational methods. In this study, we have proposed a DBN-based matrix factorization model named DBN-MF to identify the underlying miRNA-disease associations. First, the unsupervised learning DBNs were trained with miRNAs’ and diseases’ raw features, respectively, and the extracted features were obtained. Second, a classifier with 2 pretrained DBNs is trained in the section of supervised machine learning for fine-tuning the parameters of our model. Finally, the well-trained model was used to predict the association score for each pair of unknown miRNA-disease. We compared DBN-MF model with previous computational methods on HMDD V2.0 and HMDD V3.0, the experimental results showed that DBN-MF achieved much better prediction performance than the previous methods for both AUC and AUPR no matter based on the same information or with the methods based on multiple types of evidence. The results on database HMDD V3.0 were better than HMDD V2.0, which demonstrated that a more sufficient database can help improve the performance of the prediction method. Also, the case study further illustrated the effectiveness of DBN-MF.
The excellent performance of DBN-MF is attributed to several important factors. First, this model took full advantage of the valid and updated miRNA-disease association data verified with biological experiments. Even though it did not integrate multiple types of data, the association data were sufficient enough to train a good model. Second, the unsupervised training of DBNs can learn the latent features of miRNAs and diseases very well, and well-trained DBNs are obtained with all the miRNAs or diseases, so this model includes the global information. Finally, backpropagation has a strong ability for learning the underlying complex associations between miRNAs and diseases with the labeled data. In summary, the excellent performance of this model is attributed to the nonlinear features of diseases and miRNAs that our proposed deep networks learned in the process of matrix factorizations. This advantage reveals the information that the traditional linear matrix factorization methods cannot learn.
Although DBN-MF shows great performance in predicting novel miRNA-disease associations, there are also some limitations. For example, the calculation of DBN-MF is based on miRNA-disease associations, so it cannot predict the novel associations for new diseases or miRNAs that have no known associations with miRNAs or diseases. In the future, we would further improve our model by extracting the features of miRNAs and diseases based on more and various types of information on miRNAs and diseases, such as the genes that miRNAs targeted, the GO terms of miRNA, the protein-protein network, and the disease phenotype. Specifically, we could integrate the miRNAs’ and diseases’ features extracted by DBN-MF with the miRNAs’ and diseases’ features extracted from other types of data. For example, first, the miRNAs’ semantic features can be described according to manifold learning method by miRNA target gene information from the database mirTarBase 58 and the gene ontology annotations from the database GO.59,60 The disease semantic features can be obtained according to the manifold learning method by directed acyclic graph (DAG) constructed by the MeSH descriptors (https://www.nlm.nih.gov/). Then each miRNA or disease can be represented by integrating its DBN-MF features with its semantic features. Finally, we get an association score for each pair of miRNA-disease according to the integrated features of miRNAs and diseases. Also, we may use iFeature, 61 iLearn, 62 BioSeq-Analysis2.0, 63 or BioSeq-Analysis 64 to extract the features of miRNAs for improving our method.
