TY - JOUR
AU1 - Ma, Yingjun
AU2 - Ma, Yuanyuan
AB - Introduction MicroRNAs (miRNAs) are a group of small noncoding RNAs that play important roles in many biological processes [1]. They have the ability to inhibit or promote gene expression, thereby affecting protein synthesis. As a result, dysregulation of miRNAs is associated with various biological processes and diseases [2–4]. The identification of disease-related miRNAs is highly significant for studying disease pathogenesis and drug development. In the past, biological experiments were utilized for this purpose, but such methods were not only time-consuming and laborious, but also inadequate for large-scale detection of the miRNA-disease association [5]. With the development of high-throughput sequencing technology, many related databases for miRNA and disease research have been established. Notably, DIANA-TarBase [6], miRTarBase [7], and miRWalk [8] offer a vast collection of miRNA-gene associations. MiRbase [9] and MiREDiBase [10], on the other hand, furnish sequence data and miRNA editing sites, respectively. The Comparative Toxicogenomics Database (CTD) [11] gathers a vast array of biological entities and associated information, including diseases, genes, phenotypes, and chemical compounds. These databases have significantly broadened our comprehension of miRNA functions and their regulatory mechanisms, serving as a foundation for constructing computational models to anticipate potential miRNA-disease associations. In recent years, numerous models for predicting miRNA-disease associations have been proposed. Tang et al. [12] developed a multi-channel graph convolutional network, which utilized a GCN encoder to capture features under different views, and augmented the learned prediction representation by multi-channel attention. Ma et al. [13] proposed a graph autoencoder model to address the over-smoothing problem of the GNN method, which employed a graph encoder to splice aggregate feature embeddings and self-feature embeddings, and adopted a bilinear decoder for link prediction. In previous research, to obtain higher quality similarity networks, we introduced kernel neighborhood similarity into multi-network bidirectional propagation, which effectively integrates multi-network information to improve prediction performance [14]. Li et al. [5] integrated GCN, CNN and the Squeeze Excitation Network (GCSENet) to devise a novel prediction model for miRNA-disease associations. The model utilizes GCN to gather the features from the miRNA-disease-gene heterogeneous network, performs convolutional operations via CNN, and determines the importance of each feature channel by employing SENet’s squeeze and excite blocks. Most of the aforementioned techniques concentrate on foreseeing a binary association between miRNA and disease. Nevertheless, mounting evidence indicates that the malfunction of miRNAs triggers disease through various conceivable mechanisms [15,16]. On the one hand, only certain types of miRNAs are the cause of disease. For instance, targeted deletion of heart and muscle-specific miR-1-2 results in defects in cardiac morphogenesis, such as ventricular septal defects, and high mortality before and after birth [17]. On the other hand, the same miRNA can also cause the same disease through a different set of pathways. For example, miR-146a can directly target SMAD4, thereby regulating cell proliferation and apoptosis and playing a role in the onset and development of gastric cancer [18]. The ectopic expression of miR-146a inhibits the migration and invasion of gastric cancer cells, and down-regulates the expression of EGFR and IRAK1 [19]. Therefore, identifying miRNA-disease associations and associated types will enhance our understanding of the pathogenesis of diseases related to miRNA dysregulation on a deeper level. In recent years, researchers have placed greater emphasis on identifying different types of miRNA-disease associations. Chen et al. [20] first proposed a restricted Boltzmann machine model (RBMMMDA) for predicting associations of miRNAs, diseases, and related types. However, RBMMMDA neglects the use of auxiliary information, which limits its predictive power to some extent. Inspired by the successful application of the existing tensor decomposition method in studying high-order biological relationships [21], Huang et al. [22] initially introduced the CANDECOMP/PARAFAC (CP) decomposition technique to multi-type miRNA-disease association prediction, coupled with biological similarity serving as a constraint to enhance prediction accuracy. Subsequently, Wang et al. [23] developed a NMCMDA method utilizing end-to-end data-driven learning for the prediction of multi-category miRNA-disease associations. To address the challenges of the current tensor decomposition model, which easily falls into local minima and produces false-negative samples, Dong et al. [24] developed a novel multi-type miRNA-disease association prediction model by integrating hypergraph learning and tensor weighting into non-negative tensor decomposition. Due to the intrinsic lack of completeness and noise in miRNA-disease-type datasets, Yu et al. [25] combined tensor decomposition and label propagation, employed robust principal component analysis on tensors to obtain low-rank prediction tensors, and utilized label propagation to transfer information. While the previously developed methods have yielded success in multi-type disease-related miRNAs prediction tasks, there are still limitations. Firstly, calculating miRNA similarities with the help of known miRNA-disease associations will cause the model to rely on the known associations. Secondly, both CP decomposition and non-negative tensor decomposition are linear models, which hinders the ability to identify complex nonlinear relationships among miRNAs, diseases and association types. Finally, the above models include numerous hyperparameters requiring adjustment, affecting both model computational efficiency and generalization ability. To address the above challenges, we propose a novel computational model called Kernel Bayesian Logistic Tensor Decomposition with Automatic Rank Determination (KBLTDARD) for predicting different types of miRNA-disease associations. Firstly, to reduce dependence on known associations and enhance network precision, we construct the functional similarity of miRNAs from other data sources beyond miRNA-disease-type, and fused multiple similarities to obtain more accurate miRNA(disease) similarity. Secondly, to ensure nonlinear learning ability and avoid tedious hyperparameter debugging, we build hierarchical probability model to formulate logical tensor decomposition, and employ full Bayesian treatment to sparse induction priors of multiple latent variables, enabling automatic search of hyperparameters. Finally, a highly efficient deterministic Bayesian inference algorithm was developed to ensure solution efficiency. Experimental results indicate that LTDSSL outperforms other state-of-the-art methods in predicting multiple types of miRNA-disease associations with high accuracy. Materials and methods Method review In this section, we propose a new computational model called KBLTDARD for predicting miRNA-disease-type associations, which mainly consists of three steps (shown in Fig 1). Firstly, multiple similarities of miRNA (disease) are calculated and fused to obtain a more accurate miRNA (disease) similarity network (shown in step 1 of Fig 1). Secondly, a Bayesian framework of logistic tensor decomposition is established, and the auxiliary information of miRNA (or disease) and the prior probability of latent variables are introduced to construct the probabilistic graphical model of KBLTDARD (shown in step 2 of Fig 1). Finally, the Bayesian variational inference framework for KBLTDARD is established to realize the prediction of potential miRNA-disease-type associations (shown in step 3 of Fig 1). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. The workflow of our proposed KBLTDARD model. https://doi.org/10.1371/journal.pcbi.1012287.g001 Dataset The Human MiRNA Disease Database (HMDD) contains extensive data on experimentally validated human miRNA-disease associations [15,26]. Many computational models utilize HMDD to establish benchmark data sets for execution studies [22–25,27]. To facilitate comparison, we utilize two widely used multi-type miRNA-disease data sets (HMDD v2.0 and HMDD v3.2) established by Huang et al. [22] as benchmark data sets. Specifically, HMDD v2.0 classifies miRNA-disease associations into four types based on evidence from circulation, epigenetics, genetics, and target, containing 1,675 associations for 324 miRNAs and 169 diseases under the four types, with a density of 0.681% in this dataset. HMDD v3.2 contains 16,341 associations of 713 miRNAs and 447 diseases under five types: circulation, epigenetics, genetics, target, and tissue, with a density of 1.025% in the dataset. To obtain additional auxiliary information beyond the associated data, Huang et al. also downloaded disease descriptors from the Medical Subject Headings (MeSH) and calculated the semantic similarity of the diseases [28]. Furthermore, to avoid dependence on known miRNA-disease-type associations, according to previous studies [12, 14], we extracted miRNA-gene associations from miRTarBase Release 8.0 [7] and functional association probabilities of genes from HumanNet [29]. Then, the functional similarity of miRNAs can be obtained by combining the above two kinds of association information. Tensor construction Given a set of miRNAs , a set of diseases , and a set of association types . Then, all associations of miRNAs, diseases and types can be described by the third-order tensor . The (i, j, k) element in tensor is recorded as , which represents the relationship between miRNA mi and disease dj under association type tk. When mi and dj are associated under tk, ; otherwise, . The matrix Y∷k is the kth frontal slice of , representing the k-th type of miRNA-disease association. Previous studies on miRNA-disease associations [5,12,30,31] ignored the impact of association types and were equivalent to research on a certain association type. Although some associations have been discovered and validated, there still exists many associations that have not been verified, and inferring these potential associations may improve our understanding of the pathogenic mechanisms of different types of miRNAs. To this end, our goal is the prediction of potential miRNA-disease-type triples, which is the tensor completion problem. However, since there are few known associations, the tensor is very sparse and contains very limited useful information. Therefore, extracting the auxiliary information of the miRNA and the disease is an effective way to improve the performance of prediction. MiRNA functional similarity To prevent dependency on known miRNA-disease associations, we exploited known miRNA-gene associations and gene similarity to calculate miRNA functional similarity. Referring to previous studies [12, 32], let LLS(gi, gj) denote the association log-likelihood score between gene gi and gj obtained from HumanNet. Then, the similarity S(gi, gj) between gi and gj is (1) where e(gi, gj) is the edge between genes gi and gj. LLSmin and LLSmax denote the minimum and maximum log-likelihood score in HumanNet, respectively. Let Gi and Gj denote the gene sets associated with miRNA mi and mj respectively, then the functional similarity can be calculated as follows: (2) where S(g,G) = max{S(g, gi) |gi ∈ G}, |Gi| and |Gj| represent the number of genes contained in Gi and Gj, respectively. In summary, for each benchmark dataset, we obtain miRNA-disease-type tensor , disease semantic similarity and miRNA functional similarity . Network integration The known tensor also contains important information, which can be an important supplement to the similarity of miRNA (disease) [22,27]. Dong et al. [27] employed to construct a miRNA-disease association matrix, and then calculated the Gaussian similarity of miRNAs, which ignored the influence of association type. Therefore, to retain the information of miRNA, disease and association types simultaneously, we extract features directly from and fuse them to obtain a more accurate similarity network. Let matrix represent the mode-1 matrixization of , that is, project the disease and type dimensions of onto the columns of Y(1). Then, is the ith row of Y(1), which represents the interaction profile feature of the i-th miRNA. Similarly, let matrix represent the mode-2 matrixization of , which also represents the interaction profile feature of diseases. Since Y(1) and Y(2) are both high-dimensional and sparse association matrices, which inevitably contain noise. Therefore, we first employ non-negative double singular value decomposition (NNDSVD) [33] for dimensionality reduction to obtain the non-negative low-dimensional features Fm and Fd of miRNA and disease respectively. Then, we adopt Kernel Soft-neighborhood Similarity (KSNS) to build the similarity network. Referring to previous studies [14,34–39], KSNS hierarchically integrates neighborhood information and mines nonlinear relationships of samples, and has been well applied to the prediction of various types of biological interactions. Therefore, according to Fm and Fd, the interaction profile similarities and of miRNAs and diseases are obtained by KSNS, respectively. Finally, we obtained two types of miRNA similarity ( and ) and two types of disease similarity ( and ), which measured the similarity relationship of miRNA (or diseases) from different perspectives. Referring to previous studies [36,40], we utilized clusDCA to fuse and to obtain the integrated similarity Sm of miRNAs, and fused and to obtain the integrated similarity Sd of diseases. KBLTDARD In previous research, we established a new tensor decomposition model (LTDSSL), which introduces logistic functions into tensor decomposition to improve nonlinear learning capabilities, showing strong performance in higher-order relation prediction problems [41]. However, LTDSSL requires manually specifying the rank of the tensor and the values of the hyperparameters, without considering the uncertainty information of potential factors. Based on this, this study combines tensor decomposition and Bayesian inference to establish a new tensor decomposition model. Let and represent the latent factor matrices of miRNAs, diseases and association types respectively. Then, the association probability of the ith miRNA, jth disease, and the kth type is as follows: (3) where is reconstructed tensor, and its (i, j, k)th entry is . The known miRNA-disease-type triplets are experimentally verified and has higher reliability. Therefore, the weighted logical tensor decomposition model is obtained as follows: (4) where c ≥ 1 represents the importance level parameter. Fig 2 presents the probabilistic graphical model of KBLTDARD with latent variables and corresponding priors. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. The probability graph framework of KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.g002 In Fig 2, the occurrence probability of is calculated from G, W and H by (6). The probability distribution of the factor matrix G is obtained by combined with the miRNA similarity Sm, and the probability distribution of the factor matrix H is obtained by V combined with the disease similarity Sd. σg, σh and λ are precision parameters. We specify the priors of all latent variables and parameters in this section. To effectively integrate the auxiliary information, the elements of the factor matrix G are independent, and the (i, r)th element Gi,r satisfies the multivariate Gaussian distribution with expectation and precision σg (5) Similarly, let the elements of H be independent, and the (j, r)th element Hj,r satisfies the multivariate Gaussian distribution with expectation and precision σh (6) Here, the accuracy parameters σg and σh of the Gaussian distribution satisfy the Jeffreys prior (7) In general, the effective dimension R of the latent space is the tuning parameter, the selection of which is quite challenging and costly. To both infer the value of R and avoid overfitting, we introduce automatic rank determination into the priors of U, V and W [42]. Specifically, let each column of U be an independent random vector, and whose rth column satisfy the multivariate Gaussian distribution with a mean vector 0 and precision matrices λr EI (8) where EI represents the identity matrix of size I × I, and λr controls the rth column of U. When λr has a large value, U r approaches 0, indicating that they make little contribution to and can be removed from U. This process realizes the automatic determination of R. Similarly, the rth column of V satisfies the multivariate Gaussian prior with a mean vector 0 and precision matrices λrEJ (9) where EJ represents the identity matrix of size J × J. Similar to U and V, the prior distribution of W is as follows (10) where EK represents the identity matrix of size K × K. To achieve joint column sparsity of U, V and W [43], we further assign the conjugate Gamma prior to λ (11) where, λ = {λ1, λ2, ⋯,λR}, {α, β} are the two parameters of the Gamma distribution, and we adopt the uninformative prior, that is, α = 1, β = 1 [38]. For simplicity of notation, all unknown latent variables are collected and denoted together by Θ = {G, H, W, U, V, λ, σg, σh}. The probabilistic graphical model is shown in Fig 2, from which we can easily write the joint distribution of the model as (12) Combining the likelihood in (4), the priors of model parameters G and H in (5) and (6), the prior distributions of U, V and W in (8), (9) and (10), and the hyperpriors in (7) and (11), the logarithmic joint distribution of KBLTDARD can be obtained (see S1 Text of S1 File for details) (13) where Λ = diag(λ1, λ2, ⋯, λR), ln(Λ) = diag(ln(λ1), ln(λ2), ⋯, ln(λR)), and diag(∙) denotes converting the vector into a diagonal matrix. In (13), tr(∙) represents the trace of the square matrix, const represents a constant independent of Θ, and ℓ(Θ) represents the logarithmic joint distribution, that is, . Without losing generality, performing the maximum posterior estimate of Θ by maximizing (13) is somewhat equivalent to optimizing the square error function with regularization applied to the logical tensor decomposition and additional constraints on the regularization parameters. However, unlike point estimation, our goal is to compute the complete posterior distribution of all variables in Θ given the data tensor and the similarities (Sm and Sd), that is, (14) Model Inference of KBLTDARD An exact Bayesian inference in (14) requires integration over all latent variables, which is analytically intractable. Therefore, this study adopts variational inference to calculate the approximate posterior distribution q(Θ) of the latent variable [42–44]. The principle of variational inference is to define a parameter distribution group on the latent variable and update the parameters to minimize the Kullback–Leibler (KL) distance between and q(Θ) (15) where lnP(Y) is a constant, representing model evidence, and its lower bound is defined as . With a mean field approximation, q(Θ) is factorized according to the latent variables as (16) It is worth noting that (16) is the only assumption for the posterior distribution of the latent variable. When other variables are fixed, the approximate logarithmic posterior distribution of the latent variable Θk can be accurately obtained (17) where represents expectation, and const denotes a constant that is not dependent on the current variable. Θ\Θk represents the set of all latent variables except Θk. 1) Estimate latent variables G and H: Combined with (3) and (4), the likelihood function contains exponential forms of G and H, resulting in it having no conjugate priors. Therefore, with reference to [45], we adopt the following approximation (18) where σ(x) = 1/(1 + exp(−x)) represents the sigmoid function. Combining (3), (4) and (18), the logarithmic likelihood of satisfies (see S2 Text of S1 File for details) (19) where ξijk represents the local variation parameter. From (19), Ln(h(ξijk, G, H, W)) is quadratic functions of G and H, which is the lower bounds of log likelihood. Replace in (12) with h(ξ, G, H, W), combine (5), and substitute them into (17), it is found that the approximate posterior density of the ith row Gi∙ of G obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Gi∙), that is, (see S3 Text of S1 File for details) (20) where . A(1) represents the mode-1 matrixization of tensor . ⊛ represents the Hadamard product, and ⊙ represents the Khatri-Rao product. ER represents the identity matrix of size R × R. and denote the expectation of Hj∙THj∙ and Wk∙TWk∙ respectively as follows: (21) Similarly, the posterior density of the jth row Hj∙ of H obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Hj∙), that is, (22) where and A(2) represents the mode-2 matrixization of tensor . 2) Estimate the latent variable W: Similar to the solution of G and H, replace in (19) with h(ξijk, G, H, W), combine with (10), and substitute them into (17), it can be found that the approximate posterior density of the kth row Wk∙ of W obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Wk∙), that is, (see S4 Text of S1 File for details) (23) where diag(∙) represents the conversion of a vector to diagonal matrix form, and A(3) represents the mode-3 matrixization of tensor . 3) Estimate latent variables U and V: Substituting the priors of U and G into (17), the logarithmic posterior approximation of the rth column U∙r of U satisfies (see S5 Text of S1 File for details) (24) From (24), the posterior approximation U.r obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(U.r), that is, (25) Apparently, the posterior approximation V.r also obeys the multivariate Gaussian distribution , as follows: (26) 4) Estimate the latent variable λ: Substituting the priors of U, V, W and λ in (8), (9), (10) and (11) into (17), the logarithmic posterior approximately of λ satisfies (see S6 Text of S1 File for details) (27) From (27), the approximate posterior of λr follows a Gamma distribution with parameters and , that is, (28) where represents the expectation of λr. In (28), , and are the expectations of U.rTU.r, V.rTV.r, and W.rTW.r, respectively, as follows: (29) where tr(∙) represents the trace of the square matrix, and σ2(∙) represents the variance. 5) Estimate latent variables σg and σh: Substituting (5) and (7) into (17), the logarithmic posterior approximately of σg satisfies (30) Therefore, the posterior distribution of σg is a Gamma distribution with mean (31) Referring to Theorem 1 (see S7 Text of S1 File for details), is transformed into (32) Similarly, the posterior approximation of σh follows the Gamma distribution, whose expectation is (33) 6) Estimate the local variation parameter ξijk: From (19), take the derivative of Ln(h(ξijk, G, H, W)) with respect to ξijk, set the derivative equal to 0, and obtain ξijk that satisfies (see S8 Text of S1 File for details) (34) where 〈∙〉 is the generalized inner product, which means the product of corresponding elements, and then summed [42]. In (34), , and are the expectations of , and , respectively, as follows: (35) In summary, the optimization algorithm for solving KBLTDARD is presented in Algorithm 1. Algorithm 1. The Algorithm of KBLTDARD Input: Known miRNA-disease-type tensor , miRNA similarity Sm, disease similarity Sd. Output: G, W, H, and . Initialization:  Initialize and with multivariate Gaussian random numbers. Initialize Σ(G), Σ(H), Σ(W), Σ(U) and Σ(V) with unit tensors of corresponding size. Let local variational parameters ξijk = 1, i = 1, ⋯, I, j = 1, ⋯, J, k = 1, ⋯, K. repeat  Update the posterior expectations and of σg and σh via (31) and (33), respectively.  Update the posterior expectations and variances of G, H and W via (20), (22) and (23), respectively.  Update the posterior expectations and variances of U and V via (25) and (26), respectively.   for each r (1 ≤ r ≤ R)    Update the posterior q(λr) via (28) and (29).   end for  Update the local variational parameter ξ via (34) and (35). Until convergence.    Return . Output: Factor matrices G, W, H, and association probability . Complexity analysis In algorithm 1, the time complexity is primarily attributed to the updating of the posterior expectation or variance of the latent variable and the iteration of the local variational parameters. Therefore, we combine the updated formula of latent variables to analyze one by one. In (31) and (33), the computation cost of precision parameters σg and σh are and respectively. The computation cost of factor matrix W in (23) is , while the computation cost of factor matrix G in (20) and H in (22) are and , respectively. The computation cost of latent variable U in (25) and V in (26) are and , respectively. The computation cost of the rank determination parameter λ in (28) is , while the computation cost of the local variational parameter ξ in (34) is . In summary, since the maximum number of iterations is fixed, the total computational cost of an update in Algorithm 1 is O(IJKR2 + I3R + J3R + IR3 + JR3 + KR3), where the potential space dimension R ≪ min{I, J}. Method review In this section, we propose a new computational model called KBLTDARD for predicting miRNA-disease-type associations, which mainly consists of three steps (shown in Fig 1). Firstly, multiple similarities of miRNA (disease) are calculated and fused to obtain a more accurate miRNA (disease) similarity network (shown in step 1 of Fig 1). Secondly, a Bayesian framework of logistic tensor decomposition is established, and the auxiliary information of miRNA (or disease) and the prior probability of latent variables are introduced to construct the probabilistic graphical model of KBLTDARD (shown in step 2 of Fig 1). Finally, the Bayesian variational inference framework for KBLTDARD is established to realize the prediction of potential miRNA-disease-type associations (shown in step 3 of Fig 1). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. The workflow of our proposed KBLTDARD model. https://doi.org/10.1371/journal.pcbi.1012287.g001 Dataset The Human MiRNA Disease Database (HMDD) contains extensive data on experimentally validated human miRNA-disease associations [15,26]. Many computational models utilize HMDD to establish benchmark data sets for execution studies [22–25,27]. To facilitate comparison, we utilize two widely used multi-type miRNA-disease data sets (HMDD v2.0 and HMDD v3.2) established by Huang et al. [22] as benchmark data sets. Specifically, HMDD v2.0 classifies miRNA-disease associations into four types based on evidence from circulation, epigenetics, genetics, and target, containing 1,675 associations for 324 miRNAs and 169 diseases under the four types, with a density of 0.681% in this dataset. HMDD v3.2 contains 16,341 associations of 713 miRNAs and 447 diseases under five types: circulation, epigenetics, genetics, target, and tissue, with a density of 1.025% in the dataset. To obtain additional auxiliary information beyond the associated data, Huang et al. also downloaded disease descriptors from the Medical Subject Headings (MeSH) and calculated the semantic similarity of the diseases [28]. Furthermore, to avoid dependence on known miRNA-disease-type associations, according to previous studies [12, 14], we extracted miRNA-gene associations from miRTarBase Release 8.0 [7] and functional association probabilities of genes from HumanNet [29]. Then, the functional similarity of miRNAs can be obtained by combining the above two kinds of association information. Tensor construction Given a set of miRNAs , a set of diseases , and a set of association types . Then, all associations of miRNAs, diseases and types can be described by the third-order tensor . The (i, j, k) element in tensor is recorded as , which represents the relationship between miRNA mi and disease dj under association type tk. When mi and dj are associated under tk, ; otherwise, . The matrix Y∷k is the kth frontal slice of , representing the k-th type of miRNA-disease association. Previous studies on miRNA-disease associations [5,12,30,31] ignored the impact of association types and were equivalent to research on a certain association type. Although some associations have been discovered and validated, there still exists many associations that have not been verified, and inferring these potential associations may improve our understanding of the pathogenic mechanisms of different types of miRNAs. To this end, our goal is the prediction of potential miRNA-disease-type triples, which is the tensor completion problem. However, since there are few known associations, the tensor is very sparse and contains very limited useful information. Therefore, extracting the auxiliary information of the miRNA and the disease is an effective way to improve the performance of prediction. MiRNA functional similarity To prevent dependency on known miRNA-disease associations, we exploited known miRNA-gene associations and gene similarity to calculate miRNA functional similarity. Referring to previous studies [12, 32], let LLS(gi, gj) denote the association log-likelihood score between gene gi and gj obtained from HumanNet. Then, the similarity S(gi, gj) between gi and gj is (1) where e(gi, gj) is the edge between genes gi and gj. LLSmin and LLSmax denote the minimum and maximum log-likelihood score in HumanNet, respectively. Let Gi and Gj denote the gene sets associated with miRNA mi and mj respectively, then the functional similarity can be calculated as follows: (2) where S(g,G) = max{S(g, gi) |gi ∈ G}, |Gi| and |Gj| represent the number of genes contained in Gi and Gj, respectively. In summary, for each benchmark dataset, we obtain miRNA-disease-type tensor , disease semantic similarity and miRNA functional similarity . Network integration The known tensor also contains important information, which can be an important supplement to the similarity of miRNA (disease) [22,27]. Dong et al. [27] employed to construct a miRNA-disease association matrix, and then calculated the Gaussian similarity of miRNAs, which ignored the influence of association type. Therefore, to retain the information of miRNA, disease and association types simultaneously, we extract features directly from and fuse them to obtain a more accurate similarity network. Let matrix represent the mode-1 matrixization of , that is, project the disease and type dimensions of onto the columns of Y(1). Then, is the ith row of Y(1), which represents the interaction profile feature of the i-th miRNA. Similarly, let matrix represent the mode-2 matrixization of , which also represents the interaction profile feature of diseases. Since Y(1) and Y(2) are both high-dimensional and sparse association matrices, which inevitably contain noise. Therefore, we first employ non-negative double singular value decomposition (NNDSVD) [33] for dimensionality reduction to obtain the non-negative low-dimensional features Fm and Fd of miRNA and disease respectively. Then, we adopt Kernel Soft-neighborhood Similarity (KSNS) to build the similarity network. Referring to previous studies [14,34–39], KSNS hierarchically integrates neighborhood information and mines nonlinear relationships of samples, and has been well applied to the prediction of various types of biological interactions. Therefore, according to Fm and Fd, the interaction profile similarities and of miRNAs and diseases are obtained by KSNS, respectively. Finally, we obtained two types of miRNA similarity ( and ) and two types of disease similarity ( and ), which measured the similarity relationship of miRNA (or diseases) from different perspectives. Referring to previous studies [36,40], we utilized clusDCA to fuse and to obtain the integrated similarity Sm of miRNAs, and fused and to obtain the integrated similarity Sd of diseases. KBLTDARD In previous research, we established a new tensor decomposition model (LTDSSL), which introduces logistic functions into tensor decomposition to improve nonlinear learning capabilities, showing strong performance in higher-order relation prediction problems [41]. However, LTDSSL requires manually specifying the rank of the tensor and the values of the hyperparameters, without considering the uncertainty information of potential factors. Based on this, this study combines tensor decomposition and Bayesian inference to establish a new tensor decomposition model. Let and represent the latent factor matrices of miRNAs, diseases and association types respectively. Then, the association probability of the ith miRNA, jth disease, and the kth type is as follows: (3) where is reconstructed tensor, and its (i, j, k)th entry is . The known miRNA-disease-type triplets are experimentally verified and has higher reliability. Therefore, the weighted logical tensor decomposition model is obtained as follows: (4) where c ≥ 1 represents the importance level parameter. Fig 2 presents the probabilistic graphical model of KBLTDARD with latent variables and corresponding priors. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. The probability graph framework of KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.g002 In Fig 2, the occurrence probability of is calculated from G, W and H by (6). The probability distribution of the factor matrix G is obtained by combined with the miRNA similarity Sm, and the probability distribution of the factor matrix H is obtained by V combined with the disease similarity Sd. σg, σh and λ are precision parameters. We specify the priors of all latent variables and parameters in this section. To effectively integrate the auxiliary information, the elements of the factor matrix G are independent, and the (i, r)th element Gi,r satisfies the multivariate Gaussian distribution with expectation and precision σg (5) Similarly, let the elements of H be independent, and the (j, r)th element Hj,r satisfies the multivariate Gaussian distribution with expectation and precision σh (6) Here, the accuracy parameters σg and σh of the Gaussian distribution satisfy the Jeffreys prior (7) In general, the effective dimension R of the latent space is the tuning parameter, the selection of which is quite challenging and costly. To both infer the value of R and avoid overfitting, we introduce automatic rank determination into the priors of U, V and W [42]. Specifically, let each column of U be an independent random vector, and whose rth column satisfy the multivariate Gaussian distribution with a mean vector 0 and precision matrices λr EI (8) where EI represents the identity matrix of size I × I, and λr controls the rth column of U. When λr has a large value, U r approaches 0, indicating that they make little contribution to and can be removed from U. This process realizes the automatic determination of R. Similarly, the rth column of V satisfies the multivariate Gaussian prior with a mean vector 0 and precision matrices λrEJ (9) where EJ represents the identity matrix of size J × J. Similar to U and V, the prior distribution of W is as follows (10) where EK represents the identity matrix of size K × K. To achieve joint column sparsity of U, V and W [43], we further assign the conjugate Gamma prior to λ (11) where, λ = {λ1, λ2, ⋯,λR}, {α, β} are the two parameters of the Gamma distribution, and we adopt the uninformative prior, that is, α = 1, β = 1 [38]. For simplicity of notation, all unknown latent variables are collected and denoted together by Θ = {G, H, W, U, V, λ, σg, σh}. The probabilistic graphical model is shown in Fig 2, from which we can easily write the joint distribution of the model as (12) Combining the likelihood in (4), the priors of model parameters G and H in (5) and (6), the prior distributions of U, V and W in (8), (9) and (10), and the hyperpriors in (7) and (11), the logarithmic joint distribution of KBLTDARD can be obtained (see S1 Text of S1 File for details) (13) where Λ = diag(λ1, λ2, ⋯, λR), ln(Λ) = diag(ln(λ1), ln(λ2), ⋯, ln(λR)), and diag(∙) denotes converting the vector into a diagonal matrix. In (13), tr(∙) represents the trace of the square matrix, const represents a constant independent of Θ, and ℓ(Θ) represents the logarithmic joint distribution, that is, . Without losing generality, performing the maximum posterior estimate of Θ by maximizing (13) is somewhat equivalent to optimizing the square error function with regularization applied to the logical tensor decomposition and additional constraints on the regularization parameters. However, unlike point estimation, our goal is to compute the complete posterior distribution of all variables in Θ given the data tensor and the similarities (Sm and Sd), that is, (14) Model Inference of KBLTDARD An exact Bayesian inference in (14) requires integration over all latent variables, which is analytically intractable. Therefore, this study adopts variational inference to calculate the approximate posterior distribution q(Θ) of the latent variable [42–44]. The principle of variational inference is to define a parameter distribution group on the latent variable and update the parameters to minimize the Kullback–Leibler (KL) distance between and q(Θ) (15) where lnP(Y) is a constant, representing model evidence, and its lower bound is defined as . With a mean field approximation, q(Θ) is factorized according to the latent variables as (16) It is worth noting that (16) is the only assumption for the posterior distribution of the latent variable. When other variables are fixed, the approximate logarithmic posterior distribution of the latent variable Θk can be accurately obtained (17) where represents expectation, and const denotes a constant that is not dependent on the current variable. Θ\Θk represents the set of all latent variables except Θk. 1) Estimate latent variables G and H: Combined with (3) and (4), the likelihood function contains exponential forms of G and H, resulting in it having no conjugate priors. Therefore, with reference to [45], we adopt the following approximation (18) where σ(x) = 1/(1 + exp(−x)) represents the sigmoid function. Combining (3), (4) and (18), the logarithmic likelihood of satisfies (see S2 Text of S1 File for details) (19) where ξijk represents the local variation parameter. From (19), Ln(h(ξijk, G, H, W)) is quadratic functions of G and H, which is the lower bounds of log likelihood. Replace in (12) with h(ξ, G, H, W), combine (5), and substitute them into (17), it is found that the approximate posterior density of the ith row Gi∙ of G obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Gi∙), that is, (see S3 Text of S1 File for details) (20) where . A(1) represents the mode-1 matrixization of tensor . ⊛ represents the Hadamard product, and ⊙ represents the Khatri-Rao product. ER represents the identity matrix of size R × R. and denote the expectation of Hj∙THj∙ and Wk∙TWk∙ respectively as follows: (21) Similarly, the posterior density of the jth row Hj∙ of H obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Hj∙), that is, (22) where and A(2) represents the mode-2 matrixization of tensor . 2) Estimate the latent variable W: Similar to the solution of G and H, replace in (19) with h(ξijk, G, H, W), combine with (10), and substitute them into (17), it can be found that the approximate posterior density of the kth row Wk∙ of W obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(Wk∙), that is, (see S4 Text of S1 File for details) (23) where diag(∙) represents the conversion of a vector to diagonal matrix form, and A(3) represents the mode-3 matrixization of tensor . 3) Estimate latent variables U and V: Substituting the priors of U and G into (17), the logarithmic posterior approximation of the rth column U∙r of U satisfies (see S5 Text of S1 File for details) (24) From (24), the posterior approximation U.r obeys the multivariate Gaussian distribution with expectation and covariance matrix Σ(U.r), that is, (25) Apparently, the posterior approximation V.r also obeys the multivariate Gaussian distribution , as follows: (26) 4) Estimate the latent variable λ: Substituting the priors of U, V, W and λ in (8), (9), (10) and (11) into (17), the logarithmic posterior approximately of λ satisfies (see S6 Text of S1 File for details) (27) From (27), the approximate posterior of λr follows a Gamma distribution with parameters and , that is, (28) where represents the expectation of λr. In (28), , and are the expectations of U.rTU.r, V.rTV.r, and W.rTW.r, respectively, as follows: (29) where tr(∙) represents the trace of the square matrix, and σ2(∙) represents the variance. 5) Estimate latent variables σg and σh: Substituting (5) and (7) into (17), the logarithmic posterior approximately of σg satisfies (30) Therefore, the posterior distribution of σg is a Gamma distribution with mean (31) Referring to Theorem 1 (see S7 Text of S1 File for details), is transformed into (32) Similarly, the posterior approximation of σh follows the Gamma distribution, whose expectation is (33) 6) Estimate the local variation parameter ξijk: From (19), take the derivative of Ln(h(ξijk, G, H, W)) with respect to ξijk, set the derivative equal to 0, and obtain ξijk that satisfies (see S8 Text of S1 File for details) (34) where 〈∙〉 is the generalized inner product, which means the product of corresponding elements, and then summed [42]. In (34), , and are the expectations of , and , respectively, as follows: (35) In summary, the optimization algorithm for solving KBLTDARD is presented in Algorithm 1. Algorithm 1. The Algorithm of KBLTDARD Input: Known miRNA-disease-type tensor , miRNA similarity Sm, disease similarity Sd. Output: G, W, H, and . Initialization:  Initialize and with multivariate Gaussian random numbers. Initialize Σ(G), Σ(H), Σ(W), Σ(U) and Σ(V) with unit tensors of corresponding size. Let local variational parameters ξijk = 1, i = 1, ⋯, I, j = 1, ⋯, J, k = 1, ⋯, K. repeat  Update the posterior expectations and of σg and σh via (31) and (33), respectively.  Update the posterior expectations and variances of G, H and W via (20), (22) and (23), respectively.  Update the posterior expectations and variances of U and V via (25) and (26), respectively.   for each r (1 ≤ r ≤ R)    Update the posterior q(λr) via (28) and (29).   end for  Update the local variational parameter ξ via (34) and (35). Until convergence.    Return . Output: Factor matrices G, W, H, and association probability . Complexity analysis In algorithm 1, the time complexity is primarily attributed to the updating of the posterior expectation or variance of the latent variable and the iteration of the local variational parameters. Therefore, we combine the updated formula of latent variables to analyze one by one. In (31) and (33), the computation cost of precision parameters σg and σh are and respectively. The computation cost of factor matrix W in (23) is , while the computation cost of factor matrix G in (20) and H in (22) are and , respectively. The computation cost of latent variable U in (25) and V in (26) are and , respectively. The computation cost of the rank determination parameter λ in (28) is , while the computation cost of the local variational parameter ξ in (34) is . In summary, since the maximum number of iterations is fixed, the total computational cost of an update in Algorithm 1 is O(IJKR2 + I3R + J3R + IR3 + JR3 + KR3), where the potential space dimension R ≪ min{I, J}. Results Experimental settings To extensively assess the predictive performance of multi-type miRNA-disease associations, referring to previous studies [22–25,27], we adopt two types of 5-fold cross validation. CVtype: Evaluation of the accuracy of model predictions for types. We randomly divided the miRNA-disease pairs with at least one type association into 5 disjoint equal parts. In each experiment, one subset was alternately selected as the test set and the rest as the training set. CVtriplet: Evaluation of the accuracy of model predictions for triples. We randomly divided all miRNA-disease--type triples into 5 disjoint equal parts. In each experiment, one subset was alternately selected as the test set and the rest as the training set. Regarding CVtype, we are interested in the type with the maximum score in the test set. Therefore, referring to previous studies [22–25,27,41], we sorted the types of miRNA-disease pairs according to prediction scores, selected the type with the highest score, and applied the average Top-1 precision, average Top-1 recall, and average Top-1 F1 as evaluation indicators. For CVtriplet, we choose commonly used overall evaluation indicators, namely the area under the precision-recall curve (AUPR), the area under the ROC curve (AUC), and the F1 value [22,24,27]. In this study, the Matlab tensor toolbox "tensor_toolbox" is used to perform tensor calculations [46]. Comparison experiments Several computational models have been developed and applied for the prediction of multiple types of miRNA-disease associations. To comprehensively evaluate the performance of KBLTDARD, we select six state-of-the-art tensor decomposition models as benchmarks. TFAI [22,47]: TFAI introduces graph Laplacian regularization based on the CP decomposition to keep the information about the local structure of the data. FBCPARD [42]: FBCPARD is a standard Bayesian tensor decomposition model, which introduces the Bayesian framework into CP decomposition and utilizes automatic rank determination to achieve adaptive inference of CP rank. TDRC [22]: TDRC established a new way of relation constraint and integrated auxiliary information of miRNAs and diseases into CP decomposition. WeightTDAIGN [27]: WeightTDAIGN introduces a positive sample weighting strategy based on CP decomposition to improve prediction performance, utilizes the L2,1 norm constraint projection matrix to reduce the impact of redundant information, and employs graph regularization to preserve local structural information. TFLP [25]: TFLP combines tensor robust principal component analysis and label propagation, introduces multiple similarity information of miRNAs (diseases), and achieves prediction through iteration of label information. SPLDHyperAWNTF [24]: SPLDHyperAWNTF integrates hypergraph learning and tensor weighting with non-negative tensor decomposition to achieve miRNA disease-type triple prediction. Except for FBCPARD, the above five methods are applied to multiple types of miRNA-disease association prediction. Therefore, we adopt the optimal parameters recommended by the above methods to perform experiments. In the original literature, WeightTDAIGN, TFLP, SPLDHyperAWNTFTDRC, and TFAI all construct miRNA functional similarity through miRNA-disease associations, which results in bias towards known miRNA-disease associations. Therefore, different from the original method, this paper adopts the functional similarity of miRNAs described in Section 2.2.2 to perform the above model. In addition, when performing CVtriplet, the random selection of negative samples may have an impact on the evaluation metrics of the model. Therefore, referring to the suggestion of Huang et al. [22], for each experiment, we perform negative sample selection under 20 different seeds and calculate the mean as the final evaluation index. According to Table 1, under the CVtype, KBLTDARD outperforms other methods on both HMDD v2.0 and HMDD v3.2. Specifically, on HMDD v2.0, KBLTDARD’s Top-1 precision, Top-1 recall and Top-1 F1 reached 0.6361, 0.5665 and 0.5869 respectively, which are 6.87%, 6.79% and 6.57% higher than the second-ranked SPLDHyperAWNTF. On HMDD v3.2, the Top-1 precision, Top-1 recall and Top-1 F1 of KBLTDARD are 0.6236, 0.4783 and 0.5197, respectively, which are substantially better than other methods. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The predictive performance of all models under CVtype. https://doi.org/10.1371/journal.pcbi.1012287.t001 According to Table 2, under the CVtriplet, KBLTDARD also performs better prediction performance on HMDD v2.0 and HMDD v3.2. Specifically, on HMDD v2.0, the AUPR of KBLTDARD is 0.8966, which is 5.00%, 16.29%, 4.23%, 9.74%, 9.08% and 1.69% higher than that of TFAI (0.8539), FBCPARD (0.7710), TDRC (0.8602), WeightTDAIGN (0.8170), TFLP (0.8220) and SPLDHyperAWNTF (0.8817), respectively. The AUC and F1 of HMDD v2.0 are 0.8893 and 0.8218 respectively, which are substantially better than other methods. Furthermore, on HMDD v3.2, the AUPR, AUC and F1 of KBLTDARD are 0.9452, 0.9445 and 0.8775 respectively, which is better than other methods. The comparison of KBLTDARD with other models under 20 random seeds is detailed in S9 Text of S1 File. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. The predictive performance of all models under CVtriplet. https://doi.org/10.1371/journal.pcbi.1012287.t002 In summary, KBLTDARD achieved the most optimal prediction performance, followed by SPLDHyperAWNTF. FBCPARD’s prediction ability was limited to some extent as it solely relied on miRNA-disease-type associations for prediction. Furthermore, models other than KBLTDARD contain many hyperparameters that often demand cumbersome debugging before conducting predictions, greatly affecting their computational efficiency and generalization capabilities. In contrast, KBLTDARD, with the help of Bayesian framework, takes low-dimensional features and model hyperparameters as latent variables, and realizes model solution by inferring the posterior distribution of latent variables, avoiding complex parameter debugging and enhancing generalization ability. Ablation studies Compared with the traditional Bayesian tensor decomposition methods [42,44], the improvement of KBLTDARD is manifested in three aspects: the introduction of logical functions, the addition of auxiliary information, and the set of importance levels. For a better understanding of these contributions, we created comparison models by removing the logistic function, auxiliary information, and importance level from KBLTDARD, respectively. Specifically, KBTD represents the model acquired by removing the logistic function from KBLTDARD, BLTD represents the model acquired by removing the auxiliary information from KBLTDARD, and KBLTD-NOC represents the model acquired by eliminating the importance level from KBLTDARD. Fig 3 shows the prediction performance of ablation experiments evaluated by 5-fold cross-validation under HMDD v2.0 and HMDD v3.2. For KBLTD-NOC and KBLTDARD, after setting the importance level, the predictive ability of KBLTDARD is substantially superior to that of KBLTD-NOC, especially on the sparse data set (HMDD v2.0). This result indicates that increasing the importance of known associations can effectively mitigate the influence of false-negative samples on model performance. For BLTD and KBLTDARD, after combining auxiliary information, KBLTDARD achieves higher prediction performance compared with BLTD. This result shows that the introduction of auxiliary information corrects the iteration direction of logical tensor decomposition and improves the model’s prediction ability for isolated samples. For KBTD and KBLTDARD, after the introduction of logistic functions, the prediction performance of KBLTDARD has been significantly improved. The result of this experiment demonstrates that the introduction of logical functions significantly improves nonlinear learning capabilities. To sum up, the addition of logical functions, auxiliary information, and importance levels can improve the predictive ability of the model to a certain extent. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. The prediction performance of ablation experiment. https://doi.org/10.1371/journal.pcbi.1012287.g003 Case study To further evaluate the actual prediction performance of KBLTDARD, we conduct two types of case studies. The first strategy evaluates KBLTDARD from a global perspective, that is, testing the model’s predictive ability for all diseases. Therefore, we employ KBLTDARD to predict 447 diseases in HMDD v3.2 one by one. The second strategy predicts four common diseases (‘Gastric Neoplasms’, ‘Myocardial Infarction’, ‘Prostate Neoplasms’ and ‘Pancreatic Neoplasms’) and checks the latest literature to test the prediction results. In Case Study 1, for each disease from HMDD v3.2, all its associations with all miRNAs and types are removed, and the prediction score for each disease is evaluated. We adopt AUC to evaluate the overall predictive performance with respect to each disease and perform statistics on all AUC values. In addition, for each disease, researchers are more likely to focus on the fraction of known associations included in the top-ranked associations [38,48], that is, the hit rate, as follows: (36) where N is the total number of triples in the test set, ρ represents the scaling factor, which in this study is selected as {1%, 5%, 10%}, and [∙] indicates an integer operation. Scand ([ρ ∙ N]) denotes the set of highest scoring top [ρ ∙ N] triples, and STest denotes the real triple set in the test set. As presented in Fig 4, the average AUC of KBLTDARD is 0.8165. Among 447 diseases predicted by KBLTDARD, the AUC of 286 diseases exceeded 0.8, amounting to 63.98%. Only 17 diseases had AUC less than 0.6, amounting to less than 4%. The average hit rates of the top 1%, top 5%, and top 10% are 0.1490, 0.3865, and 0.5320 respectively, which are 14 times, 7 times, and 5 times more than the random hit rates (0.01, 0.05, and 0.10) respectively. The above results indicate that the associations with top prediction scores contain the vast majority of known associations. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. The prediction results of KBLTDARD for all diseases. https://doi.org/10.1371/journal.pcbi.1012287.g004 Then we focus on these four common diseases: Gastric Neoplasms, Myocardial Infarction, Prostate Neoplasms and Pancreatic Neoplasms for further analysis. Table 3 shows the top 20 predictions and related evidence for the disease Gastric Neoplasms, and S1, S2, and S3 Tables show the top 20 predictions and related evidence for the three diseases Myocardial Infarction, Prostate Neoplasms, and Pancreatic Neoplasms, respectively. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Gastric Neoplasms-related miRNAs and types predicted by KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.t003 Gastric cancer is the fifth most prevalent cancer in the world and the third major cause of cancer-related deaths worldwide [49]. There are more than 1 million new cases of gastric cancer worldwide every year, and the number of gastric cancer-related deaths exceeds 780,000 [50]. Tchernitsa et al. [51] studied the differential expression of miRNA in adjacent normal and tumor samples from gastric cancer patients. The results found that miR-146a was significantly different in lymph node-positive and node-negative gastric cancer, and its changes may affect local tumor growth and lymph node spread. Zhang et al. [52] found that hsa-mir-21 is up-regulated in gastric cancer tissues and is significantly related to the degree of differentiation, local invasion and lymph node metastasis of tumor tissues. As shown in Table 3, among the top 20 miRNA and type associations predicted by KBLTDARD, 17 have been verified by relevant literature. Furthermore, in S1, S2,and S3 Tables, among the top 20 associations predicted by KBLTDARD for Myocardial Infarction, Prostate Neoplasms, and Pancreatic Neoplasms, 16, 16, and 17 were confirmed, respectively. Experimental settings To extensively assess the predictive performance of multi-type miRNA-disease associations, referring to previous studies [22–25,27], we adopt two types of 5-fold cross validation. CVtype: Evaluation of the accuracy of model predictions for types. We randomly divided the miRNA-disease pairs with at least one type association into 5 disjoint equal parts. In each experiment, one subset was alternately selected as the test set and the rest as the training set. CVtriplet: Evaluation of the accuracy of model predictions for triples. We randomly divided all miRNA-disease--type triples into 5 disjoint equal parts. In each experiment, one subset was alternately selected as the test set and the rest as the training set. Regarding CVtype, we are interested in the type with the maximum score in the test set. Therefore, referring to previous studies [22–25,27,41], we sorted the types of miRNA-disease pairs according to prediction scores, selected the type with the highest score, and applied the average Top-1 precision, average Top-1 recall, and average Top-1 F1 as evaluation indicators. For CVtriplet, we choose commonly used overall evaluation indicators, namely the area under the precision-recall curve (AUPR), the area under the ROC curve (AUC), and the F1 value [22,24,27]. In this study, the Matlab tensor toolbox "tensor_toolbox" is used to perform tensor calculations [46]. Comparison experiments Several computational models have been developed and applied for the prediction of multiple types of miRNA-disease associations. To comprehensively evaluate the performance of KBLTDARD, we select six state-of-the-art tensor decomposition models as benchmarks. TFAI [22,47]: TFAI introduces graph Laplacian regularization based on the CP decomposition to keep the information about the local structure of the data. FBCPARD [42]: FBCPARD is a standard Bayesian tensor decomposition model, which introduces the Bayesian framework into CP decomposition and utilizes automatic rank determination to achieve adaptive inference of CP rank. TDRC [22]: TDRC established a new way of relation constraint and integrated auxiliary information of miRNAs and diseases into CP decomposition. WeightTDAIGN [27]: WeightTDAIGN introduces a positive sample weighting strategy based on CP decomposition to improve prediction performance, utilizes the L2,1 norm constraint projection matrix to reduce the impact of redundant information, and employs graph regularization to preserve local structural information. TFLP [25]: TFLP combines tensor robust principal component analysis and label propagation, introduces multiple similarity information of miRNAs (diseases), and achieves prediction through iteration of label information. SPLDHyperAWNTF [24]: SPLDHyperAWNTF integrates hypergraph learning and tensor weighting with non-negative tensor decomposition to achieve miRNA disease-type triple prediction. Except for FBCPARD, the above five methods are applied to multiple types of miRNA-disease association prediction. Therefore, we adopt the optimal parameters recommended by the above methods to perform experiments. In the original literature, WeightTDAIGN, TFLP, SPLDHyperAWNTFTDRC, and TFAI all construct miRNA functional similarity through miRNA-disease associations, which results in bias towards known miRNA-disease associations. Therefore, different from the original method, this paper adopts the functional similarity of miRNAs described in Section 2.2.2 to perform the above model. In addition, when performing CVtriplet, the random selection of negative samples may have an impact on the evaluation metrics of the model. Therefore, referring to the suggestion of Huang et al. [22], for each experiment, we perform negative sample selection under 20 different seeds and calculate the mean as the final evaluation index. According to Table 1, under the CVtype, KBLTDARD outperforms other methods on both HMDD v2.0 and HMDD v3.2. Specifically, on HMDD v2.0, KBLTDARD’s Top-1 precision, Top-1 recall and Top-1 F1 reached 0.6361, 0.5665 and 0.5869 respectively, which are 6.87%, 6.79% and 6.57% higher than the second-ranked SPLDHyperAWNTF. On HMDD v3.2, the Top-1 precision, Top-1 recall and Top-1 F1 of KBLTDARD are 0.6236, 0.4783 and 0.5197, respectively, which are substantially better than other methods. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The predictive performance of all models under CVtype. https://doi.org/10.1371/journal.pcbi.1012287.t001 According to Table 2, under the CVtriplet, KBLTDARD also performs better prediction performance on HMDD v2.0 and HMDD v3.2. Specifically, on HMDD v2.0, the AUPR of KBLTDARD is 0.8966, which is 5.00%, 16.29%, 4.23%, 9.74%, 9.08% and 1.69% higher than that of TFAI (0.8539), FBCPARD (0.7710), TDRC (0.8602), WeightTDAIGN (0.8170), TFLP (0.8220) and SPLDHyperAWNTF (0.8817), respectively. The AUC and F1 of HMDD v2.0 are 0.8893 and 0.8218 respectively, which are substantially better than other methods. Furthermore, on HMDD v3.2, the AUPR, AUC and F1 of KBLTDARD are 0.9452, 0.9445 and 0.8775 respectively, which is better than other methods. The comparison of KBLTDARD with other models under 20 random seeds is detailed in S9 Text of S1 File. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. The predictive performance of all models under CVtriplet. https://doi.org/10.1371/journal.pcbi.1012287.t002 In summary, KBLTDARD achieved the most optimal prediction performance, followed by SPLDHyperAWNTF. FBCPARD’s prediction ability was limited to some extent as it solely relied on miRNA-disease-type associations for prediction. Furthermore, models other than KBLTDARD contain many hyperparameters that often demand cumbersome debugging before conducting predictions, greatly affecting their computational efficiency and generalization capabilities. In contrast, KBLTDARD, with the help of Bayesian framework, takes low-dimensional features and model hyperparameters as latent variables, and realizes model solution by inferring the posterior distribution of latent variables, avoiding complex parameter debugging and enhancing generalization ability. Ablation studies Compared with the traditional Bayesian tensor decomposition methods [42,44], the improvement of KBLTDARD is manifested in three aspects: the introduction of logical functions, the addition of auxiliary information, and the set of importance levels. For a better understanding of these contributions, we created comparison models by removing the logistic function, auxiliary information, and importance level from KBLTDARD, respectively. Specifically, KBTD represents the model acquired by removing the logistic function from KBLTDARD, BLTD represents the model acquired by removing the auxiliary information from KBLTDARD, and KBLTD-NOC represents the model acquired by eliminating the importance level from KBLTDARD. Fig 3 shows the prediction performance of ablation experiments evaluated by 5-fold cross-validation under HMDD v2.0 and HMDD v3.2. For KBLTD-NOC and KBLTDARD, after setting the importance level, the predictive ability of KBLTDARD is substantially superior to that of KBLTD-NOC, especially on the sparse data set (HMDD v2.0). This result indicates that increasing the importance of known associations can effectively mitigate the influence of false-negative samples on model performance. For BLTD and KBLTDARD, after combining auxiliary information, KBLTDARD achieves higher prediction performance compared with BLTD. This result shows that the introduction of auxiliary information corrects the iteration direction of logical tensor decomposition and improves the model’s prediction ability for isolated samples. For KBTD and KBLTDARD, after the introduction of logistic functions, the prediction performance of KBLTDARD has been significantly improved. The result of this experiment demonstrates that the introduction of logical functions significantly improves nonlinear learning capabilities. To sum up, the addition of logical functions, auxiliary information, and importance levels can improve the predictive ability of the model to a certain extent. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. The prediction performance of ablation experiment. https://doi.org/10.1371/journal.pcbi.1012287.g003 Case study To further evaluate the actual prediction performance of KBLTDARD, we conduct two types of case studies. The first strategy evaluates KBLTDARD from a global perspective, that is, testing the model’s predictive ability for all diseases. Therefore, we employ KBLTDARD to predict 447 diseases in HMDD v3.2 one by one. The second strategy predicts four common diseases (‘Gastric Neoplasms’, ‘Myocardial Infarction’, ‘Prostate Neoplasms’ and ‘Pancreatic Neoplasms’) and checks the latest literature to test the prediction results. In Case Study 1, for each disease from HMDD v3.2, all its associations with all miRNAs and types are removed, and the prediction score for each disease is evaluated. We adopt AUC to evaluate the overall predictive performance with respect to each disease and perform statistics on all AUC values. In addition, for each disease, researchers are more likely to focus on the fraction of known associations included in the top-ranked associations [38,48], that is, the hit rate, as follows: (36) where N is the total number of triples in the test set, ρ represents the scaling factor, which in this study is selected as {1%, 5%, 10%}, and [∙] indicates an integer operation. Scand ([ρ ∙ N]) denotes the set of highest scoring top [ρ ∙ N] triples, and STest denotes the real triple set in the test set. As presented in Fig 4, the average AUC of KBLTDARD is 0.8165. Among 447 diseases predicted by KBLTDARD, the AUC of 286 diseases exceeded 0.8, amounting to 63.98%. Only 17 diseases had AUC less than 0.6, amounting to less than 4%. The average hit rates of the top 1%, top 5%, and top 10% are 0.1490, 0.3865, and 0.5320 respectively, which are 14 times, 7 times, and 5 times more than the random hit rates (0.01, 0.05, and 0.10) respectively. The above results indicate that the associations with top prediction scores contain the vast majority of known associations. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. The prediction results of KBLTDARD for all diseases. https://doi.org/10.1371/journal.pcbi.1012287.g004 Then we focus on these four common diseases: Gastric Neoplasms, Myocardial Infarction, Prostate Neoplasms and Pancreatic Neoplasms for further analysis. Table 3 shows the top 20 predictions and related evidence for the disease Gastric Neoplasms, and S1, S2, and S3 Tables show the top 20 predictions and related evidence for the three diseases Myocardial Infarction, Prostate Neoplasms, and Pancreatic Neoplasms, respectively. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Gastric Neoplasms-related miRNAs and types predicted by KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.t003 Gastric cancer is the fifth most prevalent cancer in the world and the third major cause of cancer-related deaths worldwide [49]. There are more than 1 million new cases of gastric cancer worldwide every year, and the number of gastric cancer-related deaths exceeds 780,000 [50]. Tchernitsa et al. [51] studied the differential expression of miRNA in adjacent normal and tumor samples from gastric cancer patients. The results found that miR-146a was significantly different in lymph node-positive and node-negative gastric cancer, and its changes may affect local tumor growth and lymph node spread. Zhang et al. [52] found that hsa-mir-21 is up-regulated in gastric cancer tissues and is significantly related to the degree of differentiation, local invasion and lymph node metastasis of tumor tissues. As shown in Table 3, among the top 20 miRNA and type associations predicted by KBLTDARD, 17 have been verified by relevant literature. Furthermore, in S1, S2,and S3 Tables, among the top 20 associations predicted by KBLTDARD for Myocardial Infarction, Prostate Neoplasms, and Pancreatic Neoplasms, 16, 16, and 17 were confirmed, respectively. Discussion and conclusion In this paper, we proposed a new KBLTDARD model to predict higher-order relationships between miRNAs, diseases and types. This model utilizes miRNA-gene association and gene similarity to calculate the functional similarity of miRNAs, avoiding the re-use of known miRNA-disease associations. Then, we combine logistic tensor decomposition and Bayesian inference, introduce auxiliary information, and build a probabilistic graphical model to describe the dependence between latent variables. In addition, with regard to KBLTDARD, we developed an efficient deterministic Bayesian inference algorithm to ensure the efficiency of the model solution. Under the 5-CV framework, the top-1 precision of KBLTDARD for new type prediction reached 0.6320 and 0.6246, and the AUC values for new triplet prediction reached 0.8834 and 0.9445, respectively, on HMDD v2.0 and HMDD v3.2 datasets. The results show that the performance of KBLTDARD is significantly improved compared to the previous methods. Case studies of ’gastric neoplasia’, ’myocardial infarction’, ’prostate neoplasia’ and ’pancreatic neoplasia’ also demonstrated the predictive power of KBLTDARD. Taken together, these results suggest that KBLTDARD can effectively observe multiple types of miRNA-disease associations. It should be noted that the following factors can contribute to the reliable performance of KBLTDARD. First, we extract important information from the related database of miRNA and disease, and mine key features from the trained tensors to ensure the richness of information about miRNAs and diseases. In addition, we combined logical tensor decomposition and Bayesian inference to realise the automatic search of hyperparameters, which improves the nonlinear learning ability and generalisation ability of the model. However, there are some limitations that may affect the performance of KBLTDARD. First, to facilitate model inference, we selected Gaussian and Gamma distributions with conjugation properties to represent prior distributions of latent variables, which may not be optimal for model representations. In future research, we will further explore Bayesian theory and try more advanced prior distributions. Second, although our model shows stronger learning ability than some deep learning, in future studies we will try to combine some advanced deep learning methods (such as hypergraph neural networks, etc.) to improve the nonlinear representation ability of the model. Supporting information S1 File. Supporting information. https://doi.org/10.1371/journal.pcbi.1012287.s001 (DOCX) S1 Fig. The boxplot of prediction results of KBLTDARD under CVtriplet. https://doi.org/10.1371/journal.pcbi.1012287.s002 (TIF) S1 Table. Myocardial Infarction-related miRNAs and association types predicted by KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.s003 (DOCX) S2 Table. Prostate Neoplasms-related miRNAs and association types predicted by KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.s004 (DOCX) S3 Table. Pancreatic Neoplasms-related miRNAs and association types predicted by KBLTDARD. https://doi.org/10.1371/journal.pcbi.1012287.s005 (DOCX) S4 Table. Statistics of predictions for all models under CVtriplet. https://doi.org/10.1371/journal.pcbi.1012287.s006 (DOCX) S5 Table. The P-value of the Wilcoxon rank sum test for pairing KBLTDARD with other models. https://doi.org/10.1371/journal.pcbi.1012287.s007 (DOCX)
TI - Kernel Bayesian logistic tensor decomposition with automatic rank determination for predicting multiple types of miRNA-disease associations
JF - PLoS Computational Biology
DO - 10.1371/journal.pcbi.1012287
DA - 2024-07-08
UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/kernel-bayesian-logistic-tensor-decomposition-with-automatic-rank-MZin4RqZ0c
SP - e1012287
VL - 20
IS - 7
DP - DeepDyve
ER -