TY - JOUR AB - Introduction Most complex traits such as obesity involve a diverse set of genes, intricate interplay among them and subtle interaction between genetic and environment factors. One of the first steps toward a systematic understanding of the genetic basis of a complex trait is the identification of causal genetic elements, e.g. genes, genetic markers and/or single nucleotide polymorphisms (SNPs), whose variations are responsible for the traits. The objective of this challenging task is two-fold: effectively identifying a subset of genetic elements out of a large pool of candidates whose patterns are characteristic of a trait of interest, and accurately predicting the phenotype with a model that accommodate interactions among selected genetic elements. Despite recent advances in high-throughput technologies that have produced an enormous amount of biological data, heterogeneous data types, non-linear relationships among genes and complex phenotypes have made this task difficult. Although conventional linkage analyses and association studies as well as the latest genome-wide association studies (GWAS) have produced a fruitful collection of genomic susceptibility loci for a variety of complex traits and diseases [1], [2], they have mainly been able to detect genetic elements of marginal effects while failed to respect epistatic interactions [3], [4]; as a result, they have a low power for predicting phenotypes [5]. As an intermediate between genotype and phenotype, gene expression has been proven to be a rich and valuable source of information complementary to genotype information for dissecting complex traits. On one extreme using gene expression data alone, classifiers or regressors have been built to predict disease types or stages with only a small number of disease-related genes [6]–[8]. By integrating information of genetics and gene expression, genetics of gene expression-based approaches [9]–[11] and network-based approaches [12]–[14] have been independently developed and applied to identify genes related to complex traits. Recently a few machine learning based methods have been proposed to integrate both genotype and gene expression data to not only identify relevant genes, but also predict phenotypes based on selected genes. Ruderfer et al. [15] adopted a SVM classifier to predict drug responses (i.e., sensitivity or resistance) in yeast. They showed that using both data of transcripts and genetic markers can improve prediction accuracy compared with using either transcripts or genetic markers alone. Based on the elastic net regularized regression [16], Chen et al. [17] developed Camelot to predict quantitative response (i.e. growth yield) of yeast to 94 drugs using genotype and gene expression data collected from drug-free conditions from yeast segregants. Compared with the work by Ruderfer et al. [15], Camelot was able to make accurate quantitative prediction on various drug treatments as opposed to dichotomic classes of drug response. Camelot also emphasized greatly on causal inference by incorporating a priori knowledge and adopting post statistic tests to select only handful genetic makers and expression transcripts as phenotype predictors. For example, for predicting the hydrogen peroxide response, only a single gene, DHH1, passed their pre-filtering criteria and was then used to construct the final prediction model. Although appropriate for downstream experimental validation as Camelot always make the most conservative choices, it remains unknown whether its filtering steps could indeed help improve prediction accuracy and whether it would otherwise prevent further novel discovery besides the factors known to have large marginal effects. Random Forests (RF) [18], an ensemble of classification or regression trees, has recently been successfully applied in various biological studies [19]–[23]. RF has many desirable characteristics that make it well suited for integrating both genotypic and gene expression information. It is well adapted for variable selection for high-dimensional data with competing prediction accuracy compared to the state-of-the-art machine learning techniques. RF is able to accommodate categorical (e.g. genotype) and continuous (e.g. gene expression) data. It can be used when the number of variables substantially exceeds the number of observations (e.g. thousands of SNP markers and probes of gene expression versus a few hundred samples of subject) [19]–[23]. Moreover, RF supports possible interactions among variables [4], which is critical for systems-biology studies where interplays between genetic (e.g. epistatically interacting SNPs) and gene expression (e.g. coactivator/corepressor) must be taken into consideration. While promising, however, conventional RF algorithms have several drawbacks that limit their success on large biological problems. Firstly, even though RF allows possible interactions among variables, it does not incorporate possible correlation among variables; even worse, with correlated variables, it suffers from biases introduced in measuring variable importance (VI) [24], [25], which can result in incorrect or misleading variable rankings. Secondly, RF's prediction accuracy may decline significantly when the proportion of truly informative variables among all variables is small [26]. In this paper, we develop a new method, called module-guided Random Forests (mgRF), to integrate genotypic and gene expression information to understand and possibly dissect complex relationships among different genetic elements underlying complex traits. mgRF combines the method of conventional RF and a network-based analysis to remedy the two aforementioned drawbacks of conventional RF by exploiting structural relationships, extracted from the network analysis, among different types of variables. As a test and application, we applied mgRF to the data of genetic markers and gene expression from a cohort of F2 female mouse intercross to examine its performance and demonstrate its ability to identify genetic elements that contribute to mouse weight, many of which were missed by the conventional RF algorithm. mgRF outperformed the state-of-the-art methods that combine information from multiple biological sources with more accurate predictions. Furthermore, using mgRF we investigated the interactions among multiple genetic elements underlying mouse weight. Statistically significant interactions of SNP-to-SNP, gene-to-gene, and SNP-to-gene identified by mgRF revealed genetic elements and their significant association underlying mouse weight. The results demonstrated a great expectation of mgRF as a complementary framework to “genetics of gene expression” analysis for dissecting genetic mechanism of complex traits, such as obesity. Results Overview of the mgRF method In mgRF our main objectives are to capture intrinsic structures of variable (genetic element) correlation and/or interaction and to incorporate such information in the RF framework to predict a complex phenotype. The major steps of mgRF algorithm, outlined in Figure 1, consist of the identification of variable modules from a variable correlation network (Figures 1A and 1B) and an iterative RFs construction process (Figure 1C). In the first step we construct a correlation network and identify modules in the network to group highly-correlated variables, which may be in different types, using a network clustering method such as HQCut [27]–[29]. In the second step of mgRF, we iteratively construct a series of RFs guided by the previously identified network modules. Instead of randomly sampling variables in each node of regression tree, we adopt a two-stage candidate variable sampling procedure, where we first select a subset of modules and then choose one representative variable for each of the selected modules (right panels in Figure 1C) to correct the bias of variable importance while incorporating the variable association information. Except the first RF construction, we use a modified weighted sampling to improve the prediction accuracy by prioritizing informative variables among a large pool of variables. A key element of mgRF is to correct possible bias of variable importance measure and improve the performance of RF for high-dimensional data. This is done in part by introducing a module importance (MI) to each network module identified. Initially all MI and variable importance (VI) are set to 0 so that in the first iteration the sampling of modules and variables is un-weighted. After each iteration of RF, new MIs and VIs are re-estimated (not accumulated). The values of MIs and VIs can typically converge within a small number of iterations, where little change can be observed between the last and the second to the last estimations. The final output of mgRF is an ensemble of trees as a model for future analysis and its corrected VIs (cVIs) and MIs for variable and module ranking. Details and parameter selections of the mgRF algorithm are described in Materials and Methods. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 1. Flow chart of mgRF algorithm. (A) A subgraph of a constructed correlation network where a node indicates a variable and an edge represents the correlation between two variables. (B) Identification of network modules using HQcut, where different colors encode different modules. (C) Iterative construction of RFs. The panels on the left show multiple iterations of RFs and the panels on the right illustrate the sampling scheme in each node during of tree construction. mtry (k) candidate variables are sampled using a two-stage candidate variable sampling procedure, where a subset of modules (e.g. the blue, green, orange modules) is first sampled and then one representative variable from each module is selected. In the first iteration (iteration 0), all modules and variables have the same weights. At the end of one iteration, module and variable importance are re-estimated. In the figure, the importance of variables is encoded as the size of node. Then modules and variables are sampled by their corresponding weights using modified weighted sampling. The best splitting variable and value at each node in the tree in the left panel are selected from mtry (k) candidate variables. https://doi.org/10.1371/journal.pcbi.1002956.g001 Combining genotypic and expression data can better predict mouse weight To investigate the benefits of integrating genotypic and gene expression data, we examined the performance of different models on the genotypic and gene expression data of a cohort of F2 mouse intercross in a three-way comparison: (1) using only data of genetic markers (genotype-only), (2) using only data of gene expression (expression-only), and (3) using both genotypic and expression data (combined). We first compared mgRF with group lasso [30], elastic net [16], SVR-RFE [8], and the conventional RF algorithm (see Text S1) in terms of the weight regression error (Root-Mean-Square Error or RMSE) in all three types of data using 10 trials of 10-fold cross-validation. The average cross-validation RMSEs of the methods compared are shown in Figure 2. mgRF achieved the smallest average error compared to all the other competing methods in all types of data. Since we assessed each model using the same training and test data in each fold of the cross-validation, we can compute the paired t-test of RMSEs to evaluate the significance of the results. As shown in Table S1, the RMSEs of all the other methods are significantly larger (p<2.52×10−13) than that of mgRF. Furthermore, the running time of mgRF is slightly less than the conventional RF with better prediction accuracy (Table S2). It is noteworthy to mention that SVR-RFE, RF and mgRF outperformed the linear models, group lasso and elastic net, suggesting the benefits of incorporating non-linearity between variables and the mouse weight response. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 2. Performance comparison of Group lasso, Elastic net, Support Vector Regressor (SVR), conventional RF, and mgRF. Average RMSEs of these methods in a 10-fold cross-validation on the mouse weight dataset. Even though the standard deviation (shown as error bar) of RMSEs among 10 trials of 10-fold cross-validation is relatively high due to the small number of samples, if we compare the RMSEs of different methods using the same set of training samples, the improvement is evident (mgRF vs conventional RF, one-tail paired t-test p-value<3.83×10−16). https://doi.org/10.1371/journal.pcbi.1002956.g002 In particular, we examined the RMSEs of mouse weight using mgRF. The boxplot of prediction errors on the three data types are shown in Figure 3A. The combined data have the smallest error (RMSE = 3.80 and R2 = 0.604), followed by the expression-only data (RMSE = 4.13, and R2 = 0.534), while the genotypic-only data have the highest error rate (RMSE = 5.62 and R2 = 0.137). Thus using either the genotypic or gene expression data alone is less effective than using the combined data. Although the standard error of RMSEs of different trials (quartile bar in Figure 3A) is relatively large comparing to the difference of mean RMSE between with and without genotypic data, there is a substantial improvement in pairwise comparisons using the same training samples (Figures 3B to 3D). The two-dimensional co-ordinates of point in each of these plots indicate the RMSEs of mgRF trained with the same set of samples but with different data types. In Figures 3B and 3C, most of the points appear under the reference diagonal line, which confirms that both expression-only and combined data achieved better performance in a single fold than the genotype-only data (paired one-tail t-test p-value≤4.7446×10−28 and ≤1.7272×10−32, respectively). This was probably because in general the linkage signal of genetic markers is weak (LOD score <4), while gene expression is more closely related to the phenotype than genotypes. Furthermore, mgRF using both genotypic and gene expression data outperforms using expression-only data in more than 90% of the trials (Figure 3D, paired one-tail t-test p-value≤1.691×10−19), showing that combining genotypic and gene expression data can indeed improve the prediction power and suggesting that information of gene expression plays a role in bridging the gap between genotype variations and complex traits. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 3. Summary of prediction errors of mgRF using three types of data. (A) Boxplot of prediction errors in root-mean-square error (RMSE) in 10 trials of 10-folds cross-validation. Scatter plots of (B) genotype-only (x-axis) vs. expression-only (y-axis), (C) genotype-only vs. combined, and (D) expression-only vs. combined. The dashed diagonal lines (x = y) indicate points of equal RMSE. Given two vectors, the p-value of one-tail paired t-test evaluate if the distribution of one vector (100 RMSE values for 10 trials of 10-fold cross validation) is statistically smaller than the other one. https://doi.org/10.1371/journal.pcbi.1002956.g003 Genetic elements identified by mgRF reveal perturbed pathways related to mouse weight The mgRF method used corrected variable importance (cVI) and module importance (MI) to identify variables and groups of variables that influence the trait of body weight. MIs were computed for network modules identified by the HQcut algorithm [27], [29], [31]. To assess and illustrate mgRF's ability for correcting the bias of variable importance, we evaluated different regression models regarding their abilities for recovering the true variable importance associated with the known data-generating pattern in a simulation study (see Text S2). mgRF was able to accurately recover the known pattern of variables' importance and the VI measure of mgRF was more stable than the other methods in all simulations, as discussed in Text S2. When applied to the mouse weight data from a cohort of 132 samples and compared with modules identified by topological overlap matrix based methods [12], [32], HQcut produced much smaller modules, allowing only highly-correlated variables to be clustered in a module (Figure S1). HQcut identified 146, 1036 and 1187 network modules (see module structures in Table S3, S4, S5) in the genotype-only, expression-only and combined data, respectively. As expected, SNPs in one module were generally in linkage disequilibrium. Genes in one module were co-expressed and potentially functionally related. There were SNPs and genes assigned to the same module in the combined data set due to the large correlation values among those gene expression and SNPs. The top-ranked genetic markers and genes in the combined data largely overlapped with those identified by genotype-only and expression-only data types indicating the stability of mgRF in terms of variable ranking. Here we reported the top-ranked modules of genetic markers and genes in Table 1 and 2. Among these top-ranked SNPs (Table 1), rs3662726 (Chromosome 5, 123 Mb) is near Gofm2 (gonadal fat mass 2) QTL which has been reported to confer increased fat mass in female mouse [33]. We also examined the LOD scores of SNPs using the traditional QTL mapping. Several “hotspot” QTLs on Chromosomes 1, 3, 5, 7, 10, 15 and 19 were partially overlapped with the top-ranked markers by mgRF (Figure 4). Table S6 lists all the top-ranked SNPs. Note that several markers with low LOD scores were assigned relative high cVIs, suggesting that a SNP with low marginal effect can be identified by mgRF because of their interaction with other SNPs, which may contribute to the variation of body weight. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 4. LOD scores from QTL mapping versus VI scores from mgRF. The black curves represent the LOD scores of a single marker genome scan in conventional QTL analysis. The blue bars represent the cVI scores of genetic markers output by mgRF. https://doi.org/10.1371/journal.pcbi.1002956.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Representative significant SNPs in the top 10 modules identified by mgRF in the combined dataset. https://doi.org/10.1371/journal.pcbi.1002956.t001 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Representative significant genes in the top 10 modules identified by mgRF in the combined dataset.* https://doi.org/10.1371/journal.pcbi.1002956.t002 It is important to note that there was little overlap among the 100 top-ranked genes from group lasso, elastic net, SVR-RFE, the conventional RF algorithm, and mgRF (Figure S2A). The lack of consensus indicated that these algorithms identified their own top-ranked genes based on different (unspecified) assumptions on the given data and target models to be learned. Introduction of such assumptions seemed to be inevitable because of the lack of sufficient knowledge of the problem at hand and different objectives that these methods were devised to achieve. Nevertheless, all these methods strived to select predictive variables (genetic factors). On top of finding individual predictive genetic factors, mgRF was particularly designed to identify such predictive genetic factors whose association might contribute more significantly than individual variables at the module level because it propagated the contribution of individual variables to highly correlated neighbors, rather than fully focusing on individual genes. Table S7 lists the top-ranked genes related to mouse weight from mgRF. Among these top-ranked genes (Table 2), monoacylglycerol O-acyltrasferase 1 (Mogat1, cVI = 11.84) in module 189 has been previously identified to be located within Chromosome 1 obesity QTL interval near D1Mit215. Within this QTL interval on Chromosome 1, insulin-like growth factor binding protein 2 (Igfbp2, cVI = 10.94) has expression levels in liver negatively correlated with mesenteric fat pad weights [34]. Igfbp2 (appeared in module 32) has also been shown to prevent diet-induced obesity and insulin resistance in mice on overexpression [35]. In particular, module 32 contained Cyp2c37 (cVI = 9.65), C7orf24 (cVI = 7.43) and Gpld1 (cVI = 6.54), which were not among the top 100 ranked genes from any of the other methods compared, probably due to their correlation with Igfbp2. Raet1d (cVI = 9.38) in module 189 was also not identified by the competing methods (Table S8) probably due to its correlation with Mogat1. Remarkably, Cyp2c37 has been previously recognized as being associated with fat mass [10] and Gpld1 had been shown to be associated with the level of adiponectin, a hormone secreted from adipose tissue which is negatively correlated with obesity [36]. It is viable to hypothesize that other genes identified by mgRF, which were neglected by the other methods, may potentially contribute to mouse weight variation. To further assess the biological significance of the genes identified by mgRF, we conducted a Gene Ontology (GO) enrichment analysis (see Materials and Methods) on the top-ranked genes from the methods that were compared. mgRF identified more enriched biological processes than the other methods (Table S9). In particular, genes identified by mgRF were enriched with many obesity-related processes, such as regulation of lipid storage (p = 7.17×10−06), positive regulation of cholesterol storage (p = 1.07×10−05), regulation of growth (p = 0.000575), and cellular response to cholesterol (p = 0.00396). In contrast, the enriched biological processes provided by the other methods were less significant and less biologically meaningful (Tables S5B to S5E). As an example, Figure 5 shows a sub-network of some top-ranked genes from mgRF to compare the variable importance measures in mgRF and the conventional RF. The size of nodes represents relative cVIs from mgRF in Figure 5A and represents relative VIs from conventional RF in Figure 5B. In the conventional RF algorithm, one variable, Igfbp2, has a larger importance than others. As a result, the importance of Igfbp2 overshadows several other correlated variables such as Fmo3, Cyp2c37, and Raet1d, which may in fact be equally important as Igfbp2. In mgRF, several genes with the highest cVI, such as Mogat1, MGC137458, and Igfbp2, which are known to be critical to body weight, were also hub nodes in the network with many edges. It was consistent with our previous studies on the importance of hub genes in the co-expression network [37]. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 5. Network structure of 30 top ranked genes from mgRF. (A) The size of nodes represents the relative cVIs from mgRF. (B) The size of nodes represents the relative VIs from the conventional RF algorithm. https://doi.org/10.1371/journal.pcbi.1002956.g005 Significant interactions among multiple genetic elements revealed by mgRF Although the corrected variable importance (cVI) from mgRF quantifies the contribution of a genetic factor to the prediction power, it does not indicate whether the contribution is from the genetic factor alone or from its interaction between or association with other factors. One advantage of the RF method is its ability to incorporate variable interactions, which mgRF inherited. We devised a systematic statistical test (see Materials and Methods) to assess the significance of gene-to-gene, SNP-to-SNP, and SNP-to-gene interactions revealed by mgRF. To examine the biological relevance of genes identified in gene-to-gene interactions in the mouse weight data, we first tested the functional enrichment among 160 unique genes from the top 100 most significant pairs of interactions (Table S10). Interestingly, these genes were enriched with metabolic processes such as isoprenoid metabolic process (p = 0.00675), drug metabolic process (p = 0.00841), and terpenoid metabolic process (p = 0.00117), indicating obesity-related interaction among the identified genes. In particular, the pair of Avpr1a and Igfbp2 is one of the most significant interactions (p = 4.15×10−07), both of which are also among the most predictive genes. However, only 11 (∼7%) of the 160 unique genes from the significant gene-to-gene interactions were overlapped with the top 100 most predictive genes identified by cVI (Figure S2B). This suggested that our interaction test could indeed identify genes that were less significant when examined individually. For example, macrophage receptor with collagenous structure (Marco) was observed to interact with many other phenotype-related genes (e.g. Dhrs4, Cyp2d22, and Pdia5), even though its own cVI was relatively low. Among the top-ranked SNP-to-SNP interactions, we found significant interactions between SNPs on Chromosome 5 (123 Mb) and Chromosome 19 (51 Mb) (p = 3.34×10−11), on Chromosome 2 (96 Mb) and Chromosome 9 (61 Mb) (p = 8.97×10−7), and on Chromosome 1 (41 Mb) and Chromosome 15 (62 Mb) (p = 2.94×10−06, Table S11). The most significant interaction was between p45558 (Chromosome 5, 123 Mb) and p44890 (Chromosome 19, 51 Mb). Cis-eQTLs analysis [12] indicated that Bmp2, a key regulator of adipogenesis, was a candidate gene of p45558, and Cyp2c40, known to be presented in Fatty acid metabolism, was a candidate gene of p44890. For SNP-to-gene interactions (Table S12), there was little overlapping between genes involved in SNP-to-gene interactions and genes involved in gene-to-gene interactions (Figure S2B). Of particular interest was the interaction between SNP p45334 (Chromosome 1, 77 Mb) and gene Ehhadh, where Mogat1 is one of the candidate eQTL genes of p45334 and Ehhadh is annotated in the fatty acid metabolism pathway. We compared the top 50 unique SNPs involved in SNP-to-gene interactions with top SNPs ranked by cVIs. More than 20 of them were among the top-50 most predictive SNPs. On the other hand, only two genes involved in top 100 SNP-to-gene interactions were among the top 100 most predictive genes. We hypothesized that the most predictive SNPs did not interact with the most predictive genes because the information contained in these SNPs was redundant to these predictive genes, which usually were the expression traits of the corresponding predictive SNPs. In turn, by combining less predictive gene and marker profiles introduced extra information into the system and indeed improved the prediction accuracy. Genes involved in such interactions might reveal additional perturbed pathways underlying the trait of body weight. Overview of the mgRF method In mgRF our main objectives are to capture intrinsic structures of variable (genetic element) correlation and/or interaction and to incorporate such information in the RF framework to predict a complex phenotype. The major steps of mgRF algorithm, outlined in Figure 1, consist of the identification of variable modules from a variable correlation network (Figures 1A and 1B) and an iterative RFs construction process (Figure 1C). In the first step we construct a correlation network and identify modules in the network to group highly-correlated variables, which may be in different types, using a network clustering method such as HQCut [27]–[29]. In the second step of mgRF, we iteratively construct a series of RFs guided by the previously identified network modules. Instead of randomly sampling variables in each node of regression tree, we adopt a two-stage candidate variable sampling procedure, where we first select a subset of modules and then choose one representative variable for each of the selected modules (right panels in Figure 1C) to correct the bias of variable importance while incorporating the variable association information. Except the first RF construction, we use a modified weighted sampling to improve the prediction accuracy by prioritizing informative variables among a large pool of variables. A key element of mgRF is to correct possible bias of variable importance measure and improve the performance of RF for high-dimensional data. This is done in part by introducing a module importance (MI) to each network module identified. Initially all MI and variable importance (VI) are set to 0 so that in the first iteration the sampling of modules and variables is un-weighted. After each iteration of RF, new MIs and VIs are re-estimated (not accumulated). The values of MIs and VIs can typically converge within a small number of iterations, where little change can be observed between the last and the second to the last estimations. The final output of mgRF is an ensemble of trees as a model for future analysis and its corrected VIs (cVIs) and MIs for variable and module ranking. Details and parameter selections of the mgRF algorithm are described in Materials and Methods. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 1. Flow chart of mgRF algorithm. (A) A subgraph of a constructed correlation network where a node indicates a variable and an edge represents the correlation between two variables. (B) Identification of network modules using HQcut, where different colors encode different modules. (C) Iterative construction of RFs. The panels on the left show multiple iterations of RFs and the panels on the right illustrate the sampling scheme in each node during of tree construction. mtry (k) candidate variables are sampled using a two-stage candidate variable sampling procedure, where a subset of modules (e.g. the blue, green, orange modules) is first sampled and then one representative variable from each module is selected. In the first iteration (iteration 0), all modules and variables have the same weights. At the end of one iteration, module and variable importance are re-estimated. In the figure, the importance of variables is encoded as the size of node. Then modules and variables are sampled by their corresponding weights using modified weighted sampling. The best splitting variable and value at each node in the tree in the left panel are selected from mtry (k) candidate variables. https://doi.org/10.1371/journal.pcbi.1002956.g001 Combining genotypic and expression data can better predict mouse weight To investigate the benefits of integrating genotypic and gene expression data, we examined the performance of different models on the genotypic and gene expression data of a cohort of F2 mouse intercross in a three-way comparison: (1) using only data of genetic markers (genotype-only), (2) using only data of gene expression (expression-only), and (3) using both genotypic and expression data (combined). We first compared mgRF with group lasso [30], elastic net [16], SVR-RFE [8], and the conventional RF algorithm (see Text S1) in terms of the weight regression error (Root-Mean-Square Error or RMSE) in all three types of data using 10 trials of 10-fold cross-validation. The average cross-validation RMSEs of the methods compared are shown in Figure 2. mgRF achieved the smallest average error compared to all the other competing methods in all types of data. Since we assessed each model using the same training and test data in each fold of the cross-validation, we can compute the paired t-test of RMSEs to evaluate the significance of the results. As shown in Table S1, the RMSEs of all the other methods are significantly larger (p<2.52×10−13) than that of mgRF. Furthermore, the running time of mgRF is slightly less than the conventional RF with better prediction accuracy (Table S2). It is noteworthy to mention that SVR-RFE, RF and mgRF outperformed the linear models, group lasso and elastic net, suggesting the benefits of incorporating non-linearity between variables and the mouse weight response. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 2. Performance comparison of Group lasso, Elastic net, Support Vector Regressor (SVR), conventional RF, and mgRF. Average RMSEs of these methods in a 10-fold cross-validation on the mouse weight dataset. Even though the standard deviation (shown as error bar) of RMSEs among 10 trials of 10-fold cross-validation is relatively high due to the small number of samples, if we compare the RMSEs of different methods using the same set of training samples, the improvement is evident (mgRF vs conventional RF, one-tail paired t-test p-value<3.83×10−16). https://doi.org/10.1371/journal.pcbi.1002956.g002 In particular, we examined the RMSEs of mouse weight using mgRF. The boxplot of prediction errors on the three data types are shown in Figure 3A. The combined data have the smallest error (RMSE = 3.80 and R2 = 0.604), followed by the expression-only data (RMSE = 4.13, and R2 = 0.534), while the genotypic-only data have the highest error rate (RMSE = 5.62 and R2 = 0.137). Thus using either the genotypic or gene expression data alone is less effective than using the combined data. Although the standard error of RMSEs of different trials (quartile bar in Figure 3A) is relatively large comparing to the difference of mean RMSE between with and without genotypic data, there is a substantial improvement in pairwise comparisons using the same training samples (Figures 3B to 3D). The two-dimensional co-ordinates of point in each of these plots indicate the RMSEs of mgRF trained with the same set of samples but with different data types. In Figures 3B and 3C, most of the points appear under the reference diagonal line, which confirms that both expression-only and combined data achieved better performance in a single fold than the genotype-only data (paired one-tail t-test p-value≤4.7446×10−28 and ≤1.7272×10−32, respectively). This was probably because in general the linkage signal of genetic markers is weak (LOD score <4), while gene expression is more closely related to the phenotype than genotypes. Furthermore, mgRF using both genotypic and gene expression data outperforms using expression-only data in more than 90% of the trials (Figure 3D, paired one-tail t-test p-value≤1.691×10−19), showing that combining genotypic and gene expression data can indeed improve the prediction power and suggesting that information of gene expression plays a role in bridging the gap between genotype variations and complex traits. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 3. Summary of prediction errors of mgRF using three types of data. (A) Boxplot of prediction errors in root-mean-square error (RMSE) in 10 trials of 10-folds cross-validation. Scatter plots of (B) genotype-only (x-axis) vs. expression-only (y-axis), (C) genotype-only vs. combined, and (D) expression-only vs. combined. The dashed diagonal lines (x = y) indicate points of equal RMSE. Given two vectors, the p-value of one-tail paired t-test evaluate if the distribution of one vector (100 RMSE values for 10 trials of 10-fold cross validation) is statistically smaller than the other one. https://doi.org/10.1371/journal.pcbi.1002956.g003 Genetic elements identified by mgRF reveal perturbed pathways related to mouse weight The mgRF method used corrected variable importance (cVI) and module importance (MI) to identify variables and groups of variables that influence the trait of body weight. MIs were computed for network modules identified by the HQcut algorithm [27], [29], [31]. To assess and illustrate mgRF's ability for correcting the bias of variable importance, we evaluated different regression models regarding their abilities for recovering the true variable importance associated with the known data-generating pattern in a simulation study (see Text S2). mgRF was able to accurately recover the known pattern of variables' importance and the VI measure of mgRF was more stable than the other methods in all simulations, as discussed in Text S2. When applied to the mouse weight data from a cohort of 132 samples and compared with modules identified by topological overlap matrix based methods [12], [32], HQcut produced much smaller modules, allowing only highly-correlated variables to be clustered in a module (Figure S1). HQcut identified 146, 1036 and 1187 network modules (see module structures in Table S3, S4, S5) in the genotype-only, expression-only and combined data, respectively. As expected, SNPs in one module were generally in linkage disequilibrium. Genes in one module were co-expressed and potentially functionally related. There were SNPs and genes assigned to the same module in the combined data set due to the large correlation values among those gene expression and SNPs. The top-ranked genetic markers and genes in the combined data largely overlapped with those identified by genotype-only and expression-only data types indicating the stability of mgRF in terms of variable ranking. Here we reported the top-ranked modules of genetic markers and genes in Table 1 and 2. Among these top-ranked SNPs (Table 1), rs3662726 (Chromosome 5, 123 Mb) is near Gofm2 (gonadal fat mass 2) QTL which has been reported to confer increased fat mass in female mouse [33]. We also examined the LOD scores of SNPs using the traditional QTL mapping. Several “hotspot” QTLs on Chromosomes 1, 3, 5, 7, 10, 15 and 19 were partially overlapped with the top-ranked markers by mgRF (Figure 4). Table S6 lists all the top-ranked SNPs. Note that several markers with low LOD scores were assigned relative high cVIs, suggesting that a SNP with low marginal effect can be identified by mgRF because of their interaction with other SNPs, which may contribute to the variation of body weight. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 4. LOD scores from QTL mapping versus VI scores from mgRF. The black curves represent the LOD scores of a single marker genome scan in conventional QTL analysis. The blue bars represent the cVI scores of genetic markers output by mgRF. https://doi.org/10.1371/journal.pcbi.1002956.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Representative significant SNPs in the top 10 modules identified by mgRF in the combined dataset. https://doi.org/10.1371/journal.pcbi.1002956.t001 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Representative significant genes in the top 10 modules identified by mgRF in the combined dataset.* https://doi.org/10.1371/journal.pcbi.1002956.t002 It is important to note that there was little overlap among the 100 top-ranked genes from group lasso, elastic net, SVR-RFE, the conventional RF algorithm, and mgRF (Figure S2A). The lack of consensus indicated that these algorithms identified their own top-ranked genes based on different (unspecified) assumptions on the given data and target models to be learned. Introduction of such assumptions seemed to be inevitable because of the lack of sufficient knowledge of the problem at hand and different objectives that these methods were devised to achieve. Nevertheless, all these methods strived to select predictive variables (genetic factors). On top of finding individual predictive genetic factors, mgRF was particularly designed to identify such predictive genetic factors whose association might contribute more significantly than individual variables at the module level because it propagated the contribution of individual variables to highly correlated neighbors, rather than fully focusing on individual genes. Table S7 lists the top-ranked genes related to mouse weight from mgRF. Among these top-ranked genes (Table 2), monoacylglycerol O-acyltrasferase 1 (Mogat1, cVI = 11.84) in module 189 has been previously identified to be located within Chromosome 1 obesity QTL interval near D1Mit215. Within this QTL interval on Chromosome 1, insulin-like growth factor binding protein 2 (Igfbp2, cVI = 10.94) has expression levels in liver negatively correlated with mesenteric fat pad weights [34]. Igfbp2 (appeared in module 32) has also been shown to prevent diet-induced obesity and insulin resistance in mice on overexpression [35]. In particular, module 32 contained Cyp2c37 (cVI = 9.65), C7orf24 (cVI = 7.43) and Gpld1 (cVI = 6.54), which were not among the top 100 ranked genes from any of the other methods compared, probably due to their correlation with Igfbp2. Raet1d (cVI = 9.38) in module 189 was also not identified by the competing methods (Table S8) probably due to its correlation with Mogat1. Remarkably, Cyp2c37 has been previously recognized as being associated with fat mass [10] and Gpld1 had been shown to be associated with the level of adiponectin, a hormone secreted from adipose tissue which is negatively correlated with obesity [36]. It is viable to hypothesize that other genes identified by mgRF, which were neglected by the other methods, may potentially contribute to mouse weight variation. To further assess the biological significance of the genes identified by mgRF, we conducted a Gene Ontology (GO) enrichment analysis (see Materials and Methods) on the top-ranked genes from the methods that were compared. mgRF identified more enriched biological processes than the other methods (Table S9). In particular, genes identified by mgRF were enriched with many obesity-related processes, such as regulation of lipid storage (p = 7.17×10−06), positive regulation of cholesterol storage (p = 1.07×10−05), regulation of growth (p = 0.000575), and cellular response to cholesterol (p = 0.00396). In contrast, the enriched biological processes provided by the other methods were less significant and less biologically meaningful (Tables S5B to S5E). As an example, Figure 5 shows a sub-network of some top-ranked genes from mgRF to compare the variable importance measures in mgRF and the conventional RF. The size of nodes represents relative cVIs from mgRF in Figure 5A and represents relative VIs from conventional RF in Figure 5B. In the conventional RF algorithm, one variable, Igfbp2, has a larger importance than others. As a result, the importance of Igfbp2 overshadows several other correlated variables such as Fmo3, Cyp2c37, and Raet1d, which may in fact be equally important as Igfbp2. In mgRF, several genes with the highest cVI, such as Mogat1, MGC137458, and Igfbp2, which are known to be critical to body weight, were also hub nodes in the network with many edges. It was consistent with our previous studies on the importance of hub genes in the co-expression network [37]. Download: PPT PowerPoint slide PNG larger image TIFF original image Figure 5. Network structure of 30 top ranked genes from mgRF. (A) The size of nodes represents the relative cVIs from mgRF. (B) The size of nodes represents the relative VIs from the conventional RF algorithm. https://doi.org/10.1371/journal.pcbi.1002956.g005 Significant interactions among multiple genetic elements revealed by mgRF Although the corrected variable importance (cVI) from mgRF quantifies the contribution of a genetic factor to the prediction power, it does not indicate whether the contribution is from the genetic factor alone or from its interaction between or association with other factors. One advantage of the RF method is its ability to incorporate variable interactions, which mgRF inherited. We devised a systematic statistical test (see Materials and Methods) to assess the significance of gene-to-gene, SNP-to-SNP, and SNP-to-gene interactions revealed by mgRF. To examine the biological relevance of genes identified in gene-to-gene interactions in the mouse weight data, we first tested the functional enrichment among 160 unique genes from the top 100 most significant pairs of interactions (Table S10). Interestingly, these genes were enriched with metabolic processes such as isoprenoid metabolic process (p = 0.00675), drug metabolic process (p = 0.00841), and terpenoid metabolic process (p = 0.00117), indicating obesity-related interaction among the identified genes. In particular, the pair of Avpr1a and Igfbp2 is one of the most significant interactions (p = 4.15×10−07), both of which are also among the most predictive genes. However, only 11 (∼7%) of the 160 unique genes from the significant gene-to-gene interactions were overlapped with the top 100 most predictive genes identified by cVI (Figure S2B). This suggested that our interaction test could indeed identify genes that were less significant when examined individually. For example, macrophage receptor with collagenous structure (Marco) was observed to interact with many other phenotype-related genes (e.g. Dhrs4, Cyp2d22, and Pdia5), even though its own cVI was relatively low. Among the top-ranked SNP-to-SNP interactions, we found significant interactions between SNPs on Chromosome 5 (123 Mb) and Chromosome 19 (51 Mb) (p = 3.34×10−11), on Chromosome 2 (96 Mb) and Chromosome 9 (61 Mb) (p = 8.97×10−7), and on Chromosome 1 (41 Mb) and Chromosome 15 (62 Mb) (p = 2.94×10−06, Table S11). The most significant interaction was between p45558 (Chromosome 5, 123 Mb) and p44890 (Chromosome 19, 51 Mb). Cis-eQTLs analysis [12] indicated that Bmp2, a key regulator of adipogenesis, was a candidate gene of p45558, and Cyp2c40, known to be presented in Fatty acid metabolism, was a candidate gene of p44890. For SNP-to-gene interactions (Table S12), there was little overlapping between genes involved in SNP-to-gene interactions and genes involved in gene-to-gene interactions (Figure S2B). Of particular interest was the interaction between SNP p45334 (Chromosome 1, 77 Mb) and gene Ehhadh, where Mogat1 is one of the candidate eQTL genes of p45334 and Ehhadh is annotated in the fatty acid metabolism pathway. We compared the top 50 unique SNPs involved in SNP-to-gene interactions with top SNPs ranked by cVIs. More than 20 of them were among the top-50 most predictive SNPs. On the other hand, only two genes involved in top 100 SNP-to-gene interactions were among the top 100 most predictive genes. We hypothesized that the most predictive SNPs did not interact with the most predictive genes because the information contained in these SNPs was redundant to these predictive genes, which usually were the expression traits of the corresponding predictive SNPs. In turn, by combining less predictive gene and marker profiles introduced extra information into the system and indeed improved the prediction accuracy. Genes involved in such interactions might reveal additional perturbed pathways underlying the trait of body weight. Discussion A systems biology approach is necessary to dissect complex traits, such as obesity, and understand relationships among various genetic factors. Combing heterogeneous data from multiple sources will become increasingly important to model a large quantity of data and interpret results. In this paper, we proposed a novel approach that integrates the method of Random Forests and a network analysis to incorporate genotypic and gene expression data for revealing genetic factors and their interaction or association that are characteristic of complex traits. To overcome the curse of dimensionality, mgRF enhanced conventional RF with the module structure of a correlation network and the weighted sampling procedure. As a result, it successfully identified a small subset of both predictive and biological meaningful genes and genetic markers out of thousands of candidates. Meanwhile, mgRF was able to model the complex associations and possibly interactions between heterogeneous variables, which lead to interesting findings that can shed some lights on solving the genetic multiplicity problem underlying complex traits. To rectify the bias in ranking correlated variables in conventional RF, simple but effective strategies such as grouping correlated variables prior to model fitting [38], [39] can be applied, where cluster centroids obtained from a hierarchical clustering could be used as supergenes to fit classification/regression models. Compared with Tolosi and Lengaue's work [39], the major differences and novelty of mgRF are three folds. (1) We maintained the original feature space in RF models so that the importance of individual variables can be estimated. (2) Instead of using a simple hierarchical clustering, we adopted the network modeling method HQCut, which is able to automatically and accurately determine the number of modules (clusters) in the network. (3) We further utilized the learnt MI and VI to guide a weighted sampling of variables. Compared with other regression models, such as elastic net and support vector regressor (SVR), mgRF naturally handles different types of variables in that the splitting points of continuous (or ordinal) variables preserve the order information, which, however, is disregarded in categorical variables. On the contrary, elastic net and SVR treat categorical variables as continuous ones, consequently imposing ordered information, which is related to how the categories should be encoded. There is a popular variable importance measure, permutation VI [18], which measures the increase of out-of-bag prediction error with the variable to be measured being permutated. However, the permutation VI suffers from several shortcomings for large problems. It requires an excessive computation time, since each variable needs to be permuted dozens of times to ensure statistical stability. In addition, when the baseline prediction error is large, there is little chance for permutation to make a prediction worse, which leads to an uniformly low VI [25]. More critically, it is still subject to the bias of correlated variables [40], [41]. Even though this problem can be corrected [24], [42], [43], the incurred computation time of additional permutation for a solution will make the excessive computation cost prohibitive for large application. Using a BxH F2 mouse intercross data set, we showed that the proposed algorithm was effective on not only reducing prediction error, but also identifying a subset of genetic markers and genes that are associated with the trait of body weight. By integrating genotypic and gene expression data, mgRF achieved a lower prediction error compared to using either type of data alone. These results support the idea that gene expression plays an intermediate bridging role between genotypic variations and a phenotype. Genotypic data alone are insufficient for accurately predicting the body weight due to their relative weak effects, while gene expression data bridge the gap between genotypic variants and a phenotype as gene expression can be intermediate traits of multiple genetic markers. Besides the annotated body weight relevant genetic elements, such as QTL rs3662726, genes Mogat1, Igfbp2, and Cyp2c37, mgRF provided valuable hypotheses on putative, novel genetic elements and their interactions that are potentially important for body weight and obesity. In particular, the top-ranked SNPs and genes, which have similar levels of importance but are lack of known annotations, are excellent candidates for further validations. A key feature of mgRF is that it exploited splitting variables to incorporate non-linear interactions of variables into the model and to identify intriguing associations within and across two types of data. The proposed statistical test for variable associations aimed at extracting biological relevant markers and genes that might have been overlooked by individual variable importance ranking. The results of mgRF showed that several known obesity-related genes and loci were associate or even interacted with each other and genes that were strongly associated were indeed related to the traits of obesity and/or body weight, as these genes were enriched with biological processes on metabolisms. More importantly, the results revealed that many genetic elements, which have not been indicated previously to be associated with the traits, interacted with obesity-related genes and their associations may contribute more significantly to the traits than associations between genes that were known to be related to obesity. In addition, the results also included significant pairs of genes even though the predictive scores of individual genes whose predictive scores were insignificant. These results suggested that more obesity related genetic factors remain to be discovered and mgRF is potentially an enabling method for identifying genetic factors whose significance would not be appreciated unless their associations or interactions were taken into consideration. Another key advantage of RF is that at each splitting point, it only considers mtry (k) candidate variables for splitting, usually k<