Abstract Motivation Transcriptomics and proteomics data have been integrated into constraint-based models to influence flux predictions. However, it has been reported recently for Escherichia coli and Saccharomyces cerevisiae, that model predictions from parsimonious flux balance analysis (pFBA), which does not use expression data, are as good or better than predictions from various algorithms that integrate transcriptomics or proteomics data into constraint-based models. Results In this paper, we describe a novel constraint-based method called Linear Bound Flux Balance Analysis (LBFBA), which uses expression data (either transcriptomic or proteomic) to predict metabolic fluxes. The method uses expression data to place soft constraints on individual fluxes, which can be violated. Parameters in the soft constraints are first estimated from a training expression and flux dataset before being used to predict fluxes from expression data in other conditions. We applied LBFBA to E.coli and S.cerevisiae datasets and found that LBFBA predictions were more accurate than pFBA predictions, with average normalized errors roughly half of those from pFBA. For the first time, we demonstrate a computational method that integrates expression data into constraint-based models and improves quantitative flux predictions over pFBA. Availability and implementation Code is available in the Supplementary data available at Bioinformatics online. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Constraint-based modeling (CBM) can be used to predict cell physiology (e.g. growth rate and metabolic fluxes) under different conditions and improve our understanding of cell metabolism. CBM has been applied in metabolic engineering (Burgard et al., 2003; Cotten and Reed, 2013; Kim et al., 2011; Tervo and Reed, 2012), metabolic comparisons (Bosi et al., 2016; Hamilton and Reed, 2012; Nuccio and Bäumler, 2014), drug discovery (Chavali et al., 2012; Kim et al., 2010; Lee et al., 2009) and other health applications (Becker and Palsson, 2008; Magnúsdóttir et al., 2017; Shlomi et al., 2008). Recent developments in sequencing and mass spectrometry have enabled transcriptomics and proteomics datasets to become more widely available. These omics datasets can be used to derive expression-based CBM constraints and/or objective functions, which can potentially improve model predictions. There are two fundamental ways that expression data has been integrated into constraint-based models. The first way is to directly integrate the expression information into the flux bound. For example, Åkesson et al. (Åkesson et al., 2004) set the fluxes to zero if expression of their associated genes was low. E-Flux (Colijn et al., 2009) directly models the maximum allowable flux value as a function of measured gene expression. The second way is to divide the reactions into different categories based on gene expression (e.g. highly expressed or lowly expressed) and then maximize the agreement (reactions associated with highly expressed genes have high flux) or minimize the disagreement (reactions associated with lowly expressed genes should not have high flux) between flux and expression. For example, Gene Inactivity Moderated by Metabolism and Expression (GIMME) (Becker and Palsson, 2008) minimizes the flux through reactions whose associated genes’ expression falls below a given threshold, weighted by the difference between the genes’ expression and the threshold. iMAT (Shlomi et al., 2008) divides the reactions into those associated with highly expressed genes and reactions associated with lowly expressed genes, and maximizes the number of reactions whose fluxes are consistent with their gene expression state. Transcriptionally controlled flux balance analysis (tFBA) (van Berlo et al., 2011) assumes that if gene expression changes significantly from one condition to the other, the flux through the reaction associated with that gene will change accordingly. tFBA minimizes the violation of this assumption. A detailed comparison of different methods is shown in Table 1 Table 1. A comparison between different constraint-based methods integrating gene expression data Åkesson E-flux GIMME iMAT tFBA MADE PROM LBFBA Directly integrated gene expression into flux bound Yes Yes No No No No Yes Yes Maximized agreement or minimized violation between flux and gene expression No No Yes Yes Yes Yes No No Needs flux data to parameterize constraints No No No No No No No Yes Compared flux predictions to measured intracellular fluxes Yes (4a) No No No No No No Yes (37a) Number of experimental conditions used 1 1 1 1 9 4 907 28b Åkesson E-flux GIMME iMAT tFBA MADE PROM LBFBA Directly integrated gene expression into flux bound Yes Yes No No No No Yes Yes Maximized agreement or minimized violation between flux and gene expression No No Yes Yes Yes Yes No No Needs flux data to parameterize constraints No No No No No No No Yes Compared flux predictions to measured intracellular fluxes Yes (4a) No No No No No No Yes (37a) Number of experimental conditions used 1 1 1 1 9 4 907 28b a The number of fluxes that were compared. b Sensitivity analysis showed that 4 or 5 conditions in the training dataset were sufficient. Note: These methods are Åkesson (Åkesson et al., 2004), E-flux (Colijn et al., 2009), GIMME (Becker and Palsson, 2008), Imat (Shlomi et al., 2008), tFBA (van Berlo et al., 2011), MADE (Jensen and Papin, 2011), PROM (Chandrasekaran and Price, 2010) and LBFBA. Table 1. A comparison between different constraint-based methods integrating gene expression data Åkesson E-flux GIMME iMAT tFBA MADE PROM LBFBA Directly integrated gene expression into flux bound Yes Yes No No No No Yes Yes Maximized agreement or minimized violation between flux and gene expression No No Yes Yes Yes Yes No No Needs flux data to parameterize constraints No No No No No No No Yes Compared flux predictions to measured intracellular fluxes Yes (4a) No No No No No No Yes (37a) Number of experimental conditions used 1 1 1 1 9 4 907 28b Åkesson E-flux GIMME iMAT tFBA MADE PROM LBFBA Directly integrated gene expression into flux bound Yes Yes No No No No Yes Yes Maximized agreement or minimized violation between flux and gene expression No No Yes Yes Yes Yes No No Needs flux data to parameterize constraints No No No No No No No Yes Compared flux predictions to measured intracellular fluxes Yes (4a) No No No No No No Yes (37a) Number of experimental conditions used 1 1 1 1 9 4 907 28b a The number of fluxes that were compared. b Sensitivity analysis showed that 4 or 5 conditions in the training dataset were sufficient. Note: These methods are Åkesson (Åkesson et al., 2004), E-flux (Colijn et al., 2009), GIMME (Becker and Palsson, 2008), Imat (Shlomi et al., 2008), tFBA (van Berlo et al., 2011), MADE (Jensen and Papin, 2011), PROM (Chandrasekaran and Price, 2010) and LBFBA. . Machado and Herrgård (Machado and Herrgård, 2014) compared flux predictions made by several of these expression-based methods and parsimonious flux balance analysis (pFBA), which does not use expression data and instead maximizes biomass yield and minimizes total flux through the network (Lewis et al., 2010). They found that pFBA predictions were as good as or better than those obtained using existing methods that integrated transcriptomics/proteomics data. In this paper, we developed a CBM method called Linear Bound FBA (LBFBA) that uses expression data to more accurately predict intracellular fluxes than pFBA. The method adds expression-derived flux bounds for a subset of fluxes based on the gene/protein expression data. These upper and lower flux bounds are linear functions of the transcriptomics/proteomics data and are reaction-specific. Instead of applying a single bound function to all reactions, LBFBA derives reaction-specific bounds for each reaction, which are learned from the transcriptomics (or proteomics) and fluxomics data in a training dataset. LBFBA requires measured transcriptomics or proteomics data and fluxomics data to first parameterize the flux bound functions and requires only transcriptomics or proteomics data to predict the flux distribution in a new condition. LBFBA was applied to Escherichia coli and Saccharomyces cerevisiae datasets and the flux distributions predicted by LBFBA were found to be more accurate than those fluxes predicted by pFBA. 2 Materials and methods 2.1 Overview of pFBA CBM is a powerful tool to predict cellular phenotypes and flux distributions. The basic formulation of a constraint-based model involves different constraints and an objective function. Flux balance analysis (FBA) is one of the CBM methods often used to predict a flux distribution which maximizes biomass yield. pFBA uses the sum of the absolute value of the fluxes as an objective function [Equation (1)] and can be formulated as: min∑j∈Reaction|vj| (1)s.t. ∑j∈ReactionSij·vj=0 ∀i∈Metabolite (2) LBj ≤ vj ≤ UBj ∀j∈Reaction (3) vj ≥ 0 ∀j∈Irreversible Reaction (4) vj=vjls ∀j∈Extracellular Reaction (5) vbiomass=vmeasured_biomass (6)Equation (2) is the steady-state mass balance constraint, meaning there is no accumulation for each metabolite in the cell. S denotes the stoichiometric matrix where Sij is the stoichiometric coefficient of metabolite i for reaction j . vj is the flux through reaction j . Equation (3) is the enzyme capacity constraint, which imposes an upper bound ( UBj ) and lower bound ( LBj ) for each reaction (which is typically −1000 and 1000 mmol/gDW/h, respectively). Equation (4) ensures that fluxes through irreversible reactions are non-negative. Equation (5) fixes the extracellular flux values to the best-estimates of the extracellular fluxes ( vjls ) obtained from a least squares fit between the metabolic model and extracellular flux measurements (see Supplementary Methods for details). Equation (6) fixes the biomass flux (i.e. growth rate) to the measured value. By solving pFBA, the flux distribution under a specific condition can be predicted. 2.2 Mathematical formulation of LBFBA In LBFBA, gene or protein expression data are used to further tighten the upper and lower bounds for individual fluxes. LBFBA is formulated as the following optimization problem: min∑j∈Reaction|vj|+β·∑j∈Rexpαj (7)s.t. ∑j∈ReactionSij·vj=0 ∀i∈Metabolite (8) LBj ≤ vj ≤ UBj ∀j∈Reaction (9) vj ≥ 0 ∀j∈Irreversible Reaction (10) vj=vjls ∀j∈Extracellular Reaction (11) vbiomass=vmeasured_biomass (12) vglucose·(ajgj+cj)−αj ≤ vj ≤ vglucose·(ajgj+bj)+αj ∀j∈Rexp (13) αj ≥ 0 ∀j∈Rexp (14) LBFBA includes the same mass balance, directionality, and capacity constraints as pFBA [Equations (8)–(12)]. Assuming the flux bounds are linearly correlated with gene or protein expression level, we formulated expression-based constraints [Equation (13)] for the subset of reactions ( Rexp ) with measured flux and expression values in a training dataset. In this work, Rexp contained 37 reactions for E.coli based on a mutant multi-omics dataset (Ishii et al., 2007) and 33 reactions for a yeast aerobicity multi-omics dataset (Jouhten et al., 2008; Rintala et al., 2009; Wiebe et al., 2008;). In Equation (13), gj is the expression level for reaction j. The gene-to-protein-to-reaction (GPR) association is used to calculate gj from the gene or protein expression levels for genes associated with reaction j. If one reaction can be catalyzed by multiple isoenzymes, then gj is calculated as the sum of the expression of these isoenzymes. For GPRs involving subunits, the expression is the minimum of expression across all subunits. In Equation (13), aj , bj and cj are parameters, which are estimated from expression and flux measurements from independent training datasets [using Equations (15) and (16)]. Given expression levels for each reaction ( gj ) and fixed parameters aj , and cj , the upper and lower flux bounds can be calculated for reactions in Rexp [using Equation (13)]. αj is a nonnegative slack variable which is introduced to avoid infeasible flux bounds [Equation (14)] and LBFBA tries to make these slack variables as small as possible. Therefore, these slack variables are weighted by β and penalized in the objective function. Unless noted otherwise, β was set to 10 (see Supplementary Fig. S1 for sensitivity analysis on the value of β ). LBFBA attempts to minimize the sum of absolute values of all the fluxes with the smallest total flux bound violations of the expression-based constraints [Equation (7)]. When using different multi-omics datasets as training and test sets, z-scores for the gj values were used instead along with minor modifications to the LBFBA constraints (see Supplementary Methods for details). Both pFBA and LBFBA were implemented for a genome-scale metabolic reconstruction of E.coli iJO1366 (Orth et al., 2011) and S.cerevisiae iMM904 (Mo et al., 2009). The following reactions were deleted in all iJO1366 simulations: CAT, FHL, SPODM, SPODMpp, DHPTDNR, DHPTDNRN, SUCASPtpp, SUCFUMtpp, SUCMALtpp and SUCTARTtpp. All optimization problems were solved using CPLEX in GAMS optimization environment (GAMS Development Corporation, Washington, DC, USA) and CVX (Grant and Boyd, 2014; Grant and Boyd, 2008), a package specifying and solving convex programs in MATLAB. 2.3 LBFBA expression parameterization The goal for the LBFBA parameter estimation is to find two parallel lines, which correspond to the upper and lower flux bounds, for each individual flux in Rexp (Fig. 1). Therefore, the slope ( aj ) and intercepts for the two lines ( bj and cj ) need to be calculated for each reaction using a training dataset, which is independent from the testing dataset. To obtain these parameters ( aj , bj , and cj ), a linear program was formulated separately for each reaction j in Rexp : min(bj−cj) (15)s.t. aj·gjk+cj ≤ vjkvglucosek ≤ aj·gjk+bj ∀k∈K (16) We used the mutant multi-omics dataset for E.coli and S.cerevisiae multi-omics dataset (see below) containing transcriptomics, proteomics and fluxomics to estimate the parameters used in Equation (13). As before gj is the expression level for reaction j . The flux for reaction j in condition k is defined as vjk , and the relative flux is found by dividing it by vglucosek (glucose uptake rate). gjk, vjk, vglucosek were obtained from the multi-omics dataset. The set K includes all the different conditions used from the multi-omics dataset to parameterize the flux bounds. All the points must be lay between the two lines defined with slope aj and intercept bj , cj . The vertical distance between the two lines, represented as bj−cj should be as small as possible so that there are tight bounds for each reaction flux. This parameter estimation algorithm requires at least three pairs of expression level and flux measurements (i.e. K ≥ 3 ). We included the points with 0 expression and 0 flux (which occurs for mutants without isozymes) in the parameter estimation since the flux must be zero if there is no enzyme available. Fig. 1. View largeDownload slide Relative flux through the phosphoglucose isomerase (PGI) reaction versus gene expression (b4025). The x-axis represents the mRNA expression value for b4025, which encodes Pgi. The y-axis represents the relative flux, which is the flux through the PGI reaction divided by the glucose uptake rate. Given 29 pairs of b4025 gene expression and PGI flux measurements in the E.coli mutant multi-omics dataset, 28 data points are plotted when performing a leave-one-out cross-validation to predict fluxes in the test condition. The two parallel lines shown are used to bound the PGI flux given a gene expression value for b4025. The slope (a) and intercepts (b and c) for the two lines can be obtained using linear optimization [Equations (15) and (16)] Fig. 1. View largeDownload slide Relative flux through the phosphoglucose isomerase (PGI) reaction versus gene expression (b4025). The x-axis represents the mRNA expression value for b4025, which encodes Pgi. The y-axis represents the relative flux, which is the flux through the PGI reaction divided by the glucose uptake rate. Given 29 pairs of b4025 gene expression and PGI flux measurements in the E.coli mutant multi-omics dataset, 28 data points are plotted when performing a leave-one-out cross-validation to predict fluxes in the test condition. The two parallel lines shown are used to bound the PGI flux given a gene expression value for b4025. The slope (a) and intercepts (b and c) for the two lines can be obtained using linear optimization [Equations (15) and (16)] 2.4 Normalized error calculation The normalized errors were used to compare how close the predicted flux distribution was to the measured flux distribution and was calculated as: ∑j∈measured|vjp,k−vjk|/∑j∈measured|vjk| (17) vjp,k is the predicted flux through reaction j in condition k and is the solution from pFBA [Equations (1)–(6)] or LBFBA [Equations (7)–(14)]. vjk is the measured flux in condition k through reaction j in the dataset. The normalized errors can vary due to alternative optimal solutions. Minimum and maximum possible normalized errors for a particular flux prediction method and condition were calculated (see Supplementary Methods for details). 3 Results 3.1 Application of LBFBA to the E.coli mutant multi-omics dataset To evaluate the accuracy of LBFBA predictions, we applied LBFBA to the E.coli mutant multi-omics dataset (Ishii et al., 2007) and performed leave-one-out (LOOCV) and 10-fold cross-validation. In the E.coli mutant multi-omics dataset, transcriptomics, proteomics and fluxomics were measured under 29 glucose aerobic conditions. Conditions 1–24 are knockout mutants with growth rates of 0.2 h−1. Conditions 25–29 are wild-type strains with growth rates of 0.1, 0.4, 0.5, 0.7, 0.2 h−1. In LOOCV, one condition was selected as the test set each time and the other 28 conditions were used as the training set to obtain the aj , bj and cj values. The normalized errors were calculated and compared to normalized errors from pFBA (Fig. 2). LBFBA was used with transcriptomics or proteomics data. For each condition, the best possible normalized error (Best case) was also calculated by making reaction fluxes as close as possible to their measured value (see Supplementary Methods for details). Among the 29 conditions, except for conditions 4 (E.coli Δpgi knockout at a growth rate of 0.2 h−1) and 26 (wild-type E.coli at a growth rate of 0.4 h−1), LBFBA with transcriptomics outperformed pFBA in terms of accuracy (Wilcoxon signed-rank test, P-value = 3.1e − 5). For example, in condition 28 (wild-type E.coli at a growth rate of 0.7 h−1) pFBA falsely predicted gluconate production, which diverts carbon from glycolysis and thus predicted lower fluxes in glycolysis and the TCA cycle. LBFBA with proteomics generally outperformed pFBA (except for conditions 4 and 5) and LBFBA with transcriptomics in terms of accuracy. We also performed a 10-fold cross-validation and found similar results (Supplementary Fig. S2). Fig. 2. View largeDownload slide Simulation result for LBFBA compared with pFBA for the E.coli dataset. The x-axis represents the 29 conditions in the multi-omics dataset. Each condition has measured transcriptomics, proteomics and fluxomics data. The y-axis represents the normalized error, which is defined as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. For each condition, data from the 28 other conditions were used to parameterize flux bounds before LBFBA was performed on the test condition (i.e. leave-one-out-cross-validation). For most cases, pFBA (hollow diamond) has a higher normalized error than LBFBA integrating transcriptomics data (star) or proteomics data (solid triangle). Integrating proteomics data were more accurate than integrating transcriptomics for most conditions. The best case (hollow circle) represents the lowest possible error achieved by fitting constraint-based models to the measured fluxes. All the normalized errors predicted by pFBA or LBFBA were unique (i.e. fluxes through reactions in Rexp had no variability across alternative pFBA or LBFBA optimal solutions) Fig. 2. View largeDownload slide Simulation result for LBFBA compared with pFBA for the E.coli dataset. The x-axis represents the 29 conditions in the multi-omics dataset. Each condition has measured transcriptomics, proteomics and fluxomics data. The y-axis represents the normalized error, which is defined as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. For each condition, data from the 28 other conditions were used to parameterize flux bounds before LBFBA was performed on the test condition (i.e. leave-one-out-cross-validation). For most cases, pFBA (hollow diamond) has a higher normalized error than LBFBA integrating transcriptomics data (star) or proteomics data (solid triangle). Integrating proteomics data were more accurate than integrating transcriptomics for most conditions. The best case (hollow circle) represents the lowest possible error achieved by fitting constraint-based models to the measured fluxes. All the normalized errors predicted by pFBA or LBFBA were unique (i.e. fluxes through reactions in Rexp had no variability across alternative pFBA or LBFBA optimal solutions) To determine the optimal training set size, a sensitivity analysis with respect to training set size for LBFBA with transcriptomics was performed (Fig. 3). When the training set size was 0, pFBA was run for each of the 29 conditions. The average normalized error for pFBA across the 29 conditions was around 40%. The training set size ( K ) in LBFBA was then varied from 3 to 28. When the training set size was K , K conditions were randomly selected from the 29 conditions as the training set to estimate the parameters (slope aj and intercept bj, cj for the set of measured fluxes), and then LBFBA was run on the testing set, containing the other (29- K ) conditions. The average normalized error for a given training set of size K was calculated by averaging the normalized errors for the (29- K ) independent conditions in the testing set. For each training set size except zero, this process was repeated 100 times. Each box plot in Figure 3 represents results from 100 randomly chosen testing datasets. The median of these 100 average normalized errors is represented as the horizontal line in each box plot. As shown in Figure 3, the best training set size was 4 or 5. When the training set size was larger than seven, the median average normalized error started to increase. If a suitable training set size (4 or 5) was chosen, the median of the 100 average normalized errors was around 20%. Compared to pFBA, where the average normalized error was around 40%, LBFBA decreased the average normalized error by more than half. The parameter values aj , bj and cj for all 37 reactions, which led to the lowest normalized error for K=3 are listed in Supplementary Table S1. Fig. 3. View largeDownload slide Sensitivity analysis of the training set size using transcriptomics data. The x-axis represents the number of instances (conditions) in the training set. The y-axis represents average normalized error for the test set. The average normalized error for pFBA across the 29 conditions is around 40% (corresponding to a training set size of zero). Each Tukey box plot was created using the boxplot function in MATLAB. The horizontal line represents the median value. The top and bottom edges of the boxes indicate the 75th and 25th percentiles and the plus signs indicate the outliers (which are 1.5 times the interquartile range away from the top/bottom edge of the boxes, where the interquartile range is the difference between 75th and 25th percentile values). The whiskers indicate the furthest values away from the boxes that are not considered outliers Fig. 3. View largeDownload slide Sensitivity analysis of the training set size using transcriptomics data. The x-axis represents the number of instances (conditions) in the training set. The y-axis represents average normalized error for the test set. The average normalized error for pFBA across the 29 conditions is around 40% (corresponding to a training set size of zero). Each Tukey box plot was created using the boxplot function in MATLAB. The horizontal line represents the median value. The top and bottom edges of the boxes indicate the 75th and 25th percentiles and the plus signs indicate the outliers (which are 1.5 times the interquartile range away from the top/bottom edge of the boxes, where the interquartile range is the difference between 75th and 25th percentile values). The whiskers indicate the furthest values away from the boxes that are not considered outliers 3.2 Application of LBFBA to E.coli carbon source multi-omics datasets To evaluate the accuracy of LBFBA predictions on a fully independent test set, we used the E.coli mutant multi-omics dataset (Ishii et al., 2007) to parameterize the flux bounds (values listed in Supplementary Table S2). Flux distributions for another E.coli carbon source multi-omics dataset (Gerosa et al., 2015) were predicted given transcriptomics data. In Gerosa et al., transcriptomics and fluxomics data were measured for wild-type E.coli under eight different carbon sources (acetate, fructose, galactose, glucose, glycerol, gluconate, pyruvate and succinate). Normalized errors for LBFBA compared with pFBA were found (Fig. 4). LBFBA slightly outperforms pFBA in terms of accuracy for conditions 2, 3, 7 (fructose, galactose, pyruvate as a carbon source, respectively). pFBA slightly outperforms LBFBA for conditions 1, 4, 5, 6 and 8 (acetate, glucose, glycerol, succinate as a carbon source, respectively). Unlike the mutant E.coli multi-omics dataset, the normalized errors between pFBA and LBFBA for this carbon source multi-omics dataset was not significantly different (Wilcoxon signed-rank test, P-value = 0.25). This is mainly because of the large differences between flux distributions in the training set (where glucose was the carbon source) and the test set (using eight different carbon sources). Condition 6 (where gluconate was the carbon source) does not fit the E.coli metabolic model (iJO1366) well, as the normalized errors for the best possible fit (Best case) to the measured E.coli fluxes was high. Fig. 4. View largeDownload slide Simulation result for LBFBA compared with pFBA for Gerosa E.coli dataset. The x-axis represents the eight conditions in the Gerosa multi-omics dataset. Each condition has measured transcriptomics and fluxomics data. The y-axis represents the normalized error, which is defined as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. For each condition, data from the 29 conditions in the E.coli mutant multi-omics dataset by Ishii et al. was used to parameterize flux bounds before LBFBA was performed on the eight conditions in Gerosa E.coli dataset. The z-scores for expression values were used instead of the expression values directly (see Supplementary Methods for details). The best case (hollow circle) represents the lowest possible error achieved by fitting constraint-based models to the measured fluxes. All the normalized errors predicted by pFBA or LBFBA were unique Fig. 4. View largeDownload slide Simulation result for LBFBA compared with pFBA for Gerosa E.coli dataset. The x-axis represents the eight conditions in the Gerosa multi-omics dataset. Each condition has measured transcriptomics and fluxomics data. The y-axis represents the normalized error, which is defined as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. For each condition, data from the 29 conditions in the E.coli mutant multi-omics dataset by Ishii et al. was used to parameterize flux bounds before LBFBA was performed on the eight conditions in Gerosa E.coli dataset. The z-scores for expression values were used instead of the expression values directly (see Supplementary Methods for details). The best case (hollow circle) represents the lowest possible error achieved by fitting constraint-based models to the measured fluxes. All the normalized errors predicted by pFBA or LBFBA were unique 3.3 Application of LBFBA to a yeast aerobicity multi-omics dataset A yeast aerobicity multi-omics dataset (Jouhten et al., 2008; Rintala et al., 2009; Wiebe et al., 2008) was also used for the evaluating LBFBA predictions using a LOOCV. In the yeast multi-omics dataset, transcriptomics, proteomics and fluxomics were measured under five conditions with different oxygen composition in the inlet gas. From condition 1–5, the oxygen composition in the inlet gas increased from 0 (fully anaerobic) to 20.9% (fully aerobic). Following the same protocol as in Section 3.1, the normalized errors for LBFBA compared with pFBA were found (Fig. 5). Each error bar represents the maximum and minimum normalized error that the method can achieve for a given condition due to alternative optimal solutions (see Supplementary Methods for details on how these error bars were calculated). LBFBA improved the yeast flux predictions and had a smaller error range for each condition, indicating a smaller flux variability across alternative solutions. Unlike the E.coli mutant multi-omics datasets, LBFBA had similar performance using proteomics or transcriptomics data. The normalized errors for the best possible fit to the measured yeast fluxes was high for conditions 2–5, indicating there are significant inconsistencies between the yeast metabolic model (iMM904) and measured fluxes. Fig. 5. View largeDownload slide Simulation result for LBFBA compared with pFBA for a yeast dataset. The x-axis represents the five conditions with different oxygen composition. Each condition has measured transcriptomics, proteomics and fluxomics data. The y-axis represents the normalized error, which is define as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. The error bar for each normalized error was generated by minimizing and maximizing the normalized error while fixing the LBFBA objective. For all 5 conditions, pFBA (hollow diamond) had a higher normalized error than LBFBA integrating transcriptomics data (star) and proteomics data (solid triangle). The error bar for LBFBA was also smaller than pFBA. The best case (hollow circle) represents the lowest possible error achievable by the iMM904 constraint-based model Fig. 5. View largeDownload slide Simulation result for LBFBA compared with pFBA for a yeast dataset. The x-axis represents the five conditions with different oxygen composition. Each condition has measured transcriptomics, proteomics and fluxomics data. The y-axis represents the normalized error, which is define as sum of the absolute error for each measured reaction divided by the sum of the absolute value of experimental measurements [Equation (17)]. The error bar for each normalized error was generated by minimizing and maximizing the normalized error while fixing the LBFBA objective. For all 5 conditions, pFBA (hollow diamond) had a higher normalized error than LBFBA integrating transcriptomics data (star) and proteomics data (solid triangle). The error bar for LBFBA was also smaller than pFBA. The best case (hollow circle) represents the lowest possible error achievable by the iMM904 constraint-based model 3.4 Impact of reaction-specific flux bounds LBFBA uses reaction-specific bounds to limit the flux instead of using the same bound for every flux like pFBA or using a global assumption for all reactions (e.g. highly expressed reactions have high flux). This is the key difference between LBFBA and other methods like MADE (Jensen and Papin, 2011), GIMME (Becker and Palsson, 2008), IMAT (Shlomi et al., 2008) and tFBA (van Berlo et al., 2011). To evaluate whether LBFBA performs better due to these reaction-specific bounds, we instead applied common linear equations for flux bounds for every reaction in Rexp and predicted the flux distribution using LBFBA. These new generic linear bounds were parameterized by analyzing 1036 pairs of transcriptomics and fluxomic measurements from the E.coli mutant multi-omics dataset, which included 37 measured fluxes under 28 conditions (Fig. 6 shows data from all 29 conditions) to predict fluxes for an independent test condition. The normalized errors predicted by LBFBA with these generic linear bounds (i.e. not reaction-specific) using LOOCV were about the same as the normalized errors from pFBA (average normalized error across 29 conditions was 39.81% and 39.84% for LBFBA and pFBA, respectively) because these generic linear bounds were too loose. Fig. 6. View largeDownload slide Generic flux bounds generated from the relative flux and gene expression for all 37 reactions in the E.coli multi-omics dataset. The x-axis represents the gene expression value and the y-axis represents the flux normalized by the carbon uptake rate for the corresponding condition. Shown are all 1073 pairs of transcriptomics levels and relative fluxes for the E.coli multi-omics dataset, which included measurements for 37 fluxes under 29 conditions. The two parallel linear lines are used to bound the 37 fluxes in LBFBA Fig. 6. View largeDownload slide Generic flux bounds generated from the relative flux and gene expression for all 37 reactions in the E.coli multi-omics dataset. The x-axis represents the gene expression value and the y-axis represents the flux normalized by the carbon uptake rate for the corresponding condition. Shown are all 1073 pairs of transcriptomics levels and relative fluxes for the E.coli multi-omics dataset, which included measurements for 37 fluxes under 29 conditions. The two parallel linear lines are used to bound the 37 fluxes in LBFBA 4 Discussion In this paper, we describe a new method, LBFBA, which predicts flux distributions when transcriptomics/proteomics data are available along with flux values for a training dataset. The method integrates the transcriptomics/proteomics data by setting the flux bounds for a subset of reactions using reaction-specific linear functions based on their corresponding expression levels. LBFBA can predict the flux distribution more accurately than pFBA. With the E.coli mutant dataset, when using four sets of experimental measurements of transcriptomics and fluxomics, LBFBA can achieve a normalized error which is around half the normalized errors of pFBA. LBFBA can also reduce the flux variability across alternate solutions, as shown in the yeast example. LBFBA has several advantages compared to other expression-based methods: (i) LBFBA does not assume that higher expression will lead to higher corresponding metabolic fluxes since the slopes of the linear flux bound functions can take negative values. Literature has shown that it is not always true that higher expression will lead to higher corresponding metabolic fluxes (Kim et al., 2013). (ii) LBFBA is a simple linear programing CBM method and thus easy to solve. (iii) LBFBA only requires four or five sets of experimental measurements of transcriptomics/proteomics and fluxomics for parameterizing the flux bounds. Larger training set sizes may worsen the predictions because the flux bounds obtained from the training set may become too large due to data outliers. Therefore, the solution space does not shrink a lot and thus the LBFBA predictions become worse. Additionally, training with more datasets increases the likelihood that conditions are included where fluxes are controlled at the metabolic level instead of the expression level. In these instances, expression is not a good indicator of flux and could negatively impact LBFBA predictions. (iv) LBFBA was not sensitive to which specific conditions were used in the training of a given size (Supplementary Fig. S3) except for condition 4 (E.coli Δpgi knockout at a growth rate of 0.2 h−1, see Supplementary Figure S4 for explanation why including condition 4 in the training set worsens LBFBA predictions). (v) LBFBA works for both eukaryotes and prokaryotes, and for both wild-type and knockout strains, as shown in the E.coli example. LBFBA can integrate proteomics or transcriptomics and both lead to better flux predictions than pFBA. However, using proteomics data does show more accurate flux predictions than using transcriptomics data for most test cases. In the E.coli example, using proteomics data improved flux predictions for 24 out of the 29 conditions. In the yeast example, no significant difference was found between integrating transcriptomics or proteomics data. Therefore, it is generally recommended to integrate proteomics, if available. LBFBA uses linear functions to constrain a subset of metabolic fluxes to a range of values that are more physiologically realistic, rather than try and fit expression and flux data to a single linear function. Since there is often weak correlation between flux and expression (see Supplementary Table S3 for correlation coefficients for the E.coli mutant dataset), using flux bounds allows for weak correlations between flux and expression to still be accounted for by reducing the solution space. In this work linear bounds were used that required only three parameters be estimated for each reaction. Other types of functions could potentially be used; however, this could increase the number of parameters and training datasets required. The flux bounds in LBFBA are sensitive to the datasets used to parameterize them (Fig. 3). However, even with weak correlations between flux and expression the LBFBA flux predictions are still robust (since average normalized errors for half of the testing datasets differed by <0.05, Fig. 3). Some conditions when included in the training dataset decreased LBFBA accuracy (Supplementary Fig. S3). For example, including data for the Pgi mutant in the training dataset (i.e. condition 4 from the E.coli mutant multi-omics dataset) tended to worsen LBFBA flux predictions for other conditions. It would be beneficial to identify these outliers and eliminate them from the training datasets a priori; however, we were unable to find a way to identify them. For example, including data from the Pgi mutant into the training set changed the estimated a, b and c values, but the magnitude of these changes was not larger than adding other single conditions to the same training set. Additionally, measured fluxes for the Pgi mutant were not the most different compared to measured fluxes in other conditions, as determined by the mean squared distance between flux distributions (Long and Reed, 2017). While removing outliers could improve LBFBA predictions, these outliers can also capture more extreme cellular behaviors, so removing them could make predictions of other extreme behaviors worse since similar conditions were not included in the training datasets. Higher expression does not necessarily result in higher flux through the corresponding reaction. As shown in Supplementary Table S1, nearly half of the slopes are negative, which demonstrates either a weak or negative correlation between gene expression and flux levels. This is because there are multiple layers of enzyme control between gene expression and the corresponding flux (Hoppe, 2012). Hackett et al. showed that changes in flux across nutrient conditions are predominantly due to metabolite levels, and not enzyme levels (Hackett et al., 2016). This observation could explain why LBFBA performance was not significantly better compared to pFBA for the E.coli carbon source multi-omics dataset. Among the top six reactions (RPE, G6PDH2r, PGL, GND, PYK, PGI, RPI) with the most negative slope, five belong to the Pentose Phosphate Pathway (RPE, G6PDH2r, PGL, GND, RPI). As the gene expression increases, the flux bounds are shifting down. Entry of glucose 6-phosphate either into glycolysis via PGI or into the pentose phosphate pathway via G6PDH2r is largely determined by the relative concentrations of NADP+ and NADPH (Lehninger et al., 2008). Other reactions had slopes that were close to zero, indicating that expression does not have much of an impact on flux. This includes the D-lactate dehydrogenase (LDH_D), acetate kinase (ACKr) and phosphate acetyltransferase (PTAr) reactions. These results could indicate that for these reactions flux is controlled at the metabolite level rather than the gene expression level. LBFBA needs training data to parameterize the flux bounds. If transcriptomics/proteomics and fluxomics data for the same condition are available, LBFBA can be used. However, these datasets are usually hard to obtain due to the high cost of fluxomics measurements. Also, the fluxomics measurements used in this paper are limited to central metabolism. So LBFBA only compares the flux predictions in the central metabolism with those from pFBA, even though both methods can predict the genome-scale flux distributions. As more genome-scale flux measurements become available, flux bounds for more reactions can be included in LBFBA and comparisons can be made between predicted and measured fluxes beyond central metabolism. Acknowledgements Thanks to Caroline Mitchell for help editing the manuscript. Thanks to Joonhoon Kim, Prashant Kumar and Matthew Long for helpful discussions. Funding This work was funded in part by the Office of Science (BER), U.S. Department of Energy (DE-SC0008103), the U.S. Department of Energy Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02‐07ER64494), and by grant U19AI106772, provided by the National Institute of Allergy and Infectious Diseases, National Institutes of Health. Conflict of Interest: none declared. References Åkesson M. , et al. ( 2004 ) Integration of gene expression data into genome-scale metabolic models . Metab. Eng. , 6 , 285 – 293 . Google Scholar Crossref Search ADS PubMed Becker S.A. , Palsson B.O. ( 2008 ) Context-specific metabolic networks are consistent with experiments . PLoS Comput. Biol. , 4 , e1000082 . Google Scholar Crossref Search ADS PubMed van Berlo R.J.P. , et al. ( 2011 ) Predicting metabolic fluxes using gene expression differences as constraints . IEEE/ACM Trans. Comput. Biol. Bioinform. , 8 , 206 – 216. Google Scholar Crossref Search ADS PubMed Bosi E. , et al. ( 2016 ) Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity . Proc. Natl. Acad. Sci. USA , 113 , E3801 – E3809. Google Scholar Crossref Search ADS Burgard A.P. , et al. ( 2003 ) Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization . Biotechnol. Bioeng. , 84 , 647 – 657. Google Scholar Crossref Search ADS PubMed Chandrasekaran S. , Price N.D. ( 2010 ) Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis . Proc. Natl. Acad. Sci. USA , 107 , 17845 – 17850 . Google Scholar Crossref Search ADS Chavali A.K. , et al. ( 2012 ) A metabolic network approach for the identification and prioritization of antimicrobial drug targets . Trends Microbiol. , 20 , 113 – 123. Google Scholar Crossref Search ADS PubMed Colijn C. , et al. ( 2009 ) Interpreting expression data with metabolic flux models: predicting Mycobacterium tuberculosis mycolic acid production . PLoS Comput. Biol. , 5 , e1000489 . Google Scholar Crossref Search ADS PubMed Cotten C. , Reed J.L. ( 2013 ) Constraint-based strain design using continuous modifications (CosMos) of flux bounds finds new strategies for metabolic engineering . Biotechnol. J. , 8 , 595 – 604 . Google Scholar Crossref Search ADS PubMed Gerosa L. , et al. ( 2015 ) Pseudo-transition analysis identifies the key regulators of dynamic metabolic adaptations from steady-state data . Cell Syst. , 1 , 270 – 282. Google Scholar Crossref Search ADS PubMed Grant M. , Boyd S. ( 2008 ) Graph implementations for nonsmooth convex programs . In: Blondel V. , et al. (eds), Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences . Springer Limited , Berlin , pp. 95 – 110 . Grant M. , Boyd S. ( 2014 ) CVX: Matlab Software for Disciplined Convex Programming, Version 2.1 . http://cvxr.com/cvx. Hackett S.R. , et al. ( 2016 ) Systems-level analysis of mechanisms regulating yeast metabolic flux . Science , 354 , aaf2786 . Google Scholar Crossref Search ADS PubMed Hamilton J.J. , Reed J.L. ( 2012 ) Identification of functional differences in metabolic networks using comparative genomics and constraint-based models . PLoS One , 7 , e34670 . Google Scholar Crossref Search ADS PubMed Hoppe A. ( 2012 ) What mRNA abundances can tell us about metabolism . Metabolites , 2 , 614 – 631 . Google Scholar Crossref Search ADS PubMed Ishii N. , et al. ( 2007 ) Multiple high-throughput analyses monitor the response of E. coli to perturbations . Science , 316 , 593 – 597. Google Scholar Crossref Search ADS PubMed Jensen P.A. , Papin J.A. ( 2011 ) Functional integration of a metabolic network model and expression data without arbitrary thresholding . Bioinformatics , 27 , 541 – 547 . Google Scholar Crossref Search ADS PubMed Jouhten P. , et al. ( 2008 ) Oxygen dependence of metabolic fluxes and energy generation of Saccharomyces cerevisiae CEN.PK113‐1A . BMC Syst. Biol. , 2 , 60 . Google Scholar Crossref Search ADS PubMed Kim H.U. , et al. ( 2013 ) Flux-coupled genes and their use in metabolic flux analysis . Biotechnol. J. , 8 , 1035 – 1042 . Google Scholar Crossref Search ADS PubMed Kim H.U. , et al. ( 2010 ) Genome-scale metabolic network analysis and drug targeting of multi-drug resistant pathogen Acinetobacter baumannii AYE . Mol. Biosyst. , 6 , 339 – 348. Google Scholar Crossref Search ADS PubMed Kim J. , et al. ( 2011 ) Large-scale bi-level strain design approaches and mixed-integer programming solution techniques . PLoS One , 6 , e24162 . Google Scholar Crossref Search ADS PubMed Lee D.-S. , et al. ( 2009 ) Comparative genome-scale metabolic reconstruction and flux balance analysis of multiple Staphylococcus aureus genomes identify novel antimicrobial drug targets . J. Bacteriol. , 191 , 4015 – 4024. Google Scholar Crossref Search ADS PubMed Lehninger A.L. , et al. ( 2008 ) Lehninger Principles of Biochemistry . W.H. Freeman , New York . Lewis N.E. , et al. ( 2010 ) Omic data from evolved E. coli are consistent with computed optimal growth from genome-scale models . Mol. Syst. Biol. , 6 , 390 . Google Scholar Crossref Search ADS PubMed Long M.R. , Reed J.L. ( 2017 ) Improving flux predictions by integrating data from multiple strains . Bioinformatics , 33 , 893 – 900 . Google Scholar PubMed Machado D. , Herrgård M. ( 2014 ) Systematic evaluation of methods for integration of transcriptomic data into constraint-based models of metabolism . PLoS Comput. Biol. , 10 , e1003580 . Google Scholar Crossref Search ADS PubMed Magnúsdóttir S. , et al. ( 2017 ) Generation of genome-scale metabolic reconstructions for 773 members of the human gut microbiota . Nat. Biotechnol. , 35 , 81 – 89. Google Scholar Crossref Search ADS PubMed Mo M.L. , et al. ( 2009 ) Connecting extracellular metabolomic measurements to intracellular flux states in yeast . BMC Syst. Biol , 3 , 37 . Google Scholar Crossref Search ADS PubMed Nuccio S.-P. , Bäumler A.J. ( 2014 ) Comparative analysis of Salmonella genomes identifies a metabolic network for escalating growth in the inflamed gut . MBio , 5 , e00929 – 14 . Google Scholar Crossref Search ADS PubMed Orth J.D. , et al. ( 2011 ) A comprehensive genome-scale reconstruction of Escherichia coli metabolism–2011 . Mol. Syst. Biol. , 7 , 535 . Google Scholar Crossref Search ADS PubMed Rintala E. , et al. ( 2009 ) Low oxygen levels as a trigger for enhancement of respiratory metabolism in Saccharomyces cerevisiae . BMC Genomics , 10 , 461 . Google Scholar Crossref Search ADS PubMed Shlomi T. , et al. ( 2008 ) Network-based prediction of human tissue-specific metabolism . Nat. Biotechnol. , 26 , 1003 – 1010 . Google Scholar Crossref Search ADS PubMed Tervo C.J. , Reed J.L. ( 2012 ) FOCAL: an experimental design tool for systematizing metabolic discoveries and model development . Genome Biol. , 13 , R116 . Google Scholar Crossref Search ADS PubMed Wiebe M.G. , et al. ( 2008 ) Central carbon metabolism of Saccharomyces cerevisiae in anaerobic, oxygen-limited and fully aerobic steady-state conditions and following a shift to anaerobic conditions . FEMS Yeast Res. , 8 , 140 – 154 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Bioinformatics – Oxford University Press
Published: Nov 15, 2018
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.