Abstract Motivation Identification of complex relationships among members of microbial communities is key to understand and control the microbiota. Co-exclusion is arguably one of the most important patterns reflecting micro-organisms’ intolerance to each other’s presence. Knowing these relations opens an opportunity to manipulate microbiotas, personalize anti-microbial and probiotic treatments as well as guide microbiota transplantation. The co-exclusion pattern however, cannot be appropriately described by a linear function nor its strength be estimated using covariance or (negative) Pearson and Spearman correlation coefficients. This manuscript proposes a way to quantify the strength and evaluate the statistical significance of co-exclusion patterns between two, three or more variables describing a microbiota and allows one to extend analysis beyond micro-organism abundance by including other microbiome associated measurements such as, pH, temperature etc., as well as estimate the expected numbers of false positive co-exclusion patterns in a co-exclusion network. Results The implemented computational pipeline (CoEx) tested against 2380 microbial profiles (samples) from The Human Microbiome Project resulted in body-site specific pairwise co-exclusion patterns. Availability and implementation C++ source code for calculation of the score and P-value for two, three and four dimensional co-exclusion patterns as well as source code and executable files for the CoEx pipeline are available at https://scsb.utmb.edu/labgroups/fofanov/co-exclusion_in_microbial_communities.asp. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Every life-supporting environment is colonized by microbes. Understanding the complex relationships among micro-organisms opens new possibilities in controlling microbial communities (MC) and has a variety of applications from improving human health to reducing negative effects of microbial activities in industrial settings, such as micro-biologically influenced corrosion, product spoilage and formation of harmful biofilms. Over the last decade, High Throughput Sequencing technology has become a key component of a large variety of microbiome studies. Since whole genome sequencing remains relatively expensive, and requires extensive computational resources to perform analysis on data which may consist of up to 95% host DNA, many of the recent studies have been focused on marker gene sequencing approach where selected genes (usually highly conserved such as 16S/18S rRNA or COI) are amplified, sequenced and used to identify micro-organism profiles of microbiota samples (Srivathsan et al., 2015). The first steps of the analysis of this data include quality assessment of sequencing reads, clustering, taxonomic annotation and normalization resulting in relative abundance profiles which may contain thousands of organisms identified across hundreds of samples. The downstream analysis of these profiles typically focuses on estimation of microbial diversity of individual samples and/or search for microbes exhibiting distinctive abundance patterns across specific groups of samples such as micro-organisms highly abundant in one group and absent in another. This approach has already resulted in discoveries of micro-organisms potentially associated with colorectal cancer (Ahn et al., 2013; Flanagan et al., 2014; Kostic et al., 2012; Sun and Kato, 2016; Tjalsma et al., 2012), early gastric cancer (Lee et al., 2016), prostate cancer (Perry and Lambert, 2011) and cystic fibrosis (Zhao et al., 2014). Many recent studies (Fisher and Mehta, 2014; He et al., 2017; Mainali et al., 2017) have been dedicated to the identification of competitive, symbiotic and commensal (Reshef et al., 2011) relationships between members of MC, focusing on identification of micro-organisms for which pairwise abundances can be modeled using linear, exponential, or periodic (for time series) functions. Since manual search for even the simplest of patterns [such as Pearson or Spearman correlation (Fieller et al., 1957; Pearson, 1895)] in microbial abundance profiles is virtually impossible (the number of pairs of micro-organism profiles to be inspected is proportional to the square of the number of micro-organisms under consideration nn-12 , where n is the number of micro-organisms in the dataset), various algorithms and computational pipelines (CoExs) have been developed to perform such tasks automatically. For example, several CoExs employ Pearson or Spearman correlation to identify the most significant positively and negatively correlated micro-organism profiles (Barberán et al., 2012; Cardinale et al., 2015; Goodrich et al., 2014; Jackson et al., 2018; Maruyama et al., 2014; Wang et al., 2016). SparCC (Friedman and Alm, 2012) improves on correlation search by correcting undesirable effects of compositional data. Software packages such as CoNet (Faust et al., 2012) employ several similarity measures such as Kullback–Leibler divergence (Kullback, 1959), Pearson correlation (Pearson, 1895) and Spearman correlation (Fieller et al., 1957), as well as mutual information (MI; Cover and Thomas, 2005) to identify the highest scoring pairwise relationships. Among these methods, only MI and maximal information coefficient (MIC; Reshef et al., 2011) algorithms are capable of identifying non-linear and non-functional [impossible to represent in y = f(x) form] pairwise relations by using MI and maximum MI scores [see Weiss et al. (2017) for the review and performance comparison]. These two methods, however, lack the ability to discriminate against intuitively difficult to interpret patterns (Fig. 1a and b) and can miss some important relationships. For example, mutual exclusion among micro-organisms (Fig. 1c–f), produce low MI and MIC scores and as result such patterns will not appear on the top of the list of significant relationships identified and reported by the MIC pipeline (Reshef et al., 2011). It is also obvious, that complex relationships among members of MCs can include two, three and more organisms. To date however, no methods for identification of such multi-dimensional microbial appearance/abundance patterns have been adopted by the scientific community. In theory, MI and MIC approaches can be extended to identify patterns which include three, four, or more organisms but the computational complexity of such a task renders these methods impractical. Fig. 1. View largeDownload slide Non-random patterns (associations) including two and three organism profiles. (a, b)––Intuitively difficult to interpret patterns with high MIC score; (c)—Co-exclusion pattern involving three organisms; (d, e, f)—Pairwise co-exclusion between three organisms involved in the 3D co-exclusion pattern. Values in the figure represent MI; MIC; Pearson correlation ( ρ ); co-exclusion coefficient (c) Fig. 1. View largeDownload slide Non-random patterns (associations) including two and three organism profiles. (a, b)––Intuitively difficult to interpret patterns with high MIC score; (c)—Co-exclusion pattern involving three organisms; (d, e, f)—Pairwise co-exclusion between three organisms involved in the 3D co-exclusion pattern. Values in the figure represent MI; MIC; Pearson correlation ( ρ ); co-exclusion coefficient (c) Co-exclusion is arguably one of the most important patterns to be identified in MCs. Knowing which micro-organisms are unable to tolerate each other’s presence or can replace one another in the community opens an opportunity to manipulate and control the microbiota, guide microbiota transplantation and personalize the choice of micro-organisms for probiotic treatments. Examples of recently discovered co-exclusion relationships include Wolbachia sp. co-exclusion with Asaia sp.(Hughes et al., 2014) and Flaviviruses (Dengue, Zika; Bian et al., 2010; Schnettler et al., 2016); and co-exclusion between Lactobacillus spp. and Fusobacteria spp. (Sobhani et al., 2013), which are associated with elevated risk of colorectal cancer, may open new opportunities in cancer prevention. It is important to emphasize however, that mutual exclusion/avoidance pattern is not anti-correlation (negative Pearson or Spearman correlation). The nature of co-exclusion does not depend on the abundance of given species but (in an ideal case) has to be based only on the fact that one micro-organism can be present (in any abundance) only if the other micro-organism is absent (and vise-versa). This type of relationship cannot be described in terms of a functional relationship (mathematically represented as a function) and requires a different equation type of representation which in the two dimensional (2D) case can be defined as: Xa Xb=0 , where Xa and Xb are abundances of two organisms (a and b). The goal of the presented study is to introduce a quantified definition of the strength and statistical significance of multi-dimensional co-exclusion patterns between variables describing MCs as well as estimate the correspondence between statistical significance threshold of individual co-exclusion patterns and expected numbers of false positive co-exclusion patterns in a co-exclusion network. 2 Materials and methods 2.1 Goodness of fit for 2D co-exclusion patterns The coefficient of determination ( R2 ) is traditionally used to quantify the goodness of fit of experimental observation data points {yi;xi} i=1,…,N to the given functional ( y=f(x) ) model. In general terms, it reflects how much better the given model fits the experimental data in comparison with a base model and is mathematically defined as the ratio between the deviation of experimental observations from the model and the deviation from the ‘base model’ which is usually chosen as the average ( y=y- ): R2≡1-SSresidualSStotal=1-∑i=1Nf(xi)-yi2∑i=1Nyi-y-2 (1) Using the same principles and assuming that the co-exclusion pattern is described by the equation: xy=0 , the deviation of each experimental observation from the co-exclusion model can be defined as a geometric fit, which in this particular case is the shortest distance of the data point to the closest axis and SSresidual can be defined as: SSresidual=∑i=1Nmin(xi2,yi2) Since the deviation from the co-exclusion model is calculated using distances from both (x and y) axis, the base model also must include averages of both variables: { x=x-;y=y-} which allows definition of SStotal as: SStotal=∑i=1Nxi-x-2 +yi-y-2 Thus, the goodness of fit into the co-exclusion pattern co-exclusion coefficient (c) can be introduced as: c≡R2=1-∑i=1Nminxi2,yi2∑i=1Nxi-x-2 +yi-y-2 (2a) It is important to mention however, that in contrast with the goodness of fit for functional relationships (Formula 1) the value of the co-exclusion coefficient (Formula 2a) can be affected by linear transformation of one of the variables. In fact, its value can be artificially ‘improved’ by multiplying all the values associated to one of the variables by a very small coefficient (α). In order to make the co-exclusion coefficient independent of linear transformations (which also allows calculation of its value for variables measured in different scales such as bacterial abundance versus temperature, salinity, or pH), the linear transformation coefficient (α) must be included in the definition of co-exclusion coefficient: c≡R2=1-maxα∑i=1Nminαxi2,yi2∑i=1Nαxi-αx-2+yi-y-2 (2b) In this case (Formula 2b), the co-exclusion coefficient is defined as the lowest (worst) co-exclusion value across all possible values of the linear transformation coefficients (Fig. 2a and b). Fig. 2. View largeDownload slide Effect of linear transformation on the goodness of fit for 2D (a, b) and 3D Type-1 (c, d) and 3D Type-2 (e, f) co-exclusion. Figures b, d and f show the effect of linear transformation on the SSresidualSStotal ratio (Formulas 2b, 3a and 3b) Fig. 2. View largeDownload slide Effect of linear transformation on the goodness of fit for 2D (a, b) and 3D Type-1 (c, d) and 3D Type-2 (e, f) co-exclusion. Figures b, d and f show the effect of linear transformation on the SSresidualSStotal ratio (Formulas 2b, 3a and 3b) 2.2 Statistical significance of the co-exclusion pattern and the data properties influencing it Regardless of how strong the value of the co-exclusion coefficient is, certain properties of the data can significantly affect its ability to reflect what is intuitively considered as a mutually exclusive relationship. The most striking example of such effects would be if the co-exclusion coefficient is calculated between two variables out of which one is always zero. In this case, all the observations will be located on one axis resulting in a ‘perfect’ co-exclusion score. Another case where false but high scoring co-exclusion patterns can be observed simply by chance is when both variables have only a few non-zero values across a large number of observations and therefore, the data become dominated by a large number of zero-pairs (both variables equal to zero) which perfectly fit in the co-exclusion pattern (Fig. 2a). In fact, in zero-pair rich datasets, high scoring but false co-exclusion patterns can appear between pairs of independent variables due to the low probability of both variables having non-zero values in same sample (observation). Since no assumption regarding the distributions of the values of the variables can be made, the statistical significance (such as P-value) of the observed co-exclusion pattern can be estimated using a bootstrapping approach. In this case, P-value can be calculated as a ratio of the number of the simulated bootstrap attempts in which the co-exclusion coefficient values are better or equal to the co-exclusion score of the original data over the total number of bootstrap attempts. For all the results presented in this manuscript, we have used up to 100 000 bootstraps for each pair of variables. It is important to emphasize however, that in order to improve the computational performance, several additional criteria have been used to discontinue the bootstrapping iterations when it becomes clear that the expected P-value will be higher than the pre-defined threshold. To date, several authors have raised concerns over the significant effects of presence of zero-pairs (observations where both variables have zero or very low values) on results of correlation analysis between OTU profiles (Kaul et al., 2017; Weiss et al., 2017; Xu et al., 2015). To explore these effects on co-exclusion patterns, we have performed several computational experiments using simulated data. For four sample sizes (20, 40, 80 and 160), we have randomly generated 10 datasets with perfect co-exclusion patterns where all non-zero values were generated randomly in the 0.2–0.9 interval and were paired with zero values for the corresponding variable. The proportion of non-zero values on each axis was equal (50–50%). During the sequential steps of the simulation 0%, 10%, 20%, … and 90% of the data points were replaced by zero-pairs. Not surprisingly (Fig. 3a and b), the P-values of the observed co-avoidance pattern improve when number of samples increases, while increasing the proportion of zero-pairs has the opposite effect. This observation highlights the connection between the number of samples and proportion of zero-pairs with the expected co-exclusion pattern’s P-value. For example, for sample size 80 and 50% of zero-pairs, the P-value of the perfect co-avoidance pattern is expected to be equal or less than 10−5. Fig. 3. View largeDownload slide (a) The effects of zero-pairs and sample size on the P-value on the ‘perfect’ co-avoidance pattern; (c) the effects disproportional presence of no-zero values between two variables zero-pairs on the ‘perfect’ co-avoidance pattern; (b) and (d), same data in log scale; (e) and (f), Effects of zero-pairs and noise on the co-exclusion ( ssresidualsstotal ) score Fig. 3. View largeDownload slide (a) The effects of zero-pairs and sample size on the P-value on the ‘perfect’ co-avoidance pattern; (c) the effects disproportional presence of no-zero values between two variables zero-pairs on the ‘perfect’ co-avoidance pattern; (b) and (d), same data in log scale; (e) and (f), Effects of zero-pairs and noise on the co-exclusion ( ssresidualsstotal ) score The disproportional presence of non-zero values between two variables can also affect the P-value of a perfect co-exclusion pattern. The simulation results on Figure 3c and d show that while such an effect is minimal for 25–75%, it becomes stronger for 10–90% ratio, especially for cases with high number of zero-pairs. Random noise and measurement errors can also affect P-values of perfect co-exclusion patterns. It is important to mention however that introduction of low level (<10%) noise (Fig. 3e and f) affects the value of the co-exclusion coefficient, but not its significance (P-value). 2.3 Goodness of fit for multi-dimensional co-exclusion patterns There are two different ways how co-exclusion can be defined in the case of three variables (3D co-exclusion). The Type-1 co-exclusion characterizes the situation when each organism avoids every other organisms in a group (Fig. 2c and d). In this case, the co-exclusion pattern is represented by three axes and geometric distance for each experimental observation is calculated as the distance to the closest axis and the co-exclusion model has to be represented as: xyz=0 AND [(x≠0 AND y=0 AND z=0) OR (y≠0 AND x=0 AND z=0) OR (z≠0 AND x=0 AND y=0)] Similar to Formula 2b, to exclude the dependence of the co-exclusion score from linear transformations, 3D Type-1 co-exclusion definition will need to include two transformation coefficients (α and β). The Type-1 3D co-exclusion score can be defined as: c≡R2=1-maxα,β∑i=1Nmindxi,dyi,dzi∑i=1Nαxi-αx-2 +βyi-βy-2+zi-z-2 (3a) Where dxi=βyi2+zi2, dyi= αxi2+zi2 and dzi=αxi2+βyi2, are the distances to the x, y, z and t axes, respectively. Not surprisingly, to express Type-1 3D co-exclusion, pairwise co-exclusion patterns have to be present between each pair of variables involved in the 3D pattern. This allows one to establish simple exclusion rules and reduce the computational complexity of the search for multi-dimensional co-exclusion patterns from O(m3) to O(m2) , where m is number of variables considered in the pattern: Rule 1. Two variables exhibiting 2D co-exclusion pattern with zero or small number of observations where both organisms are absent in the samples cannot be part of any higher dimensional co-exclusion pattern. Rule 2. Only variables exhibiting all three pairwise co-exclusion patterns simultaneously can form a 3D Type-1 co-exclusion pattern. Another (Type-2) 3D co-exclusion pattern could be defined as the case when each pair of organisms could co-exist, but all three cannot be present simultaneously (Fig. 2e–g). In this case, the co-exclusion model can be represented as three axis planes: x = 0 (yz plane), y = 0 (xz plane) and z = 0 (xy plane) and the co-exclusion coefficient can be calculated using minimal distance from these planes and the co-exclusion model has to be represented as: xyz=0 AND [x=0 AND y≠0 AND z≠0 OR y=0 AND x≠0 AND z≠0 OR z=0 AND x≠0 AND y≠0] The Type-2 3D co-exclusion score can be defined as: c≡R2=1-maxα,β∑i=1Nmindxyi, dxzi, dyzi∑i=1Nαxi-αx-2+βyi-βy-2+zi-z-2 (3b) Interestingly, in this definition Type-2 co-exclusion pattern will show high co-exclusion score if variables are exhibiting Type-1 co-exclusion, but not vise-versa. A simple (but less effective) exclusion criterion can be used to avoid unnecessary calculations to identify if any three variables can be involved in Type-2 3D co-exclusion: every two variables must have a significant number of zero-pairs: observations where one or both organisms are absent. Where dxyi= zi2, dxzi=βyi2 and dyzi=αxi2 , are the distances to the xy,xz and yz planes,respectively. Co-exclusion patterns can be defined for higher number of variables (dimensions). For example, co-exclusion in four dimensions (4D) can be calculated using distances from axes (Formula 4a), axis planes (Formula 4b), as well as three dimensional (3D) sub-spaces (Formula 4c) where t stands for the fourth dimension (fourth variable): c≡R2=1-maxα,β,γ∑i=1Nmin dxi,dyi,dzi,dti∑i=1Nαxi-αx-2 +βyi-βy-2+γzi-γz-2+(ti-t-)2 (4a) Where dxi=βyi2+γzi2+ti2, dyi=αxi2+γzi2+ti2, dzi=αxi2+βyi2+ti2, and dti=αxi2+βyi2+(γzi)2, are the distances to the x, y, z and t axes, respectively. c≡R2=1-maxα,β,γ∑i=1Nmindxyi,dxzi,dxti,dyzi,dyti,dzti∑i=1Nαxi-αx-2 +βyi-βy-2+γzi-γz-2+(ti-t-)2 (4b) where dxyi=(γzi)2+ti2, dxzi=βyi2+ti2,dxti=βyi2+(γzi)2, dyzi=αxi2+ti2,dyti=αxi2+ (γzi)2, and dzti=αxi2+βyi2, are the distances to the xy, xz, xt, yz, yt and zt planes, respectively. c≡R2=1-maxα,β,γ∑i=1Nmindxyzi,dxyti,dxzti,dyzti∑i=1Nαxi-αx-2 +βyi-βy-2+γzi-γz-2+(ti-t-)2 (4c) Where dxyzi=ti2,dxyti=γzi2,dxzti=βyi2, and dyzti=αxi2, are the distances to the xyz, xyt, xzt, and yzt 3D, subspaces, respectively. While Formulas 2b, 3a, b and 4a–c can provide a numerical value for the strength of the co-exclusion, the above described bootstrapping approach can be used to estimate its confidence. 2.4 Co-exclusion network: type I/II errors and expected fraction of false positive pairs Considering the large number of combinations of two, three and more variables tested independently in search for co-exclusion patterns, a substantial number of statistically significant patterns can be observed simply by chance (multiple hypothesis testing problem). Assuming that the P-values of individual co-exclusion patterns reflect the probability of these patterns to appear by chance (false positive), one can use a bootstrapping approach to identify the correspondence between P-value thresholds and expected number/fraction of encountered false positive co-exclusion patterns. In this approach, the identification of co-exclusion patterns in the original data is supplemented by several iterations of identification of co-exclusion patterns in randomized datasets (Null Model) generated by shuffling and re-normalization of the original dataset. Comparison of the distributions of the P-values in the original and P-values of simulated data averaged across all iterations (Fig. 4b, e and h) allows one to estimate the expected fraction of the false positive edges (Type I errors) in the co-exclusion network (Fig. 4c, f and i) as a function of the P-value threshold. In all the microbiota datasets presented in Section 3, we have generated 20 randomized (shuffled) datasets. All the networks (Fig. 4 and Supplementary Material, Section 2) were generated using 0.05 threshold for the expected fraction of false positive pairs (5% false positive rate). Fig. 4. View largeDownload slide Examples of co-exclusion networks and P-value distributions using Stool, Posterior Fornix and Supragingival Plaque datasets from the HMP. The co-exclusion networks (a, d and g) are created using 5% cutoff for the allowed fraction of false positive pairs (F). Strong co-exclusion ( SSresidualSStotal ≤0.01 ) is indicated in bright red edges with a transition to blue as the co-exclusion score approaches 0.2. Node colors reflect different taxonomy assignments at Phylum level and node sizes are proportional to the number of co-exclusion relationships. The cumulative distributions of the P-values for pairs in observed (blue) and simulated (red) networks (b, e and h) are shown with a green dashed line which represents the P-value threshold (*P) used to create the networks. The expected fraction of the false positive edges (Type I errors) in the co-exclusion network as a function of the pairwise P-value threshold is shown in the 3rd column (c, f and i) Fig. 4. View largeDownload slide Examples of co-exclusion networks and P-value distributions using Stool, Posterior Fornix and Supragingival Plaque datasets from the HMP. The co-exclusion networks (a, d and g) are created using 5% cutoff for the allowed fraction of false positive pairs (F). Strong co-exclusion ( SSresidualSStotal ≤0.01 ) is indicated in bright red edges with a transition to blue as the co-exclusion score approaches 0.2. Node colors reflect different taxonomy assignments at Phylum level and node sizes are proportional to the number of co-exclusion relationships. The cumulative distributions of the P-values for pairs in observed (blue) and simulated (red) networks (b, e and h) are shown with a green dashed line which represents the P-value threshold (*P) used to create the networks. The expected fraction of the false positive edges (Type I errors) in the co-exclusion network as a function of the pairwise P-value threshold is shown in the 3rd column (c, f and i) The estimation of the Type II error (false negative) in reconstruction of any network describing microbial interactions is quite challenging. In general, it requires the presence of a model describing the ground-truth of the relationship between variables using artificial assumptions. For example, in the Friedman and Alm manuscript (Friedman and Alm, 2012), the false negative edges in the network created using SparCC algorithm were defined as edges present in the SparCC generated network but absent in the network created using Pearson correlation. It is important to emphasize that in the presented co-exclusion network reconstruction process, the selection of any cutoff for pairwise co-exclusion P-values results in the rejection of some number of ‘perfectly fine’ co-exclusion patterns which cannot be estimated, at least not without including additional assumptions/model(s) in consideration. 2.5 Implementation The C++ source code for the calculations of the score and P-values for 2, 3 and 4D co-exclusion patterns as well as source code and executable for the CoEx for the reconstruction of 2D co-exclusion networks is available at https://scsb.utmb.edu/labgroups/fofanov/co-exclusion_in_microbial_communities.asp). The presented CoEx also performs bootstrapping simulations for (a) each pairwise co-exclusion pattern and (b) the resulting network in a single run. Each run of the pipeline results in two tables (text files): one containing information (such as co-exclusion score and P-value) on co-exclusion patterns between pairs of OTUs within the original data and one containing the simulated network P-value distribution results (more details are available in the Supplementary Material). The pipeline source code can be compiled using g++ for Linux or Microsoft Visual Studio C++ compilers. The software is single-threaded and requires minimum 8 GB memory. The run time can vary depending on the resulting number of pairwise relationships as intensive bootstrapping must be completed for each reported pair. For example, the exhaustive analysis of 4851 OTU pairs for identification of statistically significant co-exclusion patterns using 160 stool samples involving 99 OTUs shown in Figure 4 took less than ∼18 min to complete on an Intel® Core™ i7-3630QM CPU @ 2.40 GHz ASUS laptop with 12.0 GB installed RAM (100 000 trials per pair and 20 network replicates). The co-exclusion network (Fig. 4) has been visualized using Cytoscape 3.4.0 (Shannon et al., 2003) visualization platform using the yFiles Organic network layout algorithm. MIC calculations (Fig. 1) have been performed in MATLAB (www.mathworks.com/products/matlab) using minepy (Albanese et al., 2013). 2.6 Human microbiome project data The data used in the analysis is originated from The NIH Human Microbiome Project (Peterson et al., 2009) and contained 18 datasets associated with 16 body sites. Microbial profiles for 2910 samples have been downloaded from the project website as of December 2016 in text format (HMQCP–QIIME Community Profiling v13 OTU table). Samples representing significantly low (less than 2000) and significantly high (over 50 000) number of sequencing reads were excluded from the analysis. The microbial profiles of the remaining 2380 samples, vary from 67 for Posterior Fornix to 200 for Antecubital Fossa, have been normalized against the total number of reads in each sample and transformed into relative abundance profiles merged to Genus taxonomy level for each body site resulting in 619 profiles. Analysis has been performed for each body site individually. For each body site under consideration, micro-organism profiles present in less than 5% of samples have been excluded from the analysis. 3 Results 3.1 Co-exclusion networks reconstruction To evaluate ability of the proposed approach to identify co-exclusion patterns, we have applied the developed pipeline (CoEx) to the microbial profiles obtained from the Human Microbiome Project. The list of the identified co-exclusion patterns, results of the bootstrapping based false positive edge estimations, as well as the networks figures are available in the Supplementary Material. Interestingly, no correlation has been observed between the number of organisms present in the dataset and the number of detected co-exclusion patterns, or the number of organisms involved in co-exclusion relationships. The highest fraction of organisms involved in co-exclusion patterns were identified in samples associated with the oral cavity: Tongue Dorsum (77.78%), Palatine Tonsils (64.65%), Supragingival Plaque (62.22%), Sub-gingival Plaque (61.70%) and Throat (46.96%, see Supplementary Material for more details), which could suggest that this environment can support a large variety of different (organism profiles) microbiomes.In Saliva samples, a significantly smaller fraction of organisms involved in co-exclusion relationships was identified (20.35%) when compared to the Throat dataset. Both datasets consist of similar number of samples and organisms (130 samples and 113 OTUs and 140 samples and 115 OTUs for Saliva and Throat datasets, respectively). This could be due to the specific properties of Saliva (chemical/food/dead epitalia cell concentration, pH, etc.) and its wide spatial reach which can be seen as a temporary reservoir of detached members of the established microbiota present in the oral cavity. The lowest fraction of the organisms involved in co-exclusion was found in Anterior Nares (1.55%) followed by Retroauricular Crease (3.70%)––despite large number of organisms found in Anterior Nares samples (200, largest number across all 16 body sites). This result however is not surprising considering that most of the micro-organisms identified in these two sample types could be of environmental origin and do not represent the native ‘core’ microbiota of this body site. In comparison to the oral cavity samples, a slightly lower fraction of organisms has been found in vaginal samples: Posterior Fornix (14.93%), Vaginal Introitus (14.58%) and Mid Vagina (11.76%). It is also interesting to mention that not all the detected highly significant patterns appear to have a good co-exclusion score. For example, the weak co-exclusion coefficient score (0.192253) between Dialister (Genus) and Lactobacillus (Genus) in the Posterior Fornix network (Fig. 4d and Supplementary Material, Section 2.2) has been found to have a very low P-value (0.000289). 4 Discussion Co-exclusion is one of many non-random association patterns which can be detected in MCs. This pattern however, is key to understanding complex relations in MCs. Knowing which organisms are incompatible, or can simply replace each other in MC can guide the development of patient-specific probiotics and improve microbial transplantation strategies. However, it is important to keep in mind that strong and statistically significant co-exclusion does not necessarily reflect competition or inability of organisms to co-exist in the same MC. Artificial co-exclusion (especially on the OTU level) may appear as an artifact of the taxonomical annotation. Moreover, in some datasets, co-exclusion can be observed simply because microbial compositions originated from very different environments. For example, by analyzing posterior fornix and gut (stool) microbiota profiles together, one can observe co-exclusion pattern between organisms specific (uniquely present) in each of these two environments. Specifically, Clostridium (from family Erysipelotrichaceae), Sub-doligranulum and Akkermansia absent in posterior fornix samples and present in 81%, 78% and 52% of stool samples will exhibit co-exclusion patterns with Finegoldia, Ureaplasma and Herbaspirillum absent in stool samples and present in posterior fornix (39%, 31% and 38%). It is also necessary to keep in mind that a high co-exclusion score with low statistical confidence cannot be interpreted as absence of co-exclusion. It simply reflects a lack of statistical evidence (not enough data points) to confirm the presence of co-exclusion. Surprisingly, in some of the analyzed MCs several high confidence (based on P-value) co-exclusion patterns have been found to have relatively low co-exclusion scores. This observation makes us hypothesize that to a certain extent, the pairs of variables showing high confidence, but low co-exclusion scores may represent some type of co-presence, but we would like to keep this discussion outside the scope of this manuscript. The overall complexity of the pairwise co-exclusion networks can reflect the diversity of MCs which are able to populate a given environment. Traditional directed/undirected graph representation however, cannot be used for multi-dimensional (except maybe Type-1) co-exclusion patterns. The ways how these and many other types of multi-dimensional relations can be visualized require development of novel approaches. It is also important to keep in mind that the best way to characterize the functional activity of microbiomes would be observing complete profiles of present (DNA) and expressed (RNA) genes, which at this time is cost prohibitive. The presented approach, however, is expected to be applicable to gene profile data when available. Acknowledgement The authors would like to thank Iryna Pinchuk and Grant Hughes for the fruitful discussions and constructive comments. Funding LA, KK, GG and YF work was partially supported by the Sealy Center for Structural Biology and Molecular Biophysics and a pilot grant from the Institute for Human Infections and Immunity at the University of Texas Medical Branch. Conflict of Interest: none declared. References Ahn J. et al. ( 2013 ) Human gut microbiome and risk for colorectal cancer . J. Natl. Cancer Inst ., 105 , 1907 – 1911 . Google Scholar Crossref Search ADS PubMed Albanese D. et al. ( 2013 ) Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers . Bioinformatics , 29 , 407 – 408 . Google Scholar Crossref Search ADS PubMed Barberán A. et al. ( 2012 ) Using network analysis to explore co-occurrence patterns in soil microbial communities . ISME J ., 6 , 343 – 351 . Google Scholar Crossref Search ADS PubMed Bian G. et al. ( 2010 ) The endosymbiotic Bacterium wolbachia induces resistance to dengue virus in Aedes aegypti . PLoS Pathog ., 6 , e1000833 . Google Scholar Crossref Search ADS PubMed Cardinale M. et al. ( 2015 ) Bacterial networks and co-occurrence relationships in the lettuce root microbiota . Environ. Microbiol ., 17 , 239 – 252 . Google Scholar Crossref Search ADS PubMed Cover T.M. , Thomas J.A. ( 2005 ) Entropy, relative entropy, and mutual information . Elements of Information Theory , 2nd edn., pp. 13 – 55 . Faust K. et al. ( 2012 ) Microbial co-occurrence relationships in the Human Microbiome . PLoS Comput. Biol ., 8 , e1002606 . Google Scholar Crossref Search ADS PubMed Fieller E.C. et al. ( 1957 ) Tests for rank correlation coefficients . Biometrika , 44 , 470 – 481 . Google Scholar Crossref Search ADS Fisher C.K. , Mehta P. ( 2014 ) Identifying keystone species in the human gut microbiome from metagenomic timeseries using sparse linear regression . PLoS One , 9 , e102451. Google Scholar Crossref Search ADS PubMed Flanagan L. et al. ( 2014 ) Fusobacterium nucleatum associates with stages of colorectal neoplasia development, colorectal cancer and disease outcome . Eur. J. Clin. Microbiol. Infect. Dis ., 33 , 1381 – 1390 . Google Scholar Crossref Search ADS PubMed Friedman J. , Alm E.J. ( 2012 ) Inferring correlation networks from genomic survey data . PLoS Comput. Biol ., 8 , e1002687. Google Scholar Crossref Search ADS PubMed Goodrich J.K. et al. ( 2014 ) Human genetics shape the gut microbiome . Cell , 159 , 789 – 799 . Google Scholar Crossref Search ADS PubMed He S. et al. ( 2017 ) Ecological diversity and co-occurrence patterns of bacterial community through soil profile in response to long-term switchgrass cultivation . Sci. Rep ., 7, 3608 . Google Scholar Crossref Search ADS Hughes G.L. et al. ( 2014 ) Native microbiome impedes vertical transmission of Wolbachia in Anopheles mosquitoes . Proc. Natl. Acad. Sci ., 111 , 12498 – 12503 . Google Scholar Crossref Search ADS Jackson M.A. et al. ( 2018 ) Detection of stable community structures within gut microbiota co-occurrence networks from different human populations . Peer J ., 6 , e4303. Google Scholar Crossref Search ADS PubMed Kaul A. et al. ( 2017 ) Analysis of microbiome data in the presence of excess zeros . Front. Microbiol ., 8 , 2114 . Google Scholar Crossref Search ADS PubMed Kostic A.D. et al. ( 2012 ) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma . Genome Res ., 22 , 292 – 298 . Google Scholar Crossref Search ADS PubMed Kullback S. ( 1959 ) Statistics and Information Theory . John Wiley & Sons, New York . Lee Y.-C. et al. ( 2016 ) Association between helicobacter pylori eradication and gastric cancer incidence: a systematic review and meta-analysis . Gastroenterology , 150 , 1113 – 1124.e5 . Google Scholar Crossref Search ADS PubMed Mainali K.P. et al. ( 2017 ) Statistical analysis of co-occurrence patterns in microbial presence-absence datasets . PLoS One , 12 , e0187132 . Google Scholar Crossref Search ADS PubMed Maruyama N. et al. ( 2014 ) Intraindividual variation in core microbiota in peri-implantitis and periodontitis . Sci. Rep ., 4 , 6602. Google Scholar Crossref Search ADS PubMed Pearson K. ( 1895 ) Note on regression and inheritance in the case of two parents . Proc. R. Soc. London , 58 , 240 – 242 . Google Scholar Crossref Search ADS Perry A. , Lambert P. ( 2011 ) Propionibacterium acnes: infection beyond the skin . Expert Rev. Anti-Nfect Thera , 9 , 1149 – 1156 . Google Scholar Crossref Search ADS Peterson J. et al. ( 2009 ) The NIH human microbiome project . Genome Res ., 19 , 2317 – 2323 . Google Scholar Crossref Search ADS PubMed Reshef D.N. et al. ( 2011 ) Detecting novel associations in large data sets . Science (80-.) , 334 , 1518 – 1524 . Google Scholar Crossref Search ADS Schnettler E. et al. ( 2016 ) Wolbachia restricts insect-specific flavivirus infection in Aedes aegypti cells . J. Gen. Virol ., 97 , 3024 – 3029 . Google Scholar Crossref Search ADS PubMed Shannon P. et al. ( 2003 ) Cytoscape: a software Environment for integrated models of biomolecular interaction networks . Genome Res ., 13 , 2498 – 2504 . Google Scholar Crossref Search ADS PubMed Sobhani I. et al. ( 2013 ) Microbial dysbiosis and colon carcinogenesis: could colon cancer be considered a bacteria-related disease? Therap. Adv. Gastroenterol ., 6 , 215 – 229 . Google Scholar Crossref Search ADS PubMed Srivathsan A. et al. ( 2015 ) Comparing the effectiveness of metagenomics and metabarcoding for diet analysis of a leaf-feeding monkey (Pygathrix nemaeus) . Mol. Ecol. Resour ., 15 , 250 – 261 . Google Scholar Crossref Search ADS PubMed Sun J. , Kato I. ( 2016 ) Gut microbiota, inflammation and colorectal cancer . Genes Dis ., 3 , 130 – 143 . Google Scholar Crossref Search ADS PubMed Tjalsma H. et al. ( 2012 ) A bacterial driver-passenger model for colorectal cancer: beyond the usual suspects . Nat. Publ. Gr ., 10 , 575 – 582 . Wang K. et al. ( 2016 ) Preliminary analysis of salivary microbiome and their potential roles in oral lichen planus . Sci. Rep ., 6 , 22943. Google Scholar Crossref Search ADS PubMed Weiss S. et al. ( 2017 ) Normalization and microbial differential abundance strategies depend upon data characteristics . Microbiome , 5 , 27 . Google Scholar Crossref Search ADS PubMed Xu L. et al. ( 2015 ) Assessment and selection of competing models for zero-inflated microbiome data . PLoS One , 10 , e0129606 . Google Scholar Crossref Search ADS PubMed Zhao J. et al. ( 2014 ) Modeling the impact of antibiotic exposure on human microbiota . Sci. Rep ., 4 , 4345. Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Bioinformatics – Oxford University Press
Published: Nov 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”
Daniel C.
“Whoa! It’s like Spotify but for academic articles.”
@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”
@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”
@JoseServera
DeepDyve Freelancer | DeepDyve Pro | |
---|---|---|
Price | FREE | $49/month |
Save searches from | ||
Create lists to | ||
Export lists, citations | ||
Read DeepDyve articles | Abstract access only | Unlimited access to over |
20 pages / month | ||
PDF Discount | 20% off | |
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.
ok to continue