A computational framework for complex disease stratification from multiple large-scale datasets

A computational framework for complex disease stratification from multiple large-scale datasets Background: Multilevel data integration is becoming a major area of research in systems biology. Within this area, multi-‘omics datasets on complex diseases are becoming more readily available and there is a need to set standards and good practices for integrated analysis of biological, clinical and environmental data. We present a framework to plan and generate single and multi-‘omics signatures of disease states. Methods: The framework is divided into four major steps: dataset subsetting, feature filtering, ‘omics-based clustering and biomarker identification. Results: We illustrate the usefulness of this framework by identifying potential patient clusters based on integrated multi-‘omics signatures in a publicly available ovarian cystadenocarcinoma dataset. The analysis generated a higher number of stable and clinically relevant clusters than previously reported, and enabled the generation of predictive models of patient outcomes. Conclusions: This framework will help health researchers plan and perform multi-‘omics big data analyses to generate hypotheses and make sense of their rich, diverse and ever growing datasets, to enable implementation of translational P4 medicine. Keywords: Molecular signatures, ‘Omics data, Stratification, Systems medicine Background Various data integration methods developed through Since the early days of medicine, practitioners have systems biology and computer science are now available always combined their observations from patient exami- to researchers. These methods aim at bridging the gap nations with their medical knowledge and experience to between the vast amounts of data generated in an ever- diagnose medical conditions and find treatments tailored cheaper way [3] and our understanding of biology to the patient [1]. Nowadays, this rationale includes the reflecting the complexity of biological systems [4]. integration of molecular, clinical, imaging information Promises of data integration are the reduced cost of clin- and other data sources to inform diagnosis and progno- ical trials, better statistical power, more accurate hypoth- sis [2] or in other words, personalised medicine. esis generation and ultimately, individualised and cheaper healthcare [2]. * Correspondence: bdemeulder@eisbm.org; cauffray@eisbm.org However, a lack of communication exists between the Bertrand De Meulder and Diane Lefaudeux contributed equally to this work. fields of clinical medicine and systems biology, bioinfor- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, matics and biostatistics, as suggested by the reluctance EISBM, 50 Avenue Tony Garnier, 69007 Lyon, France Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 2 of 23 or distrust to recent developments of personalised medi- their annotation is required to interpret results and pro- cine by the medical community [1, 5, 6]. To address this duce a single ‘omic signature. Annotation is a complex issue, we developed a computational/analysis framework task that links identifiers from the technological platforms that aims at facilitating communication between health- to existing entities (i.e. genes, peptides, metabolites, lipids, care professionals, computational biologists and etc.) [44, 45]. If the data permit, information from several bioinformaticians. ‘omics platforms is integrated into multi-‘omics signatures. Among several ways of integrating data across bio- Single and multi ‘omicssignaturesultimatelyserve to logical levels, one of the components is multi-omics data identify molecular mechanisms driving pathobiology. integration. The identification of molecular signatures Contextualisation of signatures with existing know- has been a focus of the biology and bioinformatics com- ledge is now standard practice (e.g. ontology, enrichment munities for over three decades. Early studies focused on and pathway analysis [46]), or performed with more ad- a small number of molecules, paving the way for larger vanced tools for data integration and visualisation such studies, eventually supporting the emergence of the as a disease map [47]. Exploratory analysis using ‘omics’ concept in the late 1990’s, starting with ‘genom- network-based information is valuable, with tools such ics’ [7, 8]. Owing to both technical and biological ad- as the STRING database [48], among many others. Hy- vances, many classes of molecules have been studied by potheses can then be formulated and tested in two ways, ‘omics technologies such as transcriptomics [9–11], pro- with external datasets and/or new experiments; or by teomics [12, 13], lipidomics [14, 15], metabolomics (first modelling and knowledge representation (see review in mentioned in [16, 17]), the composition of the exhaled [49] and disease maps examples in [47, 50–52]). With breath by breathomics (first mentioned in [18]) [19], and the help of systems pharmacology (see [53]), outcomes interactomics [20, 21], among others. of this whole exercise are enabling: (i) identification of Consequently, bioinformatics tools have been devel- new potential drug targets associated with newly identi- oped to analyse this new wealth of biological data, as fied patient clusters, (ii) elucidation of potential bio- reviewed in [22]. The concept of systems biology was de- markers for diagnosis, (iii) repurposing of existing drugs veloped first in the 1960’s[23, 24] to study biological or- and, ultimately, (iv) changes in diagnostic processes and ganisms as complete and complex systems, integrating development of new drugs and treatments for disease various sources of information (phenotypic data, mo- management. The key step in the systems medicine lecular data, etc.) in combination with pathway/network process is pattern recognition, for which a robust and analysis and mathematical modelling [25–33]. These sys- step-wise framework is required. tems approaches are highly suitable for the discovery of disease phenotypes (based on empirical recognition of Definitions observed characteristics) and so-called endotypes (cap- Our article focuses on the identification of disease turing complex causative mechanisms in disease) [34]. mechanisms through statistical analysis of raw data, an- The logical next step was to apply systems biology tools notation with up-to-date ontologies to generate finger- to improve clinical diagnosis, refine the endotypes lead- prints (biomarker signatures derived from data collected ing to diseases, develop a comprehensive approach to from a single technical platform), handprints (biomarker the human body and assess an individual’s health in light signatures derived from data collected within multiple of its ‘omics status. In this way the ‘systems medicine’ technical platforms, either by fusion of multiple finger- concept was born [35–41]. The systems medicine ration- prints or by direct integration of several data types) and ale is outlined in Fig. 1. interpretation on a pathway level to identify disease- Any meaningful experiment relies on a robust, bias- driving mechanisms. controlled study design [42] using appropriate technolo- One way to better define the different endotypes is to gies, leading to the production of trustworthy quality- generate molecular fingerprints (e.g. blood cell tran- checked data. Data curation then aims at organising, scriptomics analysis yields genes differentially expressed annotating, integrating and preserving data from various between clinical populations [54]) and handprints (e.g. sources for reuse and further integration. The next step is mRNA expression, DNA methylation and miRNA expres- to identify relevant molecular features using statistical sion data fused to generate clusters of cancer patients evidence. A tremendous and constantly growing number [55]). The latter can be combined to study patients e.g. at of methods is available for this purpose, making the the ‘blood biological compartment’ level, and linked with process of method selection a crucial and challenging task. specific disease markers to better define the underlying We provide some guidelines here but recommend that biology, hence providing new avenues for therapy. the reader turns to specialised reviews (such as [43]) for Despite the wealth of ‘omics analyses, little consensus more insights on the relevance and appropriateness of in- exist on which statistical or bioinformatics methods to dividual methods. Once features are statistically selected, apply on each type of data set, nor on the ‘best’ integrative De Meulder et al. BMC Systems Biology (2018) 12:60 Page 3 of 23 Fig. 1 Outline of the Systems Medicine rationale. Represented in orange are the steps linked to quality data production, followed by curation in grey, identification of interesting features through statistical analysis in blue and hypothesis generation and their validation in green. Modelling and knowledge representation methods can inform the hypotheses generated through statistical analysis of generated hypotheses on their own (in purple). Outputs of this exercise are represented in red: drug repurposing, new drugs and improved diagnostics, with the help of clinical trials methods for their combined analysis (although standards handprint analysis using the TCGA Research Network exist for some data types, see [22]). Here, we present a gen- (The Cancer Genome Atlas – http://cancergenome.nih. eric framework to perform statistical and bioinformatics ana- gov/) Ovarian serous cystadenocarcinoma (OV) dataset. lyses of ‘omics measurements, starting from raw data management to multi-platform data integration, pathway and network modelling that has been adopted by the Innovative Data preparation: Quality control, correction for Medicines Initiative (IMI) U-BIOPRED Consortium (Un- possible batch effects, missing data handling, and biased BIOmarkers for the PREDiction of respiratory disease outlier detection outcomes, http://www.ubiopred.eu) and extended in the Quality Control (QC) comprises several important steps eTRIKS Consortium (https://www.etriks.org/)tosupport a in data preparation. First, the platform-specific technical large number of national and European translational medi- QC and normalisation are performed according to the cine projects. This article is not a review of the very large standards of the respective fields of each particular body of literature on relevant bioinformatics methods. In- technological platform. stead it describes generic steps in ‘omicsdataanalysisto Batch effects are a technical bias arising during study which many methods can be mapped to help multidisciplin- design and data production, due to variability in produc- ary teams comprising clinical experts, wet-lab researchers, tion platforms, staff, batches, reagent lots, etc. Their im- bioinformaticians, biostatisticians and computational systems pact can be assessed using descriptive methods such as biologists share a common understanding and communicate Principal Component Analysis (PCA) and graphical dis- effectively throughout the systems medicine process [56]. plays. Tools such as ComBat [57] and methodologies de- We illustrate our pragmatic approach to the design veloped by van der Kloet [58] can be used to adjust for and implementation of the analysis pipeline through a batch effects when necessary. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 4 of 23 Missing data are features of all biological studies and When there is no community-wide consensus on a arise for a variety of reasons. If the source of the miss- specific quality threshold for a particular biological data ingness is unrelated to phenotype or biology, the missing type, the research group generating the data applies data points can be classified as Missing Completely At quality filters on the basis of their knowledge and experi- Random (MCAR). Such missing values may be handled ence. Precise description of each data processing step through imputation (to the mean, mode, mean of near- should accompany each dataset to inform colleagues est neighbours, or by multiple imputation etc.) or by performing downstream analysis. simple deletion [59]. Additional non-random missing data may arise due to Methods assay- or platform-specific performances. For example, The framework concept the measurement of abundances can fall below the lower Several key generic steps in data analysis were identified limit of detection or quantitation (LLQ) of the instru- and are highlighted in Fig. 3 below. ment. In such instances, imputation is generally applied. Common methods include imputation to zero, LLQ, Step 1: Dataset subsetting LLQ/2, or LLQ/√2; extrapolation and maximum likeli- This first box of Fig. 3 3 comprises two major steps: 1) hood estimation (MLE) can also be used [59]. formulating the biological question to be addressed and Particular difficulty occurs in the analysis of mass 2) preparing the data. spectrometry data, when it is impossible to distinguish MCAR data points from those below the LLQ of the Formulating the biological question technique. The combined levels of missing data often far Several types of biological questions can be tackled, exceed 10%. For these, the process depicted in the Fig. 2 leading to different partitions of the dataset(s) to study. is proposed. A partitioning scheme may rely on cohort definitions Critical appraisal of the pattern of missingness is cru- based on current state of the art, a specific biological cial. Where extensive imputation is applied, the robust- question (e.g. comparing highly atopic to non-atopic ness of imputation needs to be assessed by re-analysis, severe asthmatics), or clustering results, obtained with using a second imputation method, or by discarding the clinical variables alone, distinct specific ‘omic or multi- imputed values. ‘omics clustering, etc. Outliers are expected in any biological/platform data. When these are clearly seen to arise due to technical ar- Data preparation tefacts (differences by many orders of magnitude, etc.), Depending on the question formulated at the previous they should be discarded. Otherwise and in general, out- step, data are then subsetted when appropriate. Then, an lying values in biological data should be retained, flagged additional outlier detection check, data transformation and subjected to statistical analysis. and normalisation step can be performed, with methods Fig. 2 Process proposed for handling high levels of non-random missing data. If there are less than 10% missing values, data imputation is used, then tested for association (artificial associations might arise from the imputation process, which would then skew the analysis downstream) and submitted to a sensitivity analysis. If there are more than 10% missing values, we either collapse the feature/patient to a binary (presence/absence) scheme and run a χ test for difference in detection rates, or explore several imputation methods with highly cautious interpretation De Meulder et al. BMC Systems Biology (2018) 12:60 Page 5 of 23 Fig. 3 Overview of the framework. Starting from quality-checked and pre-processed ‘omics data, four key generic steps are highlighted: (a)dataset subsetting, including formulation of the biological question to be answered and data preparation, (b) feature filtering (optional step) where features that are uninformative in relation to the question can be removed, (c) ‘omics-based unsupervised clustering (optional step) aiming at finding groups of participants arising from the data structure using the (optionally filtered) features, and finally d) biomarker identification, including feature selection by bioinformatics means and machine learning algorithms for prediction described above. In this step, the statistical power that information content measures [68, 69], network-based the analyst can expect (or the effect size that can be ex- metrics (connectivity, centrality [70, 71]) or using a pected to be discovered) can be investigated (for more non-linear machine learning algorithm [72]. We redir- details on the computation of statistical power in ‘omics ect the reader to the following reviews for more details data analysis, see [60]). A decision on whether to split [33, 73–75]. As this step might introduce bias into the the datasets into training and validation sets is also made downstream analyses, it is not always applied. at this point (see section 4, replication of findings). Step 3: ‘Omics-based clustering Step 2: Feature filtering Clustering analysis groups elements so that objects in the Given the complexity and large amount of clinical and same group are more similar to each other than to those ‘omics data in a complex dataset, the number of features in other groups (Fig. 3c). All methods available rely on measured is vastly superior to the number of replicates similarity or distance measures and a clustering algorithm creating various statistical challenges, i.e.. the ‘curse of [76–78]. The most classical clustering methods may be dimensionality’ [61, 62]. Feature filtering (Fig. 3b)is categorized as ‘partitioning’ (constructing k clusters) or therefore often used to select a subset of features rele- ‘hierarchical’ (seeking to build a hierarchy of clusters), and vant to the biological question studied, remove noise either agglomerative (each observation starts in its own from the dataset and reduce the computing power and cluster, and pairs of clusters are merged as one moves up time needed [63–65]. the hierarchy, ending in a single cluster) or divisive (all ob- Features can be filtered according to specific criteria, servations start in the same cluster and splits are per- based for example on nominal p-values arising from com- formed recursively as one moves down the hierarchy, parison between groups. Indeed, several methods exist to ending with clusters containing one single observation). perform feature filtering, based on mean expression It is important to note that clustering techniques are values, p-values, fold changes, correlation values [66, 67], descriptive in nature and will yield clusters, whether they De Meulder et al. BMC Systems Biology (2018) 12:60 Page 6 of 23 represent reality or not [76]. One way of finding out Over-fitting may occur when a statistical model whether clusters represent reality is to assess their stabil- includes too many parameters relative to the number of ity, with the consensus clustering approach [79] for observations. The over-fitted model describes random example. Using different stable clustering algorithms on error instead of the underlying relationship of interest the same dataset and comparing them with the meta- and performs poorly with independent data. In deriving clustering rationale [80] is a further step to assess if clus- prediction models therefore, a guiding principle is that ters represent accurately and reproducibly the biological there should be at least ten observations (or events) per situation in the data. predictor element [88] while simple models with few When several ‘omics datasets on the same patients are parameters should be favoured whenever possible. available, a handprint analysis can be performed with the All in all, the combination of internal replication, FDR Similarity Network Fusion (SNF) method to derive a correction and conservative over-fitting considerations patient-wise multi-‘omics similarity matrix [55]. Other allows the detection of interesting ‘omics features with a methods for data integration in the context of subtype dis- reference statistical foundation. covery are available such as iCluster [81], Multiple Dataset Integration [82], or Patient-Specific Data Fusion [83], further Replication of findings discussed in [84] or under development, for example by the When a large number of statistical tests have been European Stategra FP7 project (http://www.stategra.eu). planned, a comprehensive adjustment for multiple test- ing can be detrimental to statistical power. Validation Step 4: Biomarker identification and replication of findings is therefore essential in order Steps 1 to 3 aim at finding groups of patients to best to avoid the widespread unvalidated biomarker syn- describe the biological condition(s), with respect to the drome that has plagued the vast majority of claimed bio- questions addressed. Step 4 aims at 1) finding the markers. Indeed, fewer than 1/1000 have proved smallest set of molecular features whose difference in clinically useful and approved by regulatory authorities abundance between these patient groups (Fig. 3d) enable [89–94]. For each combination of platform and sample their distinction (biomarkers) and 2) building classifica- type, an assessment can be made as to whether the data tion models through machine-learning techniques, some should be split into training and validation sets, or of which use both feature reduction and classification instead analysed as a single pool. model building together. The outcome is a fingerprint or The predictive value of a biomarker identified after handprint, depending on the number of different ‘omics proper internal replication applies to the dataset in datasets included in the analysis. which it was discovered. Replication of findings in add- itional sample sets is a crucial step in producing clinic- Over-fitting and false-discovery rate control ally usable biomarkers and predictive models [95, 96] As already mentioned, ‘omics technologies suffer from and should thus always be sought. what is known as the ‘curse of dimensionality’, typically Once the feature filtering step is performed, the next due to the large number of features (p) and low number step is to make sense of the results, either in a biological of samples (n). As statistical methods were historically or mathematical manner. Biological annotation can be developed for a situation where the dimensions were n performed using pathways (see review in [97]) or func- >> > p instead of the p >> > n situation, methods adjust- tional categories (reviewed in [98]); however, this kind of ments had to be made. The main issue in statistical analysis is hampered by factors such as statistical consid- analysis is the high type I error rate (false positives) in erations (which method to use, independence between null hypothesis testing. Several ways of correcting for genes and between pathways, how to take into account this have been developed, the most well-known and used the magnitude of the changes) and pathway architecture being the Bonferroni correction and the Benjamini- considerations (pathways can cross and overlap, meaning Hochberg False Discovery Rate (FDR) controlling that if one pathway is truly affected, one may observe procedure [85]. Discussions are still ongoing in the sta- other pathways being significantly affected due to the set tistics community as to which method is best to control of overlapping genes and proteins involved) [99]. One the false positive rates in the context of ‘omics data way of overcoming those limitations is to use the analysis [46, 86, 87]. We therefore advise to split the complete genome-scale network of protein-protein inter- data in testing and validation groups. Tests made actions to define affected sub-regions of the network, within each group are corrected for FDR with the with available academic [100, 101] and commercial solu- Benjamini-Hochberg’s procedure whenever possible or tions (e.g. MetaCore™ Thomson Reuters, IPA Ingenuity advised by domain experts, and only features detected Pathway Analysis). A recent proposed solution is the dis- in both groups should be considered for further ana- ease map concept, following the examples of the Parkin- lysis and interpretation. son’s disease map [47], the Atlas of Cancer Signalling De Meulder et al. BMC Systems Biology (2018) 12:60 Page 7 of 23 Networks [50] and the AlzPathway [51, 52] where an ex- the highlighting of well-known ovarian cancer biomarkers haustive set of relevant interactions to a particular dis- and pathways. ease are represented in details as a single network, In order to produce a handprint more focused on the which can then be analysed biologically and mathematic- survival status of patients in the dataset, each ‘omics ally, with the supervision of domain experts for coverage dataset was treated separately to identify features associ- and specificity [102]. ated with survival status at the end of the study and overall survival time. The latter was obtained by sum- Results ming the age (in days) of the participants at enrolment Application to a public domain dataset: TCGA OV dataset in the study and the post-study survival time, both for handprint analysis values available in the clinical variables from the TCGA The Cancer Genome Atlas (TCGA, http://cancergenome. website. After data preparation including imputation of nih.gov/) is a joint effort of the National Cancer Institute missing data in methylation and normalisation, linear (NCI) and the National Human Genome Research Insti- models testing for survival status with survival time as a tute (NHGRI) in the USA. It aims to accelerate our under- cofactor were fitted feature-wise and p-values for differ- standing of the molecular basis of cancer through ential expression/abundance were derived. All features application of genome analysis technologies. Among other with a nominal p-value < 0.05 were selected. This yielded functionalities, TCGA offers a freely available database of a total of 899 features in the methylation dataset, 37 multi-‘omics datasets (including clinical data, imaging, miRNAs and 5817 probesets in transcriptomics. DNA, mRNA and miRNA sequencing, protein, gene exon and miRNA expression, DNA methylation and copy num- ber variation (CNV)) for several cancer types, with patient ‘Omics-based clustering numbers ranging from a few dozens to above a thousand. Similarity matrices were derived from each filtered As a use case, the ovarian cancer OV dataset was ‘omics dataset, which were fused with SNF, and spectral chosen, as it comprises several ‘omics measurements for a clustering with a consensus clustering step was applied large group of patients; this dataset has already been well to detect stable clusters, as shown in Fig. 5 below. The characterized in several publications but without a data fu- choice of the optimal number of stable clusters is based sion analysis, in contrast to the glioblastoma TCGA data- on two mathematical parameters: the deviation from set, for example [55]. It comprises data from a total of 586 ideal stability (DIS, a measure of the deviation from patients, along with several ‘omics datasets (such as SNP, horizontality of the CDF curves in the left panel of the Exome, methylation…), as shown in the Table 1. below. Fig. 5, the formulation of which can be found in the All data matrices were downloaded using the Broad Insti- supplementary material of [103]), and the number of pa- tute FireBrowse TCGA interface (http://firebrowse.org/ tients assigned in each cluster (clusters with fewer than ?cohort=OV&download_dialog=true#); the results shown 10 patients should be avoided [58]). The DIS across the here are based upon data generated by the TCGA Re- number of clusters can be found in the Additional file 1. search Network. The DIS shows a minimal value for k = 3 clusters, but very similar values can be seen for k = 6, 7, 9, 10, 11 and Data preparation 12. As it is clinically interesting to distinguish a higher We used the clinical, methylation, mRNA and miRNA number of clusters and to define clusters with different data matrices from the 453 patients (out of a total of survival status, we chose the number of clusters associ- 586 patients) for which all four data types were available. ated with low DIS, no clusters with fewer than 10 pa- The overview of the analysis is summarized in the Fig. 4. tients, and statistically significant differences in survival status and survival time of patients, k = 9. Feature selection The clinical characteristics of the nine clusters are Preliminary analysis without feature selection was per- shown in Table 2. Survival curves are also shown in formed (data not shown). Briefly, this analysis led to the the Kaplan-Meyer plot (Fig. 6). Survival status and identification of four stable clusters, mainly differentiated survival time differ between the nine clusters, show- by lymphatic and venous invasion status and clinical stage. ing for example that patients in cluster 1 have a Biologically speaking, the comparison of clusters led to higher mortality rate. Table 1 This table shows the number of cases in each ‘omics platform available for the TCGA Ovarian Serous Cystadenocarcinoma dataset (source: https://gdc.cancer.gov/) Ovarian serous cystadenocarcinoma Total Exome SNP Methylation mRNA miRNA Clinical Cases 586 536 579 584 574 582 584 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 8 of 23 Fig. 4 Framework outline for the TCGA handprint analysis with additional feature filtering. Each dataset was separately filtered based on nominal p-values < 0.05 when comparing alive versus deceased patients at the end of the study taking into account the total amount of days alive. A total of 6753 features were selected: 899 differentially methylated genes, 37 miRNAs and 5817 differentially expressed probesets. Consensus clustering on the fused similarity matrices determined the number of stable clusters that were viewed in a Kaplan-Meyer plot and tested for differential survival. Machine learning was then performed to identify candidate features predicting the identified groups: Recursive Feature Elimination (RFE) on a linear Support- Vector-Machine (SVM) model to identify informative features, followed by a Random Forest (RF) model building in parallel with DIABLO sPLS-DA on those features Biomarker identification factor. Other transcription factors are also highlighted Enrichment analysis through the methylation measurements. In order to detect differentially expressed features that are Cluster 3 is associated with immune system regulation (T specific to one group, each of the nine clusters was com- cell-related processes, and more precisely CD4 and CD8-T pared to the rest of the dataset. Table 3 shows the sum- cells lineages-related processes…), cell-cell signalling, mary of statistically different features (p-value < 0.05, 5% cAMP signalling, cytokine-cytokine interaction, G-Protein FDR correction) identified in each comparison. coupled receptor (GPCR) ligand binding and neuronal and Enrichment analysis of features differentially expressed/ muscle-related pathways (potassium and calcium channels, abundant between the clusters was then performed. other ion channels and synapses). Again, several miRNAs Complete results are presented in the Additional file 2;an and transcription factors are highlighted. overview of results for which there is already evidence in Cluster 4 is also associated with the immune response, the literature is presented below in Table 4. and key functions such as lymphocyte activation, T cell In short, the biological functions enriched in each aggregation, differentiation, proliferation and activation, cluster are as follows: cluster 1 is mostly enriched in adaptive immune system, regulation of lymphocyte cell- mitochondrial translation and energy metabolism, cell cell activation, immune response-regulating signalling cycle regulation, negative regulation of apoptosis and pathway, cytokine-cytokine receptor interaction, antigen DNA damage response. In addition, several miRNAs and processing and presentation, hematopoietic cell lineage transcription factors are enriched; the details can be and hematopoiesis and B cell activation. Primary im- found in the Additional file 2. munodeficiency pathway and cell adhesion molecules, Cluster 2 is associated with chemical carcinogenesis, along with miR-938 and several transcription factors are miR-330-5p, miR-693-5p and the Pax-2 transcription also enriched. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 9 of 23 Fig. 5 Consensus clustering results for the handprint analysis with feature filtering. A number of stable clustering schemes are available (k = 3, 6, 7, 8, 9). Nine clusters were chosen as the most informative, while keeping a low value of the deviation from ideal stability index and with clinical characteristics of the clusters statistically different in both survival time and survival status between clusters Cluster 5 is related to immune response, enriched in Each cluster is linked with one or several of the well- lymphocyte activation, T cell aggregation, differentiation, known hallmarks of cancer such as regulation of the cell activation and proliferation, leukocyte differentiation, ag- cycle (clusters 1 and 7), energy metabolism (cluster 1 and 7), gregation and activation, positive regulation of cell-cell immune system (clusters 3, 4, 5 and 8), epithelial-to- adhesion, antigen processing and presentation, cytokine mesenchymal transition (cluster 4) or angiogenesis production, inflammatory response, NK cell-mediated (cluster 5) [104–106]. Interestingly, our analysis based cytotoxicity and cytokine-cytokine receptor interaction. on ‘omicsprofilesis abletoidentifyclustersthat seemto Other processes involved are NF-κB signalling, Jak- separate some of those hallmarks out, while an analysis STAT signalling, Interferon α/β signalling, TCR signal- taking into account only the clinical data cannot. As seen ling, VEGF signalling, VEGFR2-mediated cell prolifera- above, cluster 6 is associated with a higher rate of survival. tion, Hedgehog ‘off’ state, along with several miRNAs It would therefore be interesting to further explore the and transcription factors. signalling networks enriched in the comparison between Cluster 6 is enriched in several signalling pathways, cluster 6 and the other clusters to identify the molecular such as cAMP, GPCR signalling, arachidonic acid metab- mechanisms responsible for the extended survival. olism and fatty acids metabolism, as well as positive T cell selection, several miRNAs and transcription factors. Machine-learning predictive modelling Cluster 7 is linked with respiratory metabolism, p53 The next step in the analysis is to establish a model that and cell cycle regulation, splicing regulation as well as can predict which cluster a patient belongs to, based on signalling by NF-κB and miRNAs and transcription the ‘omics measurements alone. Machine-learning tech- factors. niques (reviewed in [107, 108]), available in the caret R Cluster 8 is enriched with T cell lineage commit- package [109] and in the MixOmics R packages [110, ment, potassium channels, miRNAs and transcription 111] were used. factors. Two models were built in parallel, on the same dataset. Cluster 9 is associated with ion transport (including syn- aptic, calcium and potassium channels), cAMP signalling, 1. A Recursive Feature Elimination (RFE) procedure nicotine addiction, as well as miRNAs and transcription was performed to identify the smallest number of factors. features from the three ‘omics platforms that allow De Meulder et al. BMC Systems Biology (2018) 12:60 Page 10 of 23 Table 2 Clinical characteristics of the nine clusters found in the focused handprint analysis Variables/ C1 (n = 49) C2 (n = 30) C3 (n = 75) C4 (n =41) C5 (n = 47) C6 (n = 52) C7 (n = 46) C8 (n =56) C9 (n =57) P-value clusters Age at initial 57.6 ± 13.2 53.5 ± 8.16 59.8 ± 10.7 61.1 ± 12 60.2 ± 9.67 63.4 ± 11.8 59.8 ± 12.5 59.4 ± 11.6 60 ± 11.4 3.40E- pathologic 02 diagnosis (Yr) Days from −21,200 ± 4830 −19,700 ± 3030 −21,900 ± 3870 −22,700 ± 4260 −22,200 ± 2580 − 23,300 ± 4290 −22,000 ± 4560 −21,900 ± 4240 −22,200 ± 4140 3.15E- birth (Days) 02 Days to death 1220 (725–1490) 1480 (1210–2360) 997 (404–1230) 949 (563–1360) 787 (512–1340) 1090 (680–1580) 978 (536–1450) 1070 (340–1440) 1290 (731–1700) 2.11E- (Days (IQR)) 02 Days to last 1090 (689–1460) 1200 (688–1550) 664 (238–1120) 763 (272–1820) 676 (185–1560) 804 (339–1560) 651 (347–1370) 816 (223–1370) 1280 (605–1690) 3.74E- followup 02 (Days (IQR)) Initial Cytology: 9; Cytology: 3; Cytology: 12; Cytology: 2; Cytology: 9; Cytology: 6; Cytology: 2; Cytology: 9; Cytology: 5; 3.28E- pathologic Excisional biopsy: Excisional biopsy: 0; Excisional biopsy: Excisional Excisional Excisional Excisional Excisional Excisional 03 diagnosis 2; Fine needle Fine needle 0; Fine needle biopsy: 0; Fine biopsy: 2; Fine biopsy: 0; Fine biopsy: 0; Fine biopsy: 1; Fine biopsy: 0; Fine method aspiration biopsy: aspiration biopsy: 0; aspiration needle needle needle needle needle needle 2; Incisional biopsy: Incisional biopsy: 0; biopsy: 3; aspiration aspiration biopsy: aspiration aspiration aspiration aspiration 4; Tumor Tumor resection: 27 Incisional biopsy: biopsy: 2; 0; Incisional biopsy: 1; biopsy: 0; biopsy: 0; biopsy: 1; resection: 32 0; Tumor Incisional biopsy: 2; Tumor Incisional Incisional Incisional Incisional resection: 59; biopsy: 1; resection: 33; biopsy: 0; biopsy: 0; biopsy: 3; biopsy: 0; NA: 1 Tumor NA: 1 Tumor Tumor Tumor Tumor resection: 36 resection: 44; resection: 44 resection: 43 resection: 51 NA: 1 Lymphatic No: 4; Yes: 9; No: 6; Yes: 10; NA: 14 No: 7; Yes: 19; No: 13; Yes: 5; No: 1; Yes: 17; No: 13; Yes: 6; No: 8; Yes: 21; No: 4; Yes: 8; No: 5; Yes: 2.43E- invasion NA: 36 NA: 49 NA: 23 NA: 29 NA: 33 NA: 17 NA: 44 14; NA: 38 02 Neoplasm G1: 1; G2: 13; G3: G1: 0; G2: 5; G3: 24; G1: 0; G2: 5; G3: G1: 0; G2: 5; G3: G1: 0; G2: 6; G3: G1: 0; G2: 6; G3: G1: 0; G2: 8; G1: 0; G2: 1; G1: 0; G2: 6; 1.89E- histologic 33; G4: 0; Gb: 1; G4: 0; Gb: 0; Gx: 0; 70; G4: 0; Gb: 0; 36; G4: 0; Gb: 0; 39; G4: 0; Gb: 0; 44; G4: 1; Gb: G3: 38; G4: 0; G3: 53; G4: 0; G3: 49; G4: 0; 02 grade Gx: 1 NA: 1 Gx: 0 Gx: 0 Gx: 2 0; Gx: 1 Gb: 0; Gx: 0 Gb: 0; Gx: 2 Gb: 0; Gx: 1; NA: 1 Ethnicity American Indian or American Indian or American Indian American Indian American Indian American Indian American American American 6.72E- Alaska native: 1; Alaska native: 0; or Alaska native: 0; or Alaska native: or Alaska native: or Alaska native: Indian or Indian or Indian or 01 Asian: 1; Black or Asian: 1; Black or Asian: 3; Black or 0; Asian: 1; Black 1; Asian: 1; Black 0; Asian: 2; Black Alaska native: Alaska native: Alaska native: African American: African American: 2; African American: or African or African or African 0; Asian: 3; 0; Asian: 3; 0; Asian: 0; 3; White: 43; NA: 1 White: 27; NA: 0 2; White: 68; NA: 2 American: 3; American: 0; American: 4; Black or Black or Black or White: 37; NA: 0 White: 41; NA: 4 White: 44; NA: 2 African African African American: 1; American: 2; American: 4; White: 41; White: 49; White: 51; NA: 1 NA: 2 NA: 2 Clinical stage iia: 0; iib: 0; iic: 0; iiia: iia: 0; iib: 0; iic: 1; iiia: iia: 0; iib: 0; iic: 3; iia: 0; iib: 0; iic: iia: 0; iib: 1; iic: iia: 0; iib: 0; iic: iia: 1; iib: 1; iic: iia: 0; iib: 0; iic: iia: 2; iib: 2; iic: 2.65E- 1; iiib: 0; iiic: 38; iv: 0; iiib: 1; iiic: 24; iv: 3; iiia: 1; iiib: 3; iiic: 3; iiia: 4; iiib: 4; 1; iiia: 0; iiib: 2; 2; iiia: 1; iiib: 5; 2; iiia: 0; iiib: 1; iiia: 0; iiib: 4; iiia: 0; iiib: 1; 02 10; NA: 0 NA: 1 51; iv: 16; NA: 1 iiic: 22; iv: 7; iiic: 33; iv: 9; iiic: 38; iv: 6; 4; iiic: 34; iv: 1; iiic: 42; iv: iiic: 41; iv: 7; NA: 1 NA: 1 NA: 0 4; NA: 0 12; NA: 0 NA: 0 Tumor >20 mm: 10; 1–10 > 20 mm: 5; 1–10 > 20 mm: 17; 1–10 > 20 mm: 6; 1– > 20 mm: 11; > 20 mm: 4; > 20 mm: 8; > 20 mm: 6; > 20 mm: 11; 6.13E- residual mm: 26; 11–20 mm: mm: 17; 11–20 mm: mm: 29; 11–20 mm: 10 1–10 mm: 21; 1–10 mm: 24; 1–10 mm: 15; 1–10 mm: 29; 1–10 mm: 25; 02 disease 6; no macroscopic 5; no macroscopic 5; no macroscopic mm: 18; 11–20 11–20 mm: 4; 11–20 mm: 5; 11–20 mm: 11–20 mm: 11–20 mm: disease: 4; NA: 3 disease: 12; NA: 4 disease: 12; NA: 12 mm: 1; no no macroscopic no macroscopic 5; no 2; no 2; no macroscopic disease: 3; NA: 8 disease: 12; macroscopic macroscopic macroscopic disease: 12; NA: 4 NA: 7 disease: 13; disease: 14; disease: 14; NA: 5 NA: 5 NA: 5 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 11 of 23 Table 2 Clinical characteristics of the nine clusters found in the focused handprint analysis (Continued) Variables/ C1 (n = 49) C2 (n = 30) C3 (n = 75) C4 (n =41) C5 (n = 47) C6 (n = 52) C7 (n = 46) C8 (n =56) C9 (n =57) P-value clusters Tumor Omentum: 0; Ovary: 48; Omentum: 0; Ovary: Omentum: 1; Ovary: Omentum: 0; Omentum: 1; Omentum: 0; Omentum: 0; Omentum: 0; Omentum: 0; 5.01E- tissue Peritoneum ovary: 1 30; Peritoneum 74; Peritoneum Ovary: 41; Ovary: 46; Ovary: 52; Peritoneum Ovary: 46; Ovary: 56; Ovary: 57; 01 site ovary: 0 ovary: 0 Peritoneum Peritoneum ovary: 0 Peritoneum Peritoneum Peritoneum ovary: 0 ovary: 0 ovary: 0 ovary: 0 ovary: 0 Venous No: 3; Yes: 3; NA: 43 No: 3; Yes: 10; No: 8; Yes: 7; NA: 60 No: 12; Yes: 3; No: 1; Yes: 10; No: 10; Yes: 5; No: 7; Yes: 20; No: 3; Yes: 1; No: 3; Yes: 10; 7.24E- invasion NA: 17 NA: 26 NA: 36 NA: 37 NA: 19 NA: 52 NA: 44 02 Vital status Alive: 9; Dead: 40, Alive: 14; Dead: 16; Alive: 33; Dead: 42; Alive: 18; Dead: Alive: 20; Dead: Alive: 20; Dead: Alive: 28; Dead: Alive: 31; Dead: Alive: 27; 1.90E- NA: 0 NA: 0 NA:0 23; NA: 0 27; NA: 31; NA: 1 18; NA: 0 25; NA: 0 Dead: 30; 03 NA: 0 Primary Complete remission/ Complete remission/ Complete remission Complete Complete Complete Complete Complete Complete 5.08E- therapy response: 24; Partial response: 17; Partial /response: 41; remission/ remission/ remission/ remission/ remission/ remission/ 01 outcome remission/response: remission/response: Partial remission/ response: 24; response: 24; response: 29; response: 27; response: 36; response: 35; success 12; Progressive 3; Progressive response: 7; Partial remission/ Partial remission/ Partial remission Partial Partial Partial disease: 3; Stable disease: 4; Stable Progressive disease: response: 4; response: 8; /response: 6; remission/ remission/ remission/ disease: 1; NA: 9 disease: 2; NA: 4 2; Stable disease: Progressive Progressive Progressive response: 5; response: 4; response: 5; 4; NA: 21 disease: 2; Stable disease: 4; Stable disease: 1; Progressive Progressive Progressive disease: 0; NA: 11 disease: 3; NA: 8 Stable disease: disease: 4; disease: 7; disease: 5; 5; NA: 11 Stable disease: Stable disease: Stable disease: 6; NA: 4 2; NA: 7 1; NA: 11 Days lived 22,300 ± 4750 21,100 ± 3150 22,800 ± 3930 23,800 ± 4050 23,300 ± 3840 24,500 ± 4140 23,000 ± 4490 22,800 ± 4430 23,400 ± 4240 3.85E- known 02 Nominally statistically significant differences (p < 0.05) are shown in italic. Interestingly, significant differences are detected in lymphatic invasion, clinical stage at diagnosis, vital status and the overall number of days alive De Meulder et al. BMC Systems Biology (2018) 12:60 Page 12 of 23 Fig. 6 Kaplan-Meyer plot of survival for patients from the nine clusters revealed with the consensus clustering analysis. The x axis bears the total amount of days that patients have lived, i.e. the sum of their age at enrolment in the study plus the recorded amount of days they survived during the study, censored to the right by the end of measurements in the study (enrolment plus 4624 days) satisfactory separation of the clusters. This procedure described above. A DIABLO model is a type of was controlled by Leave-Group-Out Cross Validation partial least square (sparse PLS Discriminant (LGOCV) with 100 iterations (this number was Analysis) regression model, which uses multiple chosen to ensure convergence of the validation ‘omics platform measurements on the same samples procedure) and using between 1 and 50 predictors, to predict an outcome, with a biomarkers selection with the addition of the whole set of 6753 features. A step (sparse) to select necessary and sufficient Random Forest (RF) model was built with the features features to predict the groups (discriminant analysis) identified in the previous step. To avoid overfitting, within the outcome. Details of this analysis can be the RF model was built using LGOCV with 100 found in the Additional file 4. In short, this analysis iterations and in three quarters of the samples was run as follows: the datasets were split in 2/3 available (N = 300) and then tested in the remaining training and 1/3 testing sets. The DIABLO model quarter of samples (N = 153). More details can be was then trained with boundaries set on the number found in the Additional file 3. of features allowed per component (gene expression 2. Concatenation-based integration of data combines and methylation between 50 and 110 features, and multiple datasets into a single large dataset, with the between 5 and 35 miRNA features). The performances aim to predict an outcome. However, this approach were then estimated within the training model by 10 does not account for or model relationships between repeats of 10-fold validation and the prediction power datasets and thus limits our understanding of estimated in the testing set. molecular interactions at multiple functional levels. This is the rationale behind the development of novel integrative modelling methods, such as the Topological data analysis DIABLO sPLSDA method [112]. A DIABLO model In order to visualize the patients’ relationships as mea- was built using the same dataset as the SNF analysis sured by their ‘omics profiles, we used Topology Data Table 3 Number of statistically significant different features obtained when comparing each cluster against all other patients in the dataset, for each platform. P-values were computed by a linear model in each ‘omics platform independently, and Benjamini-Hochberg FDR corrected 1 vs Rest 2 vs Rest 3 vs Rest 4 vs Rest 5 vs Rest 6 vs Rest 7 vs Rest 8 vs Rest 9 vs Rest (49 vs 404) (30 vs 423) (75 vs 378) (41 vs 412 (47 vs 406 (52 vs 401 (46 vs 407 (56 vs 397 (57 vs 396) mRNA 1861 245 4101 1073 2480 3617 2557 4620 1843 Methylation 335 550 4 388 498 233 387 528 75 miRNA 18 0 1 9 24 1 8 14 11 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 13 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer E2F Transcription factor Transcriptomics 1 vs Rest 8.17E-48 [123, 124] Sp1 Transcription factor Transcriptomics 1 vs Rest 1.95E-35 [125] Mitochondrial translation Reactome Transcriptomics 1 vs Rest 9.02E-21 [126] hsa-miR-193a-5p miRNA Transcriptomics 1 vs Rest 4.33E-09 [127] CREM Transcription factor Methylation 1 vs Rest 2.45E-03 [128] hsa-miR-940 miRNA Transcriptomics 1 vs Rest 6.80E-03 [129] hsa-miR-601 miRNA Transcriptomics 1 vs Rest 6.81E-03 [129] hsa-miR-503 miRNA Transcriptomics 1 vs Rest 1.41E-02 [129] AP-1 Transcription factor Methylation 1 vs Rest 1.52E-02 [130] TCF-4 Transcription factor Methylation 1 vs Rest 2.04E-02 [131] hsa-miR-361-3p miRNA Transcriptomics 1 vs Rest 2.53E-02 [129] C/EBP Transcription factor Methylation 2 vs Rest 1.13E-05 [132] LMXB1 Transcription factor Methylation 2 vs Rest 9.32E-05 [133] hsa-miR-330-5p miRNA Transcriptomics 2 vs Rest 7.57E-03 [134] Chemical carcinogenesis KEGG pathways Transcriptomics 2 vs Rest 1.77E-02 [135–137] hsa-miR-335 miRNA Transcriptomics 2 vs Rest 3.95E-02 [138] MZF-1 Transcription factor Transcriptomics 3 vs Rest 4.06E-39 [139] SREBP-1 Transcription factor Transcriptomics 3 vs Rest 5.29E-38 [140] AP-2gamma Transcription factor Transcriptomics 3 vs Rest 1.79E-36 [141] GPCR ligand binding Reactome Transcriptomics 3 vs Rest 8.14E-10 [142] hsa-miR-328 miRNA Transcriptomics 3 vs Rest 9.92E-10 [129] hsa-miR-370 miRNA Transcriptomics 3 vs Rest 1.09E-08 [129] hsa-miR-601 miRNA Transcriptomics 3 vs Rest 1.07E-07 [129] hsa-miR-423-5p miRNA Transcriptomics 3 vs Rest 1.36E-06 [129] hsa-miR-139-3p miRNA Transcriptomics 3 vs Rest 2.28E-05 [129] hsa-miR-769-5p miRNA Transcriptomics 3 vs Rest 9.05E-05 [129] hsa-miR-339-3p miRNA Transcriptomics 3 vs Rest 2.16E-04 [129] hsa-miR-940 miRNA Transcriptomics 3 vs Rest 2.94E-04 [129] hsa-miR-542-5p miRNA Transcriptomics 3 vs Rest 8.13E-04 [129] hsa-miR-483-5p miRNA Transcriptomics 3 vs Rest 1.50E-03 [129] hsa-miR-361-3p miRNA Transcriptomics 3 vs Rest 7.88E-03 [129] hsa-miR-449a miRNA Transcriptomics 3 vs Rest 4.87E-02 [129] T cell aggregation GO Biological Process Transcriptomics 4 vs Rest 1.94E-38 [143] T cell activation GO Biological Process Transcriptomics 4 vs Rest 1.94E-38 [144] Natural killer cell mediated cytotoxicity KEGG pathways Transcriptomics 4 vs Rest 8.60E-14 [145] Cell adhesion molecules (CAMs) KEGG pathways Transcriptomics 4 vs Rest 2.37E-11 [146] Hedgehog ‘on’ state Reactome Transcriptomics 4 vs Rest 7.21E-05 [147] HIC1 Transcription factor Methylation 4 vs Rest 2.46E-04 [148] hsa-miR-328 miRNA Transcriptomics 4 vs Rest 1.49E-02 [129] AP-2gamma Transcription factor Transcriptomics 4 vs Rest 3.00E-02 [141] T cell activation GO Biological Process Transcriptomics 5 vs Rest 1.94E-38 [144] T cell aggregation GO Biological Process Transcriptomics 5 vs Rest 2.25E-22 [143] De Meulder et al. BMC Systems Biology (2018) 12:60 Page 14 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure (Continued) Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer Natural killer cell mediated cytotoxicity KEGG pathways Transcriptomics 5 vs Rest 8.60E-14 [145] Antigen processing and presentation KEGG pathways Transcriptomics 5 vs Rest 4.33E-11 [149] Interferon alpha/beta signalling Reactome Transcriptomics 5 vs Rest 6.11E-08 [150] hsa-miR-423-5p miRNA Transcriptomics 5 vs Rest 3.09E-05 [129] hsa-miR-328 miRNA Transcriptomics 5 vs Rest 5.23E-04 [129] VEGFA-VEGFR2 Pathway Reactome Transcriptomics 5 vs Rest 2.57E-03 [151, 152] Hedgehog ‘off’ state Reactome Transcriptomics 5 vs Rest 1.21E-02 [153] hsa-miR-139-3p miRNA Transcriptomics 5 vs Rest 1.35E-02 [129] NF- κB signalling pathway KEGG pathways Transcriptomics 5 vs Rest 1.53E-02 [154] hsa-miR-601 miRNA Transcriptomics 5 vs Rest 2.71E-02 [129] Jak-STAT signalling pathway KEGG pathways Transcriptomics 5 vs Rest 3.54E-02 [155] hsa-miR-375 miRNA Transcriptomics 5 vs Rest 3.74E-02 [129] Signalling by GPCR Reactome Transcriptomics 6 vs Rest 1.24E-14 [156] hsa-miR-328 miRNA Transcriptomics 6 vs Rest 1.47E-08 [129] hsa-miR-601 miRNA Transcriptomics 6 vs Rest 6.94E-07 [129] hsa-miR-370 miRNA Transcriptomics 6 vs Rest 2.46E-06 [129] hsa-miR-423-5p miRNA Transcriptomics 6 vs Rest 4.81E-06 [129] hsa-miR-423-3p miRNA Transcriptomics 6 vs Rest 1.77E-05 [129] cAMP metabolic process GO Biological Process Transcriptomics 6 vs Rest 9.22E-05 [157] hsa-miR-769-5p miRNA Transcriptomics 6 vs Rest 5.13E-04 [129] hsa-miR-139-3p miRNA Transcriptomics 6 vs Rest 2.70E-03 [129] hsa-miR-483-5p miRNA Transcriptomics 6 vs Rest 4.90E-03 [129] hsa-miR-940 miRNA Transcriptomics 6 vs Rest 5.05E-03 [129] T cell selection GO Biological Process Transcriptomics 6 vs Rest 1.41E-02 [158] Arachidonic acid metabolism KEGG pathways Transcriptomics 6 vs Rest 1.42E-02 [135] hsa-miR-542-5p miRNA Transcriptomics 6 vs Rest 1.73E-02 [129] Oxidative phosphorylation KEGG pathways Transcriptomics 7 vs Rest 9.49E-13 [159] Stabilization of p53 Reactome Transcriptomics 7 vs Rest 1.06E-07 [160] Spliceosome KEGG pathways Transcriptomics 7 vs Rest 1.59E-07 [161] NF-kB signalling pathway Reactome Transcriptomics 7 vs Rest 3.97E-05 [154] hsa-miR-542-5p miRNA Transcriptomics 7 vs Rest 2.53E-03 [129] hsa-miR-601 miRNA Transcriptomics 7 vs Rest 2.62E-03 [129] hsa-miR-423-5p miRNA Transcriptomics 7 vs Rest 5.88E-03 [129] hsa-let-7c miRNA Transcriptomics 7 vs Rest 2.67E-02 [129] Regulation of HIF by oxygen Reactome Transcriptomics 7 vs Rest 3.32E-02 [162] hsa-miR-361-3p miRNA Transcriptomics 7 vs Rest 4.16E-02 [129] hsa-miR-328 miRNA Transcriptomics 8 vs Rest 9.25E-15 [129] hsa-miR-370 miRNA Transcriptomics 8 vs Rest 3.60E-11 [129] hsa-miR-940 miRNA Transcriptomics 8 vs Rest 1.37E-10 [129] hsa-miR-423-5p miRNA Transcriptomics 8 vs Rest 4.29E-10 [129] hsa-miR-423-3p miRNA Transcriptomics 8 vs Rest 7.47E-09 [129] hsa-miR-139-3p miRNA Transcriptomics 8 vs Rest 5.08E-07 [129] De Meulder et al. BMC Systems Biology (2018) 12:60 Page 15 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure (Continued) Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer hsa-miR-601 miRNA Transcriptomics 8 vs Rest 9.47E-07 [129] hsa-miR-542-5p miRNA Transcriptomics 8 vs Rest 4.72E-04 [129] hsa-miR-361-3p miRNA Transcriptomics 8 vs Rest 1.07E-03 [129] hsa-miR-483-5p miRNA Transcriptomics 8 vs Rest 1.32E-03 [129] hsa-miR-769-5p miRNA Transcriptomics 8 vs Rest 1.68E-03 [129] Potassium signalling pathway Reactome Transcriptomics 8 vs Rest 1.15E-02 [163] hsa-miR-99b miRNA Transcriptomics 8 vs Rest 1.93E-02 [129] hsa-miR-339-3p miRNA Transcriptomics 8 vs Rest 2.28E-02 [129] T cell lineage commitment GO Biological Process Transcriptomics 8 vs Rest 3.80E-02 [164] hsa-miR-139-3p miRNA Transcriptomics 9 vs Rest 3.58E-09 [129] hsa-miR-423-5p miRNA Transcriptomics 9 vs Rest 5.89E-09 [129] hsa-miR-328 miRNA Transcriptomics 9 vs Rest 2.32E-08 [129] hsa-miR-370 miRNA Transcriptomics 9 vs Rest 4.83E-08 [129] hsa-miR-423-3p miRNA Transcriptomics 9 vs Rest 3.89E-06 [129] hsa-miR-940 miRNA Transcriptomics 9 vs Rest 5.37E-06 [129] hsa-miR-769-5p miRNA Transcriptomics 9 vs Rest 1.07E-04 [129] hsa-miR-339-3p miRNA Transcriptomics 9 vs Rest 0.000173 [129] hsa-miR-601 miRNA Transcriptomics 9 vs Rest 2.05E-04 [129] hsa-miR-483-5p miRNA Transcriptomics 9 vs Rest 7.33E-03 [129] Calcium signalling pathway KEGG pathways Transcriptomics 9 vs Rest 1.55E-02 [165] hsa-miR-542-5p miRNA Transcriptomics 9 vs Rest 1.69E-02 [129] cAMP signalling pathway KEGG pathways Transcriptomics 9 vs Rest 2.33E-02 [166] Ion transfer GO Biological Process Transcriptomics 9 vs Rest 3.43E-02 [167] Analysis (TDA), a general framework to analyse high- OV. Other studies have been performed, either on this dimensional, incomplete and noisy data in a manner that same dataset [114–118], or on the same disease [119]. is less sensitive to the particular metric that is chosen, Tothill et al. in 2015 identified six clusters of patients, and provides dimensionality reduction and robustness to based on mRNA, immunohistochemistry and clinical noise. TDA is embedded in the software produced by data from a cohort of 285 Australian and Dutch partici- the Ayasdi company to which the data were uploaded pants, with a consensus clustering analysis of mRNA [113]. As shown in Fig. 7, the network of patients’ simi- data alone. The TCGA consortium produced their own larities obtained through TDA analysis and then colored dataset in 2011, identifying four clusters based on com- by the vital status of the patients at the end of the study bined mRNA, miRNA and DNA methylation data (data shows a higher level of complexity than is identified by combined by summarising to the gene-level all datasets the clustering analysis, suggesting that statistical and/or through a factor analysis) and using a non-negative technical limitations of the clustering methods prevent matrix factorisation to identify clusters [120]. Further us to accurately represent reality. analysis of the same dataset was then performed by Zhang et al. [118], Jin et al. [115] and Kim et al. [116] Discussion (with some variations), but these authors did not look Multi-omics data integration is, among other compo- for new phenotypes in their analysis, rather comparing nents of biological data integration, a very promising data based on clinical endpoints (survival time, histo- and emerging field. We show a structured and effective logical grades and stage of disease). Gevaert et al. [114] way to combine ‘omics data from multiple sources to used an original algorithm to combine DNA methyla- search for molecular profiles of patients. This process tion, Copy Number Variation (CNV) and gene expres- allowed for the classification of a well-studied dataset of sion data, using the clusters defined in the TCGA De Meulder et al. BMC Systems Biology (2018) 12:60 Page 16 of 23 Fig. 7 Network of patients shown in the TDA platform. The network is constructed as ‘bins’ grouping patients who are similar based on their ‘omics profiles. Each dot in the network represents a bin. The bins are overlapping by an adaptable percentage, and if at least one patient is present in the overlap of two bins, the two bins will be linked in the network. The survival status of the patients is then translated as a color scheme (blue representing deceased patients and red alive patients). Using this technique, it is easy to identify ‘islands’ of good and poor survival among the patients, and equally easy to acknowledge that there are more such islands than is identified through the clustering technique. Thorough analysis of such networks can lead to insights into biology, as detailed in [168] original paper. Those studies showed different ways of among the 9 clusters identified and is associated with the analysing the data, leading to the identification of clinic- GPCR signalling pathway, cAMP, ion channels, arachi- ally relevant clusters in the case of Tothill and TCGA donic acid metabolism and a number of miRNAs (see original paper [117, 119]. It is however the first time in Table 4 or the Additional file 2 for more details). this paper that TCGA mRNA, miRNA and methylation Interestingly, while the two sets of groups defined with data were fused with an advanced data integration or without feature reduction show differences in inva- method to identify robust subtypes of disease. sion and clinical stage, statistically significant differences The number of clusters found in the same dataset dif- in vital status are only detected amongst groups defined fers between the TCGA analysis and our analysis. We with feature reduction. The reduced data also allows for believe that the higher number of clusters we found is the definition of a higher number of stable groups (9 in- the result of more up-to-date and powerful methods for stead of 4), thereby pointing to the usefulness of per- subtype discovery, as shown in the SNF original paper forming feature reduction prior to clustering analysis. [55]. Moreover, the subtypes identified in this analysis The biological functions highlighted by enrichment do allow for a more in-depth classification of patients analysis between the clusters indicate that these are linked with specific molecular subtypes than was previ- associated with different biological mechanisms leading ously reported. Building predictive models based on to the development of cancer in patients, ranging from multiple ‘omics profiles also contributes to the novelty immune system disorders, cell cycle dysregulation, im- of this approach as other reported studies did not pro- paired response to DNA damage, modified energy me- duce such a model, with the exception of the Tothill et tabolism, etc. al. study [119] in which the authors developed a class The predictive models that were trained and tested prediction model based on transcriptomics data only. with two different methods gave mixed power results. In Clinically speaking, classifications are most useful when the Random Forest case, the model could predict quite they allow the identification of a subset of patients with a well when patients did not belong to the clusters, but clinically relevant outcome, such as low or high survival not so well when patients did belong to them; in other rate, thus indicating where efforts may be focused to de- words, the model is specific but not sensitive. In the case velop new drugs, therapies and procedures. In our ana- of the DIABLO PLS, the model is able to predict fairly lysis, the groups identified after feature reduction are accurately the clusters 4 and 8 and less accurately cluster statistically different in terms of survival rate and time. 5. Moreover, in the case of the DIABLO analysis, the For example, cluster 6 shows the highest rate of survival model showed that the clusters have different ‘omics De Meulder et al. BMC Systems Biology (2018) 12:60 Page 17 of 23 patterns, with clusters 2 and 8 showing distinct methyla- cross-validated and clinically useful stratification of ovarian tion profiles, and cluster 4 showing different methylation cancer, towards a better and more personalized care. and transcriptomics profiles. The results presented in this manuscript are not per- Conclusion fectly predictive, however. It seems that the cluster defi- This article presents an overview of the integrative sys- nitions are not as stable as they could be; the predictive tems biology analyses developed, performed and validated models are not accurate in all clusters and the survival in the IMI U-BIOPRED and eTRIKS projects, proposing a status of the clusters are not clear cut. This reflects the template for other researchers wishing to perform similar fact shown in Fig. 7, that there seems to be much more analyses for other diseases. We demonstrate the useful- complexity within the dataset than what the clustering ness of generating hypotheses through a fingerprint/hand- analysis is able to detect. print analysis by applying to a well-studied dataset of This is due to multiple factors: the recurring issue of ovarian carcinoma, identifying a higher number of robust low number of patients, which in turn influences the groups than previously reported, potentially improving number of clusters we can find with statistical confi- our understanding of this disease. Better characterisation dence – a point which is not taken into account in the of the clusters found in the handprint analyses and valid- TDA analysis discussed here – and highlighting the need ation of the predictive model obtained by machine learn- for better stratification methods in the context of per- ing are both ongoing. We believe that handprint analyses, sonalized medicine where, ideally, each patient is his/her performed on large scale ‘omics datasets will allow re- own cluster (n = 1); sub-optimal clustering methods and searchers to identify subtypes of disease (phenotypes and algorithms also play a part in this result and it is our endotypes) [34] with greater confidence, providing better hope that continuous methods development will allow diagnosis tools for the clinicians, new avenues for drug de- for better classification. Clustering analysis is descriptive velopment for the pharmaceutical industry and deeper in- in nature: applying a clustering algorithm to a dataset sights into disease mechanisms. To be effective, handprint will always yield clusters, whether real clusters exist analyses need to be performed on the same subjects with or not. Analytical methods exist to ascertain cluster multiple ‘omics platforms. Theysuffer fromsomelimita- ‘reality’, among which stability in patients through tions, such as the decreasing but nevertheless still elevated bootstrapping, stability in time through cluster identi- cost of ‘omics data production and the protocol standard- fication from time-series experiments [121], meta isation requirements to avoid time-consuming data pre- clustering across several studies, yet only replication processing, the rather large technical, human resources studies may confirm the existence of these clusters. and expertise requirements to perform the analyses (par- Such replication effort however lies outside the scope ticularly the machine-learning analysis) or the lack of ac- of this manuscript. curate and independent benchmarking tools to identify Despite the use of most recent databases and tools, the the most powerful and/or best-suited method to analyse a biological interpretation of the differences between the particular dataset. clusters remains challenging. The main issues stem from Additional work is therefore needed to make the frame- the overlapping nature of pathways described in literature work and the analyses proposed here more accessible to a and the non-unicity of relationships between biological broad audience of health researchers. Efforts of the bioinfor- entities, leading to a high false positive rate in the results matics community are shifting in this direction; for instance, of pathway analysis [97]. Efforts are made in the systems the eTRIKS European project (http://www.etriks.org)or the biology community to correct these shortcomings, among Galaxy project hosted in the USA (https://galaxyproject.org) which the disease maps mentioned above. mandate the delivery of user-friendly interfaces to advanced This underlines the variability in biological events po- bioinformatics resources. Implementation of P4 medicine tentially leading to the development of cancer and me- across the entire health spectrum [122] will be leveraged tastasis and the need for a more personalised care for through promotion of advanced analytical tools available to patients suffering from complex diseases, such as cancer. the larger multidisciplinary community. The methods and It is our hope that this methodology will be repeated on results demonstrated in this paper should contribute to other datasets, diseases and clinical situations as it is one pave this promising road. more step towards establishing a true personalised data analysis pipeline. Additional files The clusters that were found in this analysis are interest- ing hypotheses. They would however require further valid- Additional file 1: AUC of consensus clustering. (XLSX 13 kb) ation to become clinically useful, as detailed in the Additional file 2: Complete results of the enrichment analysis between replication of findings section above. We encourage other clusters. (XLSX 4293 kb) researchers to use our findings in their research towards a De Meulder et al. BMC Systems Biology (2018) 12:60 Page 18 of 23 Respiratory Biomedical Research Unit, Southampton, UK), Tim Higgenbottam Additional file 3: Table S7. Estimated accuracy and standard deviation (Allergy Therapeutics, West Sussex, UK), Uruj Hoda (Imperial College, London, UK), of the RFE procedure. Table S8. Accuracy and Kappa values of the Jans Hohlfeld (Fraunhofer ITEM, Hannover, Germany), Cecile Holweg (Genentech, Random Forest models in the training set. Table S9. Performances values San Francisco, USA), Ildiko Horvath (Semmelweis University, Budapest, Hungary), for the Random Forest model in the testing set. Figure S11. Relative Peter Howarth (NIHR Southampton Respiratory Biomedical Research Unit, importance of the top 20 predictors building the final model of the RF. Southampton, UK), Richard Hu (Amgen Inc., Seattle, USA), Sile Hu (Imperial The importance axis is scaled, with the mRNA expression of CD3D scaled College London, UK), Xugang Hu (Amgen Inc., Seattle, USA), Val Hudson (Asthma UK, London, UK), Anna J. James (Karolinska Institutet, Stockholm, to 100% and the methylation state of POLA2 to 0% (not shown). Sweden), Juliette Kamphuis (Longfonds, Amersfoort, The Netherlands), Erika J. (DOCX 18 kb) Kennington (Asthma UK, London, UK), Dyson Kerry (CromSource, Stirling, UK), Additional file 4: DIABLO sPLSDA model results. (DOCX 18966 kb) Matthias Klüglich (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Hugo Knobel (Philips Research Laboratories, Eindhoven, The Netherlands), Richard Knowles (Arachos Pharma, UK), Alan Know (University of Nottingham, UK), Johan Kolmert (Karolinska Institutet, Stockholm, Sweden), Jon Acknowledgements Konradsen (Karolinska Institutet, Stockholm, Sweden), Maxim Kots (Chiesi The U-BIOPRED study group consists of Ian M. Adcock (Imperial College, Pharmaceutical, Parma, Italy), Linn Krueger (University Children’s Hospital, Bern, London, UK), Nora Adriaens (University of Amsterdam, The Netherlands), Has- Switzerland), Norbert Krug (Fraunhofer ITEM, Hannover, Germany), Scott Kuo san Ahmed (EISBM, Lyon, France), Antonios Aliprantis (Merck Research, Bos- (Imperial College, London, UK), Maciej Kupczyk (Karolinska Institutet, Stockholm, ton, USA), Kjell Alving (Uppsala University, Sweden), Charles Auffray (EISBM, Sweden), Bart Lambrecht (University of Gent, Belgium), Ann-Sofie Lantz Lyon, France), Philipp Badorrek (Fraunhofer ITEM, Hannover, Germany), Cor- (Karolinska Institutet, Stockholm, Sweden), Lars Lazarinis (Karolinska Institutet, nelia Faulenbach (Fraunhofer ITEM, Hannover, Germany), Per Bakke (Univer- Stockholm, Sweden), Diane Lefaudeux (EISBM, Lyon, France), Saeeda Lone-Latif sity of Bergen, Norway), David Balgoma (Karolinska Institutet, Stockholm, (University of Amsterdam, The Netherlands), Matthew J. Loza (Janssen R & D, Sweden), Aruna T. Bansal (Acclarogen Ltd. Cambridge, UK), Clair Barber (Uni- Springhouse, USA), Rene Lutter (University of Amsterdam, The Netherlands), Lisa versity of Southampton, UK), Frédéric Baribaud (Janssen R & D, Springhouse, Marouzet (NIHR Southampton Respiratory Biomedical Research Unit, Southampton, USA), An Bautmans (MSD Brussels, Belgium), Annelie F. Behndig (Umeå Uni- UK), Jane Martin (NIHR Southampton Respiratory Biomedical Research Unit, versity, Sweden), Elisabeth Bel (University of Amsterdam, The Netherlands), Southampton, UK), Sarah Masefield (European Lung Fondation, Shefield, UK), Jorge Beleta (Almirall S.A., Barcelona, Spain), Ann Berglind (Karolinska Institu- Caroline Mathon (Karolinska Institutet, Stockholm, Sweden), John G. Matthews tet, Stockholm, Sweden), Alix Berton (AstraZeneca, Mölndal, Sweden), Jean- (Genentech, San Francisco, USA), Alexander Mazein (EISBM, Lyon, France), Sally ette Bigler (Amgen Inc., Seattle, USA), Hans Bisgaard, University of Meah (Imperial College, London, UK), Andrea Meiser (Imperial College, London, Copenhagen, Denmark), Grazyna Bochenek, Jagiellonian University, Krakow, UK), Andrew Manzies-Gow (Royal Brompton and Harefield NHS Fondation Trust, Poland), Michael J. Boedigheimer (Amgen Inc., Seattle, USA), Klaus Bønnelykke London, UK), Leanne Metcalf (Asthma UK, London, UK), Roelinde Middelveld (Karo- (University of Copenhagen, Denmark), Joost Brandsma, (University of South- linska Institutet,Stockholm,Sweden),Maria Mikus (Science for Life Laboratory, ampton, UK), Armin Braun (Fraunhofer ITEM, Hannover, Germany), Paul Brink- Stockholm, Sweden), Montse Miralpeix (Almirall, Barcelona, Spain), Philip Monk man (University of Amsterdam, The Netherlands), Dominic Burg (University of (Synairgen Research Ltd., Southampton, UK), Paolo Montuschi (Università Cattolica Southampton, UK), Davide Campagna (University of Catania, Italy), Leon del Sacro Cuore, Rome, Italy), Nadia Mores (Università Cattolica del Sacro Cuore, Carayannopoulos, (MSD, USA), Massimo Caruso (University of Catania, Italy), Rome,Italy), ClareS.Murray(University of Manchester,UK),Jacek Musial Pedro Carvalho da Purificacão Rocha João Pedro (Royal Brompton and Harefield (Jagiellonian University Medical College, Krakow, Poland), David Myles (GSK, UK), NHS Fondation Trust, UK), Amphun Chaiboonchoe (EISBM, Lyon, France), Shama Naz (Karolinska Institutet, Stockholm, Sweden), Katja Nething (Boehringer Romanas Chaleckis (Karolinska Institutet, Stockholm, Sweden), Pascal Chanez Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Ben Nicholas (University (University of Aix Marseille, France), Kiang Fan Chung, Imperial College London, of Southampton, UK), Ulf Nihlen (AstraZeneca, Molndal, Sweden), Peter Nilsson UK), Courtney Coleman (Asthma UK, London, UK), Chris Compton (GSK, UK), (Science for Life Laboratory, Stockholm, Sweden), Björn Nordlund (Karolinska Julie Corfield (Arateva R & D, Nottingham, UK), Arnaldo D’Amico (University of Institutet,Stockholm,Sweden),JörgenÖstling (AstraZeneca,Molndal,Sweden), Rome ‘Tor Vergata’, Rome, Italy), Barbro Dahlén (Karolinska Institutet, Stockholm, Antonio Pacino (Lega Italiano Anti Fumo, Catania, Italy), Laurie Pahus (Aix-Marseille Sweden), Sven-Erik Dahlén (Karolinska Institutet, Stockhlom, Sweden), Jorge De University, Marseille, France), Susanna Palkonen (European Federation of Allergy Alba (Almirall S.A., Barcelona, Spain), Pim de Boer (Londfonds, Amersfoort, The and Airways Diseases Patient’s Associations, Brussels, Belgium), Ioannis Pandis Netherlands), Inge De Lepeleire (MSD, Brussels, Belgium), Bertrand De Meulder (Imperial College London, UK), Stelios Pavlidis (Imperial College London, UK), (EISBM, Lyon, France), Tamara Dekker (University of Amsterdam, The Giorgio Pennazza (University of Rome ‘Tor Vergata’,Rome, Italy),AnnePetrén Netherlands), Ingrid Delin (Karolinska Institutet, Stockholm, Sweden), Patrick (Karolinska Institutet, Stockholm, Sweden), Sandy Pink (NIHR Southampton Dennison (University of Southampton, UK), Annemiek Dijkhuis (University of Respiratory Biomedical Research Unit, Southampton, UK), Anthony Postle Amsterdam, The Netherlands), Ratko Djukanovic (University of Southampton, (University of Southampton, UK), Pippa Powel (European Lung Fondation, Sheffield, UK), Aleksandra Draper (BioSci Consulting, Maasmechelen, Belgium), Jessica UK), Malayka Rahman-Amin (Asthma UK, London, UK), Navin Rao (Janssen R & D, Edwards (Asthma UK, London, UK), Rosalia Emma (University of Catania, Italy), La Jolla, USA), Lara Ravanetti (University of Amsterdam, The Netherlands), Emma Magnus Ericsson (Karolinska University Hospital, Stockholm, Sweden), Veit Ray (NIHR Southampton Respiratory Biomedical Research Unit, Southampton, UK), Erpenbeck (Novartis Institutes for Biomedical Research, Basel, Switzerland), Stacey Reinke (Karolinska Institutet, Stockholm, Sweden), Leanne Reynolds (Asthma Damijan Erzen (Boehringer Ingelheim Pharma GmbH & Co. KKKG; Biberach, UK, London, UK), Kathrin Riemann (Boehringer Ingelheim Pharma GmbH & Co. KG, Germany), Klaus Fichtner (Boehringer Ingelheim Pharma GmbH & Co. KKKG; Biberach, Germany), John Riley (GSK, UK), Martine Robberechts (MSD, Brussels, Biberach, Germany), Neil Fitch (BioSci Consulting, Maasmechelen, Belgium), Belgium), Amanda Roberts (Asthma UK, London, UK), Graham Roberts (NIHR Louise J. Fleming (Imperial College London, UK), Breda Flood (Asthma UK, Southampton Respiratory Biomedical Research Unit, Southampton, UK), Christos London, UK), Stephen J. Fowler (Manchester Academic Health Sciences Center, Rossios (Imperial College London, UK), Anthony Rowe (Janssen R & D, UK), Kirsty Manchester, UK), Urs Frey (University Children’s Hospital, Basel, Switzerland), Russel (Imperial College London, UK), Michael Rutgers (Longfonds, Amersfoort, The Martina Gahlemann (Boehringer Ingelheim GmbH, Switzerland), Gabriella Galffy Netherlands), Thomas Sandström (Umeå University, Sweden), Giuseppe Santini (Semmelweis University, Budapest, Hungary), Hactor Gallart (Karolinska Institutet, (Università Cattolica del Sacro Cuore, Italy), Marco Santoninco (University of Rome Stockholm, Sweden), Trevor Garret (BioSci Consulting, Maasmechelen, Belgium), ‘Tor Vergata’, Rome, Italy), Corinna Schoelch (Boehringer Ingelheim Pharma GmbH Thomas Geiser (University Hospital Bern, Switzerland), Julaiha Gent (Royal & Co. KG, Biberach, Germany), James P.R. Schofield (University of Southampton, Brompton and Harefield NHS Fondation Trust, London, UK), Maria Gerhardsson de UK), Wolfgang Seibold (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Verdier (AstraZeneca Molndal, Sweden), David Gibeon (Imperial College, London, Germany), Dominick E. Shaw (University of Nottingham, UK), Ralf Sigmund UK), Cristina Gomez (Karolinska Institutet, Stockholm, Sweden), Kerry Gove (NIHR (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Florian Singer Southampton Respiratory Biomedical Research Unit and Clinical and Experimental (University Children’s Hospital, Zurich, Switzerland), Marcus Sjödin (Karolinska Sciences, Southampton, UK), Neil Gozzard (UCB, UK), Yi-ke Guo (Imperial College, Institutet,Stockholm,Sweden),PaulJ.Skipp (UniversityofSouthampton, UK), London, UK), Simone Hashimoto (University of Amsterdam, The Netherlands), Barbara Smids (University of Amsterdam, The Netherlands), Caroline Smith (NIHR John Haughney (International Primary Care Respiratory Group, Aberdeen, Southampton Respiratory Biomedical Research Unit, Southampton, UK), Jessica Scotland), Gunilla Hedlin (Karolinska Institutet, Stockholm, Sweden), Pieter-Paul Smith (Asthma UK, London, UK), Katherine M. Smith (University of Nottingham, Hekking (University of Amsterdam, The Netherlands), Elisabeth Henriksson (Karolinska Institutet, Stockholm, Sweden), Lorraine Hewitt (NIHR Southampton UK), Päivi Söderman, Karolinska Institutet, Stockholm, Sweden), Adesimbo De Meulder et al. BMC Systems Biology (2018) 12:60 Page 19 of 23 Sogbesan (Royal Brompton and Harefield NHS Fondation Trust, London, UK), Ana within and contributed to the development of the data analysis plan, as a R. Sousa (GSK, UK), Doroteya Staykova (University of Southampton, UK), Peter J. member of U-BIOPRED and eTRIKS projects. DL contributed to the writing of Sterk (University of Amsterdam, The Netherlands), Karin Strandberg (Karolinska the manuscript, to the planning and the performing of the analyses within Institutet, Stockholm, Sweden), Kai Sun (Imperial College, London, UK), David and contributed to the development of the data analysis plan, as a member Supple(Asthma UK,London,UK), Marton Szentkereszty (Semmelweis University, of U-BIOPRED and eTRIKS projects. ATB contributed to the design of the Budapest, Hungary), Lilla Tamasi (Semmelweis University, Budapest, Hungary), analyses presented within along with all statistical concerns during the de- Kamran Tariq (University of Southampton, UK), John-Olof Thörngren (Karolinska velopment of the data analysis plan, as a member of the U-BIOPRED project. University Hospital, Stockholm, Sweden), Bob Thornton (MSD, USA), Jonathan AMaz contributed to the enrichment analysis parts of the manuscript, as a Thorsen (University of Copenhagen, Denmark), Salvatore Valente (Università member of U-BIOPRED and eTRIKS projects. AC contributed to the design of Cattolica del Sacro Cuore, Rome, Italy), Wim van Aalderen (University of the data analysis plans and to the clustering parts of the manuscript as a Amsterdam, The Netherlands), Marianne van de Pol (University of Amsterdam, The Netherlands), Kees van Drunen (University of Amsterdam, The member of the U-BIOPRED project. HA contributed to the design of the data Netherlands), Marleen van Drunen (University of Amsterdam, The analysis plans and to the clustering parts of the manuscript as a member Netherlands), Jenny Versnel (Asthma UK, London, UK), Jorgen Vestbo of the U-BIOPRED project. IB contributed to the enrichment analysis and (Manchester Academic Health Sciences Centre, Manchester, UK), Anton machine-learning parts of the manuscript as a member of the eTRIKS project. Vink (Philips Research Laboratories, Eindhoven, The Netherlands), Nadja MS contributed to the enrichment analysis and machine-learning parts of Vising (University of Copenhagen, Denmark), Christophe von Garnier the manuscript as a member of the eTRIKS project. JP contributed to the data (University Hospital, Bern, Switzerland), Ariane Wagener (University of preparation parts and to the visualisations of the manuscript. SB contributed to Amsterdam, The Netherlands), Scott Wagers (BioSci Consulting, Maasmechelen, the design of the data analysis plan and to the clustering, data integration and Belgium), Frans Wald (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, enrichment analysis parts of the manuscript. NL contributed to the data Germany), Samantha Walker (Asthma UK, London, UK), Jonathan Ward preparation parts of the manuscript. KS contributed to the data managements (University of Southampton, UK), Zsoka Weiszhart (Semmelweis University, aspects of the manuscript as a member of the eTRIKS project. IP contributed to Budapest Hungary), Kristiane Wetzel (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Craig E. Wheelock (Karolinska the data managements aspects of the manuscript as a member of the eTRIKS Institutet, Stockholm, Sweden), Coen Wiegman (Imperial College London, project. XY contributed to the data managements aspects of the manuscript as UK), Siân Williams (International Primary Care Respiratory Group, a member of the eTRIKS project. MB contributed to the data managements and Aberdeen, Scotland), Susan J. Wilson (University of Southampton, UK), clustering aspects of the manuscript as a member of the U-BIOPRED project. Ashley Woodcock (Manchester Academic Health Science Centre, Manchester, KK contributed to the development of the data analysis plan and related parts UK), Xian Yang (Imperial College London, UK), Elizabeth Yeyasingham (GSK, UK), in the manuscript as a member of the U-BIOPRED project. JvE contributed to Wen Yu (Amgen Inc., Seattle, USA), Wilhelm Zetterquist (Karolinska Institutet, the development of the data analysis plan and related parts in the manuscript Stockholm, Sweden), Koos Zwinderman (University of Amsterdam, The as a member of the U-BIOPRED project. AB contributed to the development of Netherlands). The eTRIKS consortium members are: Alireza Tamaddoni Nezhad the data analysis plan and related parts in the manuscript as a member of the (Imperial College London, UK), Adriano Barbosa da Silva (University of U-BIOPRED project. TD contributed to the development of the data analysis Luxemburg, Luxemburg), Alexander Mazein (EISBM, Lyon, France), plan and related parts in the manuscript as a member of the U-BIOPRED Andreas Tielmann (Merck), Angela Gaudette (Pfizer), Anna Silberberg project. PD contributed to the development of the data analysis plan and (Pfizer), Antigoni (Anna) Elefsinioti (Bayer), Axel Oehmichen (Imperial College London, UK), Maria Biryukov (University of Luxemburg, Luxemburg), related parts in the manuscript as a member of the U-BIOPRED project. CL Bertrand De Meulder (EISBM, Lyon, France), Jen Birgitte (Lundbeck), Bron Kisler contributed to the development of the data analysis plan and related parts in (CDISC), Anna Maria Carusi, Charles Auffray (EISBM, Lyon, France), Diana O’Malley the manuscript as a member of the U-BIOPRED project. AP contributed to the (Imperial College London, UK), David Henderson (Bayer), Dorina Bratfalean development of the data analysis plan and related parts in the manuscript as a (CDISC), Diane Lefaudeux (EISBM, Lyon, France), Denny Verbeeck (Janssen), Ejner member of the U-BIOPRED project. JC contributed to the development of the Knud Moltzen (Lundbeck), Eva Lindgren (Astra Zeneca), Florian Guitton (Imperial data analysis plan and related parts in the manuscript as a member of the College London, UK), Fabien Richard (EISBM, Lyon, France), Francisco Bonachela U-BIOPRED project. RD contributed to the development of the data analysis Capdevila (Janssen), Ghita Rahal (CNRS, Lyon, France), Heike Dagmar plan and related parts in the manuscript as a member of the U-BIOPRED Schuermann (Sanofi), Ibrahim Emam (Imperial College London, UK), Irina project. KFC contributed to the overall design of the study as a member of the Balaur (EISBM, Lyon, France), Ingrid Sofie Harbo (Lundbeck), Jay Bergeron U-BIOPRED project. IMA contributed to the overall design of the study as a (Pfizer), Kai Sun (Imperial College London, UK), Laurence Mazuranok member of the U-BIOPRED project. YG contributed to the data management (Sanofi), Laurence Painell’s (IDBS), Manfred Hendlich (Sanofi), Gino Marchetti (CNRS, Lyon, France), Derek Marren (Lilly), Jaroslav Martasek aspects of the manuscript as a member of the eTRIKS project. PJS contributed (Lilly), Martin Romacker (Roche), Michael Braxenthaler (Roche), Maria to the overall design of the study as a member of the U-BIOPRED project. AMan Manuela Nogueira (EISBM, Lyon, France), Mansoor Saqi (EISBM, Lyon, contributed to the development of the data analysis plan and co-led the France), Neil Fitch (BioSci Consulting), Nesrine Taibi (EISBM, Lyon, France), systems biology work package of the U-BIOPRED project. AR contributed to the Odile Brasier (EISBM, Lyon, France), Paul Agapow (Imperial College development of the data analysis plan and co-led the systems biology work London, UK), Peter Rice (Imperial College London, UK), Paul Houston package of the U-BIOPRED and eTRIKS projects. FB contributed to the (CDISC), Philippe Rocca-Serra (University of Oxford, UK), Reinhard development of the data analysis plan and co-led the systems biology Schneider (University of Luxemburg, Luxemburg), James Rimell (Lilly), work package of the U-BIOPRED project. CA contributed to the overall Stelios Pavlidis (Imperial College London, UK), Susanna-Assunta Sansone design and supervision of the study, to the development of the data (University of Oxford, UK), Sally Miles (Imperial College London, UK), analysis plan and co-led the systems biology work package of the U-BIOPRED Samiul Hasan (GSK), Sascha Herzinger (University of Luxemburg, Luxemburg), project and its extension in the eTRIKS project. Scott Wagers (BioSci Consulting), Sikander Hayat (Bayer), Tomas Dalentoft (Astra Zeneca), Vahid Elyasigomari (Imperial College London, UK), Venkata Satagopam (University of Luxemburg, Luxemburg), Wei Gu (University of Luxemburg, Ethics approval and consent to participate Luxemburg), Xian Yang (Imperial College London, UK), Yi-Ke Guo (Imperial Not applicable College London, UK). Consent for publication Funding Not applicable This work was supported through the Innovative Medicines Initiative U-BIOPRED and eTRIKS projects (IMI n°115010 and IMI n°115446 respectively). Competing interests Availability of data and materials ATB received fees from Acclarogen Ltd. KK received fees from UCB Celltech The datasets analysed in this study are available in the NIH National Cancer Ltd. JvE received fees from UCB Pharma S.A. AB received fees from Roche Institute repository (https://portal.gdc.cancer.gov/)[117]. Products Ltd. TD received fees from Janssen R & D High Wycombe Ltd. PD received fees from AstraZeneca Ltd. CL received fees from GSK Ltd. JC Authors’ contributions received fees from Areteva R & D Ltd. AMan received fees from Roche All authors read and approved the final version of the manuscript. BDM Diagnostics GmbH, AR received fees from Janssen R & D High Wycombe wrote the main body of the manuscript, performed the analyses presented Ltd. FB received fees from Janssen R & D Springhouse LLC. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 20 of 23 Publisher’sNote 17. Tweeddale H, Notley-McRobb L, Ferenci T. Effect of slow growth on Springer Nature remains neutral with regard to jurisdictional claims in published metabolism of Escherichia coli, as revealed by global metabolite pool maps and institutional affiliations. (“metabolome”) analysis. J Bacteriol. 1998;180(19):5109–16. 18. Sterk PJ. Towards the Physionomics of asthma and COPD. Copenhagen: Author details European Respiratory Society Annual Congress; 2005. p. 17–21. European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, 19. Machado RF, Laskowski D, Deffenderfer O, Burch T, Zheng S, Mazzone PJ, EISBM, 50 Avenue Tony Garnier, 69007 Lyon, France. Acclarogen Ltd, St Mekhail T, Jennings C, Stoller JK, Pyle J, et al. Detection of lung cancer by John’s Innovation Centre, Cambridge CB4 OWS, UK. Data Science Institute, sensor array analyses of exhaled breath. Am J Respir Crit Care Med. 2005; Imperial College, London SW7 2AZ, UK. Janssen Research and Development 171(11):1286–91. Ltd, High Wycombe HP12 4DP, UK. UCB Pharma S.A, 1420 Braine-l’Alleud, 20. Sanchez C, Lachaize C, Janody F, Bellon B, Roder L, Euzenat J, Rechenmann 6 7 Belgium. UCB Celltech, 208 Bath Road, Slough SL13WE, UK. Roche Ltd, F, Jacq B. Grasping at molecular interactions and genetic networks in Welwyn Garden City AL7 1TW, UK. AstraZeneca Ltd, Alderley Park, Drosophila melanogaster using FlyNets, an internet database. Nucleic Acids Macclesfield SK10 4TG, UK. Target Sciences, GlaxoSmithKline, Gunnels Wood Res. 1999;27(1):89–94. Road, Stevenage SG1 2NY, UK. Faculty of Medicine, University of 21. Cesareni G, Ceol A, Gavrila C, Palazzi LM, Persico M, Schneider MV. Southampton, Southampton SO17 1BJ, UK. AstraZeneca R & D, 43150 Comparative interactomics. FEBS Lett. 2005;579(8):1828–33. 12 13 Mölndal, Sweden. Arateva R & D Ltd, Nottingham NG1 1GF, UK. National 22. Mayer B. Bioinformatics for omics data : methods and protocols. New York: Hearth and Lung Institute, Imperial College London, London SW3 6LY, UK. Humana Press; 2011. Department of Respiratory Medicine, Academic Medical Centre, University 23. Mesarovic MD. Case institute of technology. Systems research center.: of Amsterdam, Amsterdam AZ1105, The Netherlands. Research Informatics, systems theory and biology. Proceedings of the 3rd systems symposium at Roche Diagnostics GmbH, 82008 Unterhaching, Germany. Janssen Research case institute of technology. Berlin: Springer; 1968. and Development Ltd, Spring House, PA 19002, USA. 24. Noble D. Cardiac action and pacemaker potentials based on the Hodgkin- Huxley equations. Nature. 1960;188:495–7. Received: 20 July 2017 Accepted: 21 February 2018 25. Auffray C, Imbeaud S, Roux-Rouquie M, Hood L. From functional genomics to systems biology: concepts and practices. C R Biol. 2003;326(10–11):879–92. 26. Auffray C, Noble D. Origins of systems biology in William Harvey's masterpiece on the movement of the heart and the blood in animals. Int J References Mol Sci. 2009;10(4):1658–69. 1. Jameson JL, Longo DL. Precision medicine–personalized, problematic, and 27. Auffray C, Nottale L. Scale relativity theory and integrative systems biology: 1. promising. N Engl J Med. 2015;372(23):2229–34. Founding principles and scale laws. Prog Biophys Mol Biol. 2008;97(1):79–114. 2. Chen R, Snyder M. Promise of personalized omics to precision medicine. 28. Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh CH, Minokawa T, Wiley Interdiscip Rev Syst Biol Med. 2013;5(1):73–82. Amore G, Hinman V, Arenas-Mena C, et al. A genomic regulatory network for development. Science. 2002;295(5560):1669–78. 3. Viceconti M, Hunter P, Hose R. Big data, big knowledge: big data for personalized healthcare. IEEE J Biomed Health Inform. 2015;19(4):1209–15. 29. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems 4. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of biology. Annu Rev Genomics Hum Genet. 2001;2:343–72. integrating data to uncover genotype-phenotype interactions. Nat Rev 30. Kitano H. Looking beyond the details: a rise in system-oriented approaches Genet. 2015;16(2):85–97. in genetics and molecular biology. Curr Genet. 2002;41(1):1–10. 5. Berger B, Gaasterland T, Lengauer T, Orengo C, Gaeta B, Markel S, Valencia 31. Noble D. Modeling the heart–from genes to cells to the whole organ. A. ISCB's initial reaction to the New England journal of medicine editorial on Science. 2002;295(5560):1678–82. data sharing. PLoS Comput Biol. 2016;12(3):e1004816. 32. Nottale L, Auffray C. Scale relativity theory and integrative systems 6. Longo DL, Drazen JM. Data Sharing. N Engl J Med. 2016;374(3):276–7. biology: 2. Macroscopic quantum-type mechanics. Prog Biophys Mol Biol. 2008;97(1):115–57. 7. Hawkins TL, McKernan KJ, Jacotot LB, MacKenzie JB, Richardson PM, Lander 33. Prokop A, Csukas B. Systems biology - integrative biology and simulation ES. A magnetic attraction to high-throughput genomics. Science. 1997; tools. Dordrecht: Springer; 2013. 276(5320):1887–9. 8. MacKenzie S. High-throughput interpretation of pathways and biology. 34. Anderson GP. Endotyping asthma: new insights into key pathogenic Drug News Perspect. 2001;14(1):54–7. mechanisms in a complex, heterogeneous disease. Lancet. 2008;372(9643): 9. Pietu G, Mariage-Samson R, Fayein NA, Matingou C, Eveno E, Houlgatte R, 1107–19. Decraene C, Vandenbrouck Y, Tahi F, Devignes MD, et al. The Genexpress 35. Auffray C, Chen Z, Hood L. Systems medicine: the future of medical IMAGE knowledge base of the human brain transcriptome: a prototype genomics and healthcare. Gen Med. 2009;1(1):2. integrated resource for functional and computational genomics. Genome 36. Auffray C, Charron D, Hood L. Predictive, preventive, personalized and Res. 1999;9(2):195–209. participatory medicine: back to the future. Gen Med. 2010;2(8):57. 10. Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DE Jr, 37. Auffray C, Hood L. Editorial: systems biology and personalized medicine - Hieter P, Vogelstein B, Kinzler KW. Characterization of the yeast the future is now. Biotechnol J. 2012;7(8):938–9. transcriptome. Cell. 1997;88(2):243–51. 38. Hood L, Auffray C. Participatory medicine: a driving force for revolutionizing healthcare. Gen Med. 2013;5(12):110. 11. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997; 39. Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century 278(5338):680–6. through systems approaches. Biotechnol J. 2012;7(8):992–1001. 12. Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez JC, Yan JX, Gooley 40. Sobradillo P, Pozo F, Agusti A. P4 medicine: the future around the corner. AA, Hughes G, Humphery-Smith I, et al. From proteins to proteomes: large Arch Bronconeumol. 2011;47(1):35–40. scale protein identification by two-dimensional electrophoresis and amino 41. Wolkenhauer O, Auffray C, Jaster R, Steinhoff G, Dammann O. The acid analysis. Biotechnology (N Y). 1996;14(1):61–5. road from systems biology to systems medicine. Pediatr Res. 2013; 13. James P. Protein identification in the post-genome era: the rapid rise of 73(4 Pt 2):502–7. proteomics. Q Rev Biophys. 1997;30(4):279–331. 42. Leek JT,Scharpf RB,Bravo HC,SimchaD, LangmeadB,Johnson WE, 14. Kishimoto K, Urade R, Ogawa T, Moriyama T. Nondestructive quantification Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical of neutral lipids by thin-layer chromatography and laser-fluorescent impact of batch effects in high-throughput data. Nat Rev Genet. 2010; scanning: suitable methods for “lipidome” analysis. Biochem Biophys Res 11(10):733–9. Commun. 2001;281(3):657–62. 43. McDonald JH. Handbook of biological statistics. 3rd ed. Baltimore: Sparky 15. Han X, Gross RW. Global analyses of cellular lipidomes directly from crude House Publishing; 2014. extracts of biological samples by ESI mass spectrometry: a bridge to 44. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration lipidomics. J Lipid Res. 2003;44(6):1071–9. in biological research: an overview. J Biol Res Thessalon. 2015;22:1–16. 16. Oliver SG, Winson MK, Kell DB, Baganz F. Systematic functional analysis of 45. Rhee SY, Wood V, Dolinski K, Draghici S. Use and misuse of the gene the yeast genome. Trends Biotechnol. 1998;16(9):373–8. ontology annotations. Nat Rev Genet. 2008;9(7):509–15. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 21 of 23 46. Reimand J, Arak T, Vilo J. G:profiler–a web server for functional neuropathic pain and tissue embryological classes. Bioinformatics. 2010; interpretation of gene lists (2011 update). Nucleic Acids Res. 2011;39(Web 26(18):i531–9. Server issue):W307–15. 73. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information 47. Fujita KA, Ostaszewski M, Matsuoka Y, Ghosh S, Glaab E, Trefois C, Crespo I, feature selection. IEEE Trans Neural Netw. 2009;20(2):189–201. Perumal TM, Jurkowski W, Antony PM, et al. Integrating pathways of Parkinson's 74. Guyon I, Elisseeff A. An introduction to variable and feature selection. J disease in a molecular interaction map. Mol Neurobiol. 2014;49(1):88–102. Mach Learn Res. 2003;3:1157–82. 48. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, 75. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in Simonovic M, Roth A, Santos A, Tsafou KP, et al. STRING v10: protein-protein bioinformatics. Bioinformatics. 2007;23(19):2507–17. interaction networks, integrated over the tree of life. Nucleic Acids Res. 76. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 2015;43(Database issue):D447–52. 1999;31(3):264–323. 49. Vallabhajosyula RR, Raval A. Computational modeling in systems biology. 77. Ronan T, Qi Z, Naegle KM. Avoiding common pitfalls when clustering Methods Mol Biol. 2010;662:97–120. biological data. Sci Signal. 2016;9(432):re6. 50. Kuperstein I, Bonnet E, Nguyen HA, Cohen D, Viara E, Grieco L, Fourquet S, 78. Shirkhorshidi AS, Aghabozorgi S, Teh YW, Herawan T. Big Data Calzone L, Russo C, Kondratova M, et al. Atlas of cancer Signalling network: Clustering:AReview.Computational Science and Its Applications. a systems biology resource for integrative analysis of cancer data with 2014;8583:707–20. Google maps. Oncogene. 2015;4:e160. 79. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with 51. Mizuno S, Iijima R, Ogishima S, Kikuchi M, Matsuoka Y, Ghosh S, Miyamoto confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3. T, Miyashita A, Kuwano R, Tanaka H. AlzPathway: a comprehensive map of 80. Caruana R, Elhawary M, Nguyen N, Smith C. Meta clustering. Ieee Data signaling pathways of Alzheimer's disease. BMC Syst Biol. 2012;6:52. Mining. 2006:107–18. 52. Ogishima S, Mizuno S, Kikuchi M, Miyashita A, Kuwano R, Tanaka H, Nakaya J. 81. Shen R, Mo Q, Schultz N, Seshan VE, Olshen AB, Huse J, Ladanyi M, Sander AlzPathway, an updated map of curated signaling pathways: towards deciphering C. Integrative subtype discovery in glioblastoma using iCluster. PLoS One. Alzheimer's disease pathogenesis. Methods Mol Biol. 2016;1303:423–32. 2012;7(4):e35236. 53. Zhao S, Iyengar R. Systems pharmacology: network analysis to identify 82. Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated multiscale mechanisms of drug action. Annu Rev Pharmacol Toxicol. 2012; clustering to integrate multiple datasets. Bioinformatics. 2012;28(24):3290–7. 52:505–21. 83. Yuan Y, Savage RS, Markowetz F. Patient-specific data fusion defines 54. Bigler J, Hu X, Boedigheimer M, Rowe A, Chung F, Djukanovic R, Sousa A, prognostic cancer subtypes. PLoS Comput Biol. 2011;7(10):e1002227. Corfield J, Adcock I, Sterk P, et al. Whole transcriptome analysis in peripheral 84. Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, blood from asthmatic and healthy subjects in the U-BIOPRED study. Eur Milanesi L. Methods for the integration of multi-omics data: mathematical Respir J. 2014;44(Suppl 58):2027. aspects. BMC Bioinformatics. 2016;17(Suppl 2):15. 55. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, 85. Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical Goldenberg A. Similarity network fusion for aggregating data types on a and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57(1): genomic scale. Nat Methods. 2014;11(3):333–7. 289–300. 56. Auffray C, Balling R, Barroso I, Bencze L, Benson M, Bergeron J, Bernal- 86. Noble WS. How does multiple testing correction work? Nat Biotechnol. Delgado E, Blomberg N, Bock C, Conesa A, et al. Making sense of big data 2009;27(12):1135–7. in health research: towards an EU action plan. Gen Med. 2016;8(1):71. 87. Xie J, Cai TT, Maris J, Li H. Optimal false discovery rate control for 57. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression dependent data. Stat Interface. 2011;4(4):417–30. data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. 88. Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per 58. van der Kloet FM, Bobeldijk I, Verheij ER, Jellema RH. Analytical error independent variable in proportional hazards regression analysis. II. reduction using single point calibration for accurate and precise Accuracy and precision of regression estimates. J Clin Epidemiol. 1995; metabolomic phenotyping. J Proteome Res. 2009;8(11):5132–41. 48(12):1503–10. 59. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical 89. Auffray C. Sharing knowledge: a new frontier for public-private partnerships models. Cambridge: Cambridge University Press; 2007. in medicine. Genome Med. 2009;1(3):29. 60. Guo Y, Graber A, McBurney RN, Balasubramanian R. Sample size and statistical 90. Lindpaintner K. Biomarkers: call on industry to share. Nature. 2011;470(7333):175. power considerations in high-dimensionality data settings: a comparative 91. McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams study of classification algorithms. BMC Bioinformatics. 2010;11:447. PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, et al. Criteria for the use of 61. Michiels S, Kramar A, Koscielny S. Multidimensionality of microarrays: omics-based predictors in clinical trials: explanation and elaboration. BMC statistical challenges and (im) possible solutions. Mol Oncol. 2011;5(2):190–6. Med. 2013;11:220. 62. Lee JA, Verleysen M. Nonlinear dimensionality reduction. New York: 92. McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams Springer; 2007. PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, et al. Criteria for the use of 63. Calza S, Raffelsberger W, Ploner A, Sahel J, Leveillard T, Pawitan Y. Filtering omics-based predictors in clinical trials. Nature. 2013;502(7471):317–20. genes to improve sensitivity in oligonucleotide microarray data analysis. 93. Poste G. Bring on the biomarkers. Nature. 2011;469(7329):156–7. Nucleic Acids Res. 2007;35(16):e102. 94. Sung J, Wang Y, Chandrasekaran S, Witten DM, Price ND. Molecular 64. Stanberry L, Mias GI, Haynes W, Higdon R, Snyder M, Kolker E. Integrative signatures from omics data: from chaos to consensus. Biotechnol J. 2012; analysis of longitudinal metabolomics data from a personal multi-omics 7(8):946–57. profile. Meta. 2013;3(3):741–60. 95. Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic 65. Ideker T, Dutkowski J, Hood L. Boosting signal-to-noise in complex biology: research: validating a prognostic model. BMJ. 2009;338:b605. prior knowledge is power. Cell. 2011;144(6):860–3. 96. Hemingway H, Riley RD, Altman DG. Ten steps towards improving 66. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation prognosis research. BMJ. 2009;339:b4184. network analysis. BMC Bioinformatics. 2008;9:559. 97. Jin L, Zuo XY, Su WY, Zhao XL, Yuan MQ, Han LZ, Zhao X, Chen YD, Rao SQ. 67. Varshavsky R, Gottlieb A, Linial M, Horn D. Novel unsupervised feature Pathway-based analysis tools for complex diseases: a review. Genomics filtering of biological data. Bioinformatics. 2006;22(14):e507–13. Proteomics Bioinformatics. 2014;12(5):210–20. 68. Bonev B, Escolano F, Cazorla MA. A novel information theory method for 98. Khatri P, Draghici S. Ontological analysis of gene expression data: current filter feature selection. Lect Notes Artif Int. 2007;4827:431–40. tools, limitations, and open problems. Bioinformatics. 2005;21(18):3587–95. 69. Meyer PE. The rank Minrelation coefficient. Qual Technol Quant M. 2014; 99. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches 11(1):61–70. and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375. 70. Scardoni G, Petterlini M, Laudanna C. Analyzing biological network 100. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, parameters with CentiScaPe. Bioinformatics. 2009;25(21):2857–9. Gillespie M, Kamdar MR, et al. The Reactome pathway knowledgebase. 71. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: new features Nucleic Acids Res. 2014;42(Database issue):D472–7. for data integration and network visualization. Bioinformatics. 2011;27(3):431–2. 101. Milacic M, Haw R, Rothfels K, Wu G, Croft D, Hermjakob H, D'Eustachio P, 72. Cannistraci CV, Ravasi T, Montevecchi FM, Ideker T, Alessio M. Nonlinear Stein L. Annotating cancer variants and anti-cancer therapeutics in dimension reduction and clustering by minimum Curvilinearity unfold reactome. Cancers. 2012;4(4):1180–211. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 22 of 23 102. Mizuno S, Ogishima S, Kitatani K, Kikuchi M, Tanaka H, Yaegashi N, Nakaya J. biogenesis genes may influence epithelial ovarian cancer risk. Cancer Network analysis of a comprehensive knowledge repository reveals a dual Epidemiol Biomark Prev. 2011;20(6):1131–45. role for ceramide in alzheimer's disease. PlosOne 2016;11(2):e0148431. 127. Nakano H, Yamada Y, Miyazawa T, Yoshida T. Gain-of-function microRNA 103. Lefaudeux D, De Meulder B, Loza MJ, Peffer N, Rowe A, Baribaud F, Bansal screens identify miR-193a regulating proliferation and apoptosis in epithelial AT, Lutter R, Sousa AR, Corfield J, et al. U-BIOPRED clinical adult asthma ovarian cancer cells. Int J Oncol. 2013;42(6):1875–82. clusters linked to a subset of sputum -omics. J Allergy Clin Immunol. 2016; 128. Archer MC. Role of sp transcription factors in the regulation of cancer cell In press metabolism. Genes Cancer. 2011;2(7):712–9. 104. Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100(1):57–70. 129. Li Y, Yao L, Liu F, Hong J, Chen L, Zhang B, Zhang W. Characterization of 105. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. microRNA expression in serous ovarian carcinoma. Int J Mol Med. 2014;34(2):491–8. 2011;144(5):646–74. 130. Hein S, Mahner S, Kanowski C, Loning T, Janicke F, Milde-Langosch K. 106. Bast RC Jr, Hennessy B, Mills GB. The biology of ovarian cancer: new Expression of Jun and Fos proteins in ovarian tumors of different malignant opportunities for translation. Nat Rev Cancer. 2009;9(6):415–28. potential and in ovarian cancer cell lines. Oncol Rep. 2009;22(1):177–83. 107. Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for 131. Wang JX, Zeng Q, Chen L, Du JC, Yan XL, Yuan HF, Zhai C, Zhou JN, Jia YL, computational biology. Mol Syst Biol. 2016;12(7):878. Yue W, et al. SPINDLIN1 promotes cancer cell proliferation through 108. Sommer C, Gerlich DW. Machine learning in cell biology - teaching activation of WNT/TCF-4 signaling. Mol Cancer Res. 2012;10(3):326–35. computers to recognize phenotypes. J Cell Sci. 2013;126(Pt 24):5529–39. 132. Sundfeldt K, Ivarsson K, Carlsson M, Enerback S, Janson PO, Brannstrom M, 109. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A. Caret: Hedin L. The expression of CCAAT/enhancer binding protein (C/EBP) in the classification and regression training, vol. 5; 2012. p. 15–044. human ovary in vivo: specific increase in C/EBPbeta during epithelial 110. Le Cao KA, Gonzalez I, Dejean S. Integromics: an R package to unravel tumour progression. Br J Cancer. 1999;79(7–8):1240–8. relationships between two omics datasets. Bioinformatics. 2009;25(21):2855–6. 133. He L, Guo L, Vathipadiekal V, Sergent PA, Growdon WB, Engler DA, Rueda 111. Le Cao KA, Rohart F4, Gonzalez I, Dejean S, Gautier B, Bartolo F, Monget P, BR, Birrer MJ, Orsulic S, Mohapatra G. Identification of LMX1B as a novel Coquery J, Yao FBL. mixOmics: omics data integration project: R package oncogene in human ovarian cancer. Oncogene. 2014;33(33):4226–35. version; 2016. p. 6.1.1. 134. White NM, Chow TF, Mejia-Guerrero S, Diamandis M, Rofael Y, Faragalla H, 112. Singh ABG, Shannon C, Vacher M, Rohart F, Tebutt S, Le Cao KA. DIABLO - Mankaruous M, Gabril M, Girgis A, Yousef GM. Three dysregulated miRNAs an integrative, multi-omics, multivariate method for multi-group control kallikrein 10 expression and cell proliferation in ovarian cancer. Br J classification: bioRxiv; 2016. Cancer. 2010;102(8):1244–53. 113. Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan 135. Downie D, McFadyen MC, Rooney PH, Cruickshank ME, Parkin DE, Miller ID, M, Carlsson J, Carlsson G. Extracting insights from the shape of complex Telfer C, Melvin WT, Murray GI. Profiling cytochrome P450 expression in data using topology. Sci Rep. 2013;3:1236. ovarian cancer: identification of prognostic markers. Clin Cancer Res. 2005; 114. Gevaert O, Villalobos V, Sikic BI, Plevritis SK. Identification of ovarian cancer 11(20):7369–75. driver genes by using module network integration of multi-omics data. 136. Gambineri A, Tomassoni F, Munarini A, Stimson RH, Mioni R, Pagotto U, Interface Focus. 2013;3(4):20130013. Chapman KE, Andrew R, Mantovani V, Pasquali R, et al. A combination of 115. Jin N, Wu H, Miao Z, Huang Y, Hu Y, Bi X, Wu D, Qian K, Wang L, Wang C, et polymorphisms in HSD11B1 associates with in vivo 11{beta}-HSD1 activity al. Network-based survival-associated module biomarker and its crosstalk and metabolic syndrome in women with and without polycystic ovary with cell death genes in ovarian cancer. Sci Rep. 2015;5:11566. syndrome. Eur J Endocrinol. 2011;165(2):283–92. 116. Kim D, Joung JG, Sohn KA, Shin H, Park YR, Ritchie MD, Kim JH. Knowledge 137. Howells REJ, Dhar KK, Hoban PR, Jones PW, Fryer AA, Redman CWE, Strange boosting: a graph-based integration approach with multi-omics data and RC. Association between glutathione-S-transferase GSTP1 genotypes, GSTP1 genomic knowledge for cancer clinical outcome prediction. J Am Med over-expression, and outcome in epithelial ovarian cancer. Int J Gynecol Inform Assoc. 2015;22(1):109–20. Cancer. 2004;14(2):242–50. 117. Network TCGAR. Integrated genomic analyses of ovarian carcinoma. Nature. 138. Cao J, Cai J, Huang D, Han Q, Yang Q, Li T, Ding H, Wang Z. miR-335 2011;474(7353):609–15. represents an invasion suppressor gene in ovarian cancer by targeting Bcl- 118. Zhang Q, Burdette JE, Wang JP. Integrative network analysis of TCGA data w. Oncol Rep. 2013;30(2):701–6. for ovarian cancer. BMC Syst Biol. 2014;8:1338. 139. Tsai SJ, Hwang JM, Hsieh SC, Ying TH, Hsieh YH. Overexpression of myeloid 119. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett zinc finger 1 suppresses matrix metalloproteinase-2 expression and reduces MK, Etemadmoghadam D, Locandro B, et al. Novel molecular subtypes of invasiveness of SiHa human cervical cancer cells. Biochem Bioph Res Co. serous and endometrioid ovarian cancer linked to clinical outcome. Clin 2012;425(2):462–7. Cancer Res. 2008;14(16):5198–208. 140. Nie LY, Lu QT, Li WH, Yang N, Dongol S, Zhang X, Jiang J. Sterol regulatory 120. Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular element-binding protein 1 is required for ovarian tumor growth. Oncol Rep. pattern discovery using matrix factorization. P Natl Acad Sci USA. 2004; 2013;30(3):1346–54. 101(12):4164–9. 141. Odegaard E, Staff AC, Kaern J, Florenes VA, Kopolovic J, Trope CG, Abeler VM, Reich R, Davidson B. The AP-2gamma transcription factor is upregulated in 121. Paparrizos J, Gravano L. K-shape: efficient and accurate clustering of time advanced-stage ovarian carcinoma. Gynecol Oncol. 2006;100(3):462–8. series in: SIGMOD international conference on Management of Data: June 4, 2015. Melbourne: Australia: Edited by ACM; 2015. p. 1855–70. 142. Hudson LG, Zeineldin R, Silberberg M, Stack MS. Activated epidermal 122. Sagner M, McNeil A, Puska P, Auffray C, Price ND, Hood L, Lavie CJ, Han ZG, growth factor receptor in ovarian cancer. Cancer Treat Res. 2009;149:203–26. Chen Z, Brahmachari SK, et al. The P4 health Spectrum - a predictive, 143. Landskron J, Helland O, Torgersen KM, Aandahl EM, Gjertsen BT, Bjorge L, preventive, personalized and participatory continuum for promoting Tasken K. Activated regulatory and memory T-cells accumulate in malignant Healthspan. Prog Cardiovasc Dis. 2017;59(5):506–21. ascites from ovarian carcinoma patients. Cancer Immunol Immunother. 123. Reimer D, Sadr S, Wiedemair A, Goebel G, Concin N, Hofstetter G, 2015;64(3):337–47. Marth C, Zeimet AG. Expression of the E2F family of transcription 144. Gavalas NG, Karadimou A, Dimopoulos MA, Bamias A. Immune response factors and its clinical relevance in ovarian cancer. Ann N Y Acad Sci. in ovarian cancer: how is the immune system involved in prognosis 2006;1091:270–81. and therapy: potential for treatment utilization. Clin Dev Immunol. 2010; 124. Xanthoulis A, Tiniakos DG. E2F transcription factors and digestive system 2010:791603. malignancies: how much do we know? World J Gastroenterol. 2013;19(21): 145. Carlsten M, Norell H, Bryceson YT, Poschke I, Schedvins K, Ljunggren HG, 3189–98. Kiessling R, Malmberg KJ. Primary human tumor cells expressing CD155 125. Miyata K, Yotsumoto F, Nam SO, Odawara T, Manabe S, Ishikawa T, Itamochi impair tumor targeting by down-regulating DNAM-1 on NK cells. J H, Kigawa J, Takada S, Asahara H, et al. Contribution of transcription factor, Immunol. 2009;183(8):4921–30. SP1, to the promotion of HB-EGF expression in defense mechanism against 146. Bellone S, Siegel ER, Cocco E, Cargnelutti M, Silasi DA, Azodi M, Schwartz PE, the treatment of irinotecan in ovarian clear cell carcinoma. Cancer Med. Rutherford TJ, Pecorelli S, Santin AD. Overexpression of epithelial cell 2014;3(5):1159–69. adhesion molecule in primary, metastatic, and recurrent/chemotherapy- 126. Permuth-Wey J, Chen YA, Tsai YY, Chen Z, Qu X, Lancaster JM, Stockwell H, resistant epithelial ovarian cancer: implications for epithelial cell adhesion Dagne G, Iversen E, Risch H, et al. Inherited variants in mitochondrial molecule-specific immunotherapy. Int J Gynecol Cancer. 2009;19(5):860–6. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 23 of 23 147. Szkandera J, Kiesslich T, Haybaeck J, Gerger A, Pichler M. Hedgehog signaling pathway in ovarian cancer. Int J Mol Sci. 2013;14(1):1179–96. 148. Feng Q, Deftereos G, Hawes SE, Stern JE, Willner JB, Swisher EM, Xi L, Drescher C, Urban N, Kiviat N. DNA hypermethylation, Her-2/neu overexpression and p53 mutations in ovarian carcinoma. Gynecol Oncol. 2008;111(2):320–9. 149. Clarke B, Tinker AV, Lee CH, Subramanian S, van de Rijn M, Turbin D, Kalloger S, Han G, Ceballos K, Cadungog MG, et al. Intraepithelial T cells and prognosis in ovarian carcinoma: novel associations with stage, tumor type, and BRCA1 loss. Mod Pathol. 2009;22(3):393–402. 150. Powell CB, Manning K, Collins JL. Interferon-alpha (IFN alpha) induces a cytolytic mechanism in ovarian carcinoma cells through a protein kinase C-dependent pathway. Gynecol Oncol. 1993;50(2):208–14. 151. Adham SA, Sher I, Coomber BL. Molecular blockade of VEGFR2 in human epithelial ovarian carcinoma cells. Lab Investig. 2010;90(5):709–23. 152. Chen H, Ye D, Xie X, Chen B, Lu W. VEGF, VEGFRs expressions and activated STATs in ovarian epithelial carcinoma. Gynecol Oncol. 2004;94(3):630–5. 153. Chen Q, Gao G, Luo S. Hedgehog signaling pathway and ovarian cancer. Chin J Cancer Res. 2013;25(3):346–53. 154. Darb-Esfahani S, Sinn BV, Weichert W, Budczies J, Lehmann A, Noske A, Buckendahl AC, Muller BM, Sehouli J, Koensgen D, et al. Expression of classical NF-kappaB pathway effectors in human ovarian carcinoma. Histopathology. 2010;56(6):727–39. 155. Wang H, Xie X, Lu WG, Ye DF, Chen HZ, Li X, Cheng Q. Ovarian carcinoma cells inhibit T cell proliferation: suppression of IL-2 receptor beta and gamma expression and their JAK-STAT signaling pathway. Life Sci. 2004; 74(14):1739–49. 156. Hurst JH, Hooks SB. Regulator of G-protein signaling (RGS) proteins in cancer biology. Biochem Pharmacol. 2009;78(10):1289–97. 157. Leung PC, Choi JH. Endocrine signaling in ovarian surface epithelium and cancer. Hum Reprod Update. 2007;13(2):143–62. 158. Townsend KN, Spowart JE, Huwait H, Eshragh S, West NR, Elrick MA, Kalloger SE, Anglesio M, Watson PH, Huntsman DG, et al. Markers of T cell infiltration and function associate with favorable outcome in vascularized high-grade serous ovarian carcinoma. PLoS One. 2013;8(12):e82406. 159. Matassa DS, Amoroso MR, Lu H, Avolio R, Arzeni D, Procaccini C, Faicchia D, Maddalena F, Simeon V, Agliarulo I, et al. Oxidative metabolism drives inflammation-induced platinum resistance in human ovarian cancer. Cell Death Differ. 2016; 160. Corney DC, Flesken-Nikitin A, Choi J, Nikitin AY. Role of p53 and Rb in ovarian cancer. Adv Exp Med Biol. 2008;622:99–117. 161. Sampath J, Long PR, Shepard RL, Xia X, Devanarayan V, Sandusky GE, Perry WL 3rd, Dantzig AH, Williamson M, Rolfe M, et al. Human SPF45, a splicing factor, has limited expression in normal tissues, is overexpressed in many tumors, and can confer a multidrug-resistant phenotype to cells. Am J Pathol. 2003;163(5):1781–90. 162. Daponte A, Ioannou M, Mylonis I, Simos G, Minas M, Messinis IE, Koukoulis G. Prognostic significance of hypoxia-inducible factor 1 alpha (HIF-1 alpha) expression in serous ovarian cancer: an immunohistochemical study. BMC Cancer. 2008;8:335. 163. Kim JH, Karnovsky A, Mahavisno V, Weymouth T, Pande M, Dolinoy DC, Rozek LS, Sartor MA. LRpath analysis reveals common pathways dysregulated via DNA methylation across cancer types. BMC Genomics. 2012;13:526. 164. Ye J, Livergood RS, Peng G. The role and regulation of human Th17 cells in tumor immunity. Am J Pathol. 2013;182(1):10–20. 165. Leung CS, Yeung TL, Yip KP, Pradeep S, Balasubramanian L, Liu J, Wong KK, Mangala LS, Armaiz-Pena GN, Lopez-Berestein G, et al. Calcium-dependent Submit your next manuscript to BioMed Central FAK/CREB/TNNC1 signalling mediates the effect of stromal MFAP5 on and we will help you at every step: ovarian cancer metastatic potential. Nat Commun. 2014;5:5092. 166. Lengyel E. Ovarian cancer development and metastasis. Am J Pathol. 2010; • We accept pre-submission inquiries 177(3):1053–64. � Our selector tool helps you to find the most relevant journal 167. Frede J, Fraser SP, Oskay-Ozcelik G, Hong Y, Ioana Braicu E, Sehouli J, Gabra H, Djamgoz MB. Ovarian cancer: ion channel and aquaporin expression as � We provide round the clock customer support novel targets of clinical potential. Eur J Cancer. 2013;49(10):2331–44. � Convenient online submission 168. Bigler J, Boedigheimer M, Schofield JPR, Skipp PJ, Corfield J, Rowe A, Sousa � Thorough peer review AR, Timour M, Twehues L, Hu X, et al. A severe asthma disease signature from gene expression profiling of peripheral blood from U-BIOPRED � Inclusion in PubMed and all major indexing services cohorts. Am J Respir Crit Care Med. 2017;195(10):1311–20. � Maximum visibility for your research Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Systems Biology Springer Journals
Loading next page...
 
/lp/springer_journal/a-computational-framework-for-complex-disease-stratification-from-7ufHR3Zy3V
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Bioinformatics; Systems Biology; Simulation and Modeling; Computational Biology/Bioinformatics; Physiological, Cellular and Medical Topics; Algorithms
eISSN
1752-0509
D.O.I.
10.1186/s12918-018-0556-z
Publisher site
See Article on Publisher Site

Abstract

Background: Multilevel data integration is becoming a major area of research in systems biology. Within this area, multi-‘omics datasets on complex diseases are becoming more readily available and there is a need to set standards and good practices for integrated analysis of biological, clinical and environmental data. We present a framework to plan and generate single and multi-‘omics signatures of disease states. Methods: The framework is divided into four major steps: dataset subsetting, feature filtering, ‘omics-based clustering and biomarker identification. Results: We illustrate the usefulness of this framework by identifying potential patient clusters based on integrated multi-‘omics signatures in a publicly available ovarian cystadenocarcinoma dataset. The analysis generated a higher number of stable and clinically relevant clusters than previously reported, and enabled the generation of predictive models of patient outcomes. Conclusions: This framework will help health researchers plan and perform multi-‘omics big data analyses to generate hypotheses and make sense of their rich, diverse and ever growing datasets, to enable implementation of translational P4 medicine. Keywords: Molecular signatures, ‘Omics data, Stratification, Systems medicine Background Various data integration methods developed through Since the early days of medicine, practitioners have systems biology and computer science are now available always combined their observations from patient exami- to researchers. These methods aim at bridging the gap nations with their medical knowledge and experience to between the vast amounts of data generated in an ever- diagnose medical conditions and find treatments tailored cheaper way [3] and our understanding of biology to the patient [1]. Nowadays, this rationale includes the reflecting the complexity of biological systems [4]. integration of molecular, clinical, imaging information Promises of data integration are the reduced cost of clin- and other data sources to inform diagnosis and progno- ical trials, better statistical power, more accurate hypoth- sis [2] or in other words, personalised medicine. esis generation and ultimately, individualised and cheaper healthcare [2]. * Correspondence: bdemeulder@eisbm.org; cauffray@eisbm.org However, a lack of communication exists between the Bertrand De Meulder and Diane Lefaudeux contributed equally to this work. fields of clinical medicine and systems biology, bioinfor- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, matics and biostatistics, as suggested by the reluctance EISBM, 50 Avenue Tony Garnier, 69007 Lyon, France Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 2 of 23 or distrust to recent developments of personalised medi- their annotation is required to interpret results and pro- cine by the medical community [1, 5, 6]. To address this duce a single ‘omic signature. Annotation is a complex issue, we developed a computational/analysis framework task that links identifiers from the technological platforms that aims at facilitating communication between health- to existing entities (i.e. genes, peptides, metabolites, lipids, care professionals, computational biologists and etc.) [44, 45]. If the data permit, information from several bioinformaticians. ‘omics platforms is integrated into multi-‘omics signatures. Among several ways of integrating data across bio- Single and multi ‘omicssignaturesultimatelyserve to logical levels, one of the components is multi-omics data identify molecular mechanisms driving pathobiology. integration. The identification of molecular signatures Contextualisation of signatures with existing know- has been a focus of the biology and bioinformatics com- ledge is now standard practice (e.g. ontology, enrichment munities for over three decades. Early studies focused on and pathway analysis [46]), or performed with more ad- a small number of molecules, paving the way for larger vanced tools for data integration and visualisation such studies, eventually supporting the emergence of the as a disease map [47]. Exploratory analysis using ‘omics’ concept in the late 1990’s, starting with ‘genom- network-based information is valuable, with tools such ics’ [7, 8]. Owing to both technical and biological ad- as the STRING database [48], among many others. Hy- vances, many classes of molecules have been studied by potheses can then be formulated and tested in two ways, ‘omics technologies such as transcriptomics [9–11], pro- with external datasets and/or new experiments; or by teomics [12, 13], lipidomics [14, 15], metabolomics (first modelling and knowledge representation (see review in mentioned in [16, 17]), the composition of the exhaled [49] and disease maps examples in [47, 50–52]). With breath by breathomics (first mentioned in [18]) [19], and the help of systems pharmacology (see [53]), outcomes interactomics [20, 21], among others. of this whole exercise are enabling: (i) identification of Consequently, bioinformatics tools have been devel- new potential drug targets associated with newly identi- oped to analyse this new wealth of biological data, as fied patient clusters, (ii) elucidation of potential bio- reviewed in [22]. The concept of systems biology was de- markers for diagnosis, (iii) repurposing of existing drugs veloped first in the 1960’s[23, 24] to study biological or- and, ultimately, (iv) changes in diagnostic processes and ganisms as complete and complex systems, integrating development of new drugs and treatments for disease various sources of information (phenotypic data, mo- management. The key step in the systems medicine lecular data, etc.) in combination with pathway/network process is pattern recognition, for which a robust and analysis and mathematical modelling [25–33]. These sys- step-wise framework is required. tems approaches are highly suitable for the discovery of disease phenotypes (based on empirical recognition of Definitions observed characteristics) and so-called endotypes (cap- Our article focuses on the identification of disease turing complex causative mechanisms in disease) [34]. mechanisms through statistical analysis of raw data, an- The logical next step was to apply systems biology tools notation with up-to-date ontologies to generate finger- to improve clinical diagnosis, refine the endotypes lead- prints (biomarker signatures derived from data collected ing to diseases, develop a comprehensive approach to from a single technical platform), handprints (biomarker the human body and assess an individual’s health in light signatures derived from data collected within multiple of its ‘omics status. In this way the ‘systems medicine’ technical platforms, either by fusion of multiple finger- concept was born [35–41]. The systems medicine ration- prints or by direct integration of several data types) and ale is outlined in Fig. 1. interpretation on a pathway level to identify disease- Any meaningful experiment relies on a robust, bias- driving mechanisms. controlled study design [42] using appropriate technolo- One way to better define the different endotypes is to gies, leading to the production of trustworthy quality- generate molecular fingerprints (e.g. blood cell tran- checked data. Data curation then aims at organising, scriptomics analysis yields genes differentially expressed annotating, integrating and preserving data from various between clinical populations [54]) and handprints (e.g. sources for reuse and further integration. The next step is mRNA expression, DNA methylation and miRNA expres- to identify relevant molecular features using statistical sion data fused to generate clusters of cancer patients evidence. A tremendous and constantly growing number [55]). The latter can be combined to study patients e.g. at of methods is available for this purpose, making the the ‘blood biological compartment’ level, and linked with process of method selection a crucial and challenging task. specific disease markers to better define the underlying We provide some guidelines here but recommend that biology, hence providing new avenues for therapy. the reader turns to specialised reviews (such as [43]) for Despite the wealth of ‘omics analyses, little consensus more insights on the relevance and appropriateness of in- exist on which statistical or bioinformatics methods to dividual methods. Once features are statistically selected, apply on each type of data set, nor on the ‘best’ integrative De Meulder et al. BMC Systems Biology (2018) 12:60 Page 3 of 23 Fig. 1 Outline of the Systems Medicine rationale. Represented in orange are the steps linked to quality data production, followed by curation in grey, identification of interesting features through statistical analysis in blue and hypothesis generation and their validation in green. Modelling and knowledge representation methods can inform the hypotheses generated through statistical analysis of generated hypotheses on their own (in purple). Outputs of this exercise are represented in red: drug repurposing, new drugs and improved diagnostics, with the help of clinical trials methods for their combined analysis (although standards handprint analysis using the TCGA Research Network exist for some data types, see [22]). Here, we present a gen- (The Cancer Genome Atlas – http://cancergenome.nih. eric framework to perform statistical and bioinformatics ana- gov/) Ovarian serous cystadenocarcinoma (OV) dataset. lyses of ‘omics measurements, starting from raw data management to multi-platform data integration, pathway and network modelling that has been adopted by the Innovative Data preparation: Quality control, correction for Medicines Initiative (IMI) U-BIOPRED Consortium (Un- possible batch effects, missing data handling, and biased BIOmarkers for the PREDiction of respiratory disease outlier detection outcomes, http://www.ubiopred.eu) and extended in the Quality Control (QC) comprises several important steps eTRIKS Consortium (https://www.etriks.org/)tosupport a in data preparation. First, the platform-specific technical large number of national and European translational medi- QC and normalisation are performed according to the cine projects. This article is not a review of the very large standards of the respective fields of each particular body of literature on relevant bioinformatics methods. In- technological platform. stead it describes generic steps in ‘omicsdataanalysisto Batch effects are a technical bias arising during study which many methods can be mapped to help multidisciplin- design and data production, due to variability in produc- ary teams comprising clinical experts, wet-lab researchers, tion platforms, staff, batches, reagent lots, etc. Their im- bioinformaticians, biostatisticians and computational systems pact can be assessed using descriptive methods such as biologists share a common understanding and communicate Principal Component Analysis (PCA) and graphical dis- effectively throughout the systems medicine process [56]. plays. Tools such as ComBat [57] and methodologies de- We illustrate our pragmatic approach to the design veloped by van der Kloet [58] can be used to adjust for and implementation of the analysis pipeline through a batch effects when necessary. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 4 of 23 Missing data are features of all biological studies and When there is no community-wide consensus on a arise for a variety of reasons. If the source of the miss- specific quality threshold for a particular biological data ingness is unrelated to phenotype or biology, the missing type, the research group generating the data applies data points can be classified as Missing Completely At quality filters on the basis of their knowledge and experi- Random (MCAR). Such missing values may be handled ence. Precise description of each data processing step through imputation (to the mean, mode, mean of near- should accompany each dataset to inform colleagues est neighbours, or by multiple imputation etc.) or by performing downstream analysis. simple deletion [59]. Additional non-random missing data may arise due to Methods assay- or platform-specific performances. For example, The framework concept the measurement of abundances can fall below the lower Several key generic steps in data analysis were identified limit of detection or quantitation (LLQ) of the instru- and are highlighted in Fig. 3 below. ment. In such instances, imputation is generally applied. Common methods include imputation to zero, LLQ, Step 1: Dataset subsetting LLQ/2, or LLQ/√2; extrapolation and maximum likeli- This first box of Fig. 3 3 comprises two major steps: 1) hood estimation (MLE) can also be used [59]. formulating the biological question to be addressed and Particular difficulty occurs in the analysis of mass 2) preparing the data. spectrometry data, when it is impossible to distinguish MCAR data points from those below the LLQ of the Formulating the biological question technique. The combined levels of missing data often far Several types of biological questions can be tackled, exceed 10%. For these, the process depicted in the Fig. 2 leading to different partitions of the dataset(s) to study. is proposed. A partitioning scheme may rely on cohort definitions Critical appraisal of the pattern of missingness is cru- based on current state of the art, a specific biological cial. Where extensive imputation is applied, the robust- question (e.g. comparing highly atopic to non-atopic ness of imputation needs to be assessed by re-analysis, severe asthmatics), or clustering results, obtained with using a second imputation method, or by discarding the clinical variables alone, distinct specific ‘omic or multi- imputed values. ‘omics clustering, etc. Outliers are expected in any biological/platform data. When these are clearly seen to arise due to technical ar- Data preparation tefacts (differences by many orders of magnitude, etc.), Depending on the question formulated at the previous they should be discarded. Otherwise and in general, out- step, data are then subsetted when appropriate. Then, an lying values in biological data should be retained, flagged additional outlier detection check, data transformation and subjected to statistical analysis. and normalisation step can be performed, with methods Fig. 2 Process proposed for handling high levels of non-random missing data. If there are less than 10% missing values, data imputation is used, then tested for association (artificial associations might arise from the imputation process, which would then skew the analysis downstream) and submitted to a sensitivity analysis. If there are more than 10% missing values, we either collapse the feature/patient to a binary (presence/absence) scheme and run a χ test for difference in detection rates, or explore several imputation methods with highly cautious interpretation De Meulder et al. BMC Systems Biology (2018) 12:60 Page 5 of 23 Fig. 3 Overview of the framework. Starting from quality-checked and pre-processed ‘omics data, four key generic steps are highlighted: (a)dataset subsetting, including formulation of the biological question to be answered and data preparation, (b) feature filtering (optional step) where features that are uninformative in relation to the question can be removed, (c) ‘omics-based unsupervised clustering (optional step) aiming at finding groups of participants arising from the data structure using the (optionally filtered) features, and finally d) biomarker identification, including feature selection by bioinformatics means and machine learning algorithms for prediction described above. In this step, the statistical power that information content measures [68, 69], network-based the analyst can expect (or the effect size that can be ex- metrics (connectivity, centrality [70, 71]) or using a pected to be discovered) can be investigated (for more non-linear machine learning algorithm [72]. We redir- details on the computation of statistical power in ‘omics ect the reader to the following reviews for more details data analysis, see [60]). A decision on whether to split [33, 73–75]. As this step might introduce bias into the the datasets into training and validation sets is also made downstream analyses, it is not always applied. at this point (see section 4, replication of findings). Step 3: ‘Omics-based clustering Step 2: Feature filtering Clustering analysis groups elements so that objects in the Given the complexity and large amount of clinical and same group are more similar to each other than to those ‘omics data in a complex dataset, the number of features in other groups (Fig. 3c). All methods available rely on measured is vastly superior to the number of replicates similarity or distance measures and a clustering algorithm creating various statistical challenges, i.e.. the ‘curse of [76–78]. The most classical clustering methods may be dimensionality’ [61, 62]. Feature filtering (Fig. 3b)is categorized as ‘partitioning’ (constructing k clusters) or therefore often used to select a subset of features rele- ‘hierarchical’ (seeking to build a hierarchy of clusters), and vant to the biological question studied, remove noise either agglomerative (each observation starts in its own from the dataset and reduce the computing power and cluster, and pairs of clusters are merged as one moves up time needed [63–65]. the hierarchy, ending in a single cluster) or divisive (all ob- Features can be filtered according to specific criteria, servations start in the same cluster and splits are per- based for example on nominal p-values arising from com- formed recursively as one moves down the hierarchy, parison between groups. Indeed, several methods exist to ending with clusters containing one single observation). perform feature filtering, based on mean expression It is important to note that clustering techniques are values, p-values, fold changes, correlation values [66, 67], descriptive in nature and will yield clusters, whether they De Meulder et al. BMC Systems Biology (2018) 12:60 Page 6 of 23 represent reality or not [76]. One way of finding out Over-fitting may occur when a statistical model whether clusters represent reality is to assess their stabil- includes too many parameters relative to the number of ity, with the consensus clustering approach [79] for observations. The over-fitted model describes random example. Using different stable clustering algorithms on error instead of the underlying relationship of interest the same dataset and comparing them with the meta- and performs poorly with independent data. In deriving clustering rationale [80] is a further step to assess if clus- prediction models therefore, a guiding principle is that ters represent accurately and reproducibly the biological there should be at least ten observations (or events) per situation in the data. predictor element [88] while simple models with few When several ‘omics datasets on the same patients are parameters should be favoured whenever possible. available, a handprint analysis can be performed with the All in all, the combination of internal replication, FDR Similarity Network Fusion (SNF) method to derive a correction and conservative over-fitting considerations patient-wise multi-‘omics similarity matrix [55]. Other allows the detection of interesting ‘omics features with a methods for data integration in the context of subtype dis- reference statistical foundation. covery are available such as iCluster [81], Multiple Dataset Integration [82], or Patient-Specific Data Fusion [83], further Replication of findings discussed in [84] or under development, for example by the When a large number of statistical tests have been European Stategra FP7 project (http://www.stategra.eu). planned, a comprehensive adjustment for multiple test- ing can be detrimental to statistical power. Validation Step 4: Biomarker identification and replication of findings is therefore essential in order Steps 1 to 3 aim at finding groups of patients to best to avoid the widespread unvalidated biomarker syn- describe the biological condition(s), with respect to the drome that has plagued the vast majority of claimed bio- questions addressed. Step 4 aims at 1) finding the markers. Indeed, fewer than 1/1000 have proved smallest set of molecular features whose difference in clinically useful and approved by regulatory authorities abundance between these patient groups (Fig. 3d) enable [89–94]. For each combination of platform and sample their distinction (biomarkers) and 2) building classifica- type, an assessment can be made as to whether the data tion models through machine-learning techniques, some should be split into training and validation sets, or of which use both feature reduction and classification instead analysed as a single pool. model building together. The outcome is a fingerprint or The predictive value of a biomarker identified after handprint, depending on the number of different ‘omics proper internal replication applies to the dataset in datasets included in the analysis. which it was discovered. Replication of findings in add- itional sample sets is a crucial step in producing clinic- Over-fitting and false-discovery rate control ally usable biomarkers and predictive models [95, 96] As already mentioned, ‘omics technologies suffer from and should thus always be sought. what is known as the ‘curse of dimensionality’, typically Once the feature filtering step is performed, the next due to the large number of features (p) and low number step is to make sense of the results, either in a biological of samples (n). As statistical methods were historically or mathematical manner. Biological annotation can be developed for a situation where the dimensions were n performed using pathways (see review in [97]) or func- >> > p instead of the p >> > n situation, methods adjust- tional categories (reviewed in [98]); however, this kind of ments had to be made. The main issue in statistical analysis is hampered by factors such as statistical consid- analysis is the high type I error rate (false positives) in erations (which method to use, independence between null hypothesis testing. Several ways of correcting for genes and between pathways, how to take into account this have been developed, the most well-known and used the magnitude of the changes) and pathway architecture being the Bonferroni correction and the Benjamini- considerations (pathways can cross and overlap, meaning Hochberg False Discovery Rate (FDR) controlling that if one pathway is truly affected, one may observe procedure [85]. Discussions are still ongoing in the sta- other pathways being significantly affected due to the set tistics community as to which method is best to control of overlapping genes and proteins involved) [99]. One the false positive rates in the context of ‘omics data way of overcoming those limitations is to use the analysis [46, 86, 87]. We therefore advise to split the complete genome-scale network of protein-protein inter- data in testing and validation groups. Tests made actions to define affected sub-regions of the network, within each group are corrected for FDR with the with available academic [100, 101] and commercial solu- Benjamini-Hochberg’s procedure whenever possible or tions (e.g. MetaCore™ Thomson Reuters, IPA Ingenuity advised by domain experts, and only features detected Pathway Analysis). A recent proposed solution is the dis- in both groups should be considered for further ana- ease map concept, following the examples of the Parkin- lysis and interpretation. son’s disease map [47], the Atlas of Cancer Signalling De Meulder et al. BMC Systems Biology (2018) 12:60 Page 7 of 23 Networks [50] and the AlzPathway [51, 52] where an ex- the highlighting of well-known ovarian cancer biomarkers haustive set of relevant interactions to a particular dis- and pathways. ease are represented in details as a single network, In order to produce a handprint more focused on the which can then be analysed biologically and mathematic- survival status of patients in the dataset, each ‘omics ally, with the supervision of domain experts for coverage dataset was treated separately to identify features associ- and specificity [102]. ated with survival status at the end of the study and overall survival time. The latter was obtained by sum- Results ming the age (in days) of the participants at enrolment Application to a public domain dataset: TCGA OV dataset in the study and the post-study survival time, both for handprint analysis values available in the clinical variables from the TCGA The Cancer Genome Atlas (TCGA, http://cancergenome. website. After data preparation including imputation of nih.gov/) is a joint effort of the National Cancer Institute missing data in methylation and normalisation, linear (NCI) and the National Human Genome Research Insti- models testing for survival status with survival time as a tute (NHGRI) in the USA. It aims to accelerate our under- cofactor were fitted feature-wise and p-values for differ- standing of the molecular basis of cancer through ential expression/abundance were derived. All features application of genome analysis technologies. Among other with a nominal p-value < 0.05 were selected. This yielded functionalities, TCGA offers a freely available database of a total of 899 features in the methylation dataset, 37 multi-‘omics datasets (including clinical data, imaging, miRNAs and 5817 probesets in transcriptomics. DNA, mRNA and miRNA sequencing, protein, gene exon and miRNA expression, DNA methylation and copy num- ber variation (CNV)) for several cancer types, with patient ‘Omics-based clustering numbers ranging from a few dozens to above a thousand. Similarity matrices were derived from each filtered As a use case, the ovarian cancer OV dataset was ‘omics dataset, which were fused with SNF, and spectral chosen, as it comprises several ‘omics measurements for a clustering with a consensus clustering step was applied large group of patients; this dataset has already been well to detect stable clusters, as shown in Fig. 5 below. The characterized in several publications but without a data fu- choice of the optimal number of stable clusters is based sion analysis, in contrast to the glioblastoma TCGA data- on two mathematical parameters: the deviation from set, for example [55]. It comprises data from a total of 586 ideal stability (DIS, a measure of the deviation from patients, along with several ‘omics datasets (such as SNP, horizontality of the CDF curves in the left panel of the Exome, methylation…), as shown in the Table 1. below. Fig. 5, the formulation of which can be found in the All data matrices were downloaded using the Broad Insti- supplementary material of [103]), and the number of pa- tute FireBrowse TCGA interface (http://firebrowse.org/ tients assigned in each cluster (clusters with fewer than ?cohort=OV&download_dialog=true#); the results shown 10 patients should be avoided [58]). The DIS across the here are based upon data generated by the TCGA Re- number of clusters can be found in the Additional file 1. search Network. The DIS shows a minimal value for k = 3 clusters, but very similar values can be seen for k = 6, 7, 9, 10, 11 and Data preparation 12. As it is clinically interesting to distinguish a higher We used the clinical, methylation, mRNA and miRNA number of clusters and to define clusters with different data matrices from the 453 patients (out of a total of survival status, we chose the number of clusters associ- 586 patients) for which all four data types were available. ated with low DIS, no clusters with fewer than 10 pa- The overview of the analysis is summarized in the Fig. 4. tients, and statistically significant differences in survival status and survival time of patients, k = 9. Feature selection The clinical characteristics of the nine clusters are Preliminary analysis without feature selection was per- shown in Table 2. Survival curves are also shown in formed (data not shown). Briefly, this analysis led to the the Kaplan-Meyer plot (Fig. 6). Survival status and identification of four stable clusters, mainly differentiated survival time differ between the nine clusters, show- by lymphatic and venous invasion status and clinical stage. ing for example that patients in cluster 1 have a Biologically speaking, the comparison of clusters led to higher mortality rate. Table 1 This table shows the number of cases in each ‘omics platform available for the TCGA Ovarian Serous Cystadenocarcinoma dataset (source: https://gdc.cancer.gov/) Ovarian serous cystadenocarcinoma Total Exome SNP Methylation mRNA miRNA Clinical Cases 586 536 579 584 574 582 584 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 8 of 23 Fig. 4 Framework outline for the TCGA handprint analysis with additional feature filtering. Each dataset was separately filtered based on nominal p-values < 0.05 when comparing alive versus deceased patients at the end of the study taking into account the total amount of days alive. A total of 6753 features were selected: 899 differentially methylated genes, 37 miRNAs and 5817 differentially expressed probesets. Consensus clustering on the fused similarity matrices determined the number of stable clusters that were viewed in a Kaplan-Meyer plot and tested for differential survival. Machine learning was then performed to identify candidate features predicting the identified groups: Recursive Feature Elimination (RFE) on a linear Support- Vector-Machine (SVM) model to identify informative features, followed by a Random Forest (RF) model building in parallel with DIABLO sPLS-DA on those features Biomarker identification factor. Other transcription factors are also highlighted Enrichment analysis through the methylation measurements. In order to detect differentially expressed features that are Cluster 3 is associated with immune system regulation (T specific to one group, each of the nine clusters was com- cell-related processes, and more precisely CD4 and CD8-T pared to the rest of the dataset. Table 3 shows the sum- cells lineages-related processes…), cell-cell signalling, mary of statistically different features (p-value < 0.05, 5% cAMP signalling, cytokine-cytokine interaction, G-Protein FDR correction) identified in each comparison. coupled receptor (GPCR) ligand binding and neuronal and Enrichment analysis of features differentially expressed/ muscle-related pathways (potassium and calcium channels, abundant between the clusters was then performed. other ion channels and synapses). Again, several miRNAs Complete results are presented in the Additional file 2;an and transcription factors are highlighted. overview of results for which there is already evidence in Cluster 4 is also associated with the immune response, the literature is presented below in Table 4. and key functions such as lymphocyte activation, T cell In short, the biological functions enriched in each aggregation, differentiation, proliferation and activation, cluster are as follows: cluster 1 is mostly enriched in adaptive immune system, regulation of lymphocyte cell- mitochondrial translation and energy metabolism, cell cell activation, immune response-regulating signalling cycle regulation, negative regulation of apoptosis and pathway, cytokine-cytokine receptor interaction, antigen DNA damage response. In addition, several miRNAs and processing and presentation, hematopoietic cell lineage transcription factors are enriched; the details can be and hematopoiesis and B cell activation. Primary im- found in the Additional file 2. munodeficiency pathway and cell adhesion molecules, Cluster 2 is associated with chemical carcinogenesis, along with miR-938 and several transcription factors are miR-330-5p, miR-693-5p and the Pax-2 transcription also enriched. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 9 of 23 Fig. 5 Consensus clustering results for the handprint analysis with feature filtering. A number of stable clustering schemes are available (k = 3, 6, 7, 8, 9). Nine clusters were chosen as the most informative, while keeping a low value of the deviation from ideal stability index and with clinical characteristics of the clusters statistically different in both survival time and survival status between clusters Cluster 5 is related to immune response, enriched in Each cluster is linked with one or several of the well- lymphocyte activation, T cell aggregation, differentiation, known hallmarks of cancer such as regulation of the cell activation and proliferation, leukocyte differentiation, ag- cycle (clusters 1 and 7), energy metabolism (cluster 1 and 7), gregation and activation, positive regulation of cell-cell immune system (clusters 3, 4, 5 and 8), epithelial-to- adhesion, antigen processing and presentation, cytokine mesenchymal transition (cluster 4) or angiogenesis production, inflammatory response, NK cell-mediated (cluster 5) [104–106]. Interestingly, our analysis based cytotoxicity and cytokine-cytokine receptor interaction. on ‘omicsprofilesis abletoidentifyclustersthat seemto Other processes involved are NF-κB signalling, Jak- separate some of those hallmarks out, while an analysis STAT signalling, Interferon α/β signalling, TCR signal- taking into account only the clinical data cannot. As seen ling, VEGF signalling, VEGFR2-mediated cell prolifera- above, cluster 6 is associated with a higher rate of survival. tion, Hedgehog ‘off’ state, along with several miRNAs It would therefore be interesting to further explore the and transcription factors. signalling networks enriched in the comparison between Cluster 6 is enriched in several signalling pathways, cluster 6 and the other clusters to identify the molecular such as cAMP, GPCR signalling, arachidonic acid metab- mechanisms responsible for the extended survival. olism and fatty acids metabolism, as well as positive T cell selection, several miRNAs and transcription factors. Machine-learning predictive modelling Cluster 7 is linked with respiratory metabolism, p53 The next step in the analysis is to establish a model that and cell cycle regulation, splicing regulation as well as can predict which cluster a patient belongs to, based on signalling by NF-κB and miRNAs and transcription the ‘omics measurements alone. Machine-learning tech- factors. niques (reviewed in [107, 108]), available in the caret R Cluster 8 is enriched with T cell lineage commit- package [109] and in the MixOmics R packages [110, ment, potassium channels, miRNAs and transcription 111] were used. factors. Two models were built in parallel, on the same dataset. Cluster 9 is associated with ion transport (including syn- aptic, calcium and potassium channels), cAMP signalling, 1. A Recursive Feature Elimination (RFE) procedure nicotine addiction, as well as miRNAs and transcription was performed to identify the smallest number of factors. features from the three ‘omics platforms that allow De Meulder et al. BMC Systems Biology (2018) 12:60 Page 10 of 23 Table 2 Clinical characteristics of the nine clusters found in the focused handprint analysis Variables/ C1 (n = 49) C2 (n = 30) C3 (n = 75) C4 (n =41) C5 (n = 47) C6 (n = 52) C7 (n = 46) C8 (n =56) C9 (n =57) P-value clusters Age at initial 57.6 ± 13.2 53.5 ± 8.16 59.8 ± 10.7 61.1 ± 12 60.2 ± 9.67 63.4 ± 11.8 59.8 ± 12.5 59.4 ± 11.6 60 ± 11.4 3.40E- pathologic 02 diagnosis (Yr) Days from −21,200 ± 4830 −19,700 ± 3030 −21,900 ± 3870 −22,700 ± 4260 −22,200 ± 2580 − 23,300 ± 4290 −22,000 ± 4560 −21,900 ± 4240 −22,200 ± 4140 3.15E- birth (Days) 02 Days to death 1220 (725–1490) 1480 (1210–2360) 997 (404–1230) 949 (563–1360) 787 (512–1340) 1090 (680–1580) 978 (536–1450) 1070 (340–1440) 1290 (731–1700) 2.11E- (Days (IQR)) 02 Days to last 1090 (689–1460) 1200 (688–1550) 664 (238–1120) 763 (272–1820) 676 (185–1560) 804 (339–1560) 651 (347–1370) 816 (223–1370) 1280 (605–1690) 3.74E- followup 02 (Days (IQR)) Initial Cytology: 9; Cytology: 3; Cytology: 12; Cytology: 2; Cytology: 9; Cytology: 6; Cytology: 2; Cytology: 9; Cytology: 5; 3.28E- pathologic Excisional biopsy: Excisional biopsy: 0; Excisional biopsy: Excisional Excisional Excisional Excisional Excisional Excisional 03 diagnosis 2; Fine needle Fine needle 0; Fine needle biopsy: 0; Fine biopsy: 2; Fine biopsy: 0; Fine biopsy: 0; Fine biopsy: 1; Fine biopsy: 0; Fine method aspiration biopsy: aspiration biopsy: 0; aspiration needle needle needle needle needle needle 2; Incisional biopsy: Incisional biopsy: 0; biopsy: 3; aspiration aspiration biopsy: aspiration aspiration aspiration aspiration 4; Tumor Tumor resection: 27 Incisional biopsy: biopsy: 2; 0; Incisional biopsy: 1; biopsy: 0; biopsy: 0; biopsy: 1; resection: 32 0; Tumor Incisional biopsy: 2; Tumor Incisional Incisional Incisional Incisional resection: 59; biopsy: 1; resection: 33; biopsy: 0; biopsy: 0; biopsy: 3; biopsy: 0; NA: 1 Tumor NA: 1 Tumor Tumor Tumor Tumor resection: 36 resection: 44; resection: 44 resection: 43 resection: 51 NA: 1 Lymphatic No: 4; Yes: 9; No: 6; Yes: 10; NA: 14 No: 7; Yes: 19; No: 13; Yes: 5; No: 1; Yes: 17; No: 13; Yes: 6; No: 8; Yes: 21; No: 4; Yes: 8; No: 5; Yes: 2.43E- invasion NA: 36 NA: 49 NA: 23 NA: 29 NA: 33 NA: 17 NA: 44 14; NA: 38 02 Neoplasm G1: 1; G2: 13; G3: G1: 0; G2: 5; G3: 24; G1: 0; G2: 5; G3: G1: 0; G2: 5; G3: G1: 0; G2: 6; G3: G1: 0; G2: 6; G3: G1: 0; G2: 8; G1: 0; G2: 1; G1: 0; G2: 6; 1.89E- histologic 33; G4: 0; Gb: 1; G4: 0; Gb: 0; Gx: 0; 70; G4: 0; Gb: 0; 36; G4: 0; Gb: 0; 39; G4: 0; Gb: 0; 44; G4: 1; Gb: G3: 38; G4: 0; G3: 53; G4: 0; G3: 49; G4: 0; 02 grade Gx: 1 NA: 1 Gx: 0 Gx: 0 Gx: 2 0; Gx: 1 Gb: 0; Gx: 0 Gb: 0; Gx: 2 Gb: 0; Gx: 1; NA: 1 Ethnicity American Indian or American Indian or American Indian American Indian American Indian American Indian American American American 6.72E- Alaska native: 1; Alaska native: 0; or Alaska native: 0; or Alaska native: or Alaska native: or Alaska native: Indian or Indian or Indian or 01 Asian: 1; Black or Asian: 1; Black or Asian: 3; Black or 0; Asian: 1; Black 1; Asian: 1; Black 0; Asian: 2; Black Alaska native: Alaska native: Alaska native: African American: African American: 2; African American: or African or African or African 0; Asian: 3; 0; Asian: 3; 0; Asian: 0; 3; White: 43; NA: 1 White: 27; NA: 0 2; White: 68; NA: 2 American: 3; American: 0; American: 4; Black or Black or Black or White: 37; NA: 0 White: 41; NA: 4 White: 44; NA: 2 African African African American: 1; American: 2; American: 4; White: 41; White: 49; White: 51; NA: 1 NA: 2 NA: 2 Clinical stage iia: 0; iib: 0; iic: 0; iiia: iia: 0; iib: 0; iic: 1; iiia: iia: 0; iib: 0; iic: 3; iia: 0; iib: 0; iic: iia: 0; iib: 1; iic: iia: 0; iib: 0; iic: iia: 1; iib: 1; iic: iia: 0; iib: 0; iic: iia: 2; iib: 2; iic: 2.65E- 1; iiib: 0; iiic: 38; iv: 0; iiib: 1; iiic: 24; iv: 3; iiia: 1; iiib: 3; iiic: 3; iiia: 4; iiib: 4; 1; iiia: 0; iiib: 2; 2; iiia: 1; iiib: 5; 2; iiia: 0; iiib: 1; iiia: 0; iiib: 4; iiia: 0; iiib: 1; 02 10; NA: 0 NA: 1 51; iv: 16; NA: 1 iiic: 22; iv: 7; iiic: 33; iv: 9; iiic: 38; iv: 6; 4; iiic: 34; iv: 1; iiic: 42; iv: iiic: 41; iv: 7; NA: 1 NA: 1 NA: 0 4; NA: 0 12; NA: 0 NA: 0 Tumor >20 mm: 10; 1–10 > 20 mm: 5; 1–10 > 20 mm: 17; 1–10 > 20 mm: 6; 1– > 20 mm: 11; > 20 mm: 4; > 20 mm: 8; > 20 mm: 6; > 20 mm: 11; 6.13E- residual mm: 26; 11–20 mm: mm: 17; 11–20 mm: mm: 29; 11–20 mm: 10 1–10 mm: 21; 1–10 mm: 24; 1–10 mm: 15; 1–10 mm: 29; 1–10 mm: 25; 02 disease 6; no macroscopic 5; no macroscopic 5; no macroscopic mm: 18; 11–20 11–20 mm: 4; 11–20 mm: 5; 11–20 mm: 11–20 mm: 11–20 mm: disease: 4; NA: 3 disease: 12; NA: 4 disease: 12; NA: 12 mm: 1; no no macroscopic no macroscopic 5; no 2; no 2; no macroscopic disease: 3; NA: 8 disease: 12; macroscopic macroscopic macroscopic disease: 12; NA: 4 NA: 7 disease: 13; disease: 14; disease: 14; NA: 5 NA: 5 NA: 5 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 11 of 23 Table 2 Clinical characteristics of the nine clusters found in the focused handprint analysis (Continued) Variables/ C1 (n = 49) C2 (n = 30) C3 (n = 75) C4 (n =41) C5 (n = 47) C6 (n = 52) C7 (n = 46) C8 (n =56) C9 (n =57) P-value clusters Tumor Omentum: 0; Ovary: 48; Omentum: 0; Ovary: Omentum: 1; Ovary: Omentum: 0; Omentum: 1; Omentum: 0; Omentum: 0; Omentum: 0; Omentum: 0; 5.01E- tissue Peritoneum ovary: 1 30; Peritoneum 74; Peritoneum Ovary: 41; Ovary: 46; Ovary: 52; Peritoneum Ovary: 46; Ovary: 56; Ovary: 57; 01 site ovary: 0 ovary: 0 Peritoneum Peritoneum ovary: 0 Peritoneum Peritoneum Peritoneum ovary: 0 ovary: 0 ovary: 0 ovary: 0 ovary: 0 Venous No: 3; Yes: 3; NA: 43 No: 3; Yes: 10; No: 8; Yes: 7; NA: 60 No: 12; Yes: 3; No: 1; Yes: 10; No: 10; Yes: 5; No: 7; Yes: 20; No: 3; Yes: 1; No: 3; Yes: 10; 7.24E- invasion NA: 17 NA: 26 NA: 36 NA: 37 NA: 19 NA: 52 NA: 44 02 Vital status Alive: 9; Dead: 40, Alive: 14; Dead: 16; Alive: 33; Dead: 42; Alive: 18; Dead: Alive: 20; Dead: Alive: 20; Dead: Alive: 28; Dead: Alive: 31; Dead: Alive: 27; 1.90E- NA: 0 NA: 0 NA:0 23; NA: 0 27; NA: 31; NA: 1 18; NA: 0 25; NA: 0 Dead: 30; 03 NA: 0 Primary Complete remission/ Complete remission/ Complete remission Complete Complete Complete Complete Complete Complete 5.08E- therapy response: 24; Partial response: 17; Partial /response: 41; remission/ remission/ remission/ remission/ remission/ remission/ 01 outcome remission/response: remission/response: Partial remission/ response: 24; response: 24; response: 29; response: 27; response: 36; response: 35; success 12; Progressive 3; Progressive response: 7; Partial remission/ Partial remission/ Partial remission Partial Partial Partial disease: 3; Stable disease: 4; Stable Progressive disease: response: 4; response: 8; /response: 6; remission/ remission/ remission/ disease: 1; NA: 9 disease: 2; NA: 4 2; Stable disease: Progressive Progressive Progressive response: 5; response: 4; response: 5; 4; NA: 21 disease: 2; Stable disease: 4; Stable disease: 1; Progressive Progressive Progressive disease: 0; NA: 11 disease: 3; NA: 8 Stable disease: disease: 4; disease: 7; disease: 5; 5; NA: 11 Stable disease: Stable disease: Stable disease: 6; NA: 4 2; NA: 7 1; NA: 11 Days lived 22,300 ± 4750 21,100 ± 3150 22,800 ± 3930 23,800 ± 4050 23,300 ± 3840 24,500 ± 4140 23,000 ± 4490 22,800 ± 4430 23,400 ± 4240 3.85E- known 02 Nominally statistically significant differences (p < 0.05) are shown in italic. Interestingly, significant differences are detected in lymphatic invasion, clinical stage at diagnosis, vital status and the overall number of days alive De Meulder et al. BMC Systems Biology (2018) 12:60 Page 12 of 23 Fig. 6 Kaplan-Meyer plot of survival for patients from the nine clusters revealed with the consensus clustering analysis. The x axis bears the total amount of days that patients have lived, i.e. the sum of their age at enrolment in the study plus the recorded amount of days they survived during the study, censored to the right by the end of measurements in the study (enrolment plus 4624 days) satisfactory separation of the clusters. This procedure described above. A DIABLO model is a type of was controlled by Leave-Group-Out Cross Validation partial least square (sparse PLS Discriminant (LGOCV) with 100 iterations (this number was Analysis) regression model, which uses multiple chosen to ensure convergence of the validation ‘omics platform measurements on the same samples procedure) and using between 1 and 50 predictors, to predict an outcome, with a biomarkers selection with the addition of the whole set of 6753 features. A step (sparse) to select necessary and sufficient Random Forest (RF) model was built with the features features to predict the groups (discriminant analysis) identified in the previous step. To avoid overfitting, within the outcome. Details of this analysis can be the RF model was built using LGOCV with 100 found in the Additional file 4. In short, this analysis iterations and in three quarters of the samples was run as follows: the datasets were split in 2/3 available (N = 300) and then tested in the remaining training and 1/3 testing sets. The DIABLO model quarter of samples (N = 153). More details can be was then trained with boundaries set on the number found in the Additional file 3. of features allowed per component (gene expression 2. Concatenation-based integration of data combines and methylation between 50 and 110 features, and multiple datasets into a single large dataset, with the between 5 and 35 miRNA features). The performances aim to predict an outcome. However, this approach were then estimated within the training model by 10 does not account for or model relationships between repeats of 10-fold validation and the prediction power datasets and thus limits our understanding of estimated in the testing set. molecular interactions at multiple functional levels. This is the rationale behind the development of novel integrative modelling methods, such as the Topological data analysis DIABLO sPLSDA method [112]. A DIABLO model In order to visualize the patients’ relationships as mea- was built using the same dataset as the SNF analysis sured by their ‘omics profiles, we used Topology Data Table 3 Number of statistically significant different features obtained when comparing each cluster against all other patients in the dataset, for each platform. P-values were computed by a linear model in each ‘omics platform independently, and Benjamini-Hochberg FDR corrected 1 vs Rest 2 vs Rest 3 vs Rest 4 vs Rest 5 vs Rest 6 vs Rest 7 vs Rest 8 vs Rest 9 vs Rest (49 vs 404) (30 vs 423) (75 vs 378) (41 vs 412 (47 vs 406 (52 vs 401 (46 vs 407 (56 vs 397 (57 vs 396) mRNA 1861 245 4101 1073 2480 3617 2557 4620 1843 Methylation 335 550 4 388 498 233 387 528 75 miRNA 18 0 1 9 24 1 8 14 11 De Meulder et al. BMC Systems Biology (2018) 12:60 Page 13 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer E2F Transcription factor Transcriptomics 1 vs Rest 8.17E-48 [123, 124] Sp1 Transcription factor Transcriptomics 1 vs Rest 1.95E-35 [125] Mitochondrial translation Reactome Transcriptomics 1 vs Rest 9.02E-21 [126] hsa-miR-193a-5p miRNA Transcriptomics 1 vs Rest 4.33E-09 [127] CREM Transcription factor Methylation 1 vs Rest 2.45E-03 [128] hsa-miR-940 miRNA Transcriptomics 1 vs Rest 6.80E-03 [129] hsa-miR-601 miRNA Transcriptomics 1 vs Rest 6.81E-03 [129] hsa-miR-503 miRNA Transcriptomics 1 vs Rest 1.41E-02 [129] AP-1 Transcription factor Methylation 1 vs Rest 1.52E-02 [130] TCF-4 Transcription factor Methylation 1 vs Rest 2.04E-02 [131] hsa-miR-361-3p miRNA Transcriptomics 1 vs Rest 2.53E-02 [129] C/EBP Transcription factor Methylation 2 vs Rest 1.13E-05 [132] LMXB1 Transcription factor Methylation 2 vs Rest 9.32E-05 [133] hsa-miR-330-5p miRNA Transcriptomics 2 vs Rest 7.57E-03 [134] Chemical carcinogenesis KEGG pathways Transcriptomics 2 vs Rest 1.77E-02 [135–137] hsa-miR-335 miRNA Transcriptomics 2 vs Rest 3.95E-02 [138] MZF-1 Transcription factor Transcriptomics 3 vs Rest 4.06E-39 [139] SREBP-1 Transcription factor Transcriptomics 3 vs Rest 5.29E-38 [140] AP-2gamma Transcription factor Transcriptomics 3 vs Rest 1.79E-36 [141] GPCR ligand binding Reactome Transcriptomics 3 vs Rest 8.14E-10 [142] hsa-miR-328 miRNA Transcriptomics 3 vs Rest 9.92E-10 [129] hsa-miR-370 miRNA Transcriptomics 3 vs Rest 1.09E-08 [129] hsa-miR-601 miRNA Transcriptomics 3 vs Rest 1.07E-07 [129] hsa-miR-423-5p miRNA Transcriptomics 3 vs Rest 1.36E-06 [129] hsa-miR-139-3p miRNA Transcriptomics 3 vs Rest 2.28E-05 [129] hsa-miR-769-5p miRNA Transcriptomics 3 vs Rest 9.05E-05 [129] hsa-miR-339-3p miRNA Transcriptomics 3 vs Rest 2.16E-04 [129] hsa-miR-940 miRNA Transcriptomics 3 vs Rest 2.94E-04 [129] hsa-miR-542-5p miRNA Transcriptomics 3 vs Rest 8.13E-04 [129] hsa-miR-483-5p miRNA Transcriptomics 3 vs Rest 1.50E-03 [129] hsa-miR-361-3p miRNA Transcriptomics 3 vs Rest 7.88E-03 [129] hsa-miR-449a miRNA Transcriptomics 3 vs Rest 4.87E-02 [129] T cell aggregation GO Biological Process Transcriptomics 4 vs Rest 1.94E-38 [143] T cell activation GO Biological Process Transcriptomics 4 vs Rest 1.94E-38 [144] Natural killer cell mediated cytotoxicity KEGG pathways Transcriptomics 4 vs Rest 8.60E-14 [145] Cell adhesion molecules (CAMs) KEGG pathways Transcriptomics 4 vs Rest 2.37E-11 [146] Hedgehog ‘on’ state Reactome Transcriptomics 4 vs Rest 7.21E-05 [147] HIC1 Transcription factor Methylation 4 vs Rest 2.46E-04 [148] hsa-miR-328 miRNA Transcriptomics 4 vs Rest 1.49E-02 [129] AP-2gamma Transcription factor Transcriptomics 4 vs Rest 3.00E-02 [141] T cell activation GO Biological Process Transcriptomics 5 vs Rest 1.94E-38 [144] T cell aggregation GO Biological Process Transcriptomics 5 vs Rest 2.25E-22 [143] De Meulder et al. BMC Systems Biology (2018) 12:60 Page 14 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure (Continued) Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer Natural killer cell mediated cytotoxicity KEGG pathways Transcriptomics 5 vs Rest 8.60E-14 [145] Antigen processing and presentation KEGG pathways Transcriptomics 5 vs Rest 4.33E-11 [149] Interferon alpha/beta signalling Reactome Transcriptomics 5 vs Rest 6.11E-08 [150] hsa-miR-423-5p miRNA Transcriptomics 5 vs Rest 3.09E-05 [129] hsa-miR-328 miRNA Transcriptomics 5 vs Rest 5.23E-04 [129] VEGFA-VEGFR2 Pathway Reactome Transcriptomics 5 vs Rest 2.57E-03 [151, 152] Hedgehog ‘off’ state Reactome Transcriptomics 5 vs Rest 1.21E-02 [153] hsa-miR-139-3p miRNA Transcriptomics 5 vs Rest 1.35E-02 [129] NF- κB signalling pathway KEGG pathways Transcriptomics 5 vs Rest 1.53E-02 [154] hsa-miR-601 miRNA Transcriptomics 5 vs Rest 2.71E-02 [129] Jak-STAT signalling pathway KEGG pathways Transcriptomics 5 vs Rest 3.54E-02 [155] hsa-miR-375 miRNA Transcriptomics 5 vs Rest 3.74E-02 [129] Signalling by GPCR Reactome Transcriptomics 6 vs Rest 1.24E-14 [156] hsa-miR-328 miRNA Transcriptomics 6 vs Rest 1.47E-08 [129] hsa-miR-601 miRNA Transcriptomics 6 vs Rest 6.94E-07 [129] hsa-miR-370 miRNA Transcriptomics 6 vs Rest 2.46E-06 [129] hsa-miR-423-5p miRNA Transcriptomics 6 vs Rest 4.81E-06 [129] hsa-miR-423-3p miRNA Transcriptomics 6 vs Rest 1.77E-05 [129] cAMP metabolic process GO Biological Process Transcriptomics 6 vs Rest 9.22E-05 [157] hsa-miR-769-5p miRNA Transcriptomics 6 vs Rest 5.13E-04 [129] hsa-miR-139-3p miRNA Transcriptomics 6 vs Rest 2.70E-03 [129] hsa-miR-483-5p miRNA Transcriptomics 6 vs Rest 4.90E-03 [129] hsa-miR-940 miRNA Transcriptomics 6 vs Rest 5.05E-03 [129] T cell selection GO Biological Process Transcriptomics 6 vs Rest 1.41E-02 [158] Arachidonic acid metabolism KEGG pathways Transcriptomics 6 vs Rest 1.42E-02 [135] hsa-miR-542-5p miRNA Transcriptomics 6 vs Rest 1.73E-02 [129] Oxidative phosphorylation KEGG pathways Transcriptomics 7 vs Rest 9.49E-13 [159] Stabilization of p53 Reactome Transcriptomics 7 vs Rest 1.06E-07 [160] Spliceosome KEGG pathways Transcriptomics 7 vs Rest 1.59E-07 [161] NF-kB signalling pathway Reactome Transcriptomics 7 vs Rest 3.97E-05 [154] hsa-miR-542-5p miRNA Transcriptomics 7 vs Rest 2.53E-03 [129] hsa-miR-601 miRNA Transcriptomics 7 vs Rest 2.62E-03 [129] hsa-miR-423-5p miRNA Transcriptomics 7 vs Rest 5.88E-03 [129] hsa-let-7c miRNA Transcriptomics 7 vs Rest 2.67E-02 [129] Regulation of HIF by oxygen Reactome Transcriptomics 7 vs Rest 3.32E-02 [162] hsa-miR-361-3p miRNA Transcriptomics 7 vs Rest 4.16E-02 [129] hsa-miR-328 miRNA Transcriptomics 8 vs Rest 9.25E-15 [129] hsa-miR-370 miRNA Transcriptomics 8 vs Rest 3.60E-11 [129] hsa-miR-940 miRNA Transcriptomics 8 vs Rest 1.37E-10 [129] hsa-miR-423-5p miRNA Transcriptomics 8 vs Rest 4.29E-10 [129] hsa-miR-423-3p miRNA Transcriptomics 8 vs Rest 7.47E-09 [129] hsa-miR-139-3p miRNA Transcriptomics 8 vs Rest 5.08E-07 [129] De Meulder et al. BMC Systems Biology (2018) 12:60 Page 15 of 23 Table 4 Enrichment analysis for each comparison across all ‘omics types, with q-values, and the literature references mentioning involvement of the terms in ovarian cancer development. Q-values are the minimal false discovery rate at which the test may be called significant, or in other words, the p-value threshold to satisfy the FDR criteria set by the Benjamini-Hochberg procedure (Continued) Term Term type ‘Omic type Contrast q-value Reference of implication in ovarian cancer hsa-miR-601 miRNA Transcriptomics 8 vs Rest 9.47E-07 [129] hsa-miR-542-5p miRNA Transcriptomics 8 vs Rest 4.72E-04 [129] hsa-miR-361-3p miRNA Transcriptomics 8 vs Rest 1.07E-03 [129] hsa-miR-483-5p miRNA Transcriptomics 8 vs Rest 1.32E-03 [129] hsa-miR-769-5p miRNA Transcriptomics 8 vs Rest 1.68E-03 [129] Potassium signalling pathway Reactome Transcriptomics 8 vs Rest 1.15E-02 [163] hsa-miR-99b miRNA Transcriptomics 8 vs Rest 1.93E-02 [129] hsa-miR-339-3p miRNA Transcriptomics 8 vs Rest 2.28E-02 [129] T cell lineage commitment GO Biological Process Transcriptomics 8 vs Rest 3.80E-02 [164] hsa-miR-139-3p miRNA Transcriptomics 9 vs Rest 3.58E-09 [129] hsa-miR-423-5p miRNA Transcriptomics 9 vs Rest 5.89E-09 [129] hsa-miR-328 miRNA Transcriptomics 9 vs Rest 2.32E-08 [129] hsa-miR-370 miRNA Transcriptomics 9 vs Rest 4.83E-08 [129] hsa-miR-423-3p miRNA Transcriptomics 9 vs Rest 3.89E-06 [129] hsa-miR-940 miRNA Transcriptomics 9 vs Rest 5.37E-06 [129] hsa-miR-769-5p miRNA Transcriptomics 9 vs Rest 1.07E-04 [129] hsa-miR-339-3p miRNA Transcriptomics 9 vs Rest 0.000173 [129] hsa-miR-601 miRNA Transcriptomics 9 vs Rest 2.05E-04 [129] hsa-miR-483-5p miRNA Transcriptomics 9 vs Rest 7.33E-03 [129] Calcium signalling pathway KEGG pathways Transcriptomics 9 vs Rest 1.55E-02 [165] hsa-miR-542-5p miRNA Transcriptomics 9 vs Rest 1.69E-02 [129] cAMP signalling pathway KEGG pathways Transcriptomics 9 vs Rest 2.33E-02 [166] Ion transfer GO Biological Process Transcriptomics 9 vs Rest 3.43E-02 [167] Analysis (TDA), a general framework to analyse high- OV. Other studies have been performed, either on this dimensional, incomplete and noisy data in a manner that same dataset [114–118], or on the same disease [119]. is less sensitive to the particular metric that is chosen, Tothill et al. in 2015 identified six clusters of patients, and provides dimensionality reduction and robustness to based on mRNA, immunohistochemistry and clinical noise. TDA is embedded in the software produced by data from a cohort of 285 Australian and Dutch partici- the Ayasdi company to which the data were uploaded pants, with a consensus clustering analysis of mRNA [113]. As shown in Fig. 7, the network of patients’ simi- data alone. The TCGA consortium produced their own larities obtained through TDA analysis and then colored dataset in 2011, identifying four clusters based on com- by the vital status of the patients at the end of the study bined mRNA, miRNA and DNA methylation data (data shows a higher level of complexity than is identified by combined by summarising to the gene-level all datasets the clustering analysis, suggesting that statistical and/or through a factor analysis) and using a non-negative technical limitations of the clustering methods prevent matrix factorisation to identify clusters [120]. Further us to accurately represent reality. analysis of the same dataset was then performed by Zhang et al. [118], Jin et al. [115] and Kim et al. [116] Discussion (with some variations), but these authors did not look Multi-omics data integration is, among other compo- for new phenotypes in their analysis, rather comparing nents of biological data integration, a very promising data based on clinical endpoints (survival time, histo- and emerging field. We show a structured and effective logical grades and stage of disease). Gevaert et al. [114] way to combine ‘omics data from multiple sources to used an original algorithm to combine DNA methyla- search for molecular profiles of patients. This process tion, Copy Number Variation (CNV) and gene expres- allowed for the classification of a well-studied dataset of sion data, using the clusters defined in the TCGA De Meulder et al. BMC Systems Biology (2018) 12:60 Page 16 of 23 Fig. 7 Network of patients shown in the TDA platform. The network is constructed as ‘bins’ grouping patients who are similar based on their ‘omics profiles. Each dot in the network represents a bin. The bins are overlapping by an adaptable percentage, and if at least one patient is present in the overlap of two bins, the two bins will be linked in the network. The survival status of the patients is then translated as a color scheme (blue representing deceased patients and red alive patients). Using this technique, it is easy to identify ‘islands’ of good and poor survival among the patients, and equally easy to acknowledge that there are more such islands than is identified through the clustering technique. Thorough analysis of such networks can lead to insights into biology, as detailed in [168] original paper. Those studies showed different ways of among the 9 clusters identified and is associated with the analysing the data, leading to the identification of clinic- GPCR signalling pathway, cAMP, ion channels, arachi- ally relevant clusters in the case of Tothill and TCGA donic acid metabolism and a number of miRNAs (see original paper [117, 119]. It is however the first time in Table 4 or the Additional file 2 for more details). this paper that TCGA mRNA, miRNA and methylation Interestingly, while the two sets of groups defined with data were fused with an advanced data integration or without feature reduction show differences in inva- method to identify robust subtypes of disease. sion and clinical stage, statistically significant differences The number of clusters found in the same dataset dif- in vital status are only detected amongst groups defined fers between the TCGA analysis and our analysis. We with feature reduction. The reduced data also allows for believe that the higher number of clusters we found is the definition of a higher number of stable groups (9 in- the result of more up-to-date and powerful methods for stead of 4), thereby pointing to the usefulness of per- subtype discovery, as shown in the SNF original paper forming feature reduction prior to clustering analysis. [55]. Moreover, the subtypes identified in this analysis The biological functions highlighted by enrichment do allow for a more in-depth classification of patients analysis between the clusters indicate that these are linked with specific molecular subtypes than was previ- associated with different biological mechanisms leading ously reported. Building predictive models based on to the development of cancer in patients, ranging from multiple ‘omics profiles also contributes to the novelty immune system disorders, cell cycle dysregulation, im- of this approach as other reported studies did not pro- paired response to DNA damage, modified energy me- duce such a model, with the exception of the Tothill et tabolism, etc. al. study [119] in which the authors developed a class The predictive models that were trained and tested prediction model based on transcriptomics data only. with two different methods gave mixed power results. In Clinically speaking, classifications are most useful when the Random Forest case, the model could predict quite they allow the identification of a subset of patients with a well when patients did not belong to the clusters, but clinically relevant outcome, such as low or high survival not so well when patients did belong to them; in other rate, thus indicating where efforts may be focused to de- words, the model is specific but not sensitive. In the case velop new drugs, therapies and procedures. In our ana- of the DIABLO PLS, the model is able to predict fairly lysis, the groups identified after feature reduction are accurately the clusters 4 and 8 and less accurately cluster statistically different in terms of survival rate and time. 5. Moreover, in the case of the DIABLO analysis, the For example, cluster 6 shows the highest rate of survival model showed that the clusters have different ‘omics De Meulder et al. BMC Systems Biology (2018) 12:60 Page 17 of 23 patterns, with clusters 2 and 8 showing distinct methyla- cross-validated and clinically useful stratification of ovarian tion profiles, and cluster 4 showing different methylation cancer, towards a better and more personalized care. and transcriptomics profiles. The results presented in this manuscript are not per- Conclusion fectly predictive, however. It seems that the cluster defi- This article presents an overview of the integrative sys- nitions are not as stable as they could be; the predictive tems biology analyses developed, performed and validated models are not accurate in all clusters and the survival in the IMI U-BIOPRED and eTRIKS projects, proposing a status of the clusters are not clear cut. This reflects the template for other researchers wishing to perform similar fact shown in Fig. 7, that there seems to be much more analyses for other diseases. We demonstrate the useful- complexity within the dataset than what the clustering ness of generating hypotheses through a fingerprint/hand- analysis is able to detect. print analysis by applying to a well-studied dataset of This is due to multiple factors: the recurring issue of ovarian carcinoma, identifying a higher number of robust low number of patients, which in turn influences the groups than previously reported, potentially improving number of clusters we can find with statistical confi- our understanding of this disease. Better characterisation dence – a point which is not taken into account in the of the clusters found in the handprint analyses and valid- TDA analysis discussed here – and highlighting the need ation of the predictive model obtained by machine learn- for better stratification methods in the context of per- ing are both ongoing. We believe that handprint analyses, sonalized medicine where, ideally, each patient is his/her performed on large scale ‘omics datasets will allow re- own cluster (n = 1); sub-optimal clustering methods and searchers to identify subtypes of disease (phenotypes and algorithms also play a part in this result and it is our endotypes) [34] with greater confidence, providing better hope that continuous methods development will allow diagnosis tools for the clinicians, new avenues for drug de- for better classification. Clustering analysis is descriptive velopment for the pharmaceutical industry and deeper in- in nature: applying a clustering algorithm to a dataset sights into disease mechanisms. To be effective, handprint will always yield clusters, whether real clusters exist analyses need to be performed on the same subjects with or not. Analytical methods exist to ascertain cluster multiple ‘omics platforms. Theysuffer fromsomelimita- ‘reality’, among which stability in patients through tions, such as the decreasing but nevertheless still elevated bootstrapping, stability in time through cluster identi- cost of ‘omics data production and the protocol standard- fication from time-series experiments [121], meta isation requirements to avoid time-consuming data pre- clustering across several studies, yet only replication processing, the rather large technical, human resources studies may confirm the existence of these clusters. and expertise requirements to perform the analyses (par- Such replication effort however lies outside the scope ticularly the machine-learning analysis) or the lack of ac- of this manuscript. curate and independent benchmarking tools to identify Despite the use of most recent databases and tools, the the most powerful and/or best-suited method to analyse a biological interpretation of the differences between the particular dataset. clusters remains challenging. The main issues stem from Additional work is therefore needed to make the frame- the overlapping nature of pathways described in literature work and the analyses proposed here more accessible to a and the non-unicity of relationships between biological broad audience of health researchers. Efforts of the bioinfor- entities, leading to a high false positive rate in the results matics community are shifting in this direction; for instance, of pathway analysis [97]. Efforts are made in the systems the eTRIKS European project (http://www.etriks.org)or the biology community to correct these shortcomings, among Galaxy project hosted in the USA (https://galaxyproject.org) which the disease maps mentioned above. mandate the delivery of user-friendly interfaces to advanced This underlines the variability in biological events po- bioinformatics resources. Implementation of P4 medicine tentially leading to the development of cancer and me- across the entire health spectrum [122] will be leveraged tastasis and the need for a more personalised care for through promotion of advanced analytical tools available to patients suffering from complex diseases, such as cancer. the larger multidisciplinary community. The methods and It is our hope that this methodology will be repeated on results demonstrated in this paper should contribute to other datasets, diseases and clinical situations as it is one pave this promising road. more step towards establishing a true personalised data analysis pipeline. Additional files The clusters that were found in this analysis are interest- ing hypotheses. They would however require further valid- Additional file 1: AUC of consensus clustering. (XLSX 13 kb) ation to become clinically useful, as detailed in the Additional file 2: Complete results of the enrichment analysis between replication of findings section above. We encourage other clusters. (XLSX 4293 kb) researchers to use our findings in their research towards a De Meulder et al. BMC Systems Biology (2018) 12:60 Page 18 of 23 Respiratory Biomedical Research Unit, Southampton, UK), Tim Higgenbottam Additional file 3: Table S7. Estimated accuracy and standard deviation (Allergy Therapeutics, West Sussex, UK), Uruj Hoda (Imperial College, London, UK), of the RFE procedure. Table S8. Accuracy and Kappa values of the Jans Hohlfeld (Fraunhofer ITEM, Hannover, Germany), Cecile Holweg (Genentech, Random Forest models in the training set. Table S9. Performances values San Francisco, USA), Ildiko Horvath (Semmelweis University, Budapest, Hungary), for the Random Forest model in the testing set. Figure S11. Relative Peter Howarth (NIHR Southampton Respiratory Biomedical Research Unit, importance of the top 20 predictors building the final model of the RF. Southampton, UK), Richard Hu (Amgen Inc., Seattle, USA), Sile Hu (Imperial The importance axis is scaled, with the mRNA expression of CD3D scaled College London, UK), Xugang Hu (Amgen Inc., Seattle, USA), Val Hudson (Asthma UK, London, UK), Anna J. James (Karolinska Institutet, Stockholm, to 100% and the methylation state of POLA2 to 0% (not shown). Sweden), Juliette Kamphuis (Longfonds, Amersfoort, The Netherlands), Erika J. (DOCX 18 kb) Kennington (Asthma UK, London, UK), Dyson Kerry (CromSource, Stirling, UK), Additional file 4: DIABLO sPLSDA model results. (DOCX 18966 kb) Matthias Klüglich (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Hugo Knobel (Philips Research Laboratories, Eindhoven, The Netherlands), Richard Knowles (Arachos Pharma, UK), Alan Know (University of Nottingham, UK), Johan Kolmert (Karolinska Institutet, Stockholm, Sweden), Jon Acknowledgements Konradsen (Karolinska Institutet, Stockholm, Sweden), Maxim Kots (Chiesi The U-BIOPRED study group consists of Ian M. Adcock (Imperial College, Pharmaceutical, Parma, Italy), Linn Krueger (University Children’s Hospital, Bern, London, UK), Nora Adriaens (University of Amsterdam, The Netherlands), Has- Switzerland), Norbert Krug (Fraunhofer ITEM, Hannover, Germany), Scott Kuo san Ahmed (EISBM, Lyon, France), Antonios Aliprantis (Merck Research, Bos- (Imperial College, London, UK), Maciej Kupczyk (Karolinska Institutet, Stockholm, ton, USA), Kjell Alving (Uppsala University, Sweden), Charles Auffray (EISBM, Sweden), Bart Lambrecht (University of Gent, Belgium), Ann-Sofie Lantz Lyon, France), Philipp Badorrek (Fraunhofer ITEM, Hannover, Germany), Cor- (Karolinska Institutet, Stockholm, Sweden), Lars Lazarinis (Karolinska Institutet, nelia Faulenbach (Fraunhofer ITEM, Hannover, Germany), Per Bakke (Univer- Stockholm, Sweden), Diane Lefaudeux (EISBM, Lyon, France), Saeeda Lone-Latif sity of Bergen, Norway), David Balgoma (Karolinska Institutet, Stockholm, (University of Amsterdam, The Netherlands), Matthew J. Loza (Janssen R & D, Sweden), Aruna T. Bansal (Acclarogen Ltd. Cambridge, UK), Clair Barber (Uni- Springhouse, USA), Rene Lutter (University of Amsterdam, The Netherlands), Lisa versity of Southampton, UK), Frédéric Baribaud (Janssen R & D, Springhouse, Marouzet (NIHR Southampton Respiratory Biomedical Research Unit, Southampton, USA), An Bautmans (MSD Brussels, Belgium), Annelie F. Behndig (Umeå Uni- UK), Jane Martin (NIHR Southampton Respiratory Biomedical Research Unit, versity, Sweden), Elisabeth Bel (University of Amsterdam, The Netherlands), Southampton, UK), Sarah Masefield (European Lung Fondation, Shefield, UK), Jorge Beleta (Almirall S.A., Barcelona, Spain), Ann Berglind (Karolinska Institu- Caroline Mathon (Karolinska Institutet, Stockholm, Sweden), John G. Matthews tet, Stockholm, Sweden), Alix Berton (AstraZeneca, Mölndal, Sweden), Jean- (Genentech, San Francisco, USA), Alexander Mazein (EISBM, Lyon, France), Sally ette Bigler (Amgen Inc., Seattle, USA), Hans Bisgaard, University of Meah (Imperial College, London, UK), Andrea Meiser (Imperial College, London, Copenhagen, Denmark), Grazyna Bochenek, Jagiellonian University, Krakow, UK), Andrew Manzies-Gow (Royal Brompton and Harefield NHS Fondation Trust, Poland), Michael J. Boedigheimer (Amgen Inc., Seattle, USA), Klaus Bønnelykke London, UK), Leanne Metcalf (Asthma UK, London, UK), Roelinde Middelveld (Karo- (University of Copenhagen, Denmark), Joost Brandsma, (University of South- linska Institutet,Stockholm,Sweden),Maria Mikus (Science for Life Laboratory, ampton, UK), Armin Braun (Fraunhofer ITEM, Hannover, Germany), Paul Brink- Stockholm, Sweden), Montse Miralpeix (Almirall, Barcelona, Spain), Philip Monk man (University of Amsterdam, The Netherlands), Dominic Burg (University of (Synairgen Research Ltd., Southampton, UK), Paolo Montuschi (Università Cattolica Southampton, UK), Davide Campagna (University of Catania, Italy), Leon del Sacro Cuore, Rome, Italy), Nadia Mores (Università Cattolica del Sacro Cuore, Carayannopoulos, (MSD, USA), Massimo Caruso (University of Catania, Italy), Rome,Italy), ClareS.Murray(University of Manchester,UK),Jacek Musial Pedro Carvalho da Purificacão Rocha João Pedro (Royal Brompton and Harefield (Jagiellonian University Medical College, Krakow, Poland), David Myles (GSK, UK), NHS Fondation Trust, UK), Amphun Chaiboonchoe (EISBM, Lyon, France), Shama Naz (Karolinska Institutet, Stockholm, Sweden), Katja Nething (Boehringer Romanas Chaleckis (Karolinska Institutet, Stockholm, Sweden), Pascal Chanez Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Ben Nicholas (University (University of Aix Marseille, France), Kiang Fan Chung, Imperial College London, of Southampton, UK), Ulf Nihlen (AstraZeneca, Molndal, Sweden), Peter Nilsson UK), Courtney Coleman (Asthma UK, London, UK), Chris Compton (GSK, UK), (Science for Life Laboratory, Stockholm, Sweden), Björn Nordlund (Karolinska Julie Corfield (Arateva R & D, Nottingham, UK), Arnaldo D’Amico (University of Institutet,Stockholm,Sweden),JörgenÖstling (AstraZeneca,Molndal,Sweden), Rome ‘Tor Vergata’, Rome, Italy), Barbro Dahlén (Karolinska Institutet, Stockholm, Antonio Pacino (Lega Italiano Anti Fumo, Catania, Italy), Laurie Pahus (Aix-Marseille Sweden), Sven-Erik Dahlén (Karolinska Institutet, Stockhlom, Sweden), Jorge De University, Marseille, France), Susanna Palkonen (European Federation of Allergy Alba (Almirall S.A., Barcelona, Spain), Pim de Boer (Londfonds, Amersfoort, The and Airways Diseases Patient’s Associations, Brussels, Belgium), Ioannis Pandis Netherlands), Inge De Lepeleire (MSD, Brussels, Belgium), Bertrand De Meulder (Imperial College London, UK), Stelios Pavlidis (Imperial College London, UK), (EISBM, Lyon, France), Tamara Dekker (University of Amsterdam, The Giorgio Pennazza (University of Rome ‘Tor Vergata’,Rome, Italy),AnnePetrén Netherlands), Ingrid Delin (Karolinska Institutet, Stockholm, Sweden), Patrick (Karolinska Institutet, Stockholm, Sweden), Sandy Pink (NIHR Southampton Dennison (University of Southampton, UK), Annemiek Dijkhuis (University of Respiratory Biomedical Research Unit, Southampton, UK), Anthony Postle Amsterdam, The Netherlands), Ratko Djukanovic (University of Southampton, (University of Southampton, UK), Pippa Powel (European Lung Fondation, Sheffield, UK), Aleksandra Draper (BioSci Consulting, Maasmechelen, Belgium), Jessica UK), Malayka Rahman-Amin (Asthma UK, London, UK), Navin Rao (Janssen R & D, Edwards (Asthma UK, London, UK), Rosalia Emma (University of Catania, Italy), La Jolla, USA), Lara Ravanetti (University of Amsterdam, The Netherlands), Emma Magnus Ericsson (Karolinska University Hospital, Stockholm, Sweden), Veit Ray (NIHR Southampton Respiratory Biomedical Research Unit, Southampton, UK), Erpenbeck (Novartis Institutes for Biomedical Research, Basel, Switzerland), Stacey Reinke (Karolinska Institutet, Stockholm, Sweden), Leanne Reynolds (Asthma Damijan Erzen (Boehringer Ingelheim Pharma GmbH & Co. KKKG; Biberach, UK, London, UK), Kathrin Riemann (Boehringer Ingelheim Pharma GmbH & Co. KG, Germany), Klaus Fichtner (Boehringer Ingelheim Pharma GmbH & Co. KKKG; Biberach, Germany), John Riley (GSK, UK), Martine Robberechts (MSD, Brussels, Biberach, Germany), Neil Fitch (BioSci Consulting, Maasmechelen, Belgium), Belgium), Amanda Roberts (Asthma UK, London, UK), Graham Roberts (NIHR Louise J. Fleming (Imperial College London, UK), Breda Flood (Asthma UK, Southampton Respiratory Biomedical Research Unit, Southampton, UK), Christos London, UK), Stephen J. Fowler (Manchester Academic Health Sciences Center, Rossios (Imperial College London, UK), Anthony Rowe (Janssen R & D, UK), Kirsty Manchester, UK), Urs Frey (University Children’s Hospital, Basel, Switzerland), Russel (Imperial College London, UK), Michael Rutgers (Longfonds, Amersfoort, The Martina Gahlemann (Boehringer Ingelheim GmbH, Switzerland), Gabriella Galffy Netherlands), Thomas Sandström (Umeå University, Sweden), Giuseppe Santini (Semmelweis University, Budapest, Hungary), Hactor Gallart (Karolinska Institutet, (Università Cattolica del Sacro Cuore, Italy), Marco Santoninco (University of Rome Stockholm, Sweden), Trevor Garret (BioSci Consulting, Maasmechelen, Belgium), ‘Tor Vergata’, Rome, Italy), Corinna Schoelch (Boehringer Ingelheim Pharma GmbH Thomas Geiser (University Hospital Bern, Switzerland), Julaiha Gent (Royal & Co. KG, Biberach, Germany), James P.R. Schofield (University of Southampton, Brompton and Harefield NHS Fondation Trust, London, UK), Maria Gerhardsson de UK), Wolfgang Seibold (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Verdier (AstraZeneca Molndal, Sweden), David Gibeon (Imperial College, London, Germany), Dominick E. Shaw (University of Nottingham, UK), Ralf Sigmund UK), Cristina Gomez (Karolinska Institutet, Stockholm, Sweden), Kerry Gove (NIHR (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Florian Singer Southampton Respiratory Biomedical Research Unit and Clinical and Experimental (University Children’s Hospital, Zurich, Switzerland), Marcus Sjödin (Karolinska Sciences, Southampton, UK), Neil Gozzard (UCB, UK), Yi-ke Guo (Imperial College, Institutet,Stockholm,Sweden),PaulJ.Skipp (UniversityofSouthampton, UK), London, UK), Simone Hashimoto (University of Amsterdam, The Netherlands), Barbara Smids (University of Amsterdam, The Netherlands), Caroline Smith (NIHR John Haughney (International Primary Care Respiratory Group, Aberdeen, Southampton Respiratory Biomedical Research Unit, Southampton, UK), Jessica Scotland), Gunilla Hedlin (Karolinska Institutet, Stockholm, Sweden), Pieter-Paul Smith (Asthma UK, London, UK), Katherine M. Smith (University of Nottingham, Hekking (University of Amsterdam, The Netherlands), Elisabeth Henriksson (Karolinska Institutet, Stockholm, Sweden), Lorraine Hewitt (NIHR Southampton UK), Päivi Söderman, Karolinska Institutet, Stockholm, Sweden), Adesimbo De Meulder et al. BMC Systems Biology (2018) 12:60 Page 19 of 23 Sogbesan (Royal Brompton and Harefield NHS Fondation Trust, London, UK), Ana within and contributed to the development of the data analysis plan, as a R. Sousa (GSK, UK), Doroteya Staykova (University of Southampton, UK), Peter J. member of U-BIOPRED and eTRIKS projects. DL contributed to the writing of Sterk (University of Amsterdam, The Netherlands), Karin Strandberg (Karolinska the manuscript, to the planning and the performing of the analyses within Institutet, Stockholm, Sweden), Kai Sun (Imperial College, London, UK), David and contributed to the development of the data analysis plan, as a member Supple(Asthma UK,London,UK), Marton Szentkereszty (Semmelweis University, of U-BIOPRED and eTRIKS projects. ATB contributed to the design of the Budapest, Hungary), Lilla Tamasi (Semmelweis University, Budapest, Hungary), analyses presented within along with all statistical concerns during the de- Kamran Tariq (University of Southampton, UK), John-Olof Thörngren (Karolinska velopment of the data analysis plan, as a member of the U-BIOPRED project. University Hospital, Stockholm, Sweden), Bob Thornton (MSD, USA), Jonathan AMaz contributed to the enrichment analysis parts of the manuscript, as a Thorsen (University of Copenhagen, Denmark), Salvatore Valente (Università member of U-BIOPRED and eTRIKS projects. AC contributed to the design of Cattolica del Sacro Cuore, Rome, Italy), Wim van Aalderen (University of the data analysis plans and to the clustering parts of the manuscript as a Amsterdam, The Netherlands), Marianne van de Pol (University of Amsterdam, The Netherlands), Kees van Drunen (University of Amsterdam, The member of the U-BIOPRED project. HA contributed to the design of the data Netherlands), Marleen van Drunen (University of Amsterdam, The analysis plans and to the clustering parts of the manuscript as a member Netherlands), Jenny Versnel (Asthma UK, London, UK), Jorgen Vestbo of the U-BIOPRED project. IB contributed to the enrichment analysis and (Manchester Academic Health Sciences Centre, Manchester, UK), Anton machine-learning parts of the manuscript as a member of the eTRIKS project. Vink (Philips Research Laboratories, Eindhoven, The Netherlands), Nadja MS contributed to the enrichment analysis and machine-learning parts of Vising (University of Copenhagen, Denmark), Christophe von Garnier the manuscript as a member of the eTRIKS project. JP contributed to the data (University Hospital, Bern, Switzerland), Ariane Wagener (University of preparation parts and to the visualisations of the manuscript. SB contributed to Amsterdam, The Netherlands), Scott Wagers (BioSci Consulting, Maasmechelen, the design of the data analysis plan and to the clustering, data integration and Belgium), Frans Wald (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, enrichment analysis parts of the manuscript. NL contributed to the data Germany), Samantha Walker (Asthma UK, London, UK), Jonathan Ward preparation parts of the manuscript. KS contributed to the data managements (University of Southampton, UK), Zsoka Weiszhart (Semmelweis University, aspects of the manuscript as a member of the eTRIKS project. IP contributed to Budapest Hungary), Kristiane Wetzel (Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany), Craig E. Wheelock (Karolinska the data managements aspects of the manuscript as a member of the eTRIKS Institutet, Stockholm, Sweden), Coen Wiegman (Imperial College London, project. XY contributed to the data managements aspects of the manuscript as UK), Siân Williams (International Primary Care Respiratory Group, a member of the eTRIKS project. MB contributed to the data managements and Aberdeen, Scotland), Susan J. Wilson (University of Southampton, UK), clustering aspects of the manuscript as a member of the U-BIOPRED project. Ashley Woodcock (Manchester Academic Health Science Centre, Manchester, KK contributed to the development of the data analysis plan and related parts UK), Xian Yang (Imperial College London, UK), Elizabeth Yeyasingham (GSK, UK), in the manuscript as a member of the U-BIOPRED project. JvE contributed to Wen Yu (Amgen Inc., Seattle, USA), Wilhelm Zetterquist (Karolinska Institutet, the development of the data analysis plan and related parts in the manuscript Stockholm, Sweden), Koos Zwinderman (University of Amsterdam, The as a member of the U-BIOPRED project. AB contributed to the development of Netherlands). The eTRIKS consortium members are: Alireza Tamaddoni Nezhad the data analysis plan and related parts in the manuscript as a member of the (Imperial College London, UK), Adriano Barbosa da Silva (University of U-BIOPRED project. TD contributed to the development of the data analysis Luxemburg, Luxemburg), Alexander Mazein (EISBM, Lyon, France), plan and related parts in the manuscript as a member of the U-BIOPRED Andreas Tielmann (Merck), Angela Gaudette (Pfizer), Anna Silberberg project. PD contributed to the development of the data analysis plan and (Pfizer), Antigoni (Anna) Elefsinioti (Bayer), Axel Oehmichen (Imperial College London, UK), Maria Biryukov (University of Luxemburg, Luxemburg), related parts in the manuscript as a member of the U-BIOPRED project. CL Bertrand De Meulder (EISBM, Lyon, France), Jen Birgitte (Lundbeck), Bron Kisler contributed to the development of the data analysis plan and related parts in (CDISC), Anna Maria Carusi, Charles Auffray (EISBM, Lyon, France), Diana O’Malley the manuscript as a member of the U-BIOPRED project. AP contributed to the (Imperial College London, UK), David Henderson (Bayer), Dorina Bratfalean development of the data analysis plan and related parts in the manuscript as a (CDISC), Diane Lefaudeux (EISBM, Lyon, France), Denny Verbeeck (Janssen), Ejner member of the U-BIOPRED project. JC contributed to the development of the Knud Moltzen (Lundbeck), Eva Lindgren (Astra Zeneca), Florian Guitton (Imperial data analysis plan and related parts in the manuscript as a member of the College London, UK), Fabien Richard (EISBM, Lyon, France), Francisco Bonachela U-BIOPRED project. RD contributed to the development of the data analysis Capdevila (Janssen), Ghita Rahal (CNRS, Lyon, France), Heike Dagmar plan and related parts in the manuscript as a member of the U-BIOPRED Schuermann (Sanofi), Ibrahim Emam (Imperial College London, UK), Irina project. KFC contributed to the overall design of the study as a member of the Balaur (EISBM, Lyon, France), Ingrid Sofie Harbo (Lundbeck), Jay Bergeron U-BIOPRED project. IMA contributed to the overall design of the study as a (Pfizer), Kai Sun (Imperial College London, UK), Laurence Mazuranok member of the U-BIOPRED project. YG contributed to the data management (Sanofi), Laurence Painell’s (IDBS), Manfred Hendlich (Sanofi), Gino Marchetti (CNRS, Lyon, France), Derek Marren (Lilly), Jaroslav Martasek aspects of the manuscript as a member of the eTRIKS project. PJS contributed (Lilly), Martin Romacker (Roche), Michael Braxenthaler (Roche), Maria to the overall design of the study as a member of the U-BIOPRED project. AMan Manuela Nogueira (EISBM, Lyon, France), Mansoor Saqi (EISBM, Lyon, contributed to the development of the data analysis plan and co-led the France), Neil Fitch (BioSci Consulting), Nesrine Taibi (EISBM, Lyon, France), systems biology work package of the U-BIOPRED project. AR contributed to the Odile Brasier (EISBM, Lyon, France), Paul Agapow (Imperial College development of the data analysis plan and co-led the systems biology work London, UK), Peter Rice (Imperial College London, UK), Paul Houston package of the U-BIOPRED and eTRIKS projects. FB contributed to the (CDISC), Philippe Rocca-Serra (University of Oxford, UK), Reinhard development of the data analysis plan and co-led the systems biology Schneider (University of Luxemburg, Luxemburg), James Rimell (Lilly), work package of the U-BIOPRED project. CA contributed to the overall Stelios Pavlidis (Imperial College London, UK), Susanna-Assunta Sansone design and supervision of the study, to the development of the data (University of Oxford, UK), Sally Miles (Imperial College London, UK), analysis plan and co-led the systems biology work package of the U-BIOPRED Samiul Hasan (GSK), Sascha Herzinger (University of Luxemburg, Luxemburg), project and its extension in the eTRIKS project. Scott Wagers (BioSci Consulting), Sikander Hayat (Bayer), Tomas Dalentoft (Astra Zeneca), Vahid Elyasigomari (Imperial College London, UK), Venkata Satagopam (University of Luxemburg, Luxemburg), Wei Gu (University of Luxemburg, Ethics approval and consent to participate Luxemburg), Xian Yang (Imperial College London, UK), Yi-Ke Guo (Imperial Not applicable College London, UK). Consent for publication Funding Not applicable This work was supported through the Innovative Medicines Initiative U-BIOPRED and eTRIKS projects (IMI n°115010 and IMI n°115446 respectively). Competing interests Availability of data and materials ATB received fees from Acclarogen Ltd. KK received fees from UCB Celltech The datasets analysed in this study are available in the NIH National Cancer Ltd. JvE received fees from UCB Pharma S.A. AB received fees from Roche Institute repository (https://portal.gdc.cancer.gov/)[117]. Products Ltd. TD received fees from Janssen R & D High Wycombe Ltd. PD received fees from AstraZeneca Ltd. CL received fees from GSK Ltd. JC Authors’ contributions received fees from Areteva R & D Ltd. AMan received fees from Roche All authors read and approved the final version of the manuscript. BDM Diagnostics GmbH, AR received fees from Janssen R & D High Wycombe wrote the main body of the manuscript, performed the analyses presented Ltd. FB received fees from Janssen R & D Springhouse LLC. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 20 of 23 Publisher’sNote 17. Tweeddale H, Notley-McRobb L, Ferenci T. Effect of slow growth on Springer Nature remains neutral with regard to jurisdictional claims in published metabolism of Escherichia coli, as revealed by global metabolite pool maps and institutional affiliations. (“metabolome”) analysis. J Bacteriol. 1998;180(19):5109–16. 18. Sterk PJ. Towards the Physionomics of asthma and COPD. Copenhagen: Author details European Respiratory Society Annual Congress; 2005. p. 17–21. European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, 19. Machado RF, Laskowski D, Deffenderfer O, Burch T, Zheng S, Mazzone PJ, EISBM, 50 Avenue Tony Garnier, 69007 Lyon, France. Acclarogen Ltd, St Mekhail T, Jennings C, Stoller JK, Pyle J, et al. Detection of lung cancer by John’s Innovation Centre, Cambridge CB4 OWS, UK. Data Science Institute, sensor array analyses of exhaled breath. Am J Respir Crit Care Med. 2005; Imperial College, London SW7 2AZ, UK. Janssen Research and Development 171(11):1286–91. Ltd, High Wycombe HP12 4DP, UK. UCB Pharma S.A, 1420 Braine-l’Alleud, 20. Sanchez C, Lachaize C, Janody F, Bellon B, Roder L, Euzenat J, Rechenmann 6 7 Belgium. UCB Celltech, 208 Bath Road, Slough SL13WE, UK. Roche Ltd, F, Jacq B. Grasping at molecular interactions and genetic networks in Welwyn Garden City AL7 1TW, UK. AstraZeneca Ltd, Alderley Park, Drosophila melanogaster using FlyNets, an internet database. Nucleic Acids Macclesfield SK10 4TG, UK. Target Sciences, GlaxoSmithKline, Gunnels Wood Res. 1999;27(1):89–94. Road, Stevenage SG1 2NY, UK. Faculty of Medicine, University of 21. Cesareni G, Ceol A, Gavrila C, Palazzi LM, Persico M, Schneider MV. Southampton, Southampton SO17 1BJ, UK. AstraZeneca R & D, 43150 Comparative interactomics. FEBS Lett. 2005;579(8):1828–33. 12 13 Mölndal, Sweden. Arateva R & D Ltd, Nottingham NG1 1GF, UK. National 22. Mayer B. Bioinformatics for omics data : methods and protocols. New York: Hearth and Lung Institute, Imperial College London, London SW3 6LY, UK. Humana Press; 2011. Department of Respiratory Medicine, Academic Medical Centre, University 23. Mesarovic MD. Case institute of technology. Systems research center.: of Amsterdam, Amsterdam AZ1105, The Netherlands. Research Informatics, systems theory and biology. Proceedings of the 3rd systems symposium at Roche Diagnostics GmbH, 82008 Unterhaching, Germany. Janssen Research case institute of technology. Berlin: Springer; 1968. and Development Ltd, Spring House, PA 19002, USA. 24. Noble D. Cardiac action and pacemaker potentials based on the Hodgkin- Huxley equations. Nature. 1960;188:495–7. Received: 20 July 2017 Accepted: 21 February 2018 25. Auffray C, Imbeaud S, Roux-Rouquie M, Hood L. From functional genomics to systems biology: concepts and practices. C R Biol. 2003;326(10–11):879–92. 26. Auffray C, Noble D. Origins of systems biology in William Harvey's masterpiece on the movement of the heart and the blood in animals. Int J References Mol Sci. 2009;10(4):1658–69. 1. Jameson JL, Longo DL. Precision medicine–personalized, problematic, and 27. Auffray C, Nottale L. Scale relativity theory and integrative systems biology: 1. promising. N Engl J Med. 2015;372(23):2229–34. Founding principles and scale laws. Prog Biophys Mol Biol. 2008;97(1):79–114. 2. Chen R, Snyder M. Promise of personalized omics to precision medicine. 28. Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh CH, Minokawa T, Wiley Interdiscip Rev Syst Biol Med. 2013;5(1):73–82. Amore G, Hinman V, Arenas-Mena C, et al. A genomic regulatory network for development. Science. 2002;295(5560):1669–78. 3. Viceconti M, Hunter P, Hose R. Big data, big knowledge: big data for personalized healthcare. IEEE J Biomed Health Inform. 2015;19(4):1209–15. 29. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems 4. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of biology. Annu Rev Genomics Hum Genet. 2001;2:343–72. integrating data to uncover genotype-phenotype interactions. Nat Rev 30. Kitano H. Looking beyond the details: a rise in system-oriented approaches Genet. 2015;16(2):85–97. in genetics and molecular biology. Curr Genet. 2002;41(1):1–10. 5. Berger B, Gaasterland T, Lengauer T, Orengo C, Gaeta B, Markel S, Valencia 31. Noble D. Modeling the heart–from genes to cells to the whole organ. A. ISCB's initial reaction to the New England journal of medicine editorial on Science. 2002;295(5560):1678–82. data sharing. PLoS Comput Biol. 2016;12(3):e1004816. 32. Nottale L, Auffray C. Scale relativity theory and integrative systems 6. Longo DL, Drazen JM. Data Sharing. N Engl J Med. 2016;374(3):276–7. biology: 2. Macroscopic quantum-type mechanics. Prog Biophys Mol Biol. 2008;97(1):115–57. 7. Hawkins TL, McKernan KJ, Jacotot LB, MacKenzie JB, Richardson PM, Lander 33. Prokop A, Csukas B. Systems biology - integrative biology and simulation ES. A magnetic attraction to high-throughput genomics. Science. 1997; tools. Dordrecht: Springer; 2013. 276(5320):1887–9. 8. MacKenzie S. High-throughput interpretation of pathways and biology. 34. Anderson GP. Endotyping asthma: new insights into key pathogenic Drug News Perspect. 2001;14(1):54–7. mechanisms in a complex, heterogeneous disease. Lancet. 2008;372(9643): 9. Pietu G, Mariage-Samson R, Fayein NA, Matingou C, Eveno E, Houlgatte R, 1107–19. Decraene C, Vandenbrouck Y, Tahi F, Devignes MD, et al. The Genexpress 35. Auffray C, Chen Z, Hood L. Systems medicine: the future of medical IMAGE knowledge base of the human brain transcriptome: a prototype genomics and healthcare. Gen Med. 2009;1(1):2. integrated resource for functional and computational genomics. Genome 36. Auffray C, Charron D, Hood L. Predictive, preventive, personalized and Res. 1999;9(2):195–209. participatory medicine: back to the future. Gen Med. 2010;2(8):57. 10. Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DE Jr, 37. Auffray C, Hood L. Editorial: systems biology and personalized medicine - Hieter P, Vogelstein B, Kinzler KW. Characterization of the yeast the future is now. Biotechnol J. 2012;7(8):938–9. transcriptome. Cell. 1997;88(2):243–51. 38. Hood L, Auffray C. Participatory medicine: a driving force for revolutionizing healthcare. Gen Med. 2013;5(12):110. 11. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997; 39. Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century 278(5338):680–6. through systems approaches. Biotechnol J. 2012;7(8):992–1001. 12. Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez JC, Yan JX, Gooley 40. Sobradillo P, Pozo F, Agusti A. P4 medicine: the future around the corner. AA, Hughes G, Humphery-Smith I, et al. From proteins to proteomes: large Arch Bronconeumol. 2011;47(1):35–40. scale protein identification by two-dimensional electrophoresis and amino 41. Wolkenhauer O, Auffray C, Jaster R, Steinhoff G, Dammann O. The acid analysis. Biotechnology (N Y). 1996;14(1):61–5. road from systems biology to systems medicine. Pediatr Res. 2013; 13. James P. Protein identification in the post-genome era: the rapid rise of 73(4 Pt 2):502–7. proteomics. Q Rev Biophys. 1997;30(4):279–331. 42. Leek JT,Scharpf RB,Bravo HC,SimchaD, LangmeadB,Johnson WE, 14. Kishimoto K, Urade R, Ogawa T, Moriyama T. Nondestructive quantification Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical of neutral lipids by thin-layer chromatography and laser-fluorescent impact of batch effects in high-throughput data. Nat Rev Genet. 2010; scanning: suitable methods for “lipidome” analysis. Biochem Biophys Res 11(10):733–9. Commun. 2001;281(3):657–62. 43. McDonald JH. Handbook of biological statistics. 3rd ed. Baltimore: Sparky 15. Han X, Gross RW. Global analyses of cellular lipidomes directly from crude House Publishing; 2014. extracts of biological samples by ESI mass spectrometry: a bridge to 44. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration lipidomics. J Lipid Res. 2003;44(6):1071–9. in biological research: an overview. J Biol Res Thessalon. 2015;22:1–16. 16. Oliver SG, Winson MK, Kell DB, Baganz F. Systematic functional analysis of 45. Rhee SY, Wood V, Dolinski K, Draghici S. Use and misuse of the gene the yeast genome. Trends Biotechnol. 1998;16(9):373–8. ontology annotations. Nat Rev Genet. 2008;9(7):509–15. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 21 of 23 46. Reimand J, Arak T, Vilo J. G:profiler–a web server for functional neuropathic pain and tissue embryological classes. Bioinformatics. 2010; interpretation of gene lists (2011 update). Nucleic Acids Res. 2011;39(Web 26(18):i531–9. Server issue):W307–15. 73. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information 47. Fujita KA, Ostaszewski M, Matsuoka Y, Ghosh S, Glaab E, Trefois C, Crespo I, feature selection. IEEE Trans Neural Netw. 2009;20(2):189–201. Perumal TM, Jurkowski W, Antony PM, et al. Integrating pathways of Parkinson's 74. Guyon I, Elisseeff A. An introduction to variable and feature selection. J disease in a molecular interaction map. Mol Neurobiol. 2014;49(1):88–102. Mach Learn Res. 2003;3:1157–82. 48. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, 75. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in Simonovic M, Roth A, Santos A, Tsafou KP, et al. STRING v10: protein-protein bioinformatics. Bioinformatics. 2007;23(19):2507–17. interaction networks, integrated over the tree of life. Nucleic Acids Res. 76. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 2015;43(Database issue):D447–52. 1999;31(3):264–323. 49. Vallabhajosyula RR, Raval A. Computational modeling in systems biology. 77. Ronan T, Qi Z, Naegle KM. Avoiding common pitfalls when clustering Methods Mol Biol. 2010;662:97–120. biological data. Sci Signal. 2016;9(432):re6. 50. Kuperstein I, Bonnet E, Nguyen HA, Cohen D, Viara E, Grieco L, Fourquet S, 78. Shirkhorshidi AS, Aghabozorgi S, Teh YW, Herawan T. Big Data Calzone L, Russo C, Kondratova M, et al. Atlas of cancer Signalling network: Clustering:AReview.Computational Science and Its Applications. a systems biology resource for integrative analysis of cancer data with 2014;8583:707–20. Google maps. Oncogene. 2015;4:e160. 79. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with 51. Mizuno S, Iijima R, Ogishima S, Kikuchi M, Matsuoka Y, Ghosh S, Miyamoto confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3. T, Miyashita A, Kuwano R, Tanaka H. AlzPathway: a comprehensive map of 80. Caruana R, Elhawary M, Nguyen N, Smith C. Meta clustering. Ieee Data signaling pathways of Alzheimer's disease. BMC Syst Biol. 2012;6:52. Mining. 2006:107–18. 52. Ogishima S, Mizuno S, Kikuchi M, Miyashita A, Kuwano R, Tanaka H, Nakaya J. 81. Shen R, Mo Q, Schultz N, Seshan VE, Olshen AB, Huse J, Ladanyi M, Sander AlzPathway, an updated map of curated signaling pathways: towards deciphering C. Integrative subtype discovery in glioblastoma using iCluster. PLoS One. Alzheimer's disease pathogenesis. Methods Mol Biol. 2016;1303:423–32. 2012;7(4):e35236. 53. Zhao S, Iyengar R. Systems pharmacology: network analysis to identify 82. Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated multiscale mechanisms of drug action. Annu Rev Pharmacol Toxicol. 2012; clustering to integrate multiple datasets. Bioinformatics. 2012;28(24):3290–7. 52:505–21. 83. Yuan Y, Savage RS, Markowetz F. Patient-specific data fusion defines 54. Bigler J, Hu X, Boedigheimer M, Rowe A, Chung F, Djukanovic R, Sousa A, prognostic cancer subtypes. PLoS Comput Biol. 2011;7(10):e1002227. Corfield J, Adcock I, Sterk P, et al. Whole transcriptome analysis in peripheral 84. Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, blood from asthmatic and healthy subjects in the U-BIOPRED study. Eur Milanesi L. Methods for the integration of multi-omics data: mathematical Respir J. 2014;44(Suppl 58):2027. aspects. BMC Bioinformatics. 2016;17(Suppl 2):15. 55. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, 85. Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical Goldenberg A. Similarity network fusion for aggregating data types on a and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57(1): genomic scale. Nat Methods. 2014;11(3):333–7. 289–300. 56. Auffray C, Balling R, Barroso I, Bencze L, Benson M, Bergeron J, Bernal- 86. Noble WS. How does multiple testing correction work? Nat Biotechnol. Delgado E, Blomberg N, Bock C, Conesa A, et al. Making sense of big data 2009;27(12):1135–7. in health research: towards an EU action plan. Gen Med. 2016;8(1):71. 87. Xie J, Cai TT, Maris J, Li H. Optimal false discovery rate control for 57. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression dependent data. Stat Interface. 2011;4(4):417–30. data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. 88. Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per 58. van der Kloet FM, Bobeldijk I, Verheij ER, Jellema RH. Analytical error independent variable in proportional hazards regression analysis. II. reduction using single point calibration for accurate and precise Accuracy and precision of regression estimates. J Clin Epidemiol. 1995; metabolomic phenotyping. J Proteome Res. 2009;8(11):5132–41. 48(12):1503–10. 59. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical 89. Auffray C. Sharing knowledge: a new frontier for public-private partnerships models. Cambridge: Cambridge University Press; 2007. in medicine. Genome Med. 2009;1(3):29. 60. Guo Y, Graber A, McBurney RN, Balasubramanian R. Sample size and statistical 90. Lindpaintner K. Biomarkers: call on industry to share. Nature. 2011;470(7333):175. power considerations in high-dimensionality data settings: a comparative 91. McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams study of classification algorithms. BMC Bioinformatics. 2010;11:447. PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, et al. Criteria for the use of 61. Michiels S, Kramar A, Koscielny S. Multidimensionality of microarrays: omics-based predictors in clinical trials: explanation and elaboration. BMC statistical challenges and (im) possible solutions. Mol Oncol. 2011;5(2):190–6. Med. 2013;11:220. 62. Lee JA, Verleysen M. Nonlinear dimensionality reduction. New York: 92. McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams Springer; 2007. PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, et al. Criteria for the use of 63. Calza S, Raffelsberger W, Ploner A, Sahel J, Leveillard T, Pawitan Y. Filtering omics-based predictors in clinical trials. Nature. 2013;502(7471):317–20. genes to improve sensitivity in oligonucleotide microarray data analysis. 93. Poste G. Bring on the biomarkers. Nature. 2011;469(7329):156–7. Nucleic Acids Res. 2007;35(16):e102. 94. Sung J, Wang Y, Chandrasekaran S, Witten DM, Price ND. Molecular 64. Stanberry L, Mias GI, Haynes W, Higdon R, Snyder M, Kolker E. Integrative signatures from omics data: from chaos to consensus. Biotechnol J. 2012; analysis of longitudinal metabolomics data from a personal multi-omics 7(8):946–57. profile. Meta. 2013;3(3):741–60. 95. Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic 65. Ideker T, Dutkowski J, Hood L. Boosting signal-to-noise in complex biology: research: validating a prognostic model. BMJ. 2009;338:b605. prior knowledge is power. Cell. 2011;144(6):860–3. 96. Hemingway H, Riley RD, Altman DG. Ten steps towards improving 66. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation prognosis research. BMJ. 2009;339:b4184. network analysis. BMC Bioinformatics. 2008;9:559. 97. Jin L, Zuo XY, Su WY, Zhao XL, Yuan MQ, Han LZ, Zhao X, Chen YD, Rao SQ. 67. Varshavsky R, Gottlieb A, Linial M, Horn D. Novel unsupervised feature Pathway-based analysis tools for complex diseases: a review. Genomics filtering of biological data. Bioinformatics. 2006;22(14):e507–13. Proteomics Bioinformatics. 2014;12(5):210–20. 68. Bonev B, Escolano F, Cazorla MA. A novel information theory method for 98. Khatri P, Draghici S. Ontological analysis of gene expression data: current filter feature selection. Lect Notes Artif Int. 2007;4827:431–40. tools, limitations, and open problems. Bioinformatics. 2005;21(18):3587–95. 69. Meyer PE. The rank Minrelation coefficient. Qual Technol Quant M. 2014; 99. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches 11(1):61–70. and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375. 70. Scardoni G, Petterlini M, Laudanna C. Analyzing biological network 100. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, parameters with CentiScaPe. Bioinformatics. 2009;25(21):2857–9. Gillespie M, Kamdar MR, et al. The Reactome pathway knowledgebase. 71. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: new features Nucleic Acids Res. 2014;42(Database issue):D472–7. for data integration and network visualization. Bioinformatics. 2011;27(3):431–2. 101. Milacic M, Haw R, Rothfels K, Wu G, Croft D, Hermjakob H, D'Eustachio P, 72. Cannistraci CV, Ravasi T, Montevecchi FM, Ideker T, Alessio M. Nonlinear Stein L. Annotating cancer variants and anti-cancer therapeutics in dimension reduction and clustering by minimum Curvilinearity unfold reactome. Cancers. 2012;4(4):1180–211. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 22 of 23 102. Mizuno S, Ogishima S, Kitatani K, Kikuchi M, Tanaka H, Yaegashi N, Nakaya J. biogenesis genes may influence epithelial ovarian cancer risk. Cancer Network analysis of a comprehensive knowledge repository reveals a dual Epidemiol Biomark Prev. 2011;20(6):1131–45. role for ceramide in alzheimer's disease. PlosOne 2016;11(2):e0148431. 127. Nakano H, Yamada Y, Miyazawa T, Yoshida T. Gain-of-function microRNA 103. Lefaudeux D, De Meulder B, Loza MJ, Peffer N, Rowe A, Baribaud F, Bansal screens identify miR-193a regulating proliferation and apoptosis in epithelial AT, Lutter R, Sousa AR, Corfield J, et al. U-BIOPRED clinical adult asthma ovarian cancer cells. Int J Oncol. 2013;42(6):1875–82. clusters linked to a subset of sputum -omics. J Allergy Clin Immunol. 2016; 128. Archer MC. Role of sp transcription factors in the regulation of cancer cell In press metabolism. Genes Cancer. 2011;2(7):712–9. 104. Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100(1):57–70. 129. Li Y, Yao L, Liu F, Hong J, Chen L, Zhang B, Zhang W. Characterization of 105. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. microRNA expression in serous ovarian carcinoma. Int J Mol Med. 2014;34(2):491–8. 2011;144(5):646–74. 130. Hein S, Mahner S, Kanowski C, Loning T, Janicke F, Milde-Langosch K. 106. Bast RC Jr, Hennessy B, Mills GB. The biology of ovarian cancer: new Expression of Jun and Fos proteins in ovarian tumors of different malignant opportunities for translation. Nat Rev Cancer. 2009;9(6):415–28. potential and in ovarian cancer cell lines. Oncol Rep. 2009;22(1):177–83. 107. Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for 131. Wang JX, Zeng Q, Chen L, Du JC, Yan XL, Yuan HF, Zhai C, Zhou JN, Jia YL, computational biology. Mol Syst Biol. 2016;12(7):878. Yue W, et al. SPINDLIN1 promotes cancer cell proliferation through 108. Sommer C, Gerlich DW. Machine learning in cell biology - teaching activation of WNT/TCF-4 signaling. Mol Cancer Res. 2012;10(3):326–35. computers to recognize phenotypes. J Cell Sci. 2013;126(Pt 24):5529–39. 132. Sundfeldt K, Ivarsson K, Carlsson M, Enerback S, Janson PO, Brannstrom M, 109. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A. Caret: Hedin L. The expression of CCAAT/enhancer binding protein (C/EBP) in the classification and regression training, vol. 5; 2012. p. 15–044. human ovary in vivo: specific increase in C/EBPbeta during epithelial 110. Le Cao KA, Gonzalez I, Dejean S. Integromics: an R package to unravel tumour progression. Br J Cancer. 1999;79(7–8):1240–8. relationships between two omics datasets. Bioinformatics. 2009;25(21):2855–6. 133. He L, Guo L, Vathipadiekal V, Sergent PA, Growdon WB, Engler DA, Rueda 111. Le Cao KA, Rohart F4, Gonzalez I, Dejean S, Gautier B, Bartolo F, Monget P, BR, Birrer MJ, Orsulic S, Mohapatra G. Identification of LMX1B as a novel Coquery J, Yao FBL. mixOmics: omics data integration project: R package oncogene in human ovarian cancer. Oncogene. 2014;33(33):4226–35. version; 2016. p. 6.1.1. 134. White NM, Chow TF, Mejia-Guerrero S, Diamandis M, Rofael Y, Faragalla H, 112. Singh ABG, Shannon C, Vacher M, Rohart F, Tebutt S, Le Cao KA. DIABLO - Mankaruous M, Gabril M, Girgis A, Yousef GM. Three dysregulated miRNAs an integrative, multi-omics, multivariate method for multi-group control kallikrein 10 expression and cell proliferation in ovarian cancer. Br J classification: bioRxiv; 2016. Cancer. 2010;102(8):1244–53. 113. Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan 135. Downie D, McFadyen MC, Rooney PH, Cruickshank ME, Parkin DE, Miller ID, M, Carlsson J, Carlsson G. Extracting insights from the shape of complex Telfer C, Melvin WT, Murray GI. Profiling cytochrome P450 expression in data using topology. Sci Rep. 2013;3:1236. ovarian cancer: identification of prognostic markers. Clin Cancer Res. 2005; 114. Gevaert O, Villalobos V, Sikic BI, Plevritis SK. Identification of ovarian cancer 11(20):7369–75. driver genes by using module network integration of multi-omics data. 136. Gambineri A, Tomassoni F, Munarini A, Stimson RH, Mioni R, Pagotto U, Interface Focus. 2013;3(4):20130013. Chapman KE, Andrew R, Mantovani V, Pasquali R, et al. A combination of 115. Jin N, Wu H, Miao Z, Huang Y, Hu Y, Bi X, Wu D, Qian K, Wang L, Wang C, et polymorphisms in HSD11B1 associates with in vivo 11{beta}-HSD1 activity al. Network-based survival-associated module biomarker and its crosstalk and metabolic syndrome in women with and without polycystic ovary with cell death genes in ovarian cancer. Sci Rep. 2015;5:11566. syndrome. Eur J Endocrinol. 2011;165(2):283–92. 116. Kim D, Joung JG, Sohn KA, Shin H, Park YR, Ritchie MD, Kim JH. Knowledge 137. Howells REJ, Dhar KK, Hoban PR, Jones PW, Fryer AA, Redman CWE, Strange boosting: a graph-based integration approach with multi-omics data and RC. Association between glutathione-S-transferase GSTP1 genotypes, GSTP1 genomic knowledge for cancer clinical outcome prediction. J Am Med over-expression, and outcome in epithelial ovarian cancer. Int J Gynecol Inform Assoc. 2015;22(1):109–20. Cancer. 2004;14(2):242–50. 117. Network TCGAR. Integrated genomic analyses of ovarian carcinoma. Nature. 138. Cao J, Cai J, Huang D, Han Q, Yang Q, Li T, Ding H, Wang Z. miR-335 2011;474(7353):609–15. represents an invasion suppressor gene in ovarian cancer by targeting Bcl- 118. Zhang Q, Burdette JE, Wang JP. Integrative network analysis of TCGA data w. Oncol Rep. 2013;30(2):701–6. for ovarian cancer. BMC Syst Biol. 2014;8:1338. 139. Tsai SJ, Hwang JM, Hsieh SC, Ying TH, Hsieh YH. Overexpression of myeloid 119. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett zinc finger 1 suppresses matrix metalloproteinase-2 expression and reduces MK, Etemadmoghadam D, Locandro B, et al. Novel molecular subtypes of invasiveness of SiHa human cervical cancer cells. Biochem Bioph Res Co. serous and endometrioid ovarian cancer linked to clinical outcome. Clin 2012;425(2):462–7. Cancer Res. 2008;14(16):5198–208. 140. Nie LY, Lu QT, Li WH, Yang N, Dongol S, Zhang X, Jiang J. Sterol regulatory 120. Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular element-binding protein 1 is required for ovarian tumor growth. Oncol Rep. pattern discovery using matrix factorization. P Natl Acad Sci USA. 2004; 2013;30(3):1346–54. 101(12):4164–9. 141. Odegaard E, Staff AC, Kaern J, Florenes VA, Kopolovic J, Trope CG, Abeler VM, Reich R, Davidson B. The AP-2gamma transcription factor is upregulated in 121. Paparrizos J, Gravano L. K-shape: efficient and accurate clustering of time advanced-stage ovarian carcinoma. Gynecol Oncol. 2006;100(3):462–8. series in: SIGMOD international conference on Management of Data: June 4, 2015. Melbourne: Australia: Edited by ACM; 2015. p. 1855–70. 142. Hudson LG, Zeineldin R, Silberberg M, Stack MS. Activated epidermal 122. Sagner M, McNeil A, Puska P, Auffray C, Price ND, Hood L, Lavie CJ, Han ZG, growth factor receptor in ovarian cancer. Cancer Treat Res. 2009;149:203–26. Chen Z, Brahmachari SK, et al. The P4 health Spectrum - a predictive, 143. Landskron J, Helland O, Torgersen KM, Aandahl EM, Gjertsen BT, Bjorge L, preventive, personalized and participatory continuum for promoting Tasken K. Activated regulatory and memory T-cells accumulate in malignant Healthspan. Prog Cardiovasc Dis. 2017;59(5):506–21. ascites from ovarian carcinoma patients. Cancer Immunol Immunother. 123. Reimer D, Sadr S, Wiedemair A, Goebel G, Concin N, Hofstetter G, 2015;64(3):337–47. Marth C, Zeimet AG. Expression of the E2F family of transcription 144. Gavalas NG, Karadimou A, Dimopoulos MA, Bamias A. Immune response factors and its clinical relevance in ovarian cancer. Ann N Y Acad Sci. in ovarian cancer: how is the immune system involved in prognosis 2006;1091:270–81. and therapy: potential for treatment utilization. Clin Dev Immunol. 2010; 124. Xanthoulis A, Tiniakos DG. E2F transcription factors and digestive system 2010:791603. malignancies: how much do we know? World J Gastroenterol. 2013;19(21): 145. Carlsten M, Norell H, Bryceson YT, Poschke I, Schedvins K, Ljunggren HG, 3189–98. Kiessling R, Malmberg KJ. Primary human tumor cells expressing CD155 125. Miyata K, Yotsumoto F, Nam SO, Odawara T, Manabe S, Ishikawa T, Itamochi impair tumor targeting by down-regulating DNAM-1 on NK cells. J H, Kigawa J, Takada S, Asahara H, et al. Contribution of transcription factor, Immunol. 2009;183(8):4921–30. SP1, to the promotion of HB-EGF expression in defense mechanism against 146. Bellone S, Siegel ER, Cocco E, Cargnelutti M, Silasi DA, Azodi M, Schwartz PE, the treatment of irinotecan in ovarian clear cell carcinoma. Cancer Med. Rutherford TJ, Pecorelli S, Santin AD. Overexpression of epithelial cell 2014;3(5):1159–69. adhesion molecule in primary, metastatic, and recurrent/chemotherapy- 126. Permuth-Wey J, Chen YA, Tsai YY, Chen Z, Qu X, Lancaster JM, Stockwell H, resistant epithelial ovarian cancer: implications for epithelial cell adhesion Dagne G, Iversen E, Risch H, et al. Inherited variants in mitochondrial molecule-specific immunotherapy. Int J Gynecol Cancer. 2009;19(5):860–6. De Meulder et al. BMC Systems Biology (2018) 12:60 Page 23 of 23 147. Szkandera J, Kiesslich T, Haybaeck J, Gerger A, Pichler M. Hedgehog signaling pathway in ovarian cancer. Int J Mol Sci. 2013;14(1):1179–96. 148. Feng Q, Deftereos G, Hawes SE, Stern JE, Willner JB, Swisher EM, Xi L, Drescher C, Urban N, Kiviat N. DNA hypermethylation, Her-2/neu overexpression and p53 mutations in ovarian carcinoma. Gynecol Oncol. 2008;111(2):320–9. 149. Clarke B, Tinker AV, Lee CH, Subramanian S, van de Rijn M, Turbin D, Kalloger S, Han G, Ceballos K, Cadungog MG, et al. Intraepithelial T cells and prognosis in ovarian carcinoma: novel associations with stage, tumor type, and BRCA1 loss. Mod Pathol. 2009;22(3):393–402. 150. Powell CB, Manning K, Collins JL. Interferon-alpha (IFN alpha) induces a cytolytic mechanism in ovarian carcinoma cells through a protein kinase C-dependent pathway. Gynecol Oncol. 1993;50(2):208–14. 151. Adham SA, Sher I, Coomber BL. Molecular blockade of VEGFR2 in human epithelial ovarian carcinoma cells. Lab Investig. 2010;90(5):709–23. 152. Chen H, Ye D, Xie X, Chen B, Lu W. VEGF, VEGFRs expressions and activated STATs in ovarian epithelial carcinoma. Gynecol Oncol. 2004;94(3):630–5. 153. Chen Q, Gao G, Luo S. Hedgehog signaling pathway and ovarian cancer. Chin J Cancer Res. 2013;25(3):346–53. 154. Darb-Esfahani S, Sinn BV, Weichert W, Budczies J, Lehmann A, Noske A, Buckendahl AC, Muller BM, Sehouli J, Koensgen D, et al. Expression of classical NF-kappaB pathway effectors in human ovarian carcinoma. Histopathology. 2010;56(6):727–39. 155. Wang H, Xie X, Lu WG, Ye DF, Chen HZ, Li X, Cheng Q. Ovarian carcinoma cells inhibit T cell proliferation: suppression of IL-2 receptor beta and gamma expression and their JAK-STAT signaling pathway. Life Sci. 2004; 74(14):1739–49. 156. Hurst JH, Hooks SB. Regulator of G-protein signaling (RGS) proteins in cancer biology. Biochem Pharmacol. 2009;78(10):1289–97. 157. Leung PC, Choi JH. Endocrine signaling in ovarian surface epithelium and cancer. Hum Reprod Update. 2007;13(2):143–62. 158. Townsend KN, Spowart JE, Huwait H, Eshragh S, West NR, Elrick MA, Kalloger SE, Anglesio M, Watson PH, Huntsman DG, et al. Markers of T cell infiltration and function associate with favorable outcome in vascularized high-grade serous ovarian carcinoma. PLoS One. 2013;8(12):e82406. 159. Matassa DS, Amoroso MR, Lu H, Avolio R, Arzeni D, Procaccini C, Faicchia D, Maddalena F, Simeon V, Agliarulo I, et al. Oxidative metabolism drives inflammation-induced platinum resistance in human ovarian cancer. Cell Death Differ. 2016; 160. Corney DC, Flesken-Nikitin A, Choi J, Nikitin AY. Role of p53 and Rb in ovarian cancer. Adv Exp Med Biol. 2008;622:99–117. 161. Sampath J, Long PR, Shepard RL, Xia X, Devanarayan V, Sandusky GE, Perry WL 3rd, Dantzig AH, Williamson M, Rolfe M, et al. Human SPF45, a splicing factor, has limited expression in normal tissues, is overexpressed in many tumors, and can confer a multidrug-resistant phenotype to cells. Am J Pathol. 2003;163(5):1781–90. 162. Daponte A, Ioannou M, Mylonis I, Simos G, Minas M, Messinis IE, Koukoulis G. Prognostic significance of hypoxia-inducible factor 1 alpha (HIF-1 alpha) expression in serous ovarian cancer: an immunohistochemical study. BMC Cancer. 2008;8:335. 163. Kim JH, Karnovsky A, Mahavisno V, Weymouth T, Pande M, Dolinoy DC, Rozek LS, Sartor MA. LRpath analysis reveals common pathways dysregulated via DNA methylation across cancer types. BMC Genomics. 2012;13:526. 164. Ye J, Livergood RS, Peng G. The role and regulation of human Th17 cells in tumor immunity. Am J Pathol. 2013;182(1):10–20. 165. Leung CS, Yeung TL, Yip KP, Pradeep S, Balasubramanian L, Liu J, Wong KK, Mangala LS, Armaiz-Pena GN, Lopez-Berestein G, et al. Calcium-dependent Submit your next manuscript to BioMed Central FAK/CREB/TNNC1 signalling mediates the effect of stromal MFAP5 on and we will help you at every step: ovarian cancer metastatic potential. Nat Commun. 2014;5:5092. 166. Lengyel E. Ovarian cancer development and metastasis. Am J Pathol. 2010; • We accept pre-submission inquiries 177(3):1053–64. � Our selector tool helps you to find the most relevant journal 167. Frede J, Fraser SP, Oskay-Ozcelik G, Hong Y, Ioana Braicu E, Sehouli J, Gabra H, Djamgoz MB. Ovarian cancer: ion channel and aquaporin expression as � We provide round the clock customer support novel targets of clinical potential. Eur J Cancer. 2013;49(10):2331–44. � Convenient online submission 168. Bigler J, Boedigheimer M, Schofield JPR, Skipp PJ, Corfield J, Rowe A, Sousa � Thorough peer review AR, Timour M, Twehues L, Hu X, et al. A severe asthma disease signature from gene expression profiling of peripheral blood from U-BIOPRED � Inclusion in PubMed and all major indexing services cohorts. Am J Respir Crit Care Med. 2017;195(10):1311–20. � Maximum visibility for your research Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Systems BiologySpringer Journals

Published: May 29, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off