CircView: a visualization and exploration tool for circular RNAsFeng,, Jing;Xiang,, Yu;Xia,, Siyu;Liu,, Huan;Wang,, Jun;Ozguc, Fatma, Muge;Lei,, Lijun;Kong,, Ruoshan;Diao,, Lixia;He,, Chunjiang;Han,, Leng
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx070pmid: 29106456
Abstract Circular RNAs (circRNAs) are novel rising stars of noncoding RNAs, which are highly abundant and evolutionarily conserved across species. Number of publications related to circRNAs increased sharply in recent years, representing emerging focuses in the field. Therefore, tools, pipelines and databases have been developed to identify and store circRNAs. However, there is no existing tool to visualize and explore circRNAs. Therefore, we introduce CircView, a user-friendly visualization tool for circRNAs detected from existing tools. CircView enables users to visualize circRNAs and to quantify number of samples with detected circRNAs. CircView allows users to explore circRNAs detected by unique or multiple tools. Furthermore, CircView allows users to view the regulatory elements, such as microRNA response elements and RNA-binding protein binding sites. CircView is a unique tool to visualize and explore circRNAs, which helps users to better understand potential functions of circRNAs and design the functional experiments. circular RNAs, circRNAs, visualization, exploration Introduction Circular RNAs (circRNAs) are covalently closed RNA molecules, and are produced from precursor mRNAs through backsplicing between a downstream 3′ splice site and a upstream 5′ splice site [1]. Recently, circRNAs are emerging as a pervasive feature of gene expression in eukaryotes, from worm, fruit fly and zebrafish to mouse and human [2–4]. Approximately 10% of the expressed genes can generate circRNAs [5, 6], which make them extremely important in cellular processes. However, functions of circRNAs remained largely unknown. Recent studies showed that circRNAs play important roles in regulating gene expression [3, 7–9]. For example, CiRS‐7, a well-characterized circRNA, functions as a sponge for miR-7 in human and mouse [3, 9]. CircRNAs were also identified to be highly abundant in brain and expressed in a spatiotemporal manner, suggesting their important roles in brain development [7, 10]. Furthermore, our previous study showed that circRNAs are expressed in a tissue-specific manner, suggesting their physiological function in organ development and disorders [11]. Visualization is a crucial aspect to explore and analyze genomic/proteomic data that an interactive visualization can help researchers to understand the underlying biological meanings. Therefore, several visualization tools have been developed to visualize various kinds of biological data sets. For example, the Integrative Genomics Viewer (IGV) is an important tool to visualize large-scale genomic data, which has been widely used in genome research [12]. OpenMS is developed as an open-source tool for liquid chromatography/mass spectrometry (MS) data management, which provides a visualization tool, TOPPView, for advanced visualization of raw and processed MS data [13]. Juicebox provides a visualization system for Hi-C experimental data to explore 3D structure of genome [14]. Sushi provides genomic visualization of different sequence elements from common genomic data formats [15]. These tools allow researchers to better understand the complicated features. However, no tool has been designed to visualize circRNAs and explore their potential functions. In present study, we introduce a visualization tool specifically for circRNAs, thus to enable display features of circRNAs. Historical background Following the first discover of thousands of circRNAs in human in 2012 [16], there has been a tremendous increase in circRNA research. The first tool to identify circRNA from RNA sequencing (RNA-seq) data, Find_circ, has been released in 2013 [3]. This tool, as well as other detection tools/pipelines, accelerated the discovery studies in the field of circRNAs. For example, there are only 17 papers published in 2013 based on PubMed search. This number increased sharply to 26 in 2014, 57 in 2015 and 135 in 2016, respectively. The number of publications doubles almost every year, and it is expected to have >200 publications in 2017 (Figure 1). Figure 1 View largeDownload slide Number of publications related to circRNAs. Number for 2017 is extrapolated based on increasing rate in the previous years. Figure 1 View largeDownload slide Number of publications related to circRNAs. Number for 2017 is extrapolated based on increasing rate in the previous years. It is essential to develop accurate and efficient method to identify and quantify circRNAs. To date, multiple bioinformatics pipelines and tools have been developed to detect circRNAs through identifying the backsplice junction spanning reads (Figure 2). Salzman et al. [3] developed the first pipeline to discover thousands of circRNAs in human genome through identifying exons in scrambled order. Jeck et al. [5] combined both biochemical and informatics approach to identify circRNAs. Guo et al. [6] developed pipeline and analyzed the enrichment of circRNAs from microRNA (miRNA)-binding sites. Meanwhile, several tools have been developed to identify circRNAs using high-throughput RNA-seq data, such as find_circ [3], MapSplice2 [17], Segemehl [18], circExplorer [19], circRNA_finder [2], CIRI [20], ACFS [7], KNIFE [21], NCLscan [22], DCC [23] and UROBORUS [24]. These tools showed great differences among each other, suggesting that identification of circRNAs should be performed with caution [25]. Furthermore, these tools could be improved for higher sensitivity and better performance by using more appropriate algorithms, including CircExplorer2 [26] and CIRI2 [27]. With the increasing number of circRNAs detected by these tools/pipelines, several databases have been developed for circRNA research community (Figure 2). For example, Circ2Traits was compiled to link circRNAs and human diseases, including cancer and asthma [28]. CircBase integrated several published circRNAs data sets into a standardized database, which allowed users to explore public circRNAs or download customized python scripts to identify circRNAs from their own RNA-seq data [29]. CircNet provided circRNA expression profiles across hundreds of samples and illustrated circRNA–miRNA–gene regulatory networks [30]. circRNADb provided the protein-coding annotations for human exonic circRNAs [31]. Starbase [32] and CircInteractome [33] provided Web tools to study the interaction between circRNAs and RNA regulatory elements, including miRNAs and RNA-binding proteins (RBPs). We also constructed an integrated database, tissue-specific circRNA database, to characterize tissue-specific circRNAs in human and mouse genomes [11]. These bioinformatics circRNA detection tools and comprehensive circRNAs databases accelerated the studies of circRNAs as a novel rising star. Figure 2 View largeDownload slide Timeline overview for 13 circRNA detection tools, 3 pipelines and 7 databases. Figure 2 View largeDownload slide Timeline overview for 13 circRNA detection tools, 3 pipelines and 7 databases. Tool architecture CircView is a desktop application implemented in Java programming language, which is compatible for all major operating systems with Java virtual machine (JVM), including Mac, Windows and Linux. CircView is designed based on sets of interfaces and extendable modules, which includes three conceptual layers: (i) an interaction layer, (ii) a model layer and (iii) a data persistence layer (Figure 3). Figure 3 View largeDownload slide Tool architecture of CircView. Interaction layer includes main CircView window, user interface elements and controllers for user interaction (top). Model layer maintains core models of CircView, including circRNA model, gene model, transcript model and exon model (middle). Data persistence layer is responsible for supporting local file access and MySQL database access (bottom). Figure 3 View largeDownload slide Tool architecture of CircView. Interaction layer includes main CircView window, user interface elements and controllers for user interaction (top). Model layer maintains core models of CircView, including circRNA model, gene model, transcript model and exon model (middle). Data persistence layer is responsible for supporting local file access and MySQL database access (bottom). Interaction layer includes main CircView window, user interface elements and controllers for user interaction. Menu is responsible for configuration settings and data control. Toolbar is responsible for selecting and searching data after loading data. List panel displays circRNAs based on parameters selected by user. Image panel is implemented from Java Swing components, which is the major component for CircView to display visualizations and can fit window size automatically. Image panel also handles certain mouse actions based on Java Mouse Listener, such as zooming and selecting. All elements in image panel can be saved for further use with customized resolution. Data table displays all comparison results, which can also be saved for further analysis. Model layer maintains core models of CircView, including circRNA model, gene model, transcript model and exon model. CircRNA model is based on circRNA files loaded by users, while gene, transcript and exon models are based on gene annotation files. Model layer maps circRNAs to genes/transcripts/exons by creating a new index to improve performance of mapping progress. Model layer also performs all comparisons for interaction layer. Data persistence layer is responsible for supporting local file access and MySQL database access, thus to perform all functions without network. Annotation files and circRNA files are stored with a configuration file to include detailed information. Log files are stored to record operating process. Furthermore, this layer is designed to optionally store miRNA response elements (MREs) and RBP binding site information through MySQL database, thus to ensure high performance for CircView to search these regulatory elements. Features Overview of CircView CircView is a desktop application for visualization and exploration of circRNAs in the context of a reference genome. A key feature of CircView is an intuitive display based on mapping, statistics and analysis. It allows users to visualize and explore circRNAs without any programming skills. By clicking mouse simply, users can obtain detailed information for circRNAs, including exons composition, type and region, sample comparisons and detection tools. Users can also visualize MRE and RBP binding sites associated with circRNAs. Importantly, the visualized image can be saved for further use at defined resolution, and detailed feature information can be saved as a table for further analysis. Launching CircView and loading data CircView is available for all platforms with installed JVM. It can launch simply by downloading the ‘CircView.jar’ file from http://gb.whu.edu.cn/CircView/ or https://github.com/GeneFeng/CircView without installation. CircView is designed to load gene annotation data, circRNAs data and the optional MRE and RBP binding site data. CircView provides seven reference genomes, including human (hg38), human (hg19), mouse (mm10), mouse (mm9), zebrafish (zv9), fly (dm6) and worm (ce10) (Supplementary Figure S1). Users can also add other species by clicking species menu. CircView supports circRNA files from six existing tools, including circRNA_finder [2], CIRCexplorer [19], CIRI/CIRI2 [20, 27], find_circ [3], Mapsplice [17] and UROBORUS [24], which are recently developed and well-characterized tools to detect the backsplice junction sites of circRNAs. Other tools are also compatible if with six tab delimited columns, including chromosome, start point, end point, running number/name, junction reads and strand. CircView also provides a link for existing databases, which allows the users to redirect to those databases (Supplementary Figure S2). Furthermore, CircView supports loading MRE and RBP binding site prediction with information, including chromosome, start site, end site, MRE/RBP name and description. This optional function relies on a local MySQL database named ‘mre_rbp’. Global visualization of circRNAs CircView window is composited by several control panels (Figure 4). Menu panel is on the top, which helps user to configure species, circRNA detection tools, analysis, MRE and RBP. Tool bar is below the menu, which is responsible for selecting reference genome, detection tools, sample and chromosome, and for searching based on gene name and/or location. After loading circRNA information, genes with circRNAs will be listed in list panel, which can be sorted by gene name, position, and circRNA abundance. By selecting one gene, the detailed information of circRNAs will be displayed in the image panel. Linear gene exons are displayed as rectangles with different colors, while introns are displayed as lines. Arrows indicate the gene direction. All circRNAs are illustrated as circles, with different colors to represent different exons (Figure 4). Figures in image panel can be zoomed in/out, and can be saved for further analysis. Figure 4 View largeDownload slide Global visualization of circRNAs by CircView. CircView window is composited by several control panels. Menu panel (top) configures species, circRNA detection tools, analysis, MRE and RBP. Tool bar (middle) is responsible for selecting reference genome, detection tools, sample and chromosome, and for searching based on gene name and/or position. Genes with circRNAs will be listed in list panel (bottom-left). Detailed information of circRNAs will be displayed in the image panel (bottom-right). Linear gene exons are displayed as rectangles with different grey scale, while introns are displayed as lines. Arrows indicated the gene direction. All circRNAs are illustrated as circles, with different colors to represent different exons. Figure 4 View largeDownload slide Global visualization of circRNAs by CircView. CircView window is composited by several control panels. Menu panel (top) configures species, circRNA detection tools, analysis, MRE and RBP. Tool bar (middle) is responsible for selecting reference genome, detection tools, sample and chromosome, and for searching based on gene name and/or position. Genes with circRNAs will be listed in list panel (bottom-left). Detailed information of circRNAs will be displayed in the image panel (bottom-right). Linear gene exons are displayed as rectangles with different grey scale, while introns are displayed as lines. Arrows indicated the gene direction. All circRNAs are illustrated as circles, with different colors to represent different exons. Exploration of individual circRNAs To view individual circRNAs, users can simply select one circRNA by clicking on it. Enlarged circRNA will be popped up, with different colors indicating different exons. Exons number and length are displayed on arcs, while the abundance of circRNA is displayed in the center of circle (Figure 5). To validate the accuracy of CircView, we selected four experimentally validated circRNAs [11] and displayed them accurately (Supplementary Figure S3). For each circRNA, CircView provides the following information: (1) position of circRNA, including chromosome, donor site and acceptor site (Figure 5D); (2) gene transcript information (Figure 4); (3) exons to generate circRNA (Figure 5D); (4) type and genomic region of circRNA (Figure 5D); (5) samples detected circRNAs for both sample-specific circRNAs and recurrent circRNAs (Figure 5); (6) circRNAs detected by single/multiple tools (Figure 5); (7) maximum abundance of circRNA detected by each tool (Figure 5) and (8) information of MRE and RBP sites on circRNA (Supplementary Figure S4). Figure 5 View largeDownload slide Exploration of individual circRNAs by CircView. (A) Sample-specific or tissue-specific circRNA. (B) Recurrent circRNA. (C) Low-confident circRNA detected by few tools. (D) High-confident circRNA detected by multiple tools and display for MRE/RBP site and other detailed information, including location, type, genomic region, detection tool and detection samples for this circRNA. Figure 5 View largeDownload slide Exploration of individual circRNAs by CircView. (A) Sample-specific or tissue-specific circRNA. (B) Recurrent circRNA. (C) Low-confident circRNA detected by few tools. (D) High-confident circRNA detected by multiple tools and display for MRE/RBP site and other detailed information, including location, type, genomic region, detection tool and detection samples for this circRNA. Furthermore, CircView displayed important functional information. First of all, it is always significant to know if a circRNA can be identified in a sample-specific manner or a recurrent manner. Sample-specific circRNAs may involve in sample-specific feature, i.e. tissue-specific [11]. For example, CircView displays a sample-specific circRNA in CORIN identified in one of five (20%) of samples (Figure 5A). In contrast, recurrent circRNAs may represent the common features across samples [21], and recurrent events across multiple samples are often considered as driver for diseases, e.g. cancer [34, 35]. CircView displays a recurrent circRNA in CORIN identified in multiple samples (Figure 5B), suggesting that this circRNA may play important roles. Second, there are >10 tools have been developed. Different tools showed different sensitivity and specificity, and identification of circRNAs by single tool may not be reliable. Combining multiple tools may reduce false positives for detection [11]. Therefore, CircView provides data to indicate number of tools can detect this circRNA. For example, CircView displays a circRNA in CORIN, which can be identified by only one of six tools, suggesting low confidence of this circRNA (Figure 5C). In contrast, CircView displays another circRNA, which can be identified by four of six (67%) tools, suggesting high confidence of this circRNA (Figure 5D). Third, circRNAs were mainly reported to exert the function through binding miRNAs via MREs, which act as miRNA sponges [3, 9]. Moreover, circRNAs can bind to RBPs directly to exert more functions [3, 36], and RBP binding can also affect biogenesis of circRNAs [37, 38]. Therefore, CircView displays MREs as lines in inner circle, while displays RBP-binding sites as triangles in outer circle (Figure 5D). Detailed information of this circRNA will also be displayed when mouse moves across (Figure 5) or by clicking ‘Details’ button (Supplementary Figure S4). Display table for further analysis To enable downstream functional analysis, CircView provides a table with detailed information for all above analyses (Figure 6). Users can adjust ‘Compare Overlap’ to allow up to 10 bp mismatches for both donor sites and acceptor sites by comparing among multiple circRNAs. Users can select either multiple samples or multiple tools for comparison. Detailed information for circRNAs, especially number of tools and number of samples, is displayed for users to sort. For example, users can select those circRNAs identified by multiple tools with high confidence by simply clicking column ‘tool num’. Users can also sort by ‘sample num’ to identify those recurrent circRNAs for functional interpretation. This table can be saved for further analysis. Figure 6 View largeDownload slide Detailed information displayed in table. Number of tools and number of samples are displayed for users to sort. This table can be saved for further analysis. Figure 6 View largeDownload slide Detailed information displayed in table. Number of tools and number of samples are displayed for users to sort. This table can be saved for further analysis. Open source and highly efficient CircView is available at http://gb.whu.edu.cn/CircView/ or https://github.com/GeneFeng/CircView, with detailed documentation online, as well as in the Supplementary Material. The code is open source and licensed under the GNU General Public License. CircView creates novel index to map circRNAs to genes, thus to improve performance of mapping. Furthermore, CircView provides an optional local MySQL database to enable fast mapping of huge amount of MRE and RBP binding sites to circRNA. Taken together, CircView is highly efficient that it takes <3 min to display 30 samples from six tools on a 2.4 GB CPU machine with 8 GB of memory. Conclusions CircRNAs are emerging to be rising stars in RNA world. Studies focused on circRNAs increased sharply in recent years, and many bioinformatics tools have been developed to accelerate related research. We developed a user-friendly tool, Circview, with three layers, including interaction layer, model layer and data persistent layer. With this tool, users can easily view all circRNAs for each gene and individual circRNAs with detailed information. Users can rank circRNAs by number of samples or number of tools detected or other criteria for their downstream functional experiments. Furthermore, CircView allows users to view regulatory elements, including MRE and RBP binding sites. Taken together, CircView is an unique tool to visualize and explore circRNAs, which will help users to better understand potential functions of circRNAs. Key Points Historical perspective on research and resources for circRNAs. First visualization and exploration tool for circRNAs. Comprehensive comparison across samples or tools. Display the regulatory elements that are sponged, such as MREs and RBPs. Acknowledgements The authors thank the University of Texas Health Science Center at Houston, Wuhan University, and China Scholarship Council for financial support to this research. Funding Cancer Prevention and Research Institute of Texas (grant number RR150085 to L.H.); Natural Science Foundation of Hubei Province, China (grant number 2015CFB170 to C.H.); and China Scholarship Council (grant number 201606275095 to J. F.). Jing Feng is an associate professor at International School of Software, Wuhan University. Yu Xiang is a postdoc fellow at Department of Biochemistry and Molecular Biology, The University of Texas Health Science Center at Houston McGovern Medical School. Siyu Xia is a master student at School of Basic Medical Sciences, Wuhan University. Huan Liu is an associate professor at Wuhan Institute of Virology, Chinese Academy of Sciences. Jun Wang is a master student at School of Basic Medical Sciences, Wuhan University. Fatma Muge Ozguc is a research assistant at Department of Biochemistry and Molecular Biology, The University of Texas Health Science Center at Houston McGovern Medical School. Lijun Lei is a master student at School of Basic Medical Sciences, Wuhan University. Ruoshan Kong is an associate professor at International School of Software, Wuhan University. Lixiao Diao is a Principal Statistical Analyst at Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center. Chunjiang He is an associate professor at School of Basic Medical Sciences and Hubei Province Key Laboratory of Allergy and Immunology, Wuhan University. He is an expert for identification and functional characterization of noncoding RNAs. Leng Han is an assistant professor and CPRIT scholar at Department of Biochemistry and Molecular Biology, The University of Texas Health Science Center at Houston McGovern Medical School. He is an expert for high-throughput data mining. References 1 Chen LL. The biogenesis and emerging roles of circular RNAs . Nat Rev Mol Cell Biol 2016 ; 17 : 205 – 11 . Google Scholar Crossref Search ADS PubMed 2 Westholm JO , Miura P , Olson S , et al. Genome-wide analysis of drosophila circular RNAs reveals their structural and sequence properties and age-dependent neural accumulation . Cell Rep 2014 ; 9 : 1966 – 80 . Google Scholar Crossref Search ADS PubMed 3 Memczak S , Jens M , Elefsinioti A , et al. Circular RNAs are a large class of animal RNAs with regulatory potency . Nature 2013 ; 495 : 333 – 8 . Google Scholar Crossref Search ADS PubMed 4 Wang PL , Bao Y , Yee M-C , et al. Circular RNA is expressed across the eukaryotic tree of life . PLoS One 2014 ; 9 : e90859. Google Scholar Crossref Search ADS PubMed 5 Jeck WR , Sorrentino JA , Wang K , et al. Circular RNAs are abundant, conserved, and associated with ALU repeats . RNA 2013 ; 19 : 141 – 57 . Google Scholar Crossref Search ADS PubMed 6 Guo JU , Agarwal V , Guo H , et al. Expanded identification and characterization of mammalian circular RNAs . Genome Biol 2014 ; 15 : 409. Google Scholar Crossref Search ADS PubMed 7 You X , Vlatkovic I , Babic A , et al. Neural circular RNAs are derived from synaptic genes and regulated by development and plasticity . Nat Neurosci 2015 ; 18 : 603 – 10 . Google Scholar Crossref Search ADS PubMed 8 Li Z , Huang C , Bao C , et al. Exon-intron circular RNAs regulate transcription in the nucleus . Nat Struct Mol Biol 2015 ; 22 : 256 – 64 . Google Scholar Crossref Search ADS PubMed 9 Hansen TB , Jensen TI , Clausen BH , et al. Natural RNA circles function as efficient microRNA sponges . Nature 2013 ; 495 : 384 – 8 . Google Scholar Crossref Search ADS PubMed 10 Veno MT , Hansen TB , Veno ST , et al. Spatio-temporal regulation of circular RNA expression during porcine embryonic brain development . Genome Biol 2015 ; 16 : 245 . Google Scholar Crossref Search ADS PubMed 11 Xia S , Feng J , Lei L , et al. Comprehensive characterization of tissue-specific circular RNAs in the human and mouse genomes . Brief Bioinform 2016 . doi: 10.1093/bib/bbw081. 12 Robinson JT , Thorvaldsdottir H , Winckler W , et al. Integrative genomics viewer . Nat Biotechnol 2011 ; 29 : 24 – 6 . Google Scholar Crossref Search ADS PubMed 13 Rost HL , Sachsenberg T , Aiche S , et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis . Nat Methods 2016 ; 13 : 741 – 8 . Google Scholar Crossref Search ADS PubMed 14 Durand NC , Robinson JT , Shamim MS , et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom . Cell Syst 2016 ; 3 : 99 – 101 . Google Scholar Crossref Search ADS PubMed 15 Phanstiel DH , Boyle AP , Araya CL , et al. Sushi.R: flexible, quantitative and integrative genomic visualizations for publication-quality multi-panel figures . Bioinformatics 2014 ; 30 : 2808 – 10 . Google Scholar Crossref Search ADS PubMed 16 Salzman J , Gawad C , Wang PL , et al. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types . PLoS One 2012 ; 7 : e30733 . Google Scholar Crossref Search ADS PubMed 17 Wang K , Singh D , Zeng Z , et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery . Nucleic Acids Res 2010 ; 38 : e178 . Google Scholar Crossref Search ADS PubMed 18 Hoffmann S , Otto C , Doose G , et al. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection . Genome Biol 2014 ; 15 : R34. Google Scholar Crossref Search ADS PubMed 19 Zhang XO , Wang HB , Zhang Y , et al. Complementary sequence-mediated exon circularization . Cell 2014 ; 159 : 134 – 47 . Google Scholar Crossref Search ADS PubMed 20 Gao Y , Wang J , Zhao F. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification . Genome Biol 2015 ; 16 : 4. Google Scholar Crossref Search ADS PubMed 21 Szabo L , Morey R , Palpant NJ , et al. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development . Genome Biol 2015 ; 16 : 126. Google Scholar Crossref Search ADS PubMed 22 Chuang TJ , Wu CS , Chen CY , et al. NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a good balance between sensitivity and precision . Nucleic Acids Res 2016 ; 44 : e29. Google Scholar Crossref Search ADS PubMed 23 Cheng J , Metge F , Dieterich C. Specific identification and quantification of circular RNAs from sequencing data . Bioinformatics 2016 ; 32 : 1094 – 6 . Google Scholar Crossref Search ADS PubMed 24 Song X , Zhang N , Han P , et al. Circular RNA profile in gliomas revealed by identification tool UROBORUS . Nucleic Acids Res 2016 ; 44 : e87. Google Scholar Crossref Search ADS PubMed 25 Hansen TB , Veno MT , Damgaard CK , et al. Comparison of circular RNA prediction tools . Nucleic Acids Res 2016 ; 44 : e58. Google Scholar Crossref Search ADS PubMed 26 Zhang XO , Dong R , Zhang Y , et al. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs . Genome Res 2016 ; 26 : 1277 – 87 . Google Scholar Crossref Search ADS PubMed 27 Gao Y , Zhang J , Zhao F. Circular RNA identification based on multiple seed matching . Brief Bioinform 2017 . doi:10.1093/bib/bbx014. 28 Ghosal S , Das S , Sen R , et al. Circ2Traits: a comprehensive database for circular RNA potentially associated with disease and traits . Front Genet 2013 ; 4 : 283 . Google Scholar Crossref Search ADS PubMed 29 Glazar P , Papavasileiou P , Rajewsky N. circBase: a database for circular RNAs . RNA 2014 ; 20 : 1666 – 70 . Google Scholar Crossref Search ADS PubMed 30 Liu YC , Li JR , Sun CH , et al. CircNet: a database of circular RNAs derived from transcriptome sequencing data . Nucleic Acids Res 2016 ; 44 : D209 – 15 . Google Scholar Crossref Search ADS PubMed 31 Chen X , Han P , Zhou T , et al. circRNADb: a comprehensive database for human circular RNAs with protein-coding annotations . Sci Rep 2016 ; 6 : 34985 . Google Scholar Crossref Search ADS PubMed 32 Li JH , Liu S , Zhou H , et al. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data . Nucleic Acids Res 2014 ; 42 : D92 – 7 . Google Scholar Crossref Search ADS PubMed 33 Dudekula DB , Panda AC , Grammatikakis I , et al. CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs . RNA Biol 2016 ; 13 : 34 – 42 . Google Scholar Crossref Search ADS PubMed 34 Kandoth C , McLellan MD , Vandin F , et al. Mutational landscape and significance across 12 major cancer types . Nature 2013 ; 502 : 333 – 9 . Google Scholar Crossref Search ADS PubMed 35 Melton C , Reuter JA , Spacek DV , et al. Recurrent somatic mutations in regulatory regions of human cancer genomes . Nat Genet 2015 ; 47 : 710 – 6 . Google Scholar Crossref Search ADS PubMed 36 Zhang Y , Zhang XO , Chen T , et al. Circular intronic long noncoding RNAs . Mol Cell 2013 ; 51 : 792 – 806 . Google Scholar Crossref Search ADS PubMed 37 Conn SJ , Pillman KA , Toubia J , et al. The RNA binding protein quaking regulates formation of circRNAs . Cell 2015 ; 160 : 1125 – 34 . Google Scholar Crossref Search ADS PubMed 38 Ashwal-Fluss R , Meyer M , Pamudurti NR , et al. circRNA biogenesis competes with pre-mRNA splicing . Mol Cell 2014 ; 56 : 55 – 66 . Google Scholar Crossref Search ADS PubMed Author notes These authors contributed equally to this work. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Precision medicine needs pioneering clinical bioinformaticiansGómez-López,, Gonzalo;Dopazo,, Joaquín;Cigudosa, Juan, C;Valencia,, Alfonso;Al-Shahrour,, Fátima
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx144pmid: 29077790
Abstract Success in precision medicine depends on accessing high-quality genetic and molecular data from large, well-annotated patient cohorts that couple biological samples to comprehensive clinical data, which in conjunction can lead to effective therapies. From such a scenario emerges the need for a new professional profile, an expert bioinformatician with training in clinical areas who can make sense of multi-omics data to improve therapeutic interventions in patients, and the design of optimized basket trials. In this review, we first describe the main policies and international initiatives that focus on precision medicine. Secondly, we review the currently ongoing clinical trials in precision medicine, introducing the concept of ‘precision bioinformatics’, and we describe current pioneering bioinformatics efforts aimed at implementing tools and computational infrastructures for precision medicine in health institutions around the world. Thirdly, we discuss the challenges related to the clinical training of bioinformaticians, and the urgent need for computational specialists capable of assimilating medical terminologies and protocols to address real clinical questions. We also propose some skills required to carry out common tasks in clinical bioinformatics and some tips for emergent groups. Finally, we explore the future perspectives and the challenges faced by precision medicine bioinformatics. precision medicine, computing infrastructures, clinical bioinformatics, training, clinical bioinformatician, genomic report Precision medicine in the real world: the dress rehearsals The paradigm of precision medicine is defined by combining the use of population-based molecular profiling, clinical data, epidemiological information and other types of data to make clinical decisions that are tailored to individual patients [1]. The potential advantages of this approach, both for patients and doctors, include more accurate diagnosis and treatments, safer drug prescription, better disease prevention and consequently, a reduction in healthcare costs. The integration of genomics into routine clinical practice requires systems and workforces that are equipped and prepared to handle the scale and complexity of genomic data. As such, bioinformatics plays an essential role in providing the elements required for the processing, visualization and interpretation of a patient’s multi-omics profiles, and for the integration of these profiles with clinical data to gain a mechanistic understanding of their disease, thereby facilitating more personalized treatment [2]. Common bioinformatics tasks in a precision medicine scenario include the implementation and execution of well-established and reproducible workflows to process a patient’s omics data, applying computational methods to detect altered genes (mutated, amplified/deleted, altered expression, etc.), to interpret the biological and clinical impact of such alterations, to establish therapeutic guidance based on the patient’s genomic profile and for health record data mining to achieve knowledge-driven clinical assessment (Figure 1). In theory, all these tasks enable genome-based reports to be generated that can eventually stratify patients to facilitate clinical decision-making. The aforementioned bioinformatics tasks must be supported by robust and stable technological platforms, and computational infrastructures for data storage, data privacy and protection, which also contemplate protocols for server maintenance and pipeline management [3]. Figure 1 View largeDownload slide Precision medicine workflow: from data to patient care. Precision medicine requires computational infrastructures to efficiently store and process data on patient genotypes and phenotypes. The biological and clinical interpretation of such data is converted into an integral multi-omics report that will support clinical decision-making: ML, Machine Learning; CC, Cognitive Computing; EHRs, Electronic Health Records. Figure 1 View largeDownload slide Precision medicine workflow: from data to patient care. Precision medicine requires computational infrastructures to efficiently store and process data on patient genotypes and phenotypes. The biological and clinical interpretation of such data is converted into an integral multi-omics report that will support clinical decision-making: ML, Machine Learning; CC, Cognitive Computing; EHRs, Electronic Health Records. The focus on human health and disease brings new challenges and requirements to bioinformaticians, particularly given the volume, complexity, heterogeneity and nature of the data. Computational systems biomedicine is an emerging discipline [4] that aims to provide the computing methodologies, communication technologies and tools to tackle the problems derived from the complex nature of many human health issues and diseases. Although examples can be found in basic research areas, the application of systems medicine to the clinic is still relatively limited. Importantly, bioinformaticians face becoming novel stakeholders in the healthcare sector who will collaborate closely with physicians in clinical decision-making, henceforth becoming clinical bioinformaticians. Translating cancer genomes into the clinic Although the model of precision medicine may apply to many diseases, cancer is the disease for which it is clearly most advanced at this time. Precision oncology has incorporated the study of human cancers by genome sequencing and/or other genome-based technologies, proving that many tumours harbour hundreds of gene mutations and/or copy number changes [5]. In this scenario, the genetic heterogeneity arising from the bioinformatics analysis of tumour genome profiles indicates that the majority of cancers are not single diseases but rather, they are an array of disorders with distinct molecular mechanisms and where there is clinical variation between individuals [6–9]. Such genetic variation in tumours includes essential information to guide the diagnosis and treatment of cancer patients [10]. However, most clinical trials currently evaluate the efficacy and safety of a new drug by analysing its effects on largely unselected populations of patients, ignoring their genomic profile, a characteristic that might indeed help to identify the patients that are most likely to respond to the treatment [11]. Consequently, precision oncology has been incorporated into clinical trials. Moreover, molecular profiling of cancer has bolstered the concept of ‘basket trials’, a new and evolving type of clinical trial designed around the hypothesis that the presence of a molecular marker predicts the response to a targeted therapy independently of the tumour type [12]. Large-scale cancer genome consortia have emerged for nearly all major cancer types to comprehensively characterize the genomes of thousands of cases, including The Cancer Genome Atlas (TCGA; https://cancergenome.nih.gov/) and The International Cancer Genome Consortium (ICGC; http://icgc.org/) [13]. Unfortunately, these large-scale genomic projects were launched without complete and standardized clinical information of the donors, such as the treatments received or the patient’s medical history, making it difficult to identify new preventative and predictive genomic biomarkers that would aid the design of biomarker-driven clinical trials. Therefore, an important observation drawn from the completion of these projects is the importance of data sharing as an essential way to link genomic data with high-quality and standardized clinical information when attempting to identify genotype–phenotype associations [14, 15]. According to the 2016 Precision Medicine Essential Brief, >60% of respondents indicated that the most significant challenges in precision medicine are gaining access to clinical data, the integration of clinical data systems and the integration of clinical and genomic data [16]. Indeed, the best way to achieve clinical and genomic data sharing is currently a subject of intense debate [17–19]. Precision medicine national initiatives In recent years, many countries have implemented different precision medicine initiatives (PMIs) at a national level, including the USA, China, Australia, Qatar, South Korea and in Europe, England, France, Finland, Denmark, the Netherlands and Germany (Table 1). The US PMI, known as ‘All of Us’, aims to initiate a paradigm shift for modern medicine by increasing population-based genome sequencing, and linking it with clinical data to understand and determine how best to prevent or treat disease. Notably, ‘All of Us’ will encourage open data sharing, allowing access to patients and researchers [20]. Another relevant personalized medicine approach is the ‘100 000 Genomes Project’ in the UK, which is generating significant amounts of genomic data from rare diseases and cancer to inform clinical decision-making. Here, it is important to highlight how governing bodies, including research and health policy makers, are working together to support these initiatives, establishing international consortiums that provide policy recommendations, as well as research and training activities designed to exploit the potential of personalized medicine to the full. Some examples include the International Consortium for Personalized Medicine (ICPerMed; http://www.icpermed.eu/) or European Alliance for Personalized Medicine (EAPM; http://euapm.eu). Table 1 National and international PMIs and consortiums Initiative Description Funding/Partners URL PMI’s cohort program—‘All of us’ Research Program PMI was launched in 2015 to make advances in tailoring medical care to the individual. The program will collect genetic and health data from one million people. NIH https://allofus.nih.gov/ MyCode Community Health Initiative As part of PMI, Geisinger’s MyCode Community Health Initiative represents the largest study in the United States with EHRs linked to large-scale DNA sequencing data. Nearly 150 000 patient participants have already been signed up. NIH/Geisinger Health System hospitals. https://www.geisinger.edu/research/departments-and-centers/genomic-medicine-institute/mycode-health-initiative ICGCmed ICGCmed will link the wealth of genomic data already amassed across the cancer spectrum, with new genomic data being generated, and with clinical and health information that includes lifestyle, patient history, cancer diagnostic data and response to and survival following therapy. Funding agencies in Asia, Australia, Europe, North America and South America, supporting 88 projects in 17 jurisdictions (16 countries and the European Union), to study over 25 000 tumour genomes in 26 different tumour types. https://icgcmed.org/ GenomeAsia100k A non-profit consortium collaborating to sequence and analyse 100 000 Asian individuals Macrogen (South Korea) and MedGenome. http://www.genomeasia100k.com/ Project GENIE GENIE is a multi-phase, multi-year, international data-sharing project that catalyses precision oncology through the development of a regulatory-grade registry that aggregates and links clinical-grade cancer genomic data with clinical outcomes from tens of thousands of cancer patients treated at multiple international institutions. AACR/Dana-Farber Cancer Institute (USA), Gustave Roussy Cancer Campus (France), NKI (The Netherlands), Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins (USA), Memorial Sloan Kettering Cancer Center (USA), Princess Margaret Cancer Centre (Canada), University of Texas MD Anderson Cancer Center (USA), Vanderbilt-Ingram Cancer Center (USA) http://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx Worldwide Innovative Networking (WIN) Consortium in personalized cancer medicine WIN was created to accelerate the pace and reduce the cost of translating novel cancer treatments to the bedside through worldwide clinical trials and research projects Global initiative, headquartered in Paris, includes 35 institutional members. http://winconsortium.org/ EAPM The EAPM initiative was created to bring together European healthcare experts and patient advocates involved with major chronic diseases. EU http://euapm.eu/ 100, 000 Genomes Project This project will sequence 100 000 genomes from around 70 000 patients, combining genomic sequence data with medical records. Participants are NHS patients with a rare disease, plus their families, and patients with cancer. NHS Genomics England, UK https://www.genomicsengland.co.uk/ The ICPerMed ICPerMed brings together over 30 European and international partners representing ministries, funding agencies and the European Commission. ICPerMed provides a platform to initiate and support communication and exchange on personalized medicine research, funding and implementation. European Commission/EU countries and others http://www.icpermed.eu/index.php France Médicine Génomique 2025 Among the objectives of France Médicine Génomique is to perform approximately 10 000 WGS corresponding to 20 000 patients with rare diseases and their families, and 50 000 patients with metastatic or refractory cancers. French Government http://presse.inserm.fr/wp-content/uploads/2016/06/Plan-France-me%CC%81decine-ge%CC%81nomique-2025.pdf Estonian Genome Project (EGP) The EGP is a large population-based databank that was established with health records and biological samples from a large portion of the population. Its aim is for these data to be used in biomedical and genetic research to improve future public healthcare in Estonia. Estonian Government http://www.geenivaramu.ee/et Scottish Genomes Partnership (SGP) The SGP will initially focus on the rapid screening of > 3000 cancer patients in Scotland, diagnosing childhood illnesses, rare genetic diseases and disorders of the central nervous system, also using the data in population studies. Scottish Government http://www.scottishgenomespartnership.org/ Danish National Strategy for Personalized Medicine (2017–2020) The goal is to pave the way for the use of Personalized Medicine in the Danish healthcare system. The first phase was initiated at the beginning of 2017. It will focus on establishing joint governance and a national genome centre. The second phase will be focused on consolidation, research and development. Danish Government http://healthcaredenmark.dk/news/new-national-strategy-for-personalized-medicine.aspx Qatar Genome Programme (QGP) QGP is in its pilot phase which officially started in September 2015. The initiative aims to map the genome of the local population to apply Precision Medicine in Qatar. Qatar Government http://www.qatargenome.org.qa/ Genome of the Netherlands Consortium (GoNL) GoNL is interested in genetic variation in the Dutch population. To date, the consortium has sequenced 750 whole genomes from Dutch people and 250 trios of two parents and an adult child. Funded by Netherlands Organization for Scientific Research http://www.nlgenome.nl/ Australian National Genomic Healthcare Initiative Australian Genomics is a multi-disciplinary, multi-organization collaboration to integrate genomic medicine into Australian healthcare. Australia https://www.australiangenomics.org.au/ Initiative Description Funding/Partners URL PMI’s cohort program—‘All of us’ Research Program PMI was launched in 2015 to make advances in tailoring medical care to the individual. The program will collect genetic and health data from one million people. NIH https://allofus.nih.gov/ MyCode Community Health Initiative As part of PMI, Geisinger’s MyCode Community Health Initiative represents the largest study in the United States with EHRs linked to large-scale DNA sequencing data. Nearly 150 000 patient participants have already been signed up. NIH/Geisinger Health System hospitals. https://www.geisinger.edu/research/departments-and-centers/genomic-medicine-institute/mycode-health-initiative ICGCmed ICGCmed will link the wealth of genomic data already amassed across the cancer spectrum, with new genomic data being generated, and with clinical and health information that includes lifestyle, patient history, cancer diagnostic data and response to and survival following therapy. Funding agencies in Asia, Australia, Europe, North America and South America, supporting 88 projects in 17 jurisdictions (16 countries and the European Union), to study over 25 000 tumour genomes in 26 different tumour types. https://icgcmed.org/ GenomeAsia100k A non-profit consortium collaborating to sequence and analyse 100 000 Asian individuals Macrogen (South Korea) and MedGenome. http://www.genomeasia100k.com/ Project GENIE GENIE is a multi-phase, multi-year, international data-sharing project that catalyses precision oncology through the development of a regulatory-grade registry that aggregates and links clinical-grade cancer genomic data with clinical outcomes from tens of thousands of cancer patients treated at multiple international institutions. AACR/Dana-Farber Cancer Institute (USA), Gustave Roussy Cancer Campus (France), NKI (The Netherlands), Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins (USA), Memorial Sloan Kettering Cancer Center (USA), Princess Margaret Cancer Centre (Canada), University of Texas MD Anderson Cancer Center (USA), Vanderbilt-Ingram Cancer Center (USA) http://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx Worldwide Innovative Networking (WIN) Consortium in personalized cancer medicine WIN was created to accelerate the pace and reduce the cost of translating novel cancer treatments to the bedside through worldwide clinical trials and research projects Global initiative, headquartered in Paris, includes 35 institutional members. http://winconsortium.org/ EAPM The EAPM initiative was created to bring together European healthcare experts and patient advocates involved with major chronic diseases. EU http://euapm.eu/ 100, 000 Genomes Project This project will sequence 100 000 genomes from around 70 000 patients, combining genomic sequence data with medical records. Participants are NHS patients with a rare disease, plus their families, and patients with cancer. NHS Genomics England, UK https://www.genomicsengland.co.uk/ The ICPerMed ICPerMed brings together over 30 European and international partners representing ministries, funding agencies and the European Commission. ICPerMed provides a platform to initiate and support communication and exchange on personalized medicine research, funding and implementation. European Commission/EU countries and others http://www.icpermed.eu/index.php France Médicine Génomique 2025 Among the objectives of France Médicine Génomique is to perform approximately 10 000 WGS corresponding to 20 000 patients with rare diseases and their families, and 50 000 patients with metastatic or refractory cancers. French Government http://presse.inserm.fr/wp-content/uploads/2016/06/Plan-France-me%CC%81decine-ge%CC%81nomique-2025.pdf Estonian Genome Project (EGP) The EGP is a large population-based databank that was established with health records and biological samples from a large portion of the population. Its aim is for these data to be used in biomedical and genetic research to improve future public healthcare in Estonia. Estonian Government http://www.geenivaramu.ee/et Scottish Genomes Partnership (SGP) The SGP will initially focus on the rapid screening of > 3000 cancer patients in Scotland, diagnosing childhood illnesses, rare genetic diseases and disorders of the central nervous system, also using the data in population studies. Scottish Government http://www.scottishgenomespartnership.org/ Danish National Strategy for Personalized Medicine (2017–2020) The goal is to pave the way for the use of Personalized Medicine in the Danish healthcare system. The first phase was initiated at the beginning of 2017. It will focus on establishing joint governance and a national genome centre. The second phase will be focused on consolidation, research and development. Danish Government http://healthcaredenmark.dk/news/new-national-strategy-for-personalized-medicine.aspx Qatar Genome Programme (QGP) QGP is in its pilot phase which officially started in September 2015. The initiative aims to map the genome of the local population to apply Precision Medicine in Qatar. Qatar Government http://www.qatargenome.org.qa/ Genome of the Netherlands Consortium (GoNL) GoNL is interested in genetic variation in the Dutch population. To date, the consortium has sequenced 750 whole genomes from Dutch people and 250 trios of two parents and an adult child. Funded by Netherlands Organization for Scientific Research http://www.nlgenome.nl/ Australian National Genomic Healthcare Initiative Australian Genomics is a multi-disciplinary, multi-organization collaboration to integrate genomic medicine into Australian healthcare. Australia https://www.australiangenomics.org.au/ View Large Table 1 National and international PMIs and consortiums Initiative Description Funding/Partners URL PMI’s cohort program—‘All of us’ Research Program PMI was launched in 2015 to make advances in tailoring medical care to the individual. The program will collect genetic and health data from one million people. NIH https://allofus.nih.gov/ MyCode Community Health Initiative As part of PMI, Geisinger’s MyCode Community Health Initiative represents the largest study in the United States with EHRs linked to large-scale DNA sequencing data. Nearly 150 000 patient participants have already been signed up. NIH/Geisinger Health System hospitals. https://www.geisinger.edu/research/departments-and-centers/genomic-medicine-institute/mycode-health-initiative ICGCmed ICGCmed will link the wealth of genomic data already amassed across the cancer spectrum, with new genomic data being generated, and with clinical and health information that includes lifestyle, patient history, cancer diagnostic data and response to and survival following therapy. Funding agencies in Asia, Australia, Europe, North America and South America, supporting 88 projects in 17 jurisdictions (16 countries and the European Union), to study over 25 000 tumour genomes in 26 different tumour types. https://icgcmed.org/ GenomeAsia100k A non-profit consortium collaborating to sequence and analyse 100 000 Asian individuals Macrogen (South Korea) and MedGenome. http://www.genomeasia100k.com/ Project GENIE GENIE is a multi-phase, multi-year, international data-sharing project that catalyses precision oncology through the development of a regulatory-grade registry that aggregates and links clinical-grade cancer genomic data with clinical outcomes from tens of thousands of cancer patients treated at multiple international institutions. AACR/Dana-Farber Cancer Institute (USA), Gustave Roussy Cancer Campus (France), NKI (The Netherlands), Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins (USA), Memorial Sloan Kettering Cancer Center (USA), Princess Margaret Cancer Centre (Canada), University of Texas MD Anderson Cancer Center (USA), Vanderbilt-Ingram Cancer Center (USA) http://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx Worldwide Innovative Networking (WIN) Consortium in personalized cancer medicine WIN was created to accelerate the pace and reduce the cost of translating novel cancer treatments to the bedside through worldwide clinical trials and research projects Global initiative, headquartered in Paris, includes 35 institutional members. http://winconsortium.org/ EAPM The EAPM initiative was created to bring together European healthcare experts and patient advocates involved with major chronic diseases. EU http://euapm.eu/ 100, 000 Genomes Project This project will sequence 100 000 genomes from around 70 000 patients, combining genomic sequence data with medical records. Participants are NHS patients with a rare disease, plus their families, and patients with cancer. NHS Genomics England, UK https://www.genomicsengland.co.uk/ The ICPerMed ICPerMed brings together over 30 European and international partners representing ministries, funding agencies and the European Commission. ICPerMed provides a platform to initiate and support communication and exchange on personalized medicine research, funding and implementation. European Commission/EU countries and others http://www.icpermed.eu/index.php France Médicine Génomique 2025 Among the objectives of France Médicine Génomique is to perform approximately 10 000 WGS corresponding to 20 000 patients with rare diseases and their families, and 50 000 patients with metastatic or refractory cancers. French Government http://presse.inserm.fr/wp-content/uploads/2016/06/Plan-France-me%CC%81decine-ge%CC%81nomique-2025.pdf Estonian Genome Project (EGP) The EGP is a large population-based databank that was established with health records and biological samples from a large portion of the population. Its aim is for these data to be used in biomedical and genetic research to improve future public healthcare in Estonia. Estonian Government http://www.geenivaramu.ee/et Scottish Genomes Partnership (SGP) The SGP will initially focus on the rapid screening of > 3000 cancer patients in Scotland, diagnosing childhood illnesses, rare genetic diseases and disorders of the central nervous system, also using the data in population studies. Scottish Government http://www.scottishgenomespartnership.org/ Danish National Strategy for Personalized Medicine (2017–2020) The goal is to pave the way for the use of Personalized Medicine in the Danish healthcare system. The first phase was initiated at the beginning of 2017. It will focus on establishing joint governance and a national genome centre. The second phase will be focused on consolidation, research and development. Danish Government http://healthcaredenmark.dk/news/new-national-strategy-for-personalized-medicine.aspx Qatar Genome Programme (QGP) QGP is in its pilot phase which officially started in September 2015. The initiative aims to map the genome of the local population to apply Precision Medicine in Qatar. Qatar Government http://www.qatargenome.org.qa/ Genome of the Netherlands Consortium (GoNL) GoNL is interested in genetic variation in the Dutch population. To date, the consortium has sequenced 750 whole genomes from Dutch people and 250 trios of two parents and an adult child. Funded by Netherlands Organization for Scientific Research http://www.nlgenome.nl/ Australian National Genomic Healthcare Initiative Australian Genomics is a multi-disciplinary, multi-organization collaboration to integrate genomic medicine into Australian healthcare. Australia https://www.australiangenomics.org.au/ Initiative Description Funding/Partners URL PMI’s cohort program—‘All of us’ Research Program PMI was launched in 2015 to make advances in tailoring medical care to the individual. The program will collect genetic and health data from one million people. NIH https://allofus.nih.gov/ MyCode Community Health Initiative As part of PMI, Geisinger’s MyCode Community Health Initiative represents the largest study in the United States with EHRs linked to large-scale DNA sequencing data. Nearly 150 000 patient participants have already been signed up. NIH/Geisinger Health System hospitals. https://www.geisinger.edu/research/departments-and-centers/genomic-medicine-institute/mycode-health-initiative ICGCmed ICGCmed will link the wealth of genomic data already amassed across the cancer spectrum, with new genomic data being generated, and with clinical and health information that includes lifestyle, patient history, cancer diagnostic data and response to and survival following therapy. Funding agencies in Asia, Australia, Europe, North America and South America, supporting 88 projects in 17 jurisdictions (16 countries and the European Union), to study over 25 000 tumour genomes in 26 different tumour types. https://icgcmed.org/ GenomeAsia100k A non-profit consortium collaborating to sequence and analyse 100 000 Asian individuals Macrogen (South Korea) and MedGenome. http://www.genomeasia100k.com/ Project GENIE GENIE is a multi-phase, multi-year, international data-sharing project that catalyses precision oncology through the development of a regulatory-grade registry that aggregates and links clinical-grade cancer genomic data with clinical outcomes from tens of thousands of cancer patients treated at multiple international institutions. AACR/Dana-Farber Cancer Institute (USA), Gustave Roussy Cancer Campus (France), NKI (The Netherlands), Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins (USA), Memorial Sloan Kettering Cancer Center (USA), Princess Margaret Cancer Centre (Canada), University of Texas MD Anderson Cancer Center (USA), Vanderbilt-Ingram Cancer Center (USA) http://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx Worldwide Innovative Networking (WIN) Consortium in personalized cancer medicine WIN was created to accelerate the pace and reduce the cost of translating novel cancer treatments to the bedside through worldwide clinical trials and research projects Global initiative, headquartered in Paris, includes 35 institutional members. http://winconsortium.org/ EAPM The EAPM initiative was created to bring together European healthcare experts and patient advocates involved with major chronic diseases. EU http://euapm.eu/ 100, 000 Genomes Project This project will sequence 100 000 genomes from around 70 000 patients, combining genomic sequence data with medical records. Participants are NHS patients with a rare disease, plus their families, and patients with cancer. NHS Genomics England, UK https://www.genomicsengland.co.uk/ The ICPerMed ICPerMed brings together over 30 European and international partners representing ministries, funding agencies and the European Commission. ICPerMed provides a platform to initiate and support communication and exchange on personalized medicine research, funding and implementation. European Commission/EU countries and others http://www.icpermed.eu/index.php France Médicine Génomique 2025 Among the objectives of France Médicine Génomique is to perform approximately 10 000 WGS corresponding to 20 000 patients with rare diseases and their families, and 50 000 patients with metastatic or refractory cancers. French Government http://presse.inserm.fr/wp-content/uploads/2016/06/Plan-France-me%CC%81decine-ge%CC%81nomique-2025.pdf Estonian Genome Project (EGP) The EGP is a large population-based databank that was established with health records and biological samples from a large portion of the population. Its aim is for these data to be used in biomedical and genetic research to improve future public healthcare in Estonia. Estonian Government http://www.geenivaramu.ee/et Scottish Genomes Partnership (SGP) The SGP will initially focus on the rapid screening of > 3000 cancer patients in Scotland, diagnosing childhood illnesses, rare genetic diseases and disorders of the central nervous system, also using the data in population studies. Scottish Government http://www.scottishgenomespartnership.org/ Danish National Strategy for Personalized Medicine (2017–2020) The goal is to pave the way for the use of Personalized Medicine in the Danish healthcare system. The first phase was initiated at the beginning of 2017. It will focus on establishing joint governance and a national genome centre. The second phase will be focused on consolidation, research and development. Danish Government http://healthcaredenmark.dk/news/new-national-strategy-for-personalized-medicine.aspx Qatar Genome Programme (QGP) QGP is in its pilot phase which officially started in September 2015. The initiative aims to map the genome of the local population to apply Precision Medicine in Qatar. Qatar Government http://www.qatargenome.org.qa/ Genome of the Netherlands Consortium (GoNL) GoNL is interested in genetic variation in the Dutch population. To date, the consortium has sequenced 750 whole genomes from Dutch people and 250 trios of two parents and an adult child. Funded by Netherlands Organization for Scientific Research http://www.nlgenome.nl/ Australian National Genomic Healthcare Initiative Australian Genomics is a multi-disciplinary, multi-organization collaboration to integrate genomic medicine into Australian healthcare. Australia https://www.australiangenomics.org.au/ View Large Both national and international PMIs can be found at different stages of development. Some interesting efforts are currently being made to evaluate the social and economic benefit of personalized medicine in the healthcare sector. However, for all of these initiatives, there are significant challenges that need to be addressed to aid the broader implementation of precision medicine in the hospital setting: organizational, ethical and regulatory challenges; shortfalls in the generation of evidence; challenges in data sharing and infrastructure; the slow uptake of genomic information into clinical care and research; the economics of precision medicine [21]. Legislation to protect genomic data [22], the development of decision-making tools, preparing work groups and achieving greater patient/clinician engagement and trust [23] are all issues that must be addressed before implementing solutions to these challenges. Moreover, multidisciplinary efforts will be necessary to overcome the future challenges to the implementation of precision medicine, where bioinformatics will certainly play a key role. International consortiums for precision medicine: unity is strength Data mining and the potential for data sharing are key aspects of a number of recent high-profile genomics/personalized medicine initiatives with significant clinical potential (Table 1). For instance, the International Cancer Genome Consortium for Medicine (ICGCmed; http://icgcmed.org/) aims to address the shortcomings in standardization and in quality control of the clinical information available for patients enrolled by ICGC or TCGA. ICGCmed will collect a much richer genomics data set associated with clinical and health-related information, and that regarding response to therapies, making it more valuable for personalized medicine. Similarly, the Genomics Evidence Neoplasia Information Exchange (GENIE) project [24], supported by the American Association for Cancer Research, recently released almost 19 000 de-identified genomic records along with some clinical data collected from cancer patients as part of routine patient care. Although progress has been made, the successful execution of these projects will rely on effective global strategies for sharing disease-related clinical data and on these data complying with standards, which in general are not yet fully defined. A large number of international initiatives have emerged to develop guidelines for the responsible collection, curation, sharing and use of patient’s clinical and genomic data, and to create a harmonious approach to data sharing between the existing databases. The Global Alliance for Genomics and Health (GA4GH; https://genomicsandhealth.org/) [25] is dedicated to create interoperable technical standards to manage and share genomic and clinical data. Interestingly, the GA4GH website provides a full catalogue of worldwide genomic data initiatives, data-sharing efforts, databases and repositories, international genomics research consortia and projects and other genomics data resources [26]. The GA4GH roadmap includes the catalysis of data sharing projects, and resolving particular needs in PMIs by applying working products and demonstration projects. Some successful examples include the BRCA challenge project (http://brcaexchange.org/), which provides an efficient platform with clinical information of BRCA mutations collected from patients and their phenotypic characteristics, and Matchmaker Exchange (http://www.matchmakerexchange.org/), a federated network of databases whose goal is to find genetic causes of rare diseases by matching similar phenotypic and genotypic profiles [27]. For its medical relevance, the ClinVar Initiative that is hosted at the NIH site is also being accepted as a common shared repository that links phenotypes and genomic variation with supporting evidence [28]. Precision medicine requires advanced technologies and processes to collect, manage and analyse data, and to provide rapid and precise decision support within a clinical and public health context. Large computing resources already exist, yet they need to be fully public and operate according to global standards. ELIXIR is Europe’s leading life science infrastructure responsible for managing access to the massive amounts of data generated every day by publicly funded research (https://www.elixir-europe.org/). ELIXIR is building the technical infrastructure required by researchers to discover, combine and exchange human data via controlled access and in the context of the European Open Science Cloud initiative, while complying with data privacy and data security requirements. The backbone of the Human Data Use Case is the European Genome-phenome Archive (https://ega.crg.eu/) and the associated facilities, which include the GA4GH Beacon Project (https://beacon-network.org/). These initiatives aim to develop an open sharing platform as a simple public web service to aid genomic data centres make their data more ‘discoverable’, without revealing any sensitive information. As part of the global trend towards data sharing and open access to data repositories, the National Cancer Institute’s Genomic Data Commons constitutes a valuable research resource. This initiative provides significant quantities of data from NCI sponsored research, with over four petabytes of data initially being released to the global cancer community to allow comprehensive mining of this information [29]. Testing the paradigm: precision clinical trials The introduction of precision medicine at academic medical centres and in multi-hospital healthcare systems is ongoing. A recent survey conducted by the Healthcare Information and Management Systems Society (http://www.himssanalytics.org/) focused on larger hospitals and healthcare systems in the USA, and only 29% of the 137 respondents declared that they were engaged in some kind of precision medicine research. It is expected that current policies and initiatives driven by the ‘All of Us’ research programme in the USA and by the ICPerMed consortium in EU will facilitate and expand the implementation of precision medicine in Western countries. Until they do, some current cutting-edge projects are already setting the stage for the application of precision medicine via clinical trials. The Molecular Analysis for Therapy Choice (NCI-MATCH) and the Molecular Profiling-Based Assignment of Cancer Therapy (NCI-MPACT) trials sponsored by the NCI in the USA are clear examples of precision medicine clinical trials [30–32] (Table 2). In Europe, the PRECISION Panc platform [33] and TRACERx study [34] are ground-breaking efforts to apply precision medicine to clinical trials in the UK. Meanwhile, the Institute Curie is funding the SHIVA trial to evaluate the efficiency of precision medicine approaches, based on the idea that it is as yet unclear whether testing genetic alterations in cancer patients and assigning treatment targeting such alterations is more effective than standard non-targeted therapies. The initial results of SHIVA have shown that molecularly targeted drugs do not improve progression-free survival outside their indications when compared with the treatment of the physician’s choice in heavily pre-treated cancer patients [35]. The European Research Council (ERC) has also promoted the Rational molecular Assessments and Innovative Drug Selection (RAIDS) trial, a pioneering effort focused on the development of targeted therapies for cervical cancer [36], as well as the AVATAR trial, where a biopsy of a metastatic lesion from patients will be used to perform a complete exome analysis using next-generation sequencing (NGS). In this trial, a mouse model (Avatar) generated from Patient-Derived Xenografts [37] will be generated for each patient and candidate therapeutic targets can be experimentally tested in the patient’s Avatar to select the most effective regimen to ultimately be applied to the patient [38, 39]. Table 2 Precision medicine clinical trials Name Description Disease Location URL NCI-MATCH Phase II basket trial designed to study the efficacy of targeted drug treatments tailored directly by genetic testing in advanced solid tumours and lymphomas. The trial will enrol at least 1000 patients in treatment arms based on 19 specific genetic changes, independent of tumour origin. Solid tumours and lymphomas Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT02465060 (ID: NCT02465060) NCI-MPACT Pilot phase II trial designed to assess if targeted treatments based on precision medicine rational are more effective than standard non-targeted therapies. Molecular profiling-based targeted therapies are prescribed to treat patients with advanced metastatic solid tumours that are usually incurable or not controlled by standard treatments. NCI-MPACT randomly assigns patients with a mutation in a specific genetic pathway to either a targeted therapy for that pathway or a treatment not known to be pathway specific. Solid tumours Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT01827384 (ID: NCT01827384) PRECISIONPanc This platform aims to identify more suitable treatment options for pancreatic cancer patients, based on the molecular cancer subtypes defined by their promoters and within the framework of the ICGC consortium. PRECISIONPanc is not yet enrolling patients but it is expected that three clinical trials will recruit around 650 pancreatic cancer patients. Pancreatic cancer Multicentric.Sponsor: University of Glasgow, UK http://www.precisionpanc.org/ TRACERx TRACERx is analysing the intratumour heterogeneity in approximately 850 stage I–IIIA lung cancer patients and tracking the evolutionary trajectory from diagnosis through to relapse. Non-small-cell lung cancer Multicentric.Sponsor: Cancer Research UK http://www.cruklungcentre.org/Research/TRACERx SHIVA Randomized phase II trial to assess whether off-label use of commercial drugs for matched molecular alterations confers a clinical benefit to French patients with refractory cancer. Recurrent metastatic solid tumours Multicentric.Sponsor: Institut Curie, France. https://clinicaltrials.gov/show/NCT01771458 (ID: NCT01771458) RAIDS The RAIDS network collected a prospective data set (BioRAIDs: NCT02428842) of consecutive tumour tissues, whole blood and sera from 419 cervical cancer patients in 2013. At 54 months, whole exome sequencing (WES) became available for the first 98 patients and for 20 CC cell lines, as well as Reverse Phase Protein Array data for 154 patients with a common core set of 91 patients. Targeted sequencing is available for an additional 100 patients. These data allowed the patients to be stratified into different subgroups according to their molecular profile. The correlation with clinical outcomes is currently ongoing. Cervical cancer Multicentric.Coordinator: Institut Curie, France. http://www.raids-fp7.eu/http://cordis.europa.eu/project/rcn/106274_en.html (ERC ID: 304810) AVATAR Open label, randomized phase III study of patients receiving standard of care for resistant metastatic pancreatic cancer to test the hypothesis that an integrated personalized approach to treatment improves survival when compared with conventional treatment. Pancreatic cancer Multicentric.Madrid, Spain http://cordis.europa.eu/project/rcn/198713_en.html (ERC ID: 670582) Name Description Disease Location URL NCI-MATCH Phase II basket trial designed to study the efficacy of targeted drug treatments tailored directly by genetic testing in advanced solid tumours and lymphomas. The trial will enrol at least 1000 patients in treatment arms based on 19 specific genetic changes, independent of tumour origin. Solid tumours and lymphomas Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT02465060 (ID: NCT02465060) NCI-MPACT Pilot phase II trial designed to assess if targeted treatments based on precision medicine rational are more effective than standard non-targeted therapies. Molecular profiling-based targeted therapies are prescribed to treat patients with advanced metastatic solid tumours that are usually incurable or not controlled by standard treatments. NCI-MPACT randomly assigns patients with a mutation in a specific genetic pathway to either a targeted therapy for that pathway or a treatment not known to be pathway specific. Solid tumours Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT01827384 (ID: NCT01827384) PRECISIONPanc This platform aims to identify more suitable treatment options for pancreatic cancer patients, based on the molecular cancer subtypes defined by their promoters and within the framework of the ICGC consortium. PRECISIONPanc is not yet enrolling patients but it is expected that three clinical trials will recruit around 650 pancreatic cancer patients. Pancreatic cancer Multicentric.Sponsor: University of Glasgow, UK http://www.precisionpanc.org/ TRACERx TRACERx is analysing the intratumour heterogeneity in approximately 850 stage I–IIIA lung cancer patients and tracking the evolutionary trajectory from diagnosis through to relapse. Non-small-cell lung cancer Multicentric.Sponsor: Cancer Research UK http://www.cruklungcentre.org/Research/TRACERx SHIVA Randomized phase II trial to assess whether off-label use of commercial drugs for matched molecular alterations confers a clinical benefit to French patients with refractory cancer. Recurrent metastatic solid tumours Multicentric.Sponsor: Institut Curie, France. https://clinicaltrials.gov/show/NCT01771458 (ID: NCT01771458) RAIDS The RAIDS network collected a prospective data set (BioRAIDs: NCT02428842) of consecutive tumour tissues, whole blood and sera from 419 cervical cancer patients in 2013. At 54 months, whole exome sequencing (WES) became available for the first 98 patients and for 20 CC cell lines, as well as Reverse Phase Protein Array data for 154 patients with a common core set of 91 patients. Targeted sequencing is available for an additional 100 patients. These data allowed the patients to be stratified into different subgroups according to their molecular profile. The correlation with clinical outcomes is currently ongoing. Cervical cancer Multicentric.Coordinator: Institut Curie, France. http://www.raids-fp7.eu/http://cordis.europa.eu/project/rcn/106274_en.html (ERC ID: 304810) AVATAR Open label, randomized phase III study of patients receiving standard of care for resistant metastatic pancreatic cancer to test the hypothesis that an integrated personalized approach to treatment improves survival when compared with conventional treatment. Pancreatic cancer Multicentric.Madrid, Spain http://cordis.europa.eu/project/rcn/198713_en.html (ERC ID: 670582) View Large Table 2 Precision medicine clinical trials Name Description Disease Location URL NCI-MATCH Phase II basket trial designed to study the efficacy of targeted drug treatments tailored directly by genetic testing in advanced solid tumours and lymphomas. The trial will enrol at least 1000 patients in treatment arms based on 19 specific genetic changes, independent of tumour origin. Solid tumours and lymphomas Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT02465060 (ID: NCT02465060) NCI-MPACT Pilot phase II trial designed to assess if targeted treatments based on precision medicine rational are more effective than standard non-targeted therapies. Molecular profiling-based targeted therapies are prescribed to treat patients with advanced metastatic solid tumours that are usually incurable or not controlled by standard treatments. NCI-MPACT randomly assigns patients with a mutation in a specific genetic pathway to either a targeted therapy for that pathway or a treatment not known to be pathway specific. Solid tumours Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT01827384 (ID: NCT01827384) PRECISIONPanc This platform aims to identify more suitable treatment options for pancreatic cancer patients, based on the molecular cancer subtypes defined by their promoters and within the framework of the ICGC consortium. PRECISIONPanc is not yet enrolling patients but it is expected that three clinical trials will recruit around 650 pancreatic cancer patients. Pancreatic cancer Multicentric.Sponsor: University of Glasgow, UK http://www.precisionpanc.org/ TRACERx TRACERx is analysing the intratumour heterogeneity in approximately 850 stage I–IIIA lung cancer patients and tracking the evolutionary trajectory from diagnosis through to relapse. Non-small-cell lung cancer Multicentric.Sponsor: Cancer Research UK http://www.cruklungcentre.org/Research/TRACERx SHIVA Randomized phase II trial to assess whether off-label use of commercial drugs for matched molecular alterations confers a clinical benefit to French patients with refractory cancer. Recurrent metastatic solid tumours Multicentric.Sponsor: Institut Curie, France. https://clinicaltrials.gov/show/NCT01771458 (ID: NCT01771458) RAIDS The RAIDS network collected a prospective data set (BioRAIDs: NCT02428842) of consecutive tumour tissues, whole blood and sera from 419 cervical cancer patients in 2013. At 54 months, whole exome sequencing (WES) became available for the first 98 patients and for 20 CC cell lines, as well as Reverse Phase Protein Array data for 154 patients with a common core set of 91 patients. Targeted sequencing is available for an additional 100 patients. These data allowed the patients to be stratified into different subgroups according to their molecular profile. The correlation with clinical outcomes is currently ongoing. Cervical cancer Multicentric.Coordinator: Institut Curie, France. http://www.raids-fp7.eu/http://cordis.europa.eu/project/rcn/106274_en.html (ERC ID: 304810) AVATAR Open label, randomized phase III study of patients receiving standard of care for resistant metastatic pancreatic cancer to test the hypothesis that an integrated personalized approach to treatment improves survival when compared with conventional treatment. Pancreatic cancer Multicentric.Madrid, Spain http://cordis.europa.eu/project/rcn/198713_en.html (ERC ID: 670582) Name Description Disease Location URL NCI-MATCH Phase II basket trial designed to study the efficacy of targeted drug treatments tailored directly by genetic testing in advanced solid tumours and lymphomas. The trial will enrol at least 1000 patients in treatment arms based on 19 specific genetic changes, independent of tumour origin. Solid tumours and lymphomas Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT02465060 (ID: NCT02465060) NCI-MPACT Pilot phase II trial designed to assess if targeted treatments based on precision medicine rational are more effective than standard non-targeted therapies. Molecular profiling-based targeted therapies are prescribed to treat patients with advanced metastatic solid tumours that are usually incurable or not controlled by standard treatments. NCI-MPACT randomly assigns patients with a mutation in a specific genetic pathway to either a targeted therapy for that pathway or a treatment not known to be pathway specific. Solid tumours Multicentric, USA https://clinicaltrials.gov/ct2/show/NCT01827384 (ID: NCT01827384) PRECISIONPanc This platform aims to identify more suitable treatment options for pancreatic cancer patients, based on the molecular cancer subtypes defined by their promoters and within the framework of the ICGC consortium. PRECISIONPanc is not yet enrolling patients but it is expected that three clinical trials will recruit around 650 pancreatic cancer patients. Pancreatic cancer Multicentric.Sponsor: University of Glasgow, UK http://www.precisionpanc.org/ TRACERx TRACERx is analysing the intratumour heterogeneity in approximately 850 stage I–IIIA lung cancer patients and tracking the evolutionary trajectory from diagnosis through to relapse. Non-small-cell lung cancer Multicentric.Sponsor: Cancer Research UK http://www.cruklungcentre.org/Research/TRACERx SHIVA Randomized phase II trial to assess whether off-label use of commercial drugs for matched molecular alterations confers a clinical benefit to French patients with refractory cancer. Recurrent metastatic solid tumours Multicentric.Sponsor: Institut Curie, France. https://clinicaltrials.gov/show/NCT01771458 (ID: NCT01771458) RAIDS The RAIDS network collected a prospective data set (BioRAIDs: NCT02428842) of consecutive tumour tissues, whole blood and sera from 419 cervical cancer patients in 2013. At 54 months, whole exome sequencing (WES) became available for the first 98 patients and for 20 CC cell lines, as well as Reverse Phase Protein Array data for 154 patients with a common core set of 91 patients. Targeted sequencing is available for an additional 100 patients. These data allowed the patients to be stratified into different subgroups according to their molecular profile. The correlation with clinical outcomes is currently ongoing. Cervical cancer Multicentric.Coordinator: Institut Curie, France. http://www.raids-fp7.eu/http://cordis.europa.eu/project/rcn/106274_en.html (ERC ID: 304810) AVATAR Open label, randomized phase III study of patients receiving standard of care for resistant metastatic pancreatic cancer to test the hypothesis that an integrated personalized approach to treatment improves survival when compared with conventional treatment. Pancreatic cancer Multicentric.Madrid, Spain http://cordis.europa.eu/project/rcn/198713_en.html (ERC ID: 670582) View Large Precision bioinformatics for precision medicine Precision medicine trials will depend on bioinformatics. Indeed, the success of all the aforementioned pioneering studies will largely depend on the strength and quality of the data linking patients, molecular targets and targeted therapies. ‘Precision bioinformatics’ is defined as a branch of translational bioinformatics that specializes in the development of computational infrastructures, methodologies and tools required for clinical studies that adopt a precision medicine paradigm. Therefore, precision bioinformatics are heavily oriented to: (i) implement platforms for big data processing and integration (including electronic health records—EHRs); (ii) to establish protocols and standards for data exchange that preserve the patients’ privacy and that ensure data security; and (iii) to develop computational strategies to exploit, visualize, interpret and summarize such data, ultimately facilitating clinical decision-making [40]. New computational platforms for clinical applications High-throughput (HTP) technologies and NGS have fuelled the ‘Big Data revolution’ in biomedicine [41]. For example, the Illumina X-Ten System has the capacity to produce around 2 peta-bps per year. This situation now requires precision bioinformatics to implement platforms and algorithms for big data management and integration, ensuring that patient data can be used effectively and that clinically valuable information can be extracted. Diverse initiatives in precision bioinformatics along such lines are described in Table 3. For instance, the need to manage the activities associated with the NCI-MPACT trial motivated the implementation of the GeneMed platform and its public version called OpenGeneMed [42, 43]. The SHIVA and RAIDS trials have implemented the Knowledge and Data Integration (KDI) platform to facilitate data integration, tracking sample processing and delivering a genomic report to facilitate therapeutic decision-making [44]. For its part, the AVATAR trial integrates the RUbioSeq platform to automate sequencing data processing [45] and the PanDrugs method to guide anti-cancer therapy selection (http://www.pandrugs.org/). Table 3 Precision bioinformatics infrastructure initiatives Initiative Description Promoter Location URL GeneMed GeneMed is an informatics system designed to favour collaboration between a sequencing lab, the treatment selection team and clinical personnel, to reduce errors made by transferring and sharing data between groups, and to aid clear documentation in the NCI-MPACT clinical trial pipeline. NCI-MPACT USA — OpenGeneMed Public version of the GeneMed system. NCI-MPACT USA https://brb.nci.nih.gov/OpenGeneMed/ DiscovEHR The DiscovEHR browser facilitates access to variant frequency data from >50 000 MyCode participants. It facilitates allele frequency comparisons with other population-based and biobank resources. Regeneron Genetics Center and Geisinger Health System MyCode USA http://www.discovehrshare.com/ KDI Informatics platform implemented to ensure information sharing, cross-software interoperability, automatic data extraction and secure data transfer in the context of SHIVA, RAIDS and other studies. KDI is currently used to manage all the high-throughput data at the Institute Curie. Institute Curie France — PrecisionFDA A cloud-based data sharing system to evaluate NGS assays and for regulatory science exploration. FDA USA https://precision.fda.gov/ RD-Connect An infrastructure project that brings together databases, registries, biobanks and clinical bioinformatics data used for rare disease research in a central resource for researchers worldwide. EU Europe http://rd-connect.eu/ SOUND An international consortium established to create bioinformatics tools for statistically informed use of personal genomic and other ‘omics data in a medical context. EU Europe http://www.sound-biomed.eu/ tranSMART An open-source cloud system implemented to facilitate ‘omics data exchange in clinical and translational research. tranSMART foundation USA-Europe http://transmartfoundation.org/ G-DOC Plus Data integration and bioinformatics cloud platform to handle diverse biomedical big data, including gene expression arrays, NGS and medical images. Georgetown University USA https://gdoc.georgetown.edu/gdoc/ Oncobench A platform developed to analyse tumour data in clinical practice. The current version is collecting up to 0.5 Gb of DNA data per patient. Geneva University Hospitals, the Swiss Institute of Bioinformatics and others. Switzerland — Watson Oncology A cognitive computing system designed to support clinical decision making and to interpret cancer patients’ clinical information, identifying individualized, evidence-based treatment options. IBM and several US hospitals. USA https://www.mskcc.org/about/innovative-collaborations/watson-oncology Precision Medicine Knowledgebase Bot PMKB Bot connects to several channels, including Microsoft Teams, Skype, Slack and WebChat. As a result, clinicians can access these data in many different ways and make clinical decisions faster. PMKB currently supports 163 genes and 518 variants with 404 clinical interpretations. Microsoft and Englander Institute for Precision Medicine. USA https://pmkb.weill.cornell.edu/ Hortonworks Data Platform (HDP) A platform to store and process huge amounts of liver cancer data, making that data and related tools accessible to researchers in five different teams. HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each. Hortonworks and Arizona State University. USA https://es.hortonworks.com/customers/arizona-state-university/ Initiative Description Promoter Location URL GeneMed GeneMed is an informatics system designed to favour collaboration between a sequencing lab, the treatment selection team and clinical personnel, to reduce errors made by transferring and sharing data between groups, and to aid clear documentation in the NCI-MPACT clinical trial pipeline. NCI-MPACT USA — OpenGeneMed Public version of the GeneMed system. NCI-MPACT USA https://brb.nci.nih.gov/OpenGeneMed/ DiscovEHR The DiscovEHR browser facilitates access to variant frequency data from >50 000 MyCode participants. It facilitates allele frequency comparisons with other population-based and biobank resources. Regeneron Genetics Center and Geisinger Health System MyCode USA http://www.discovehrshare.com/ KDI Informatics platform implemented to ensure information sharing, cross-software interoperability, automatic data extraction and secure data transfer in the context of SHIVA, RAIDS and other studies. KDI is currently used to manage all the high-throughput data at the Institute Curie. Institute Curie France — PrecisionFDA A cloud-based data sharing system to evaluate NGS assays and for regulatory science exploration. FDA USA https://precision.fda.gov/ RD-Connect An infrastructure project that brings together databases, registries, biobanks and clinical bioinformatics data used for rare disease research in a central resource for researchers worldwide. EU Europe http://rd-connect.eu/ SOUND An international consortium established to create bioinformatics tools for statistically informed use of personal genomic and other ‘omics data in a medical context. EU Europe http://www.sound-biomed.eu/ tranSMART An open-source cloud system implemented to facilitate ‘omics data exchange in clinical and translational research. tranSMART foundation USA-Europe http://transmartfoundation.org/ G-DOC Plus Data integration and bioinformatics cloud platform to handle diverse biomedical big data, including gene expression arrays, NGS and medical images. Georgetown University USA https://gdoc.georgetown.edu/gdoc/ Oncobench A platform developed to analyse tumour data in clinical practice. The current version is collecting up to 0.5 Gb of DNA data per patient. Geneva University Hospitals, the Swiss Institute of Bioinformatics and others. Switzerland — Watson Oncology A cognitive computing system designed to support clinical decision making and to interpret cancer patients’ clinical information, identifying individualized, evidence-based treatment options. IBM and several US hospitals. USA https://www.mskcc.org/about/innovative-collaborations/watson-oncology Precision Medicine Knowledgebase Bot PMKB Bot connects to several channels, including Microsoft Teams, Skype, Slack and WebChat. As a result, clinicians can access these data in many different ways and make clinical decisions faster. PMKB currently supports 163 genes and 518 variants with 404 clinical interpretations. Microsoft and Englander Institute for Precision Medicine. USA https://pmkb.weill.cornell.edu/ Hortonworks Data Platform (HDP) A platform to store and process huge amounts of liver cancer data, making that data and related tools accessible to researchers in five different teams. HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each. Hortonworks and Arizona State University. USA https://es.hortonworks.com/customers/arizona-state-university/ View Large Table 3 Precision bioinformatics infrastructure initiatives Initiative Description Promoter Location URL GeneMed GeneMed is an informatics system designed to favour collaboration between a sequencing lab, the treatment selection team and clinical personnel, to reduce errors made by transferring and sharing data between groups, and to aid clear documentation in the NCI-MPACT clinical trial pipeline. NCI-MPACT USA — OpenGeneMed Public version of the GeneMed system. NCI-MPACT USA https://brb.nci.nih.gov/OpenGeneMed/ DiscovEHR The DiscovEHR browser facilitates access to variant frequency data from >50 000 MyCode participants. It facilitates allele frequency comparisons with other population-based and biobank resources. Regeneron Genetics Center and Geisinger Health System MyCode USA http://www.discovehrshare.com/ KDI Informatics platform implemented to ensure information sharing, cross-software interoperability, automatic data extraction and secure data transfer in the context of SHIVA, RAIDS and other studies. KDI is currently used to manage all the high-throughput data at the Institute Curie. Institute Curie France — PrecisionFDA A cloud-based data sharing system to evaluate NGS assays and for regulatory science exploration. FDA USA https://precision.fda.gov/ RD-Connect An infrastructure project that brings together databases, registries, biobanks and clinical bioinformatics data used for rare disease research in a central resource for researchers worldwide. EU Europe http://rd-connect.eu/ SOUND An international consortium established to create bioinformatics tools for statistically informed use of personal genomic and other ‘omics data in a medical context. EU Europe http://www.sound-biomed.eu/ tranSMART An open-source cloud system implemented to facilitate ‘omics data exchange in clinical and translational research. tranSMART foundation USA-Europe http://transmartfoundation.org/ G-DOC Plus Data integration and bioinformatics cloud platform to handle diverse biomedical big data, including gene expression arrays, NGS and medical images. Georgetown University USA https://gdoc.georgetown.edu/gdoc/ Oncobench A platform developed to analyse tumour data in clinical practice. The current version is collecting up to 0.5 Gb of DNA data per patient. Geneva University Hospitals, the Swiss Institute of Bioinformatics and others. Switzerland — Watson Oncology A cognitive computing system designed to support clinical decision making and to interpret cancer patients’ clinical information, identifying individualized, evidence-based treatment options. IBM and several US hospitals. USA https://www.mskcc.org/about/innovative-collaborations/watson-oncology Precision Medicine Knowledgebase Bot PMKB Bot connects to several channels, including Microsoft Teams, Skype, Slack and WebChat. As a result, clinicians can access these data in many different ways and make clinical decisions faster. PMKB currently supports 163 genes and 518 variants with 404 clinical interpretations. Microsoft and Englander Institute for Precision Medicine. USA https://pmkb.weill.cornell.edu/ Hortonworks Data Platform (HDP) A platform to store and process huge amounts of liver cancer data, making that data and related tools accessible to researchers in five different teams. HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each. Hortonworks and Arizona State University. USA https://es.hortonworks.com/customers/arizona-state-university/ Initiative Description Promoter Location URL GeneMed GeneMed is an informatics system designed to favour collaboration between a sequencing lab, the treatment selection team and clinical personnel, to reduce errors made by transferring and sharing data between groups, and to aid clear documentation in the NCI-MPACT clinical trial pipeline. NCI-MPACT USA — OpenGeneMed Public version of the GeneMed system. NCI-MPACT USA https://brb.nci.nih.gov/OpenGeneMed/ DiscovEHR The DiscovEHR browser facilitates access to variant frequency data from >50 000 MyCode participants. It facilitates allele frequency comparisons with other population-based and biobank resources. Regeneron Genetics Center and Geisinger Health System MyCode USA http://www.discovehrshare.com/ KDI Informatics platform implemented to ensure information sharing, cross-software interoperability, automatic data extraction and secure data transfer in the context of SHIVA, RAIDS and other studies. KDI is currently used to manage all the high-throughput data at the Institute Curie. Institute Curie France — PrecisionFDA A cloud-based data sharing system to evaluate NGS assays and for regulatory science exploration. FDA USA https://precision.fda.gov/ RD-Connect An infrastructure project that brings together databases, registries, biobanks and clinical bioinformatics data used for rare disease research in a central resource for researchers worldwide. EU Europe http://rd-connect.eu/ SOUND An international consortium established to create bioinformatics tools for statistically informed use of personal genomic and other ‘omics data in a medical context. EU Europe http://www.sound-biomed.eu/ tranSMART An open-source cloud system implemented to facilitate ‘omics data exchange in clinical and translational research. tranSMART foundation USA-Europe http://transmartfoundation.org/ G-DOC Plus Data integration and bioinformatics cloud platform to handle diverse biomedical big data, including gene expression arrays, NGS and medical images. Georgetown University USA https://gdoc.georgetown.edu/gdoc/ Oncobench A platform developed to analyse tumour data in clinical practice. The current version is collecting up to 0.5 Gb of DNA data per patient. Geneva University Hospitals, the Swiss Institute of Bioinformatics and others. Switzerland — Watson Oncology A cognitive computing system designed to support clinical decision making and to interpret cancer patients’ clinical information, identifying individualized, evidence-based treatment options. IBM and several US hospitals. USA https://www.mskcc.org/about/innovative-collaborations/watson-oncology Precision Medicine Knowledgebase Bot PMKB Bot connects to several channels, including Microsoft Teams, Skype, Slack and WebChat. As a result, clinicians can access these data in many different ways and make clinical decisions faster. PMKB currently supports 163 genes and 518 variants with 404 clinical interpretations. Microsoft and Englander Institute for Precision Medicine. USA https://pmkb.weill.cornell.edu/ Hortonworks Data Platform (HDP) A platform to store and process huge amounts of liver cancer data, making that data and related tools accessible to researchers in five different teams. HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each. Hortonworks and Arizona State University. USA https://es.hortonworks.com/customers/arizona-state-university/ View Large Remarkably, a number of local health systems are implementing bioinformatics platforms to support precision medicine studies. For instance, the Geisinger Health System launched DiscovEHR for the longitudinal integration of the EHRs corresponding to the MyCode initiative participants [46, 47]. Moreover, the Personalized Medicine plan was launched in Andalusia (Spain) to develop bioinformatics infrastructures that are interoperable with EHRs. Prototypes have been already tested using a common system for NGS-based diagnostics [48], together with the Medical Genome Project [49], an automated system for the discovery of disease genes [50]. These efforts underlie the need for data integration platforms, together with new computational methodologies and bioinformatics tools to effectively use precision medicine in clinical trials and local healthcare institutions. Other significant projects have also focused on data integration platforms to support precision medicine, such as: the PrecisionFDA platform (https://precision.fda.gov/) for NGS data sharing and pipeline testing; the RD-Connect project (http://rd-connect.eu/) that provides a global and integrated platform in the context of research into rare diseases; the Statistical Multi-Omics Understanding Consortium (SOUND) that focuses on the development of statistical and bioinformatics tools for personal multi-omics data mining (e.g. rDGIdb package [51]); the tranSMART platform to facilitate -omics data exchange [52]; the G-DOC Plus system to combine sequencing data and medical images [53]; and Oncobench [54]. Finally, it is important to highlight that both the ‘All of Us’ initiative and the European H2020 programme are actively promoting an alignment among public and private sector precision medicine projects [55]. The private sector includes the major pharmaceutical companies, big data, hardware and software companies, and especially, small and medium enterprises. Along these lines, Genomics England has launched the Genomics Expert Network for Enterprises consortium to identify effective and secure ways of bringing industry expertise into the 100 000 Genomes Project [56]. In this scenario, it is reasonable to expect a growing number of public–private partnerships to stimulate the application of commercial bioinformatics platforms in public hospitals and academic institutions. Clear examples include: the IBM Watson Oncology system that has already been introduced into some US cancer centres and hospitals [57]; Microsoft’s bot that has been implemented at the Englander Institute of Precision Medicine; the Hortonworks data platform (Table 3); and the WuXi NextCODE alliance that have become associated with both the National Heart Centre in Singapore and the 100 000 Genomes project (https://www.wuxinextcode.com/). Exchanging patient information to improve healthcare while maintaining privacy The practical value of developments in precision bioinformatics will depend on continued multidisciplinary discussion involving physicians, data scientists, biostatisticians and experts in clinical bioinformatics, but also, on the adoption of EHRs by health institutes. In the light of this, the Health Information Technology for Economic and Clinical Health provided the US Department of Health and Human Services with $29.5 billion to encourage the adoption of EHRs and to establish meaningful use for interoperability [58]. Beyond the essential digitalization of patient phenotype associated data, it is crucial to establish universal standards and controlled vocabularies to facilitate interoperability and data sharing and to ensure clinical data are collected and interpreted using standardized protocols for privacy and consent. SNOMED CT [59] and the Human Phenotype Ontology [60] represent well-established biomedical ontologies used by the 100 000 Genomes Project. Clear instances of standardization initiatives for clinical informatics protocols include the Clinical Data Interchange Standards Consortium (CDISC; http://www.cdisc.org/standards-and-implementations), Health IT Standards (http://healthcare.nist.gov/), ISO/TC215 (https://www.iso.org/committee/54960.html), Health Information Exchange (https://www.healthit.gov/HIE) and the Health Insurance Portability and Accountability Act (http://www.hhs.gov/ocr/privacy/). Other initiatives focused on general standards for EHRs include Open EHR (http://www.openehr.org/), CDISC (https://www.cdisc.org/standards) and HL7 (http://www.hl7.org/). Interoperability and integration are crucial yet challenging, particularly given the large heterogeneity of the data collected in EHRs and its confidential nature. In this sense, SemanticHEALTH was a pioneering text mining project focused on gathering fragmented semantic interoperability initiatives from EHRs in the EU [61]. Following SemanticHEALTH, other efforts have recently been initiated, and SemanticHealthNet (http://www.semantichealthnet.eu/) was launched to develop a scalable and sustainable pan-European semantic interoperable protocol for clinical and biomedical knowledge, and to help ensure that EHR systems are optimized for patient care, public health and clinical research in different healthcare systems and institutions [62]. Other collaborative initiatives for the systematic integration of EHRs include: P-medicine (http://www.p-medicine.eu/), focused on developing secure tools, robust data sharing and integration systems, IT infrastructure and virtual physiological human models to support precision medicine [63, 64]; and CER Hub, a web-based platform to combine comprehensive electronic clinical data from multiple healthcare organizations [65]. In addition to the heterogeneity of the data and the poor standardization, there are other technical and non-technical bottlenecks that must be overcome to establish the sustainable exchange of information via EHRs and genomic studies that is necessary for precision medicine. These include a lack of incentives, specifically in relation to the loss of patients to other hospitals. This problem can be overcome by making patient’s health data available anywhere, although it has long been recognized that healthcare systems are often reticent to participate in health information exchanges. Other recognized difficulties include the inefficient sorting through excessively non-selective patient information and problems in understanding the data shared in a medical context, particularly owing to a lack of details associated with the clinical notes driven by privacy concerns [66]. The reticence of patients and providers to exchange information based on privacy concerns and reports of healthcare data breaches is not unfounded [67]. However, the GA4GH Beacon Project (https://beacon-network.org/), which responds to allele-presence queries, represents an attempt to share genomic data without revealing individual patient information. However, this limits the utility of the data for research and diagnosis, and it is not exempt from risks of patient re-identification attacks [68]. The conflict between the need for data sharing and the maintenance of data security has also stimulated active research into novel cryptographic implementations that serve both patient healthcare, and that protect sensitive genomic information and patient privacy [69, 70]. It is also necessary to note that precision medicine studies should describe the full bioinformatics settings used to evaluate the quality and the traceability of the data generated. Apart from the aforementioned ClinVar resource, used regularly by clinical geneticists and other clinicians worldwide, the SHIVA trial has set a good precedent in this sense. In particular, this clinical trial fully reported the ontologies used, together with the standards and the complete bioinformatics framework used in its pipeline [44]. This kind of good practice promotes transparency, it serves as a reference for other studies and it facilitates pipeline reproducibility to those who will be in charge of computational analysis and multi-omics data interpretation in precision medicine studies (i.e. the clinical bioinformaticians). Clinical bioinformaticians at hospitals: the pioneer comes to town As seen above, the challenges to successfully apply precision medicine in hospitals are numerous and in particular, they are related to the need for large computational infrastructures for data processing and storage; the establishment of guidelines, standards and controlled protocols; the implementation of security policies to access data, along with biobanking, legal issues and ethical questions; and, last but not least, the lack of trained and specialized professionals to perform such tasks. At first glance, this situation resembles what happened two decades ago in genomics, where the completion of the Human Genome project and the development of multi-omics HTP methods highlighted the need for trained bioinformaticians to handle, analyse and extract information from large amounts of data [71]. Indeed, with the current big data explosion, this situation is even more evident, and consequently, experienced bioinformaticians and data scientist are in high demand by the big pharma and biotech companies, and at academic institutions [72]. The most desired professional profiles include experts at the crossroads of computer sciences, statistics and molecular biology, those able to present big data information in a clear manner to decision-makers [73, 74]. Nevertheless, such a profile might not suffice to cover all precision medicine requirements, which require an expert with bioinformatics skills who is familiar with the hospital environment and its specific particularities, who is fluent in clinical terminology, acquainted with clinical trial design and procedures, and most importantly, capable of understanding physicians demands for clinical decision-making tools as a priority to improve the standard of patient care (Box 1) Box 1. Fundamental technical skills for clinical bioinformaticians 1. Informatics ○ Experience in UNIX command line. ○ A basic programming language (i.e. Python). R as a useful language for handling statistics. ○ Knowledge of big data environments. 2. Life sciences ○ Understand the different types of biological data and databases. ○ Comprehend HTP data analysis methods. ○ Multi-omics data integration and interpretation. 3. Clinical scenario ○ Be familiar with EHRs, clinical terminology and medical procedures and protocols. ○ Get to know medical genomics: diagnosis, predictive and prognosis biomarkers. ○ Understand clinical trial design and monitoring. This type of bioinformatician, orientated towards the clinical environment, is still rare, yet it is an increasingly demanded species in today’s hospitals. Encountering such specialists is imperative for the effective implementation of precision medicine in health institutions, which will certainly need to set up specialist teams of clinical bioinformaticians to bridge the gap between the patient’s genomic landscape and clinical decision-making, linking raw sequences, data analysis tools, interpretation algorithms and a variety of databases and EHR repositories for each particular clinical case (Figure 2). Figure 2 View largeDownload slide Clinical bioinformatics laboratory profile. Clinical bioinformatics teams require multidisciplinary experts to perform regular tasks. Computational experts will be in charge of servers, databases, development and pipeline optimization. More biologically focused profiles will be responsible for genomic analysis and interpretation. In addition, all team members will have been trained to a varying degree in basic clinical sciences. There will be a continued knowledge exchange among the team members to achieve clinical goals and improve patient healthcare. Figure 2 View largeDownload slide Clinical bioinformatics laboratory profile. Clinical bioinformatics teams require multidisciplinary experts to perform regular tasks. Computational experts will be in charge of servers, databases, development and pipeline optimization. More biologically focused profiles will be responsible for genomic analysis and interpretation. In addition, all team members will have been trained to a varying degree in basic clinical sciences. There will be a continued knowledge exchange among the team members to achieve clinical goals and improve patient healthcare. Alternatively, it is clear that the expansion of precision medicine in the clinical environment will require an effort from physicians to better understand how bioinformatics can help in their work, assisting data-driven clinical decision-making. In fact, a number of authors have remarked on this issue, claiming physicians should become knowledgeable users who can understand the output from the bioinformatics analysis of a patient’s genomic data [75–78]. Indeed, other recent proposals go beyond specialized training in computational biology for physicians, proposing the integration of translational bioinformatics into the medical curriculum, and the establishment of official precision medicine certificates for clinical geneticists and health professionals interested in this field [79, 80]. Although the natural targets of these proposals are first clinical geneticists, these professionals are aware of their lack of training in bioinformatics, and the working duo of geneticist–bioinformatician is becoming a more common finding in hospitals. The landscape is not that clear for other medical specialties such as Oncology, Cardiology or Neurology, where the bioinformatician’s competences are considered far removed from the daily clinical work. Consequently, it is not surprising that a wide variety of opportunities for medical staff to train in bioinformatics have emerged in recent years [81]. Such opportunities must ensure they focus both on knowledge acquisition and clear practical examples of clinical use, more directly orientated to ‘classical’ health professionals. At present it is hard to find proposals for clinical training that specifically target bioinformaticians and computational biologists. The absence of such training in the bioinformatics community is slowing down the integration of bioinformaticians into the healthcare sector, contributing to a deceleration in the implementation of precision medicine at healthcare institutions. Undoubtedly, bioinformaticians and physicians speak different languages and have distinct scientific cultures that are not always easy to reconcile. The clinical bioinformatics community should contribute to overcome this language barrier, adapting its curriculum to a medical scenario and adding clinical knowledge to its background to facilitate communication with health professionals. It is necessary to organize and promote courses in fundamental clinical areas for bioinformatics professionals to ensure their full integration into working teams in healthcare institutions (Figure 3). Precision medicine will definitely benefit from the incorporation of clinically trained bioinformaticians capable of straddling the medical, biological and computational worlds, who can write code, interpret genome data and communicate with health professionals in clinical departments and hospitals. Figure 3 View largeDownload slide Precision medicine workflow in hospitals. The patient’s standard of care in a precision medicine scenario requires specialist clinical bioinformaticians to participate in multidisciplinary clinical sessions with physicians and clinical geneticists, making knowledge exchange easier and facilitating data-driven diagnosis and clinical decision-making. Conversely, physicians and clinical geneticists should be incorporated into clinical bioinformatics discussions to guide and adapt genomic reports to the medical reality. Clinical bioinformaticians, in hospitals should not be simple data extraction technicians but rather, specialized partners of physicians in accordance with a patient-centred model. Mutual training for clinicians and bioinformaticians should provide reciprocal and bidirectional information exchange. Figure 3 View largeDownload slide Precision medicine workflow in hospitals. The patient’s standard of care in a precision medicine scenario requires specialist clinical bioinformaticians to participate in multidisciplinary clinical sessions with physicians and clinical geneticists, making knowledge exchange easier and facilitating data-driven diagnosis and clinical decision-making. Conversely, physicians and clinical geneticists should be incorporated into clinical bioinformatics discussions to guide and adapt genomic reports to the medical reality. Clinical bioinformaticians, in hospitals should not be simple data extraction technicians but rather, specialized partners of physicians in accordance with a patient-centred model. Mutual training for clinicians and bioinformaticians should provide reciprocal and bidirectional information exchange. Approaching physicians: emerging efforts in clinical bioinformatics training In Europe, the ELIXIR Training Platform represents a solid training initiative that offers guidelines and best practices for educational excellence in bioinformatics, focused on the training of bioinformaticians and computational biologists [82, 83]. In addition, the Global Organisation for Bioinformatics Learning, Education and Training (GOBLET; http://mygoblet.org) and the International Society for Computational Biology Education Committee offer a networking structure for bioinformatics trainers and trainees [84]. Although the ELIXIR Training Platform and GOBLET have traditionally been involved in teaching bioinformatics for life sciences, the current demands for clinical bioinformaticians have boosted new training efforts to overcome this bottleneck. Remarkably, the ELIXIR-EXCELERATE project and UK’s Health Education England (HEE) have recently set up training workshops on clinical bioinformatics in the UK, focusing on bioinformatics instructors, and providing guidance and tips to develop and deliver training in clinical bioinformatics [85]. Moreover, the urgent need for early training in clinical bioinformatics skills drove the HEE team to elaborate a report that sets out a phased approach to provide the training required to support the 100 000 Genomes Project within the UK healthcare system. This report claims that the first urgent need for precision medicine and medical genomics is to recruit specialist healthcare scientists to assist clinical staff with the interpretation of genome sequencing data. This includes a detailed long-term training plan that promotes PhD and postdoctoral fellowships in clinical bioinformatics, as well as MSc programmes in Medical Genomics, to develop and integrate workforces of bioinformaticians into British healthcare institutions [86]. This HEE strategy emphasizes the idea that implementing precision medicine requires a novel bioinformatics expert able to integrate and interact with the hospital environment. To do so, clinical bioinformaticians must be trained in basic clinical principles, clinical trials procedures and medical genomics. Such a specific clinical background would facilitate efficient and natural communication with physicians, medical geneticists and other healthcare specialists. Clinical training would also help bioinformaticians realize that their role in hospitals goes beyond data extraction, storage and interpretation. As new partners to physicians, clinical bioinformaticians should actively participate in clinical sessions, supporting clinical decision-making with clear and intuitive genomic reports designed on the basis of input from the clinical staff (Figure 3). Clinical courses, workshops and educational programmes orientated towards bioinformaticians, along with reciprocal training for healthcare specialists, would help to increase fluent and effective information transfer for the ultimate benefit of patient care. Perspectives: the evolution of precision bioinformatics The molecular diagnostics laboratories of tomorrow will depend on trained bioinformaticians and until this happens, precision medicine will drive for the incorporation of an increasing number of specialist bioinformaticians at healthcare institutions. It is expected that these diagnostic laboratories will offer sequencing services to healthcare systems and create efficient informatics solutions to execute such tasks, all under an approved regulatory process. Pioneering groups are currently incorporating bioinformaticians, albeit with limited experience in clinical procedures and hospital environments (Box 2). Clearly, the integration of these novel specialists is not a trivial task. Indeed, there are a number of technical and non-technical barriers that need to be addressed and these will first require a strong commitment and decisive efforts by the institutions and their clinical staff, but also from the whole bioinformatics community. Box 2. Tips for emergent clinical bioinformatics group leaders 1) Rule of thumb: the patient’s health is at the centre of everything you do. 2) You’ll need complementary roles in your team, as computational, biological and clinical profiles are interdependent. 3) Be pragmatic, the ultimate goal is to fulfil medical needs to facilitate clinical decision-making. 4) Stay up-to-date, parallel research projects are essential to achieve excellence in genomics service. 5) Be rigorous (yes, even more rigorous), especially as you will be dealing with patient data, not just research data. Your analysis will contribute to the patient’s healthcare. 6) Open-mindedness and flexibility are essential. In hospitals, hardly anyone shares their bioinformatics background. Whoever adapts, wins. 7) Be communicative, overcome your own scientific limitations and language barriers. 8) Hone your teaching skills, as you will need to explain your protocols and results repeatedly. 9) Do not isolate yourself. Commands and computational papers are ok but, well… you work with healthcare professionals and patients. There is a world out there beyond computers and algorithms. 10) You are not alone. Networking with bioinformatics colleagues in other health institutions really works. Share your knowledge…and your ignorance. One such technical barrier is related to the high-performance computing infrastructures required for the efficient storage, processing and interpretation of routine large-scale genomic analysis in national healthcare systems. Unfortunately, the computing infrastructures that are currently found at healthcare institutions are not usually prepared to efficiently process such volumes of data. To address this issue in the most cost-effective way, and to deliver a genomics-based service to the British health service, Genome England is considering the possibility of establishing a pilot semi-centralized system of molecular diagnostics laboratories from 10 to 20 British cancer centres. These laboratories would share standardized operating procedures and NGS platforms. Raw sequence and phenotype data would be stored at each clinical site locally, yet data analysis would be carried out in a single cloud environment using a standardized mutation detection pipeline [87]. Other European strategies, such as EuroHPC [88] and the Partnership for Advanced Computing in Europe, are working towards the establishment of a multi-government cooperative framework to acquire and deploy an integrated supercomputing infrastructure. It is expected that EuroHPC will develop a test bed to create large European digital infrastructures for personalized medicine data. Such approaches, based on cloud computing, have been successfully applied in massive sequencing projects like the PanCancer Analysis Whole Genome project, where full genome analyses were accomplished in a cloud-computer-based architecture across 13 data centres distributed over three continents. However, these strategies need to be studied thoroughly to be applicable in clinical settings. In fact, some authors reported concerns regarding the utilization of cloud-based systems for patient data computing. These concerns are related to perceived limitations in data security and protection, the need for due consideration of the rights of patient donors and research participants, and legal issues associated to local regulations owing to fundamental differences in the understanding of the right to data protection between different legal systems [89]. Without a doubt, the implementation of shared supercomputing initiatives to support precision medicine data processing will require a substantial economic investment on the part of governments. Nevertheless, sequencing costs are falling and it is expected that the creation of shared computing infrastructures to support sequencing on a larger scale will help push sequencing costs down further, with consequent savings in time and money for healthcare systems. In addition, it is expected that precision medicine will eventually help healthcare institutions save money in treatments, while at the same time enhancing the quality of life of patients. For instance, whole-exome sequencing (WES) of children’s rare disease was shown to have improved the diagnostic rate 5-fold compared with standard care, while reducing costs [90]. Similarly, other studies indicate that the use of WES as an early, routine clinical test for infants with suspected monogenic disorders more than triples the successful diagnosis, at one-third of the cost [91]. These are clear examples of the application of panels and exome sequencing already implemented in molecular diagnostics laboratories at hospitals. These alternatives to WGS have a significantly smaller processing burden and they are therefore cheaper, although they cannot compete with the capacity of WGS to offer a complete genome landscape and more information about structural variations. It is also expected that WGS will be industrialized as a single common process, becoming more cost-effective than panels or exomes. Accordingly, a patient’s whole genome data will be used as a multi-diagnostic test for a catalogue of diseases or clinical phenotypes [87]. The integration and exploitation of population-scale biomedical data generated by mobile and real-time wellness monitoring devices also raises significant challenges for precision medicine bioinformatics. It is expected that monitorized data would promote data-driven public health studies to design precision prevention strategies [16, 92]. Such data would also help to stratify patients for active health management, dealing better with clinically asymptomatic patients and their underlying medical history. Clinical bioinformatics tools and resources are fundamental for these advances in implementing real-time biomedical and healthcare analyses in a clinical environment. The development of data capturing and storage strategies that integrate and correlate health data with patient’s EHRs on a population scale is an issue that must be addressed, and it will require the implementation of novel scientific and technical resources and methods [16]. Another forthcoming hurdle is related to radiomics, an emerging and promising field that involves the HTP extraction of many features from radiographic images [93]. Pioneering radiomics approaches have recently been incorporated into precision oncology, whereby tumour characterization is not just limited to anatomy but it can also reveal information at the cellular and genomic level that can be quantified as an imaging phenotype. Thus, computational algorithms for radiomics allow quantitative automated imaging features to be converted into mineable data. Pioneering studies in head and neck [94] or lung cancer [95] provide preliminary evidence that radiomics texture analyses can define distinctive tumour phenotypes that are driven by underlying genotypes. In this way, radiomics signatures can hold predictive and prognostic information to guide personalized radiotherapy. It is expected that the development of computational methods to efficiently extract and process radiomics data, in conjunction with the implementation of platforms to readily integrate radiomics with clinical, pathological and genomic information, will be research areas of great interest in coming years. In summary, the emerging precision medicine paradigm offers a new and challenging scenario to generate an unprecedented amount of health-related data. A novel type of bioinformatician is required to translate such data into knowledge that can be used to facilitate clinical decision-making. Clinical bioinformaticians will unify their efforts with the staff at hospitals and healthcare institutions to successfully deliver quality patient care. This situation provides new, attractive and challenging job opportunities to bioinformaticians, and maybe it is time to ask not what their health institutions can do for them but rather, what they can do for their health institutions. Key Points The precision medicine initiatives emerging around the world are facing many challenges. Current public and private investments aim to establish the computational infrastructures required to support precision medicine initiatives. Precision medicine and clinical bioinformatics can only work efficiently if electronic health records and patient genotypes are accessible to in-house bioinformaticians. The successful implementation of precision medicine in health institutions requires bioinformaticians with a basic clinical training, as yet an unfulfilled need. The clinical bioinformatician is a novel and specialized profile demanded increasingly by healthcare centres. Acknowledgements Authors would like to thank all members from CNIO Bioinformatics Unit for their comments, all of which have helped improve this manuscript. A.V. to BSC-IRB-CRG Program in Computational Biology and Award Severo Ochoa, SEV 2015-0493. Funding This study was funded by the following grants: BIO2014-57291-R from the Spanish Ministry of Economy and Competitiveness and PT13/0001/0007 from the ISCIII, both co-funded with European Regional Development Funds (ERDF); EU H2020-INFRADEV-1-2015-1 ELIXIR-EXCELERATE (ref. 676559); and Marie-Curie Career Integration Grant (CIG) CIG334361. Gonzalo Gómez-López is a senior computational biologist at the Spanish National Cancer Research Centre, and assistant professor of translational genomics and bioinformatics at the School of Medicine of the Autónoma University (Madrid). He is interested in cancer genome interpretation, transcriptomics and immunotherapy. Joaquín Dopazo is the director of the Clinical Bioinformatics Area of the Fundación Progreso y Salud (Seville). His research focuses on translational genomic data analysis for precision medicine, with an emphasis on systems medicine applications. Juan Cruz Cigudosa is a clinical geneticist with experience in genomics for research and clinical applications. He is the current president of the Spanish Human Genetics Association and scientific director of NIMGenetics. Alfonso Valencia is the director of the Spanish National Bioinformatics Institute (INB) and the Life Sciences department at the Barcelona Supercomputing Centre. His main research goals are to analyse the function and interactions of cancer-related proteins, and to develop novel computational methods to examine, represent and interpret cancer genome information. Fátima Al-Shahrour is head of the Bioinformatics Unit at Spanish National Cancer Research Centre. She is applying computational methods to precision medicine to interpret cancer genomes, for drug repositioning and for the prediction of anticancer therapies. References 1 Biankin AV. The road to precision oncology . Nat Genet 2017 ; 49 ( 3 ): 320 – 1 . Google Scholar Crossref Search ADS PubMed 2 Duffy DJ. Problems, challenges and promises: perspectives on precision medicine . Brief Bioinform 2016 ; 17 ( 3 ): 494 – 504 . Google Scholar Crossref Search ADS PubMed 3 Valencia A , Hidalgo M. Getting personalized cancer genome analysis into the clinic: the challenges in bioinformatics . Genome Med 2012 ; 13 ( 7 ): 61. Google Scholar Crossref Search ADS 4 Dubitzky W. Computational systems biomedicine . Brief Bioinform 2016 ; 17 ( 3 ): 367. Google Scholar Crossref Search ADS PubMed 5 Vogelstein B , Papadopoulos N , Velculescu VE , et al. Cancer genome landscapes . Science 2013 ; 339 ( 6127 ): 1546 – 58 . Google Scholar Crossref Search ADS PubMed 6 Lawrence MS , Stojanov P , Polak P , et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes . Nature 2013 ; 499 ( 7457 ): 214 – 8 . Google Scholar Crossref Search ADS PubMed 7 Gerlinger M , Rowan AJ , Horswell S , et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing . N Engl J Med 2012 ; 366 ( 10 ): 883 – 92 . Google Scholar Crossref Search ADS PubMed 8 Rubio-Perez C , Tamborero D , Schroeder MP , et al. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities . Cancer Cell 2015 ; 27 ( 3 ): 382 – 96 . Google Scholar Crossref Search ADS PubMed 9 Alizadeh AA , Aranda V , Bardelli A , et al. Toward understanding and exploiting tumor heterogeneity . Nat Med 2015 ; 21 ( 8 ): 846 – 53 . Google Scholar Crossref Search ADS PubMed 10 Eifert C , Powers RS. From cancer genomes to oncogenic drivers, tumors dependencies and therapeutic targets . Nature 2012 ; 12 : 572 – 6 . 11 Hyman DM , Taylor BS , Baselga J. Implementing genome-driven oncology . Cell 2017 ; 168 ( 4 ): 584 – 99 . Google Scholar Crossref Search ADS PubMed 12 Redig AJ , Jänne PA. Basket trials and the evolution of clinical trial design in an era of genomic medicine . J Clin Oncol 2015 ; 33 ( 9 ): 975 – 7 . Google Scholar Crossref Search ADS PubMed 13 Kandoth C , McLellan MD , Vandin F , et al. Mutational landscape and significance across 12 major cancer types . Nature 2013 ; 502 ( 7471 ): 333 – 9 . Google Scholar Crossref Search ADS PubMed 14 Clinical Cancer Genome Task Team of the Global Alliance for Genomics and Health . Sharing clinical and genomic data on cancer—the need for global solutions . N Engl J Med 2017 ; 376 : 2006 – 9 . Crossref Search ADS PubMed 15 Lawler M , Maughan T. From rosalind franklin to barack obama: data sharing challenges and solutions in genomics and personalised medicine . New Bioeth 2017 ; 23 ( 1 ): 64 – 73 . Google Scholar Crossref Search ADS PubMed 16 Shameer K , Badgeley MA , Miotto R , et al. Translational bioinformatics in the era of real-time biomedical, healthcare and wellness data streams . Brief Bioinform 2017 ; 18 ( 1 ): 105 – 24 . Google Scholar Crossref Search ADS PubMed 17 Longo DL , Drazen JM. Data sharing . N Engl J Med 2016 ; 374 ( 3 ): 276 – 7 . Google Scholar Crossref Search ADS PubMed 18 Bierer BE , Crosas M , Pierce HH. Data authorship as an incentive to data sharing . N Engl J Med 2017 ; 376 ( 17 ): 1684 – 7 . Google Scholar Crossref Search ADS PubMed 19 Rosenbaum L. Bridging the data-sharing divide—seeing the devil in the details, not the other camp . N Engl J Med 2017 ; 376 ( 23 ): 2201 – 3 . Google Scholar Crossref Search ADS PubMed 20 Reardon S. Giant study poses DNA data-sharing dilemma . Nature 2015 ; 525 ( 7567 ): 16 – 7 . Google Scholar Crossref Search ADS PubMed 21 Vis DJ , Lewin J , Liao RG , et al. Towards a global cancer knowledge network: dissecting the current international cancer genomic sequencing landscape . Ann Oncol 2017 ; 28 ( 5 ): 1145 – 51 . Google Scholar Crossref Search ADS PubMed 22 Feldman EA. The Genetic Information Nondiscrimination Act (GINA): public policy and medical practice in the age of personalized medicine . J Gen Intern Med 2012 ; 27 ( 6 ): 743 – 6 . Google Scholar Crossref Search ADS PubMed 23 Dzau VJ , Geoffrey SG. Realizing the full potential of precision medicine in health and health care . Jama 2016 ; 316 ( 16 ): 1659 – 60 . Google Scholar Crossref Search ADS PubMed 24 AACR Project GENIE Consortium . AACR project GENIE: powering precision medicine through an international consortium . Cancer Discov 2017 ; 7 ( 8 ): 818 – 31 . Crossref Search ADS PubMed 25 Global Alliance for Genomics and Health . GENOMICS. A federated ecosystem for sharing genomic, clinical data . Science 2016 ; 352 ( 6291 ): 1278 – 80 . Crossref Search ADS PubMed 26 http://genomicsandhealth.org/work-products-demonstration-projects/catalogue-global-activities-international-genomic-data-initiati. 27 Philippakis AA , Azzariti DR , Beltran S , et al. The matchmaker exchange: a platform for rare disease gene discovery . Hum Mutat 2015 ; 36 ( 10 ): 915 – 21 . Google Scholar Crossref Search ADS PubMed 28 Landrum MJ , Lee JM , Benson M , et al. ClinVar: public archive of interpretations of clinically relevant variants . Nucleic Acids Res 2016 ; 44 ( D1 ): D862 – 8 . Google Scholar Crossref Search ADS PubMed 29 Grossman RL , Heath AP , Ferretti V , et al. Toward a shared vision for cancer genomic data . N Engl J Med 2016 ; 375 ( 12 ): 1109 – 12 . Google Scholar Crossref Search ADS PubMed 30 Coyne GO , Takebe N , Chen AP. Defining precision: the precision medicine initiative trials NCI-MPACT and NCI-MATCH . Curr Probl Cancer 2017 ; 41 ( 3 ): 182 – 93 . Google Scholar Crossref Search ADS PubMed 31 Brower V. NCI-MATCH pairs tumor mutations with matching drugs . Nat Biotechnol 2015 ; 33 ( 8 ): 790 – 1 . Google Scholar Crossref Search ADS PubMed 32 Lih CJ , Sims DJ , Harrington RD , et al. Analytical validation and application of a targeted next-generation sequencing mutation-detection assay for use in treatment assignment in the NCI-MPACT trial . J Mol Diagn 2016 ; 18 ( 1 ): 51 – 67 . Google Scholar Crossref Search ADS PubMed 33 Bailey P , Chang DK , Nones K , et al. Genomic analyses identify molecular subtypes of pancreatic cancer . Nature 2016 ; 531 ( 7592 ): 47 – 52 . Google Scholar Crossref Search ADS PubMed 34 Jamal-Hanjani M , Hackshaw A , Ngai Y , et al. Tracking genomic cancer evolution for precision medicine: the lung TRACERx study . PLoS Biol 2014 ; 12 : e1001906 . Google Scholar Crossref Search ADS PubMed 35 Le Tourneau C , Delord JP , Gonçalves A , et al. Molecularly targeted therapy based on tumour molecular profiling versus conventional therapy for advanced cancer (SHIVA): a multicentre, open-label, proof-of-concept, randomised, controlled phase 2 trial . Lancet Oncol 2015 ; 16 ( 13 ): 1324 – 34 . Google Scholar Crossref Search ADS PubMed 36 Samuels S , Balint B , von der Leyen H , et al. Precision medicine in cancer: challenges and recommendations from an EU-funded cervical cancer biobanking study . Br J Cancer 2016 6; 115 ( 12 ): 1575 – 83 . Google Scholar Crossref Search ADS PubMed 37 Hidalgo M , Amant F , Biankin AV , et al. Patient-derived xenograft models: an emerging platform for translational cancer research . Cancer Discov 2014 ; 4 ( 9 ): 998 – 1013 . Google Scholar Crossref Search ADS PubMed 38 Garralda E , Paz K , López-Casas PP , et al. Integrated next-generation sequencing and avatar mouse models for personalized cancer treatment . Clin Cancer Res 2014 ; 20 ( 9 ): 2476 – 84 . Google Scholar Crossref Search ADS PubMed 39 Byrne AT , Alférez DG , Amant F , et al. Interrogating open issues in cancer precision medicine with patient-derived xenografts . Nat Rev Cancer 2017 ; 17 ( 4 ): 254 – 68 . Google Scholar Crossref Search ADS PubMed 40 Rehm HL. Evolving health care through personal genomics . Nat Rev Genet 2017 ; 18 ( 4 ): 259 – 67 . Google Scholar Crossref Search ADS PubMed 41 Eisenstein M. Big data: the power of petabytes . Nature 2015 ; 527 ( 7576 ): S2 – 4 . Google Scholar Crossref Search ADS PubMed 42 Zhao Y , Polley EC , Li MC , et al. GeneMed: an informatics hub for the coordination of next-generation sequencing studies that support precision oncology clinical trials . Cancer Inform 2015 ; 14 : 45 – 55 . Google Scholar PubMed 43 Palmisano A , Zhao Y , Li MC , et al. OpenGeneMed: a portable, flexible and customizable informatics hub for the coordination of next-generation sequencing studies in support of precision medicine trials . Brief Bioinform 2017 ; 18 : 723 – 34 . doi: 10.1093/bib/bbw059. Google Scholar PubMed 44 Servant N , Roméjon J , Gestraud P , et al. Bioinformatics for precision medicine in oncology: principles and application to the SHIVA clinical trial . Front Genet 2014 ; 5 : 152. Google Scholar Crossref Search ADS PubMed 45 Rubio-Camarillo M , Gómez-López G , Fernández JM , et al. RUbioSeq: a suite of parallelized pipelines to automate exome variation and bisulfite-seq analyses . Bioinformatics 2013 ; 29 ( 13 ): 1687 – 9 . Google Scholar Crossref Search ADS PubMed 46 Carey DJ , Fetterolf SN , Davis FD , et al. The Geisinger MyCode community health initiative: an electronic health record–linked biobank for precision medicine research . Genet Med 2016 ; 18 ( 9 ): 906 – 13 . Google Scholar Crossref Search ADS PubMed 47 Dewey FE , Murray MF , Overton JD , et al. Distribution and clinical impact of functional variants in 50, 726 whole-exome sequences from the DiscovEHR study . Science 2016 ; 354 . 48 Alemán A , Garcia-Garcia F , Medina I , et al. A web tool for the design and management of panels of genes for targeted enrichment and massive sequencing for clinical applications . Nucleic Acids Res 2014 ; 42 ( W1 ): 83 – 7 . Google Scholar Crossref Search ADS 49 Dopazo J , Amadoz A , Bleda M , et al. 267 Spanish exomes reveal population-specific differences in disease-related genetic variation . Mol Biol Evol 2016 ; 33 ( 5 ): 1205 – 18 . Google Scholar Crossref Search ADS PubMed 50 Alemán A , Garcia-Garcia F , Salavert F , et al. A web-based interactive framework to assist in the prioritization of disease candidate genes in whole-exome sequencing studies . Nucleic Acids Res 2014 ; 42 ( W1 ): 88 – 93 . Google Scholar Crossref Search ADS 51 Thurnherr T , Singer F , Stekhoven DJ , et al. Genomic variant annotation workflow for clinical applications . F1000Res 2016 ; 5 : 1963. Google Scholar Crossref Search ADS PubMed 52 Bauer CR , Knecht C , Fretter C , et al. Interdisciplinary approach towards a systems medicine toolbox using the example of inflammatory diseases . Brief Bioinform 2017 ; 18 : 479 – 87 . Google Scholar PubMed 53 Bhuvaneshwar K , Belouali A , Singh V , et al. G-DOC Plus—an integrative bioinformatics platform for precision medicine . BMC Bioinformatics 2016 ; 17 ( 1 ): 193 . Google Scholar Crossref Search ADS PubMed 54 http://www.sib.swiss/about-us/news/news-2016/966-launch-of-a-new-cancer-diagnostic-platform-oncobench 55 Granados Moreno P , Joly Y , Knoppers BM. Public–private partnerships in cloud-computing services in the context of genomic research . Front Med 2017 ; 4 : 3 . Google Scholar Crossref Search ADS 56 https://www.genomicsengland.co.uk/clinicians-researchers-and-industry-collaborate-with-the-100000-genomes-project/ 57 Bringing precision medicine to community oncologists . Cancer Discov 2017 ; 7 : 6 – 7 . 58 Charles D , King J , Patel V , et al. Adoption of Electronic Health Record Systems among U.S. Non-federal Acute Care Hospitals: 2008–2012. ONC Data Brief, no 9. Washington, DC: Office of the National Coordinator for Health Information Technology, 2013 . 59 Millar J. The need for a global language—SNOMED CT introduction . Stud Health Technol Inform 2016 ; 225 : 683 – 5 . Google Scholar PubMed 60 Robinson PN , Mundlos S. The human phenotype ontology . Clin Genet 2010 ; 77 ( 6 ): 525 – 34 . Google Scholar Crossref Search ADS PubMed 61 Stroetman VN , Kalra D , Lewalle P , et al. Semantic Interoperability for Better Health and Safer Healthcare Research and Deployment Roadmap for Europe . Directorate-General for the Information Society and Media (European Commission), 2009 . DOI: 10.2759/38514. https://publications.europa.eu/en/publication-detail/-/publication/9bb4f083-ac9d-47f8-ab4a-76a1f095ef15/language-en 62 Legaz-García Mdel C , Martínez-Costa C , Miñarro-Giménez JA , et al. Ontology patterns-based transformation of clinical information . Stud Health Technol Inform 2014 ; 205 : 1018 – 22 . Google Scholar PubMed 63 Schera F , Weiler G , Neri E , Kiefer S , et al. The p-medicine portal-a collaboration platform for research in personalised medicine . Ecancermedicalscience 2014 ; 11 : 398 . 64 Marés J , Shamardin L , Weiler G , et al. p-medicine: a medical informatics platform for integrated large scale heterogeneous patient data . AMIA Annu Symp Proc 2014 ; 14 : 872 – 81 . 65 Hazlehurst BL , Kurtz SE , Masica A , et al. CER Hub: an informatics platform for conducting comparative effectiveness research using multi-institutional, heterogeneous, electronic clinical data . Int J Med Inform 2015 ; 84 ( 10 ): 763 – 73 . Google Scholar Crossref Search ADS PubMed 66 Sitpati A , Kim H , Berkovich B , et al. Integrated precision medicine: the role of electronic health records in delivering personalized treatment . WIREs Syst Biol Med 2017 ; 9 : e1378-12 . 67 http://www.hipaajournal.com/largest-healthcare-data-breaches-of-2016-8631/. 68 Raisaro JL , Tramèr F , Ji Z , et al. Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks . J Am Med Inform Assoc 2017 ; 14 : 799 – 805 . doi: 10.1093/jamia/ocw167 Google Scholar Crossref Search ADS 69 Jagadeesh KA , Wu DJ , Birgmeier JA , et al. Deriving genomic diagnoses without revealing patient genomes . Science 2017 ; 357 ( 6352 ): 692 – 5 . Google Scholar Crossref Search ADS PubMed 70 Simmons S , Sahinalp C , Berger B. Enabling privacy-preserving GWASs in heterogeneous human populations . Cell Syst 2016 ; 3 ( 1 ): 54 – 61 . Google Scholar Crossref Search ADS PubMed 71 Wickware P. Training in a hybrid discipline . Nature 2001 ; 413 ( 6858 ): 6 – 7 . Google Scholar Crossref Search ADS 72 Jeffrey C. Core services: reward bioinformaticians . Nature 2015 ; 520 : 151 – 2 . Google Scholar Crossref Search ADS PubMed 73 Levine AG. An explosion of bioinformatics careers . Science 2014 ; 344 : 1303 – 06 . Google Scholar Crossref Search ADS 74 Spotlight on bioinformatics . Biology goes digital. NatureJobs 2016 . doi: 10.1038/nj0478. 75 Lopez-Campos G , Lopez-Alonso V , Martin-Sanchez F. Training health professionals in bioinformatics. Experiences and lessons learned . Methods Inf Med 2010 ; 49 ( 3 ): 299 – 304 . Google Scholar Crossref Search ADS PubMed 76 Rubinstein JC. Perspectives on an education in computational biology and medicine . Yale J Biol Med 2012 ; 85 ( 3 ): 331 – 7 . Google Scholar PubMed 77 Brazas MD , Lewitter F , Schneider MV , et al. A quick guide to genomics and bioinformatics training for clinical and public audiences . PLoS Comput Biol 2014 ; 10 ( 4 ): e1003510 . Google Scholar Crossref Search ADS PubMed 78 Clay MR , Fisher KE. Bioinformatics education in pathology training: current scope and future direction . Cancer Inform 2017 ; 10 : 16 . 79 Tan B , Ban KH , Tan TW. Integrating translational bioinformatics into the medical curriculum . Int J Med Educ 2014 ; 5 : 132 – 4 . Google Scholar Crossref Search ADS PubMed 80 McGrath S , Ghersi D. Building towards precision medicine: empowering medical professionals for the next revolution . BMC Medi Genomics 2016 ; 9 ( 1 ): 23. Google Scholar Crossref Search ADS 81 Rozman D , Acimovic J , Schmeck B. Training in systems approaches for the next generation of life scientists and medical doctors . Methods Mol Biol 2016 ; 1386 : 73 – 86 . Google Scholar Crossref Search ADS PubMed 82 Via A , Blicher T , Bongcam-Rudloff E , et al. Best practices in bioinformatics training for life scientists . Brief Bioinform 2013 ; 14 ( 5 ): 528 – 37 . Google Scholar Crossref Search ADS PubMed 83 Tramontano A , Valencia A . Education and research infrastructures. In: A. Cesario, F.B. Marcus (eds.), Cancer Systems Biology, Bioinformatics and Medicine . Springer Science, Business Media B.V., 2011 , pp. 165 – 181 . DOI: 10.1007/978-94-007-1567-7_6. ISBN 978-94-007-1567-7. 84 Attwood TK , Bongcam-Rudloff E , et al. GOBLET, the global organisation for bioinformatics learning, education and training . Nature 2017 ; 544 : e1004143 . 85 https://www.elixir-europe.org/events/excelerate-train-trainer-clinical-bioinformatics-and-best-practices-workshop 86 https://www.genomicseducation.hee.nhs.uk/images/publications/Developing_NHS_Clinical_Bioinformatics_Training.pdf 87 https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/624628/CMO_annual_report_generation_genome.pdf 88 https://ec.europa.eu/digital-single-market/en/news/eu-ministers-commit-digitising-europe-high-performance-computing-power 89 Molnár-Gábor F , Lueck R , Yakneen S , et al. Computing patient data in the cloud: practical and legal considerations for genetics and genomics research in Europe and internationally . Genome Med 2017 ; 9 ( 1 ): 58 . Google Scholar Crossref Search ADS PubMed 90 Stark Z , Tan TY , Chong B , et al. A prospective evaluation of WES as a first-tier molecular test in infants with suspected monogenic disorders . Genet Med 2016 ; 18 : 1090 – 6 . Google Scholar Crossref Search ADS PubMed 91 Stark Z , Schofield D , Alam K , et al. Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement . Genet Med 2017 ; 19 ( 8 ): 867 – 74 . Google Scholar Crossref Search ADS PubMed 92 Khoury MJ , Iademarco MF , Riley WT. Precision public health for the era of precision medicine . Ame J Prevent Med 2016 ; 50 ( 3 ): 398 – 401 . Google Scholar Crossref Search ADS 93 Lambin P , Rios-Velazquez E , Leijenaar R , et al. Radiomics: extracting more information from medical images using advanced feature analysis . Eur J Cancer 2012 ; 48 ( 4 ): 441 – 6 . Google Scholar Crossref Search ADS PubMed 94 Caudell JJ , Torres-Roca JF , Gillies RJ , et al. The future of personalised radiotherapy for head and neck cancer . Lancet Oncol 2017 ; 18 ( 5 ): e266 – 73 . Google Scholar Crossref Search ADS PubMed 95 Lee G , Lee HY , Park H , et al. Radiomics and its emerging role in lung cancer research, imaging biomarkers and clinical management: state of the art . Eur J Radiol 2017 ; 86 : 297 – 307 . Google Scholar Crossref Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Proteomics and phosphoproteomics in precision medicine: applications and challengesGiudice, Girolamo; Petsalaki, Evangelia
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx141pmid: 29077858
Abstract Recent advances in proteomics allow the accurate measurement of abundances for thousands of proteins and phosphoproteins from multiple samples in parallel. Therefore, for the first time, we have the opportunity to measure the proteomic profiles of thousands of patient samples or disease model cell lines in a systematic way, to identify the precise underlying molecular mechanism and discover personalized biomarkers, networks and treatments. Here, we review examples of successful use of proteomics and phosphoproteomics data sets in as well as their integration other omics data sets with the aim of precision medicine. We will discuss the bioinformatics challenges posed by the generation, analysis and integration of such large data sets and present potential reasons why proteomics profiling and biomarkers are not currently widely used in the clinical setting. We will finally discuss ways to contribute to the better use of proteomics data in precision medicine and the clinical setting. proteomics, phosphoproteomics, data integration, precision medicine Introduction Precision medicine refers to the use of diagnostic, therapeutic and monitoring strategies for individual patients based on their molecular profiles [1]. While there has been one promising example of monitoring molecular data from a single individual for a long term to assess their health and disease status [2], in practice, the focus of the community lies mainly in the stratification of diseases into subtypes, based on molecular biomarkers or signatures, i.e. in the molecular taxonomy of disease [3]. The aim is to use these signatures to assign patients to specific disease subgroups and administer the most effective therapy for them. For example, patients with certain variants of TPMT, a thiopurine methyltransferase, are known to exhibit severe toxicity to the most common leukemia chemotherapy drug, thiopurine [4]. The dosage of the drug for their treatment is thus currently adjusted, based on TPMT variant screening, to avoid the toxicity and treat leukemia effectively [5]. Extensive molecular characterization of gene expression signatures in breast cancer [6–8] has allowed the development of multigenes assays that are currently undergoing clinical trials for routine use in the clinic to guide patient treatment and monitoring [9]. Most efforts to molecularly characterize diseases use genomic-based methodologies to identify genetic variants, including copy number variations [10] and differential gene expression [6] associated with specific disease subtypes [11] (Figure 1). While significant progress has been made in stratifying patients and diseases, there has been limited success in using this information in the clinic. In a recent meta-analysis study of a Phase 1 trial for treating refractory malignant neoplasms, they found that, while the response rate using the ‘precision’ biomarker was significantly higher than in its absence, the median response rate was still only ∼30% [12]. Systems biology [13] has shown that focusing only on the genomic and transcriptomic layers of cell function regulation leaves us blind to other important regulators of cell phenotypes and outcomes. For example, metabolomics data provide information regarding the metabolism and energy balance regulation of the cell, and epigenomics can reflect the regulation of the gene expression and the effect of environmental factors on the cell. The use of these data sets in precision medicine has been reviewed elsewhere [14, 15]. Figure 1 Open in new tabDownload slide Example workflow for precision medicine. Multi-omics data are initially collected from patients and integrated to create their individual molecular profiles. These profiles are then matched to previously defined disease profiles that can guide the selection of treatment. This is achieved either through a match to known biomarkers, omics signatures or network/pathway signatures. The appropriate drug is then selected based on this match, to improve the chance of successful treatment and reduce the probability of side effects. Figure 1 Open in new tabDownload slide Example workflow for precision medicine. Multi-omics data are initially collected from patients and integrated to create their individual molecular profiles. These profiles are then matched to previously defined disease profiles that can guide the selection of treatment. This is achieved either through a match to known biomarkers, omics signatures or network/pathway signatures. The appropriate drug is then selected based on this match, to improve the chance of successful treatment and reduce the probability of side effects. It is well known that changes in gene expression do not always reflect changes in protein abundance [16–18]. Proteins are the major effectors of cell functions through changes in their posttranslational modifications (PTMs) and abundance, reflected also on changes in their interactome with effects on cell phenotypes. It is therefore critical to also consider proteomics, phosphoproteomics and other PTM-‘omics’ data sets in our studies to understand disease development and subtypes, as they can better capture the functional state and dynamic properties of a cell. However, these data sets have not been extensively used in the precision medicine field because of the time required to run samples, complexity and dynamic range of proteomics samples, lack of reproducibility among laboratories, differences between quantification methods and other confounding factors [19, 20]. Recently, technological developments in instrumentation, sample preparation and data analysis [20–23] and initiatives to develop standards for the generation and evaluation of these data [24–30] have resulted in the availability of high-quality, reproducible and comprehensive proteomics and phosphoproteomics data sets and protocols to generate such data. For example, Sharma and colleagues [31] were able to detect 50 000 phosphopeptides in a single human cancer cell line, and scientists can routinely and accurately measure thousands of peptides within short time frames: Hebert et al. [32] were able to measure the entire yeast proteome comprising peptides from ∼3980 proteins in just over 1 h. Hundreds of targeted and global proteomics data sets are also collected by the CPTAC (Clinical Proteomic Tumor Analysis Consortium) to contribute to the study of cancer [33]. Therefore, the bioinformatics community must currently address the challenge of taking advantage of this new layer of information and integrating it with other valuable omics layers to study the mechanism of human disease and translate it into actionable insight in the clinic. Targeted proteomics methods such as SRM/MRM (Selected/Multiple Reaction Monitoring; [34]) and data-independent acquisition methods such as SWATH-MS (Sequential Windowed Acquisition of All Theoretical Fragment Ion Mass Spectra) also allow significant reduction in variability during data acquisition and improved data set quality [35]. For details on the technological advances that have allowed this revolution in proteomics and PTM-omics data acquisition, we redirect the reader to numerous existing publications [21, 22, 36–38]. Recent reviews have discussed proteomics and phosphoproteomics in the context of precision medicine [39, 40]. In this review, we will present an overview of bioinformatics approaches used to analyze these data individually as well as integrated with other omics data sets and will discuss challenges that should be tackled to gain insight into disease mechanisms and advance the field of precision medicine. Proteomics-derived precision biomarkers and signatures A major application of proteomics is for the identification of biomarkers for disease. Biomarkers can be divided in (i) diagnostic to identify a given type of disease (ii) prognostic to measure the disease status and (iii) predictive to measure a response to a treatment [41]. Ideally, a biomarker should distinguish the disease unambiguously and should be detected in an accessible body fluid such as plasma, blood, serum urine, saliva or cerebrospinal fluids [42]. For example, the prostate-specific antigen (PSA) is one of the most famous noninvasive screening biomarkers and is used to detect prostate cancer [43]. However, a high concentration of PSA in the blood is also associated with benign prostatic hyperplasia and prostatitis [44–46]. Thus, even though PSA provides sufficient sensitivity, it fails in the discrimination between prostate cancer and other prostate pathologies because of its poor specificity [47]. In recent years, to improve biomarker sensitivity and specificity, researchers have turned to a combination of biomarkers, i.e. a disease signature, instead of pursuing an ideal biomarker [48]. Using proteomics characterization of samples from different stages of luminal-type breast cancer progression, Pozniak et al. [49] identified differences in components of protein homeostasis and metabolic regulation that can differentiate healthy, from primary or lymph node-metastasized tumor tissues, and lymph node-positive and negative breast cancers. Proteomics-based subtyping of colon and rectal cancer patients by the CPTAC was also more fine-grained than that based on transcriptomics data leading to better prediction of patient prognosis [50]. Combining protein with phosphoprotein abundance measurements using reverse phase protein arrays has also been used, e.g. for the prediction of ovarian cancer recurrence [51]. Numerous studies have showcased the value of phosphoproteomics data in providing mechanistic information underlying the disease mechanism [52–54]. For example, phosphoproteomics data have been used to discover the mechanism of resistance of melanoma cells to BRAF inhibitors [52] and of glioblastoma to mTOR (mechanistic target of rapamycin) inhibitors, leading to the discovery of a novel combination therapy for the latter [53]. Casado and collegues [55] used phosphoproteomics data on hematological cancer cell lines to assign them to specific tumor types and potential treatments. They also studied acute myeloid leukemia primary cells to identify the differential activation of kinases in cells that presented different drug resistance profiles [56]. Excitingly, cell-specific phosphoproteomics has also been used to study bidirectional signaling between endothelial cells and tumor cells to understand metastatic mechanisms of tumor cells [54]. Recently, phosphoproteomics data were used to create mechanistic models of colorectal cancer cell line-specific drug resistance, suggesting that this could be a viable option also for patients [57]. It is therefore clear that the proteomics and phosphoproteomics layer of omics information can provide valuable insight in our quest towards precision medicine. Extracting relevant and reliable features (proteins) from high-throughput proteomics data is the main challenge for the biomarkers identification process. One approach is to use those proteins that are differentially expressed between normal and disease state [58–62]. More sophisticated methods such as machine learning and network-based approaches are also used. Machine learning methods such as support vector machine [63, 64] (SVM), neural networks [65–69], decision tree [67], random forest [70, 71] and genetic algorithms [72] have been successfully applied to proteomics data to identify biomarkers for several cancer types, heart failure and other conditions. Ahn et al. [73] constructed a 29-plex array platform comprising 29 potential biomarkers associated with gastric adenocarcinoma. A total of 13 candidate biomarkers were selected by random forest feature selection algorithm. Random forest and SVM were used to classify individuals as patients with gastric adenocarcinoma or controls. The algorithms tested on an independent blinded set of 95 gastric adenocarcinoma sera and 51 controls reached a mean accuracy of 89.2 and 85.6%, respectively. Random forest generally outperformed SVM, regardless of stage or tumor size; however, the SVM algorithm performed well for diagnosing small tumors. Rogers et al. [66] trained a neural network on either presence/absence of peaks or peak intensity values in a cohort of patients affected by renal cell carcinoma. Their model reaches sensitivity and specificity values of 98.3–100%. However, in an independent validation cohort of 80 cases, the performances were significantly weaker (sensitivities and specificities ranged from 41.0 to 76.6%). This highlights the frequent tendency of machine learning approaches to overfit their functions to noise inherent to the data set rather than the signal. Appropriate consideration regarding the complexity of the model and control data sets should thus always be used to avoid this issue when using such approaches. High-throughput proteomics data sets are characterized by a high number of variables/features compared with the total number of samples available. Hence, the input space includes many irrelevant or noisy features, which, coupled with the wide heterogeneity commonly found in biological samples, make it difficult to identify the truly important biomarkers. To tackle this problem, dimensionality reduction methods [74], such as PAM (Prediction Analysis for Microarrays) [75], SVM-RFE (Support Vector Machine-Recursive Feature Elimination) [48], SAM (Significance Analysis of Microarrays) [76], are used, in combination with machine learning methods, to reduce the noise in the data sets. This is achieved by discarding irrelevant features and enhances the generalization and the prediction performance. For reviews of feature selection algorithms, we redirect the reader elsewhere [77–79]. The lack of reproducibility across different data sets, technical issues such as the overfitting problem in machine learning approaches and the intrinsic complexity of human diseases often prevent promising biomarkers from reaching clinical application [80]. A promising idea to improve the reproducibility and the interpretation of the results is to incorporate prior biological knowledge and different high-throughput data sets to facilitate our understanding of biological processes at a mechanistic level. From lists to integrated networks Uncovering the individual mechanisms of disease development and progression in different patients will be key to designing accurate precision therapy strategies. As a first step in that direction, omics data analysis approaches typically attempt to identify affected biological processes and functions [49] by using Gene Ontology [81] or pathway (or other features) enrichment analyses [82] on the differentially regulated entities of each data set (e.g. genes, proteins or phosphopeptides). These differentially regulated entities can also be mapped onto existing interaction networks or pathway maps to provide a better picture of the cell processes affected in a specific sample. For example in the tumor endothelial bidirectional signaling study mentioned above [54], the authors mapped the affected phosphopeptides onto KEGG pathway maps [83], to understand the pathways involved in the transendothelial metastasis of tumors. More recently, a collection of methods, mostly developed for and applied to genomics and transcriptomics data sets, has been developed that take into consideration also the protein interaction network and pathway structure to identify patient-specific disease-perturbed pathways [84]. The SPIA algorithm (Signalling Pathway Impact Analysis) combines information on the differential expression of genes with their influence in a pathway based on their placement in a pathway topology [85]. HotNet2 [86] and Tied Diffusion Through Interacting Events (TieDIE; [87]) use slightly varied diffusion-based approaches that include a form of random walk and weighting according to the connection strength and network topology to propagate the effect of the perturbation in a given network [88]. There are many other methods available (the most widely used are reviewed here [84]) using, for example, network propagation [89] and clustering [90], current flow through the network [91], random walk [92, 93], pathway models [57] or other approaches for identifying perturbed functional modules or pathways in a network and using these as signatures to stratify patients or differentiate cancer model cell lines (Figure 2). Figure 2 Open in new tabDownload slide Different methods used in biomarker discovery. (A) Differentially expressed method, (B) machine learning method, (C) network-based method DE, differentially expressed; NN, neural network; RF, random forest; DT, decision tree; GA, genetic algorithm; NBS, network-based stratification; RW, random walk. Figure 2 Open in new tabDownload slide Different methods used in biomarker discovery. (A) Differentially expressed method, (B) machine learning method, (C) network-based method DE, differentially expressed; NN, neural network; RF, random forest; DT, decision tree; GA, genetic algorithm; NBS, network-based stratification; RW, random walk. The concepts and methodologies can also be applicable to proteomics data sets; however, there are some issues that should be considered both when using these methods for transcriptomics/genomics data and when attempting to apply them to proteomics and phosphoproteomics data sets. Specifically, most of them tend to use existing interactome data and annotated pathway data, which are currently incomplete and biased toward highly expressed proteins [94–96]. This issue is further exacerbated, when trying to apply them to proteomics data sets, by the fact that these also inherently contain this bias. Moreover, our knowledge of tissue-specific interactions and their rewiring in different cellular states or conditions is currently limited [97, 98]. Such rewiring also occurs in disease and may vary across patients, and therefore, the use of generic networks and pathways for precision medicine applications may not be ideal. Finally, another issue to consider when applying such methods to proteomics and phosphoproteomics data sets is that they tend to have a much smaller coverage of the entire proteome than other respective omics data sets, depending on the instrument or technology used and the dynamic range of the abundances in the sample [99, 100]. It would therefore be useful to develop computational approaches that are tailored specifically to proteomics and phosphoproteomics data sets to account for these associated data characteristics. In the past few years, there have been a number of such methods developed. They mainly focus on accurately estimating the activity of diverse kinases in the systems under study to highlight the context-specific signaling networks that are active in each context. The most widely used method is the kinase–substrate enrichment analysis method [56], which calculates the kinases activity based on the differential abundance of their known substrates. Other methods include IKAP (inference of kinase activities from phosphoproteomics; [101]), which uses a machine learning approach, KARP (kinase activity ranking using phoshphoproteomics data; [102]), which calculates the relative phosphorylation of a kinase‘s substrates versus the total phosphorylation in the data and KinasePA (Perturbation analysis; [103]) and CLUE (CLUster Evaluation; [104]), which require perturbation or time series data. A few different approaches have been extensively benchmarked by Hernandez-Armenta and colleagues [105]. As proteomics and phosphoprteomics data sets provide a direct picture of the cell’s functional state, inclusion of prior knowledge in these methods, such as motifs or interaction interfaces, known enzyme–substrate relationships and effects of mutations on protein structure and function, can also help better understand the effect of perturbations on the functional network. Data integration approaches Despite the wealth of information that proteomics and phosphoproteomics data can provide, it still represents only one layer of cell function and regulation. Thus, to truly understand cell function in-depth, it is critical to consider as many as possible layers of cell function regulation [13]. This is especially true in the context of precision medicine where different layers of cell regulation may be important for each patient, and additional clinical information must also be included in the analysis. Therefore, one major challenge that our community is currently trying to solve is that of effective data integration of the less mature proteomics and phosphoproteomics layers of information with other omics data sets that have been more extensively studied and integrated in recent years. There is currently no standard or optimal approach to data integration, and several methods have been developed (for reviews, see [106–108]). Here, we will focus on the main approaches used thus far to integrate proteomics data sets with other omics or clinical data. Depending on the data sets they integrate, methods can be divided into homogenous, where the data sets contain the same type of data but from different sources, and heterogeneous, where multiple data sets with different data types are integrated. These methods can either integrate the layers of information in a step-wise fashion or in a single step to generate an integrated model of the system under study. For example Drake et al. [109], used a step-wise approach to integrate genomic, transcriptomic and phosphoproteomics data to identify patient-specific networks that are affected in prostate cancer and suggest potential precision treatments for these patients. Specifically they first used the data sets to broadly identify the pathways, transcription factors and kinases that are likely active in their samples and then applied their diffusion-based algorithm, TieDie [87], to pinpoint the different functional modules and pathways that are affected in the different patients. In this study, they also showed that the integration of phosphoproteomics was able to uncover pathways that would have otherwise been missed underlining the importance of including this level of information in precision medicine approaches. By applying this pipeline on three different prostate cell lines as validation, they were able to support their results either through evaluation of their predicted drug sensitivity or through gene essentiality studies. Rudolph and colleagues [110] integrate protein interaction networks with phosphoproteomics data and evolutionary conservation to define signaling functionalities for proteins in a data set and delineate the active signaling pathways in a given phosphoproteomics data set. A recent systematic search for algorithms to reconstruct signaling pathways from phosphoproteomics [111] has shown that integration with prior knowledge yields the best results. The most promising methods that integrate data sets in a single step include principle component analysis (PCA) [112] (or factor analysis)-based and nonnegative matrix factorization (NMF)-based [113, 114] approaches, as they are able to integrate diverse and large data sets and perform effective dimensionality reduction to allow easy downstream machine learning [115] or network-based [113] analyses and creation of models that represent the system under study. The major issue with PCA-based approaches is the difficulty in interpretation of the biological mechanism underlying the different factor associations. Therefore, different supervised [116] or unsupervised [117] approaches can be used to choose the appropriate factors and help the results interpretation. These can include implementation of linear discriminant analysis [116], Bayesian classifiers [118], SVMs [119] and K-nearest neighbor [120] approaches after the PCA analysis. Liu et al. [118] integrated microRNA, mRNA and proteomics data into a joint matrix. They then used factor analysis and linear discriminant analysis to extract the molecular mechanism of cancer in different cell lines. The integrated approach identified clinically relevant markers and outperformed the analyses performed on the separate data sets. While matrix factorization methods such as NMF and variations have been routinely applied to genomics and transcriptomics data [113, 121, 122], they have only recently been applied to proteomics data sets. For example, Yuan et al. [123] used pairwise NMF between omics data sets and clinical data to study the utility of using these omics data integration approaches in the clinic. In the subgroups, which they identified by combining proteomics and clinical data, they were able to identify—among other biomarkers and activated pathways—an additional patient subgroup that might also benefit from MEK (Mitogen-activated protein kinase kinase) targeting therapies. A great advantage of matrix factorization approaches for proteomics and phosphoproteomics data sets is that they can also be used to impute missing data points [124]. This can be valuable for these data sets, as they inherently do not provide comprehensive measurements of all the components that might be present in other omics data types such as transcriptomic or genomic data sets. Other approaches for data imputation that can be applied in proteomics and phosphoproteomics data use nonlinear optimization approaches [125, 126]. Another integration approach that has been applied to the proteomics data is based on a multiple extension co-inertia analysis to identify the relationships among different omics data sets. Meng et al. [127], for example, integrated the transcriptome and proteome profiles of cells in the NCI-60 cancer cells. Using the integrated model, they found that the extravasation signaling pathway plays a fundamental role in leukemia; the same pathway was not identified in the single data set analyses. Other than the missing data points that were discussed above, one of the major challenges for integrating proteomics and phosphoproteomics with other omics data sets is the inconsistent annotation and reporting of such data sets and analysis pipelines. This, in combination with the dynamic nature of the proteome and phosphoproteome, can result in the introduction of noise to the integrative models used to study a disease or a patient. As unified data collection and standardization processes are being developed for use of these data in the clinic, consistent methods to record the associated meta-data for this information that can be used in conjunction with existing methods for genomic and other omic data sets need to also be developed. In recent years, there have been bioinformatics platforms and methods developed to reduce the variability from the data acquisition and analysis processes. Examples for this are the ProHits [128] and OpenMS [129]. ProHits is a software platform that is used mainly for interaction proteomics and provides a variety of options for data management and analysis that are systematically tracked to ensure the downstream reproducibility of the analysis pipelines. OpenMS is an open-source suite of analysis software for mass spectrometry data allowing the implementation of different pipelines and analyses procedures in a transparent and scalable way. These kinds of platforms ensure the reproducibility of the analyses pipelines. Methods for ensuring reproducibility during data acquisition are also important. For example, the TRIC (Transfer of Identification Confidence; [130]) algorithm, developed for SWATH-MS-targeted proteomics, uses a clever alignment approach to reduce the variability in peak picking and quantification across mass spectrometry runs. Other similar software has been previously compared by Navarro et al. [131]. The inherent variability of proteomics and phosphoproteomics data sets can also be a confounding factor in data integration efforts. It has been shown in single-cell studies that the noise and sample variability significantly decrease when a specific cell response is activated compared with a static state, because of regulatory coordination [132]. Therefore, acquiring nonstatic data points, where possible, will reduce data variability and increase the signal to noise ratio. Additionally, single-cell technologies, providing single-cell measurements of protein or phosphoprotein abundance, have the potential to mitigate the data variability issue and improve the use of these data for understanding disease development. From networks to mechanistic models For use of proteomics and phosphoproteomics in the clinic, it is important to provide mechanistic information for a disease beyond the pathways and functional modules that have been affected. Halasz et al. [133] used phosphoproteomics data sets and a probabilistic framework to create a mechanistic and executable model of the rewiring that occurs in signal transduction pathways in cancer cells. They were able to identify a cell line-specific feedback loop for inhibition of IRS1 by p70S6K in colorectal cell lines and to perform stimulations to identify ways to increase their sensitivity to TRIC (TCP-1 ring complex) inhibitors. Eduati et al. [57] used dynamic logic models and phosphoproteomics data to study the colorectal cell line-specific mechanism of drug resistance and a identified novel drug combination that can be used to overcome it. Such models can be invaluable in the clinic to not only understand the mechanism of disease but also to simulate and predict the outcome of a treatment on specific patient groups or even individuals, depending on the available models. Challenges for clinical application While the proteomics and phosphoproteomics layers of functional regulation provide valuable insight into disease development and mechanism, there are still some challenges that need to be tackled before they can be readily applied for stratification of patients, even if data quality and bioinformatics challenges discussed above are tackled. One of the major challenges is that most current ‘omics’ data analyses provide results that are not readily interpretable or actionable. For example, while identifying that a handful of pathways are affected in a specific patient subgroup may suggest the administration of specific kinase inhibitors as therapy, it does not necessarily uncover the full mechanism of a disease. There have been successful examples, such as the work of Zeevi and colleagues [134] that used omics data, clinical data and machine learning to devise an actionable change in personalized nutrition to regulate post-meal glucose levels, without an in-depth understanding of the mechanism at play. However, in most situations, lack of mechanistic information regarding a disease’s development, makes it difficult to identify the causal targets for therapy at a reliability level that is appropriate for precision therapies in the clinic. As new methods for proteomics data analysis develop, our community needs to take this into consideration: rather than providing ‘big picture’ representations of affected cell processes in a disease, there is a need for producing reliable ranked targets or biomarkers by probability of being effective [135–137] or ranked testable hypotheses to help decide on one, alongside an easy-to-interpret explanation for their selection. This requires an in-depth understanding of cell processes and their interactions. Recent years have seen the collaborations between computational biologists and clinicians or basic-science biologists dramatically increase, because of the advent of large-scale data sets and systems biology. The importance, however, of understanding basic biological processes in-depth to be able to understand disease mechanisms underlines the need for increased collaboration also between clinicians and basic research scientists. Interdisciplinary collaborations, including clinical data to take snapshots of the disease ‘omics’ profile, and iterations of computational analysis and basic biology for in-depth mechanistic studies of relevant cell processes, can lead to a detailed understanding and models of disease development, thus helping better stratify patients according to their disease subtype mechanism and design more knowledge-based treatments. Proteomics and phosphoproteomics data sets, as described above, can provide mechanistic insight into cell processes and are therefore ideal for inclusion in such studies to provide testable mechanistic hypotheses. Of course, the major disadvantage of such three-level approaches is that it takes time to perform in-depth studies of cell processes; however, as our knowledgebase of cell processes, their cross-talk and their role in different diseases increases, this will prove to be a worthwhile investment in the long run and might be the only way to truly achieve the goal of precision medicine across multiple diseases. An additional challenge is presented when associating identified affected cell processes with specific disease phenotypes or clinical data. Currently, most studies use patient survival data as the patient phenotype and associate omics signatures with remission or survival rates [138]. More detailed and standardized phenotyping of patients can provide a better understanding of the causal cell processes of a disease and can improve diagnosis and tracking both of its progression and the effects of treatment and other issues that might affect a patient’s quality of life [138]. As more omics data from patients are being generated, standardized protocols for systematically recording the phenotype of the relevant cells—if possible—and wider availability of in-depth patient clinical characteristics to data scientists beyond survival rate will also provide a significant contribution toward our community‘s goal of precision medicine. Ethical considerations to ensure patient anonymity and privacy need to also be taken into account in the development of these protocols as well as in the process of data sharing [140, 141]. The standardization of analysis pipelines and representation of results also present an issue for the routine application of proteomics protocols in the clinic. Whether the outcome of patient data analysis is the identification of a biomarker or a disease signature, robust quality control and analysis tools needs to be readily available to clinicians as well as accurate protocols for sample acquisition and results interpretation. This is critical to provide reproducible, high-quality precision care for patients across different hospitals and treatment centers. Proteomics and phosphoproteomics-specific data analysis pipelines have only recently started to be systematically developed and included in precision medicine studies [109]. Therefore, as the field matures, we expect to see significant progress in their standardized use across laboratories, institutes and eventually in the clinics. Future directions Proteomics and phosphoproteomics have recently emerged as a new layer of patient omics information in the field of precision medicine. Technological advancements and community efforts to standardize protocols and achieve robust and reproducible results [24–30] have contributed greatly to the utility of this data type in large-scale studies of disease and patient stratification. Their major strength lies in the fact that they give a picture of the actual workforce of the cell and are thus highly suited for studying the mechanism of disease development and progression. Other than the data reproducibility issue that the community is now efficiently tackling, one of the main challenges from a bioinformatics perspective that still prevents the wide-spread use of proteomics and phosphoproteomics data is the need for effective, data type-specific methods to extract the valuable knowledge it encodes and to integrate it efficiently with other large-scale data sets and prior knowledge. There are significant efforts made in this direction, and as the field matures, and more PTMs are also included, we expect it to provide great insight into the development of disease and help improve stratification of patients and design of precision approaches to their treatment and monitoring. Additionally, proteomics and phosphoproteomics data, like transcriptomics, encode highly dynamic information. Therefore, to accurately highlight differences in disease mechanisms and functional networks, and to reduce data variation, it is optimal to collect data sets on stimulation or perturbation rather than in a static state. This is currently impractical in a clinical setting, where we rely on a single sample from a usually untreated patient, but it could prove useful when performing, for example, window of opportunity trials where novel drugs are tested on patients before the standard treatment to evaluate their effect on untreated individuals [142, 143]. Currently, the bulk of population-level omics data is being collected to study cancer for precision oncology applications. Clearly, for precision medicine to become widely applicable more focus should be placed on characterizing also other diseases and their subtypes. These cancer studies, nevertheless, provide a unique learning opportunity for our community: we can use this rich data set to define what is the best way to maximize the orthogonal information we acquire from all these different omics layers, to estimate how many data sets are sufficient for characterizing a disease and potentially to identify the minimal components that one needs to measure in a cell to get the global signaling, gene regulation and metabolic status from a sample. From a proteomics perspective, such information can dramatically reduce the cost and variability of a study, making it even more applicable for clinical applications, for example through an educated design of targeted proteomics or phosphoproteomics approaches. Of the drugs that are tested in clinical trials only 1 in 10 successfully go to the market [144]. This presents a huge financial burden for the pharmaceutical companies and the public. Bioinformatics approaches that effectively integrate omics data with in-depth clinical data can help guide many aspects of clinical trials to improve the chances of their success (for a recent review, see [145]): analysis of patients’ omics data can help to guide the selection of targets and associated drugs and the appropriate group to which a drug can be administered with improved chance of success. Bioinformatics data storage and automated analysis pipelines can also make this knowledge available to future studies. At later stages, side effects or outcomes of the trial can be associated with specific molecular signatures in the patients to understand their mechanisms and design approaches to circumvent them. Indeed, these methods are already in use, and there are already guidelines in place to guide the design of clinical trials using omics data sets [146]. Thus, as an increased amount of clinical records and associated omics data sets become available to scientists, bioinformatics approaches will play an important role in guiding clinical trials with an increased success rate. In an ideal precision medicine scenario, we would be able to create a widely used and robust clinical tool that can guide doctors with respect to the data required from a patient to provide his subdisease mechanism and guide the choice of therapy and monitoring. While we are several decades away from such a tool, and indeed from widespread use of any precision medicine approaches at all, it is nevertheless becoming increasingly clear that understanding at the molecular level and creating dynamic mechanistic models of cell functions during disease development and progression are critical for the success of precision medicine. Precision medicine for all is still a long-term goal for our community. However, the field is rapidly progressing, and it certainly does not seem as far-fetched as it did 10 years ago. Even not taking into consideration the improvement in global quality of life, studies have demonstrated the cost–benefit of applying such approaches in the clinic [147]. Programs such as the St Jude Children’s Research Hospital Pharmacogenomics of Anticancer Agents Research 4Kids (PG4Kids) program [148] and the Icahn School of Medicine at Mount Sinai Clinical Implementation of Personalized Medicine through Electronic Health Records and Genomics-Pharmacogenomics (CLIPMERGE PGx) program [149] can provide valuable knowledge regarding the practical prerequisites for real life precision medicine implementation. Additionally, exciting developments in preclinical studies include the use of patient-derived xenograph mouse models of disease (e.g. at the Jackson Laboratory), for testing precision therapies. We expect current and future advances in proteomics and phosphoproteomics data collection and analysis to greatly improve our understanding of disease development and progression also contributing to improved implementation of precision medicine in real world applications. Key Points Precision medicine aims to tailor diagnostic, therapeutic and monitoring approaches to specific patient subgroups. Proteomics and phosphoproteomics data sets can provide mechanistic insight into disease development and are thus valuable for precision medicine approaches. Major challenges presented by these data include the lack of data robustness and standardization as well as the limited proteome and phosphoproteome coverage. Methods that are developed specifically for these data types as well as their effective integration with other data sets can mitigate the issues. Funding Funding for open access publication fees was provided by the European Molecular Biology Laboratory. Girolamo Giudice is a postdoc in the Petsalaki group at the EMBL-EBI since April 2017 working on the development of methods for the characterization of context-specific signaling networks. He acquired his PhD in Biomedicine from the University Autónoma de Madrid on March 2017. Evangelia Petsalaki is a Group Leader at the EMBL-EBI since February 2017 studying cell signaling using whole cell models. She did her post doc (2010–16) at the LTRI in Toronto with Tony Pawson and Fritz Roth and her PhD with Rob Russell at the EMBL in Heidelberg (2009). References 1 Huang BE , Mulyasasmita W, Rajagopal G. The path from big data to precision medicine . Expert Rev Precis Med Drug Dev 2016 ; 1 ( 2 ): 129 – 43 . Google Scholar Crossref Search ADS WorldCat 2 Chen R , Mias G, Li-Pook-Than IJ, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes . Cell 2012 ; 148 ( 6 ): 1293 – 307 . Google Scholar Crossref Search ADS PubMed WorldCat 3 National Research Council (US) Committee on A Framework for Developing a New Taxonomy of Disease . Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease . Washington, DC : National Academies Press , 2011 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 4 Rudin S , Marable M, Huang RS. The promise of pharmacogenomics in reducing toxicity during acute lymphoblastic leukemia maintenance treatment . Genomics Proteomics Bioinformatics 2017 ; 15 ( 2 ): 82 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Drew L. Pharmacogenetics: the right drug for you . Nature 2016 ; 537 ( 7619 ): S60 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Perou C , Sørlie MT, Eisen MB, et al. Molecular portraits of human breast tumours . Nature 2000 ; 406 ( 6797 ): 747 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Curtis C , Shah SP, Chin SF, et al. The genomic and transcriptomic architecture of 2, 000 breast tumours reveals novel subgroups . Nature 2012 ; 486 ( 7403 ): 346 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Cancer Genome Atlas Network . Comprehensive molecular portraits of human breast tumours . Nature 2012 ; 490 ( 7418 ): 61 – 70 Crossref Search ADS PubMed WorldCat 9 KwaMakris MA , Esteva FJ. Clinical utility of gene-expression signatures in early stage breast cancer . Nat Rev Clin Oncol 2017 ; 14 : 595 – 610 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Nik-Zainal S , Davies H, Staaf J, et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences . Nature 2016 ; 534 ( 7605 ): 47 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Byron SA , Van Keuren-Jensen KR, Engelthaler DM, et al. Translating RNA sequencing into clinical diagnostics: opportunities and challenges . Nat Rev Genet 2016 ; 17 ( 5 ): 257 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Schwaederle M , Zhao M, Lee JJ, et al. Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignant neoplasms: a meta-analysis . JAMA Oncol 2016 ; 2 ( 11 ): 1452 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Kitano H. Systems biology: a brief overview . Science 2002 ; 295 ( 5560 ): 1662 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Kronfol MM , Dozmorov MG, Huang R, et al. The role of epigenomics in personalized medicine . Expert Rev Precis Med Drug Dev 2017 ; 2 ( 1 ): 33 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Clish CB. Metabolomics: an emerging but powerful tool for precision medicine . Cold Spring Harb Mol Case Stud 2015 ; 1 ( 1 ): a000588 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Tchourine K , Poultney CS, Wang L, et al. One third of dynamic protein expression profiles can be predicted by simple rate equations . Mol Biosyst 2014 ; 10 ( 11 ): 2850 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Vogel C , Abreu RS, Ko D, et al. Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line . Mol Syst Biol 2010 ; 6 ( 1 ): 400 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 18 Gholami AM , Moghaddas A, Hahne H, et al. Global proteome analysis of the NCI-60 cell line panel . Cell Rep 2013 ; 4 ( 3 ): 609 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Bell AW , Deutsch EW, Au CE, et al. A HUPO test sample study reveals common problems in mass spectrometry–based proteomics . Nat Methods 2009 ; 6 ( 6 ): 423 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Nilsson T , Mann M, Aebersold R, et al. Mass spectrometry in high-throughput proteomics: ready for the big time . Nat Methods 2010 ; 7 ( 9 ): 681 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Zhou L , Wang K, Li Q, et al. Clinical proteomics-driven precision medicine for targeted cancer therapy: current overview and future perspectives . Expert Rev Proteomics 2016 ; 13 ( 4 ): 367 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Guerin M , Gonçalves A, Toiron Y, et al. How may targeted proteomics complement genomic data in breast cancer? Expert Rev Proteomics 2017 ; 14 ( 1 ): 43 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Mitchell P. Proteomics retrenches . Nat Biotechnol 2010 ; 28 ( 7 ): 665 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Searle BC. Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies . Proteomics 2010 ; 10 ( 6 ): 1265 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Varjosalo M , Sacco R, Stukalov A, et al. Interlaboratory reproducibility of large-scale human protein-complex analysis by standardized AP-MS . Nat Methods 2013 ; 10 ( 4 ): 307 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Mann M. Comparative analysis to guide quality improvements in proteomics . Nat Methods 2009 ; 6 ( 10 ): 717 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Stead DA , Paton NW, Missier P, et al. Information quality in proteomics . Brief Bioinform 2008 ; 9 ( 2 ): 174 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Tabb DL. Quality assessment for clinical proteomics . Clin Biochem 2013 ; 46 ( 6 ): 411 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Wang X. Statistical assessment of QC metrics on raw LC-MS/MS data. In: Comai L, Katz JE, Mallick P (eds), Proteomics . Springer New York , 2017 , pp. 325 – 37 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 30 Whiteaker JR , Halusa GN, Hoofnagle AN, et al. Using the CPTAC Assay Portal to identify and implement highly characterized targeted proteomics assays . Methods MolBiol 2016 ; 1410 : 223 – 36 . Google Scholar OpenURL Placeholder Text WorldCat 31 Sharma K , D’Souza RCJ, Tyanova S, et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling . Cell Rep 2014 ; 8 ( 5 ): 1583 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Hebert AS , Richards AL, Bailey DJ, et al. The one hour yeast proteome . Mol Cell Proteomics 2014 ; 13 ( 1 ): 339 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Edwards NJ , Oberti M, Thangudu RR, et al. The CPTAC data portal: a resource for cancer proteomics research . J Proteome Res 2015 ; 14 ( 6 ): 2707 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Elschenbroich S , Kislinger T. Targeted proteomics by selected reaction monitoring mass spectrometry: applications to systems biology and biomarker discovery . Mol Biosyst 2011 ; 7 ( 2 ): 292 – 303 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Collins BC , Hunter CL, Liu Y, et al. Multi-laboratory assessment of reproducibility, qualitative and quantitative performance of SWATH-mass spectrometry . Nat Commun 2017 ; 8 : 291 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Jain K. Role of pharmacoproteomics in the development of personalized medicine . Pharmacogenomics 2004 ; 5 ( 3 ): 331 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Duarte TT , Spencer CT. Personalized proteomics: the future of precision medicine . Proteomes 2016 ; 4 ( 4 ): 29 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Choudhary C , Mann M. Decoding signalling networks by mass spectrometry-based proteomics . Nat Rev Mol Cell Biol 2010 ; 11 ( 6 ): 427 – 39 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Casado P , et al. Impact of phosphoproteomics in the translation of kinase-targeted therapies . Proteomics 2017 ; 17 ( 6 ): 1600235 . Google Scholar Crossref Search ADS WorldCat 40 Cutillas PR. Role of phosphoproteomics in the development of personalized cancer therapies . Proteomics Clin Appl 2015 ; 9 ( 3–4 ): 383 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Kulasingam V , Diamandis EP. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies . Nat Rev Clin Oncol 2008 ; 5 ( 10 ): 588 – 99 . Google Scholar Crossref Search ADS WorldCat 42 Kienzl-Wagner K , Pratschke J, Brandacher G. Proteomics—a blessing or a curse? Application of proteomics technology to transplant medicine . Transplantation 2011 ; 92 ( 5 ): 499 – 509 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Papsidero LD , Wang MC, Valenzuela LA, et al. A prostate antigen in sera of prostatic cancer patients | cancer research . Cancer Res 1980 ; 40 ( 7 ): 2428 – 32 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 44 Ilyin SE , Belkowski SM, Plata-Salamán CR. Biomarker discovery and validation: technologies and integrative approaches . Trends Biotechnol 2004 ; 22 ( 8 ): 411 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Thompson IM , Pauler DK, Goodman PJ, et al. Prevalence of prostate cancer among men with a prostate-specific antigen level ≤4.0 ng per milliliter . N Engl J Med 2004 ; 350 ( 22 ): 2239 – 46 Google Scholar Crossref Search ADS PubMed WorldCat 46 Catalona WJ , Smith DS, Ratliff TL, et al. Measurement of prostate-specific antigen in serum as a screening test for prostate cancer . N Engl J Med 1991 ; 324 ( 17 ): 1156 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Petricoin EF , Belluco C, Araujo RP, et al. The blood peptidome: a higher dimension of information content for cancer biomarker discovery . Nat Rev Cancer 2006 ; 6 ( 12 ): 961 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Guyon I , Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines—Kernel Machines . Mach Learn 2002 ; 46 : 389 – 422 . Google Scholar Crossref Search ADS WorldCat 49 Pozniak Y , Balint-Lahat N, Rudolph JD, et al. System-wide clinical proteomics of breast cancer reveals global remodeling of tissue homeostasis . Cell Syst 2016 ; 2 ( 3 ): 172 – 84 . Google Scholar Crossref Search ADS PubMed WorldCat 50 ZhangWang B , Wang JX, et al. Proteogenomic characterization of human colon and rectal cancer . Nature 2014 ; 513 ( 7518 ): 382 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Yang J-Y , Yoshihara K, Tanaka K, et al. Predicting time to ovarian carcinoma recurrence using protein markers . J Clin Invest 2013 ; 123 ( 9 ): 3740 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Parker R , Vella LJ, Xavier D, et al. Phosphoproteomic analysis of cell-based resistance to BRAF inhibitor therapy in melanoma . Front Oncol 2015 ; 5 : 95 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 53 Wei W , Shin YS, Xue M, et al. Single-cell phosphoproteomics resolves adaptive signaling dynamics and informs targeted combination therapy in glioblastoma . Cancer Cell 2016 ; 29 ( 4 ): 563 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Locard-Paulet M , Lim L, Veluscek G, et al. Phosphoproteomic analysis of interacting tumor and endothelial cells identifies regulatory mechanisms of transendothelial migration . Sci Signal 2016 ; 9 ( 414 ): ra15 . Google Scholar Crossref Search ADS PubMed WorldCat 55 Casado P , Alcolea MP, Iorio F, et al. Phosphoproteomics data classify hematological cancer cell lines according to tumor type and sensitivity to kinase inhibitors . Genome Biol 2013 ; 14 ( 4 ): R37 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Casado P , Rodriguez-Prados J-C, Cosulich SC, et al. Kinase-substrate enrichment analysis provides insights into the heterogeneity of signaling pathway activation in leukemia cells . Sci Signal 2013 ; 6 ( 268 ): rs6 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Eduati F , Doldàn-Martelli V, Klinger B, et al. Drug resistance mechanisms in colorectal cancer dissected with cell type–specific dynamic logic models . Cancer Res 2017 ; 77 ( 12 ): 3364 – 3375 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Kuo K-K , Kuo C-J, Chiu C-Y, et al. Quantitative proteomic analysis of differentially expressed protein profiles involved in pancreatic ductal adenocarcinoma . Pancreas 2016 ; 45 ( 1 ): 71 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Chung JC , Oh MJ, Choi SH, et al. Proteomic analysis to identify biomarker proteins in pancreatic ductal adenocarcinoma . ANZ J Surg 2008 ; 78 ( 4 ): 245 – 251 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Xiao H , Zhang Y, Kim Y, et al. Differential proteomic analysis of human saliva using tandem mass tags quantification for gastric cancer detection . Sci Rep 2016 ; 6 ( 1 ): 22165 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Beretov J , Wasinger VC, Millar EKA, et al. Proteomic analysis of urine to identify breast cancer biomarker candidates using a label-free LC-MS/MS approach . PLoS One 2015 ; 10 ( 11 ): e0141876 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Kimura Y , Yanagimachi M, Ino Y, et al. Identification of candidate diagnostic serum biomarkers for Kawasaki disease using proteomic analysis . Sci Rep 2017 ; 7 : 43732 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Willingale R , Jones DJL, Lamb JH, et al. Searching for biomarkers of heart failure in the mass spectra of blood plasma . Proteomics 2006 ; 6 ( 22 ): 5903 – 5914 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Siebert S , Porter D, Paterson C, et al. Urinary proteomics can define distinct diagnostic inflammatory arthritis subgroups . Sci Rep 2017 ; 7 : 40473 . Google Scholar Crossref Search ADS PubMed WorldCat 65 ZhangChen F , Wang JM, et al. A neural network approach to multi-biomarker panel discovery by high-throughput plasma proteomics profiling of breast cancer . BMC Proc 2013 ; 7 : S10 . Google Scholar OpenURL Placeholder Text WorldCat 66 Rogers M , Clarke A, Noble PJ, et al. Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis . Cancer Res 2003 ; 63 ( 20 ): 6971 – 83 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 67 Chen Y , Zheng S, Yu J, et al. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population . Clin Cancer Res 2004 ; 10 ( 24 ): 8380 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 68 Luk JM , Lam BY, Lee NPY, et al. Artificial neural networks and decision tree model analysis of liver cancer proteomes . Biochem Biophys Res Commun 2007 ; 361 ( 1 ): 68 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Ward DG , Suggett N, Cheng Y, et al. Identification of serum biomarkers for colon cancer by proteomic analysis . Br J Cancer 2006 ; 94 ( 12 ): 1898 – 905 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Bouwman FG , de Roos B, Rubio-Aliaga I, et al. 2D-electrophoresis and multiplex immunoassay proteomic analysis of different body fluids and cellular components reveal known and novel markers for extended fasting . BMC Med Genomics 2011 ; 4 ( 1 ): 24 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Ostroff RM , Mehan M, Stewart RA, et al. Early detection of malignant pleural mesothelioma in asbestos-exposed individuals with a noninvasive proteomics-based surveillance tool . PLoS One 2012 ; 7 : e46091 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Petricoin EF , Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer . Lancet 2002 ; 359 ( 9306 ): 572 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Ahn HS , Shin YS, Park PJ, et al. Serum biomarker panels for the diagnosis of gastric adenocarcinoma . Br J Cancer 2012 ; 106 ( 4 ): 733 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Tan CS , Ploner A, Quandt A, et al. Finding regions of significance in SELDI measurements for identifying protein biomarkers . Bioinformatics 2006 ; 22 ( 12 ): 1515 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Tibshirani R , Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression . Proc Natl Acad Sci USA 2002 ; 99 ( 10 ): 6567 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Tusher VG , Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response . Proc Natl Acad Sci USA 2001 ; 98 ( 9 ): 5116 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Saeys Y , Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics . Bioinformatics 2007 ; 23 ( 19 ): 2507 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Ressom HW , Varghese RS, Zhang Z, et al. Classification algorithms for phenotype prediction in genomics and proteomics . Front Biosci J Virtual Libr 2008 ; 13 ( 13 ): 691 – 708 . Google Scholar Crossref Search ADS WorldCat 79 Guyon I , Elisseeff A. An introduction to variable and feature selection . J Mach Learn Res 2003 ; 3 : 1157 – 82 . Google Scholar OpenURL Placeholder Text WorldCat 80 Check E. Proteomics and cancer: running before we can walk? . Nature 2004 ; 429 ( 6991 ): 496 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Ashburner M , Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology . Nat Genet 2000 ; 25 ( 1 ): 25 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Subramanian A , Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA 2005 ; 102 ( 43 ): 15545 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Kanehisa M , Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs . Nucleic Acids Res 2017 ; 45 ( D1 ): D353 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Creixell P , Reimand J, Haider S, et al. Pathway and network analysis of cancer genomes . Nat Methods 2015 ; 12 ( 7 ): 615 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Tarca AL , Laurentiu A, Draghici S, et al. A novel signaling pathway impact analysis . Bioinformatics 2009 ; 25 ( 1 ): 75 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Leiserson MD , Vandin MF, Wu H-T, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes . Nat Genet 2015 ; 47 ( 2 ): 106 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Paull EO , Carlin DE, Niepel M, et al. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE) . Bioinformatics 2013 ; 29 ( 21 ): 2757 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat 88 CowenIdeker LT , Raphael BJ, et al. Network propagation: a universal amplifier of genetic associations . Nat Rev Genet 2017 ; 18 : 551 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Vanunu O , Magger O, Ruppin E, et al. Associating genes and protein complexes with disease via network propagation . PLoS Comput Biol 2010 ; 6 ( 1 ): e1000641 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Hofree M , Shen JP, Carter H, et al. Network-based stratification of tumor mutations . Nat Methods 2013 ; 10 ( 11 ): 1108 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Kim Y-A , Wuchty S, Przytycka TM, Covert MW. Identifying causal genes and dysregulated pathways in complex diseases . PLoS Comput Biol 2011 ; 7 ( 3 ): e1001095 . Google Scholar Crossref Search ADS PubMed WorldCat 92 Köhler S , Bauer S, Horn D, Robinson PN. Walking the Interactome for prioritization of candidate disease genes . Am J Hum Genet 2008 ; 82 ( 4 ): 949 – 58 . Google Scholar Crossref Search ADS PubMed WorldCat 93 Navlakha S , Kingsford C. The power of protein interaction networks for associating genes with diseases . Bioinformatics 2010 ; 26 ( 8 ): 1057 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 94 Hakes L , Pinney JW, Robertson DL, et al. Protein-protein interaction networks and biology—what’s the connection? Nat Biotechnol 2008 ; 26 ( 1 ): 69 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 95 Müller T , Schrötter A, Loosse C, et al. Sense and nonsense of pathway analysis software in proteomics . J Proteome Res 2011 ; 10 ( 12 ): 5398 – 408 . Google Scholar Crossref Search ADS PubMed WorldCat 96 Soh D , Dong D, Guo Y, et al. Consistency, comprehensiveness, and compatibility of pathway databases . BMC Bioinformatics 2010 ; 11 : 449 . Google Scholar Crossref Search ADS PubMed WorldCat 97 Yeger-Lotem E , Sharan R. Human protein interaction networks across tissues and diseases . Front. Genet 2015 ; 6 : 257 . Google Scholar Crossref Search ADS PubMed WorldCat 98 Bossi A , Lehner B. Tissue specificity and the human protein interaction network . Mol Syst Biol 2009 ; 5 : 260 . Google Scholar Crossref Search ADS PubMed WorldCat 99 Meyer B , Papasotiriou DG, Karas M. 100% protein sequence coverage: a modern form of surrealism in proteomics . Amino Acids 2011 ; 41 ( 2 ): 291 – 310 . Google Scholar Crossref Search ADS PubMed WorldCat 100 Reinders J , Lewandrowski U, Moebius J, et al. Challenges in mass spectrometry-based proteomics . Proteomics 2004 ; 4 ( 12 ): 3686 – 703 . Google Scholar Crossref Search ADS PubMed WorldCat 101 Mischnik M , Sacco F, Cox J, et al. IKAP: a heuristic framework for inference of kinase activities from Phosphoproteomics data . Bioinformatics 2016 ; 32 ( 3 ): 424 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Wilkes EH , Casado P, Rajeeve V, et al. Kinase activity ranking using phosphoproteomics data (KARP) quantifies the contribution of protein kinases to the regulation of cell viability . Mol Cell Proteom 2017 ; 16 ( 9 ): 1694 – 704 . Google Scholar Crossref Search ADS WorldCat 103 Yang P , Patrick E, Humphrey SJ, et al. KinasePA: Phosphoproteomics data annotation using hypothesis driven kinase perturbation analysis . Proteomics 2016 ; 16 ( 13 ): 1868 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 104 Yang P , Zheng X, Jayaswal V, et al. Knowledge-based analysis for detecting key signaling events from time-series phosphoproteomics data . PLoS Comput Biol 2015 ; 11 ( 8 ): e1004403 . Google Scholar Crossref Search ADS PubMed WorldCat 105 Hernandez-Armenta C , Ochoa D, Gonçalves E, et al. Benchmarking substrate-based kinase activity inference using phosphoproteomic data . Bioinformatics 2017 ; 33 : 1845 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 106 Ritchie MD , Holzinger ER, Li R, et al. Methods of integrating data to uncover genotype-phenotype interactions . Nat Rev Genet 2015 ; 16 ( 2 ): 85 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat 107 Bersanelli M , Mosca E, Remondini D, et al. Methods for the integration of multi-omics data: mathematical aspects . BMC Bioinformatics 2016 ; 17 (Suppl 2): 15 . Google Scholar Crossref Search ADS PubMed WorldCat 108 Ruggles KV , Krug K, Wang X, et al. Methods, tools and current perspectives in proteogenomics . Mol Cell Proteomics 2017 ; 16 ( 6 ): 959 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 109 Drake JM , Paull EO, Graham NA, et al. Phosphoproteome integration reveals patient-specific networks in prostate cancer . Cell 2016 ; 166 ( 4 ): 1041 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 110 Rudolph JD , de Graauw M, van de Water B, et al. Elucidation of signaling pathways from large-scale phosphoproteomic data using protein interaction networks . Cell Syst 2016 ; 3 ( 6 ): 585 – 593.e3 . Google Scholar Crossref Search ADS PubMed WorldCat 111 Hill SM , et al. Inferring causal molecular networks: empirical assessment through a community-based effort . Nat Methods 2016 ; 13 ( 4 ): 310 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 112 Jolliffe IT. Principal Component Analysis , 2nd edn. New York: Springer-Verlag New York, Inc., 2002 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 113 Vaske CJ , Benz SC, Sanborn Z, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM . Bioinformatics 2010 ; 26 ( 12 ): i237 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 114 Žitnik M , Zupan B. Data fusion by matrix factorization . IEEE Trans Pattern Anal Mach Intell 2015 ; 37 ( 1 ): 41 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 115 Fusi N , Elibol HM, Probabilistic Matrix Factorization for Automated Machine Learning. arXiv preprint arXiv:1705.05355 2017 . 116 Liu Y , Devescovi V, Chen S, et al. Multilevel omic data integration in cancer cell lines: advanced annotation and emergent properties . BMC Syst Biol 2013 ; 7 : 14 . Google Scholar Crossref Search ADS PubMed WorldCat 117 Shen R , Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lunch cancer subtype analysis . Bioinformatics 2009 ; 25 ( 22 ): 2906 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 118 Persson O , Krogh M, Saal LH, et al. Microarray analysis of gliomas reveals chromosomal position-associated gene expression patterns and identifies potential immunotherapy targets . J Neurooncol 2007 ; 85 ( 1 ): 11 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 119 Furey TS , Cristianini N, Duffy N, D, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data . Bioinformatics 2000 ; 16 ( 10 ): 906 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 120 Theilhaber J , Connolly T, Roman-Roman S, et al. Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data . Genome Res 2002 ; 12 ( 1 ): 165 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 121 Alexandrov LB , Nik-Zainal S, Wedge DC, et al. Deciphering signatures of mutational processes operative in human cancer . Cell Rep 2013 ; 3 ( 1 ): 246 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 122 Zhang S , Liu C-C, Li W, et al. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data . Nucleic Acids Res 2012 ; 40 ( 19 ): 9379 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 123 Yuan Y , Van Allen EM, Omberg L, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types . Nat Biotechnol 2014 ; 32 ( 7 ): 644 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 124 Li Y , Ngom A. The non-negative matrix factorization toolbox for biological data mining . Source Code Biol Med 2013 ; 8 ( 1 ): 10 . Google Scholar Crossref Search ADS PubMed WorldCat 125 Torres-García W , Zhang W, Runger GC, et al. Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins . Bioinformatics 2009 ; 25 ( 15 ): 1905 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 126 Li F , Nie L, Wu G, et al. Prediction and characterization of missing proteomic data in Desulfovibrio vulgaris . Comp Funct Genomics 2011 ; 2011 : 78073 . Google Scholar Crossref Search ADS WorldCat 127 Meng C , Kuster B, Culhane AC, et al. A multivariate approach to the integration of multi-omics datasets . BMC Bioinformatics 2014 ; 15 : 162 , Google Scholar Crossref Search ADS PubMed WorldCat 128 Liu G , Zhang J, Larsen B, et al. ProHits: an integrated software platform for mass spectrometry-based interaction proteomics . Nat Biotechnol 2010 ; 28 ( 10 ): 1015 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 129 Pfeuffer J , et al. OpenMS – A platform for reproducible analysis of mass spectrometry data . J Biotechnol 2017 ; 261 : 142 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 130 Röst HL , Liu Y, D'Agostino G, et al. TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics . Nat Methods 2016 ; 13 ( 9 ): 777 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 131 Navarro P , Kuharev J, Gillet LC, et al. A multi-center study benchmarks software tools for label-free proteome quantification . Nat Biotechnol 2016 ; 34 ( 11 ): 1130 . Google Scholar Crossref Search ADS PubMed WorldCat 132 Martinez-Jimenez C , Pilar P, Eling CN, et al. Aging increases cell-to-cell transcriptional variability upon immune stimulation . Science 2017 ; 355 ( 6332 ): 1433 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 133 Halasz M , Kholodenko BN, Kolch W, et al. Integrating network reconstruction with mechanistic modeling to predict cancer therapies . Sci Signal 2016 ; 9 ( 455 ): ra114 . Google Scholar Crossref Search ADS PubMed WorldCat 134 Zeevi D , Korem T, Zmora N, et al. Personalized nutrition by prediction of glycemic responses . Cell 2015 ; 163 ( 5 ): 1079 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 135 Katsila T , Spyroulias GA, Patrinos GP, Matsoukas M-T. Computational approaches in target identification and drug discovery . Comput Struct Biotechnol J 2016 ; 14 : 177 – 84 . May Google Scholar Crossref Search ADS PubMed WorldCat 136 Kramer R , Cohen D. Functional genomics to new drug targets . Nat Rev Drug Discov 2004 ; 3 ( 11 ): 965 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 137 Terstappen GC , Schlüpen C, Raggiaschi R, Gaviraghi G. Target deconvolution strategies in drug discovery . Nat Rev Drug Discov 2007 ; 6 ( 11 ): 891 – 903 . Google Scholar Crossref Search ADS PubMed WorldCat 138 Gerstung M , Papaemmanuil E, Martincorena I, et al. Precision oncology for acute myeloid leukemia using a knowledge bank approach . Nat Genet 2017 ; 49 ( 3 ): 332 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 139 Robinson PN. Deep phenotyping for precision medicine . Hum Mutat 2012 ; 33 ( 5 ): 777 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 140 Juengst E , McGowan ML, Fishman JR, et al. From ‘personalized’ to ‘precision’ medicine: the ethical and social implications of rhetorical reform in genomic medicine ,. Hastings Cent Rep 2016 ; 46 ( 5 ): 21 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 141 Dzau VJ , Ginsburg GS. Realizing the full potential of precision medicine in health and health care . JAMA 2016 ; 316 ( 16 ): 1659 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 142 Glimelius B , Lahn M. Window-of-opportunity trials to evaluate clinical activity of new molecular entities in oncology . Ann Oncol 2011 ; 22 ( 8 ): 1717 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 143 Schmitz S , Duhoux F, Machiels J-P. Window of opportunity studies: do they fulfil our expectations? Cancer Treat Rev 2016 ; 43 : 50 – 57 . Google Scholar Crossref Search ADS PubMed WorldCat 144 Hay M , Thomas DW, Craighead JL, et al. Clinical development success rates for investigational drugs . Nat Biotechnol 2014 ; 32 ( 1 ): 40 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 145 Gill SK , Christopher AF, Gupta V, Bansal P. Emerging role of bioinformatics tools and software in evolution of clinical research . Perspect Clin Res 2016 ; 7 ( 3 ): 115 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 146 McShane LM , Cavenagh MM, Lively TG, et al. Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration . BMC Med 2013 ; 11 ( 1 ): 220 . Google Scholar Crossref Search ADS PubMed WorldCat 147 Stenehjem DD , Bellows BK, Yager KM, et al. Cost-utility of a prognostic test guiding adjuvant chemotherapy decisions in early-stage non-small cell lung cancer . Oncologist 2016 ; 21 : 196 – 204 . Google Scholar Crossref Search ADS PubMed WorldCat 148 St Jude Children‘s Research Hospital . St Jude‘s Family Advisory Council: PGEN4Kids Study Information, 2012 . https://s.stjude.org/multimedia/PG4KDS/PGEN4Kid.html 149 Gottesman O , Scott SA, Ellis SB, et al. The CLIPMERGE PGx program: clinical implementation of personalized medicine through electronic health records and genomics - pharmacogenomics . Clin Pharmacol Ther 2013 ; 94 ( 2 ): 214 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author 2017. Published by Oxford University Press.
Bioinformatics for precision oncologySinger, Jochen; Irmisch, Anja; Ruscheweyh, Hans-Joachim; Singer, Franziska; Toussaint, Nora C; Levesque, Mitchell P; Stekhoven, Daniel J; Beerenwinkel, Niko
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx143pmid: 29272324
Abstract Molecular profiling of tumor biopsies plays an increasingly important role not only in cancer research, but also in the clinical management of cancer patients. Multi-omics approaches hold the promise of improving diagnostics, prognostics and personalized treatment. To deliver on this promise of precision oncology, appropriate bioinformatics methods for managing, integrating and analyzing large and complex data are necessary. Here, we discuss the specific requirements of bioinformatics methods and software that arise in the setting of clinical oncology, owing to a stricter regulatory environment and the need for rapid, highly reproducible and robust procedures. We describe the workflow of a molecular tumor board and the specific bioinformatics support that it requires, from the primary analysis of raw molecular profiling data to the automatic generation of a clinical report and its delivery to decision-making clinical oncologists. Such workflows have to various degrees been implemented in many clinical trials, as well as in molecular tumor boards at specialized cancer centers and university hospitals worldwide. We review these and more recent efforts to include other high-dimensional multi-omics patient profiles into the tumor board, as well as the state of clinical decision support software to translate molecular findings into treatment recommendations. cancer, molecular tumor board, data analysis pipeline, mutation calling, clinical decision support Introduction The continuous improvement, greater availability and decreasing cost of next-generation sequencing (NGS) have allowed major cancer centers worldwide to offer NGS-based personalized oncology for clinical practice. The goal is to profile the genetic aberrations of tumors such as single-nucleotide variants (SNVs), copy number variants (CNVs), insertions and deletions (indels), structural variants (SVs) and gene fusions, and to suggest potential treatments based on the molecular lesions that are observed. These approaches can be organized either as a single institutional molecular tumor board (MTB), where detected genetic aberrations will be evaluated for any potential matching treatments, or as a basket trial, in which predefined genetic alterations are assigned to matching treatment arms (baskets). Both approaches typically include patients who are progressive on all conventional treatment options and those with rare cancers for which limited treatments exist, such as many pediatric tumors [1]. MTBs are now widespread in the USA, Europe and Australia with reported patient numbers to date ranging up to 2000 patients per cancer center [2]. Ideally, a biopsy is taken on tumor progression from the last therapy to resemble the current genetic state of the evolved tumor [3, 4]. However, some MTB approaches also profile biopsies sampled at diagnosis, especially for high-risk tumors with few treatment options [5–7], or biopsies of patients currently responding to therapy but without further therapeutic options [8–10]. Typically, biopsies with a tumor content of at least 20% are analyzed by cancer-specific gene panels, such as FoundationOne [11], or whole-exome sequencing (WES) [12] (Figure 1). Some centers include additional measurements, such as profiling of the transcriptome, methylome or copy number alterations [5, 13, 14]. Whereas profiling by WES usually includes a germ line control [5, 12], this control is missing in most panel sequencing approaches [2, 11]. In an ideal setup, matched tumor–normal DNA and RNA sequencing samples are processed in the same conditions, including in the same lane of the sequencer. The resulting NGS data are analyzed for genetic aberrations and potential drug interactions. Specific treatment suggestions are, after careful consideration of available preclinical and clinical evidence, incorporated into a clinical report, which together with the patient’s clinical data, such as treatment history, comorbidities and radiology scans, forms the basis for therapeutic decision-making in an interdisciplinary MTB. The molecular report may suggest tumor genotype-matched clinical trials and targeted therapies, such as kinase inhibitors, or recommend the avoidance of drugs, for example, in cases where mutations that potentially confer treatment resistance have been detected. Figure 1 Open in new tabDownload slide Schematic overview of the workflow of a MTB. Tumor biopsies are obtained from consenting patients, and DNA is extracted and sequenced. Variants are called and then annotated and prioritized for potential functional or clinical relevance before being reported to a tumor board, where an interdisciplinary team decides about treatment options. Figure 1 Open in new tabDownload slide Schematic overview of the workflow of a MTB. Tumor biopsies are obtained from consenting patients, and DNA is extracted and sequenced. Variants are called and then annotated and prioritized for potential functional or clinical relevance before being reported to a tumor board, where an interdisciplinary team decides about treatment options. A number of challenges exist for current precision oncology approaches during all the steps of the process, starting from clinical sampling up to bioinformatics analysis, reporting and patient treatment. In addition to difficulties in obtaining a tumor biopsy and a sufficient quantity and quality of tumor DNA and RNA for molecular profiling, Massard et al. [14] reported that in less than half of 843 patients with advanced solid tumors, an actionable mutation was found. In the largest basket trial approach to date, the MATCH trial of the US National Cancer Institute (NCI), the restricted number of drug arms resulted in even fewer gene–drug matches. Only 9% of the patients could be assigned to a genetics-based treatment [15]. The development of more selective drugs over time is expected to increase these numbers. A further challenge is to translate a MTB suggestion into patient treatment. Beltran et al. [12] reported that although 94% of solid cancer patients in their MTB had an actionable alteration, only 5% were treated based on their genotype. The main reasons were rapid decline of condition and, more importantly, the lack of access to clinical trials or off-label drugs. Finally, the costs of molecular profiling can be challenging as well. Although it has been shown that panel sequencing is financially feasible [16], the costs of more comprehensive approaches such as WES, whole-genome sequencing (WGS) and RNA sequencing can be prohibitive for reimbursement. Nevertheless, it is to be expected that comprehensive sequencing will become cheaper, and therefore financially feasible. The final outcome of cancer genotype-matched patient treatment, namely, patient response to treatment, varies widely in the published literature. Schwaederle et al. [11] report a partial response in 36% of patients, whereas the MOSCATO trial [14] reports objective responses in 11% of patients receiving matched treatments. The currently ongoing basket trials such as NCI MATCH, which aims to include 6000 patients, will provide more conclusive data, owing to larger cohort sizes and well-defined genotype-matched treatment arms. Nonetheless, a single-gene aberration is not always predictive of treatment response as has been observed for the oncogenic BRAF mutations, which predict BRAF inhibitor response in melanoma [17], but not necessarily in non-melanoma cancers [18]. Furthermore, molecular tumor approaches reported to date are based on profiling of single biopsies. Large-scale sequencing studies have shown extensive intra-patient heterogeneity between different metastases and even within individual tumors [3, 4], indicating that this approach might not necessarily identify ubiquitous and also miss relevant alterations. In this review, we discuss bioinformatics approaches to NGS-based precision oncology, including variant calling, annotation, interpretation, drug matching and reporting in a MTB setting. We have set up a bioinformatics analysis pipeline and reporting workflow for WES and WGS at the MTB of the University Hospital Zurich and will base this review on our experiences with this ongoing effort. For guidelines on the analysis of NGS-based oncology panels, please refer to [19]. Requirements on bioinformatics solutions for clinical oncology High-throughput NGS allows for time- and cost-effective molecular probing of tumors. However, the resulting sequencing data is challenging to analyze because of its large size and various confounding sources of variation, most notably amplification and sequencing errors. Careful analysis of NGS data is particularly important in the context of MTBs, where treatment suggestions based on mutation calls may have dramatic effects, ranging from recovery to death of a patient. Therefore, strict standards with respect to several aspects described below need to be followed. First and foremost, experimental noise needs to be distinguished from true biological signals. Treatment decisions have to be based only on validated, real biological alterations and should not be misled by technical artifacts. Toward this end, appropriate computational data analysis pipelines have to be used that cover the entire process from primary analysis of the read data to clinical reporting. To understand the limitations of an implemented pipeline, it needs to be evaluated under defined conditions reflecting realistic use case conditions [20, 21]. Pipelines need to be robust with respect to new sequencing data that may differ in some aspects from previously analyzed samples. In addition, mutation calls should be reported with a confidence estimate. Although some mutation callers report, for example, P-values or posterior probabilities, it remains a major challenge to provide a meaningful notion of confidence for the results of an entire pipeline. This is particularly important, as the overlap of different approaches is often limited, as mentioned in [22–25]. The results produced by a bioinformatics pipeline have to be reproducible. This requirement entails several technical prerequisites discussed below and includes controlling random seeds for all steps that involve randomization. Another important aspect of reproducibility is a rigorous documentation of each step of the pipeline, including complete documentation of the used tools, their version and parameter settings. This also holds for databases and ensures complete transparency [20]. For instance, in the past, most genomic studies have used as a reference genome GRCh37 from the Genome Reference Consortium or its equivalent from the University of California Santa Cruz, version hg19. Even though there are only minor differences in their genetic information, the naming scheme is different, which can lead to confusion. Moreover, the new human genome assembly GRCh38 not only updated the main chromosomes, and therefore changed their coordinates, but also included new contigs to represent population haplotypes, further complicating reproducibility. Therefore, it is necessary that for each file used in the pipeline, its generation and dependencies are clearly described. Such a setup also guarantees the traceability of all results. For example, it should be possible to trace back the call of a treatment-critical mutation, to assess the call manually and to validate it before recommending the treatment. In addition, genomic alterations in the patient which are not directly linked to cancer, known as incidental variants, may be discovered. As these variants may be reported in various ways with potential ethical implications, a clear strategy needs to be defined, for example, reporting all relevant incidental findings [26]. In addition to these requirements on stability, robustness, reproducibility and traceability of the computational pipeline, the size, sensitivity and complexity of comprehensive clinical data sets combined with the urgency caused by the often critical state of the respective patient result in a set of challenging technical prerequisites for the computational infrastructure and the implemented data analysis software of an MTB. Technical prerequisites Medical data require secure data storage and distributed computing. Secure storage of sensitive data calls for restrictive authorization and authentication schemes that limit data access to those who hold valid credentials. These schemes have to be implemented and reviewed on a regular basis, in particular in a clinical setting in which data might have to be stored for many years. As data sets grow and the analysis becomes increasingly complex, the computation time of even single data sets outgrow the capacity of individual computers. Distributed computing, such as high-performance clusters or cloud engines, allows for efficient execution of data analysis workflows. The drawback is that these instances do not natively comply with the strict security requirements of medical data, as resources are shared among users with and without sufficient permissions. To address the strong requirement for speed, accuracy and reproducibility, the use of a workflow manager can help with standardization and automation of the analysis. Multiple workflow managers are available such as Snakemake [27], Nextflow [28], Toil [29], Bpipe [30] and to some extent also the Galaxy framework [31]. Although they differ in features such as cluster support and programming language, they have all been implemented with the same rationale: the scientist defines the order, the parameters and the input data for a chain of tools, and the workflow manager takes care of the correct execution and documentation of the intermediate steps. Primary analysis of DNA data The primary analysis of genomic data sets typically starts with the raw sequencing data and finishes with a list of mutations. The different steps of this analysis are conducted in complex pipelines that differ according to the sequencing method used. Even for the same type of sequencing method, many pipelines are available and it has been observed repeatedly that the results can be different [24, 25, 32–35]. The primary analysis can be subdivided into (i) raw sequencing file processing, (ii) read mapping, (iii) alignment post-processing and (iv) variant calling (Figure 2). These steps are implemented to different extents in most pipelines. In the following, we will describe each of them briefly. Figure 2 Open in new tabDownload slide Schematic overview of analysis steps for DNA variant calling (blue, top) and RNA expression analysis (red, bottom). Figure 2 Open in new tabDownload slide Schematic overview of analysis steps for DNA variant calling (blue, top) and RNA expression analysis (red, bottom). Raw sequencing file processing The genomic sequencing data are provided in the form of reads, amplified DNA sequences of tens to hundreds of base pairs, in so-called FASTQ files. In addition to the sequencing information, for each nucleotide, the FASTQ file contains quality scores provided by the sequencing machine. These quantities represent the probability of the reported nucleotide to be a sequencing error, as estimated by the sequencer. Quality scores can be used to trim reads such that the FASTQ files only contain high-confidence nucleotides, and the number of false positive calls owing to sequencing errors is kept at a minimum [36]. Another source of artifacts are sequencing adapters. Adapters are short nucleotide sequences attached to the genomic DNA fragment and used for amplification and sequencing. Sometimes these adapters are contained within the nucleotide sequence of a read and may lead to false-positive mutation calls. Therefore, many pipelines include tools such as Cutadapt [37], Trimmomatic [38], SeqPurge [39] or Flexbar [40] to remove low-quality bases and artifacts in the raw sequencing data. Read mapping Owing to the sequencing protocols, the reads do not contain any information about their origin in the genome. This information is inferred by using read mappers, which align, or map, all reads to a given reference sequence. The importance of this time-critical step has led to the development of >60 different read mappers [41], with BWA [42] and Bowtie2 [43] being popular examples. They usually provide their results in Sequence Alignment/Mapping format (SAM, binary version BAM) files, which undergo different modifications during the alignment post-processing step. Alignment post-processing This phase typically starts with sorting the SAM/BAM files according to their genomic coordinates. Afterward polymerase chain reaction (PCR) duplicates are often removed using, for example, picard tools (http://broadinstitute.github.io/picard) or SAM tools [44]. These duplicates are copies of the same genomic fragment and indicate selective PCR amplification which can bias the analysis. However, duplicated reads can also be biological copies originating from the same genomic location of chromosomes of different cells. The probability of a duplicate read to be a biological copy increases with coverage [45], such that this step is typically not performed for deep-coverage targeted sequencing approaches. Another post-processing step is the re-alignment of reads around indels. As read mappers rely on heuristics to deal with the large amount of data, the resulting alignments can be suboptimal. This is especially true for sites harboring indels because here the difference between the reference genome and the patient reads is more pronounced. To reduce this bias, many pipelines perform re-alignments around these positions, for example, using the Genome Analysis Toolkit (GATK) [46–48]. For Illumina data, GATK also provides a tool to correct for biases in the sequencing process, which uses a machine learning approach to re-compute the quality scores of the nucleotides. The use of the re-alignment and quality score recalibration is generally recommended [47, 49], but they are not always performed in practice, as they are time-intensive and the impact is sometimes not obvious [50, 51]. Variant calling Variant calling in the context of oncology refers to the identification of somatic variants in the cancer genome. These variants have occurred during the development of the tumor and they need to be separated from germ line variants of the patient. Targeted cancer therapy aims to selectively inhibit cells with specific somatic mutations, such as SNVs, indels and SVs. There are two conceptually different approaches to identify somatic variants, namely, (i) filtering for somatic variants using existing variant databases and (ii) using a normal control sample to distinguish somatic from germ line variants. The first approach identifies variants in the genome by analyzing only the tumor sample, using tools such as VarScan2 [52], SiNVICT [53] or GATK HaplotypeCaller [46–48]. The identified mutations are then compared with existing databases, such as dbSNP [54, 55], ExAC [56], ClinVar [57] or COSMIC [58], to assess whether a given variant has previously been reported as a germ line variant or a cancer-associated change in the genome. The major advantage of such approaches is independence from a control tissue sample, while major drawbacks are dependence on quality and completeness of the databases as well as limited sensitivity because low-frequency variants are difficult to distinguish from sequencing noise. The second approach uses an additional non-cancerous sample from the same individual as a germ line control. This approach can further be subdivided into methods that (a) apply variant calling to the tumor and control sample independently (using tools of approach (i)) or (b) use the genomic information of the two samples jointly. Approaches in the first category subtract from the tumor sample all mutations in the control sample, i.e. the germ line variants. Methods of the second category directly call somatic mutations by comparing variants between tumor and control sample for each position, which increases the power for calling true mutations at a given false-positive rate [59]. The idea is to model the control and tumor sample jointly to transfer noise patterns learned from the control sample to handle confounding factors appropriately. The results of approaches in (b) are usually superior to results from approaches in category (a), especially with regard to specificity [60]. Examples are MuTect [61], Strelka [62], VarScan2 [52], JointSNVMix [60] and deepSNV [59, 63]. For the identification of SVs, there are four commonly used techniques, namely, clustering, split-read mapping, contig assembly and statistical testing, as described in more detail in [22]. SV detection can be divided into CNV detection and identification of other SVs such as translocations and inversions. CNV calling is performed not only on WGS, but also on WES and even amplicon sequencing data. Numerous methods for CNV calling exist [64], including EXCAVATOR [65], BIC-seq2 [66] and CopywriteR [67]. In contrast, SVs like translocations and inversions are usually called based on WGS to determine the actual breakpoints of the genomic rearrangement. Popular methods include Pindel [68], SVDetect [69], Delly [70] and Lumpy [71]. As mentioned in [22], sensitive and specific SV calling remains a challenge, and choosing the appropriate approach greatly depends on the type of SV and NGS protocol features, such as the library size. For a more comprehensive review of CNV and SV calling, we refer to [22, 64, 72, 73]. Primary analysis of RNA data While variant calling is typically based on DNA data, differential expression analysis uses RNA sequencing data. Alignment and read pre- and post-processing are generally similar for DNA and RNA sequencing, with some key differences, for example, read mappers have to perform a special gapped alignment, because RNA reads sometimes do not continuously align to the reference sequence owing to splicing events, but map to different exons with large gaps in between. Popular RNA aligners are STAR [74] and TopHat [75]. In contrast to DNA alignments, the coverage of RNA alignments varies between regions in the genome owing to different gene expression levels. Thus, the coverage of RNA alignments can be used to infer gene expression levels after normalization with respect to total read count, gene length and possibly other confounding factors such as GC content. Here, commonly used tools include HTSeq [76] and featureCounts [77]. If matching control tissue is available, differential gene expression compared with normal can also be assessed, albeit with reduced statistical power owing to the lack of replicates. Typically, however, no adequate normal tissue is available. Popular tools for differential gene expression analysis include DESeq2 [78] and EdgeR [79, 80], which model read counts directly, account for various sources of confounding and provide robust statistical procedures for parameter estimation. An alternative, albeit imperfect, approach to detecting over- or under-expressed genes is the comparison of tumor gene expression levels to publicly available data sets of suitable tumor or normal cohorts, such as TCGA (https://cancergenome.nih.gov/) or GTEx [81]. For example, Oberg et al. used 124 transcriptomes from various normal tissues as a reference data set in a pediatric hematology-oncology setting [5]. Batch effects have to be taken into account, when comparing separately generated RNA sequencing data sets. Multiple tools for batch effect removal are available, e.g. the R package SVA [82]. However, it remains a challenge to integrate transcriptome data in a clinical tumor board setting, where the task typically is to compare an individual tumor sample with a separate healthy reference or tumor cohort. Eventually, the goal is to use the RNA sequencing data in at least three ways: (1) to validate the expression of SNVs, CNVs or SVs, (2) to identify misregulated pathways that could potentially be targetable and (3) to determine the proportion of immune cell infiltration based on immune signatures. For each of these aims, different references might be necessary. As healthy tissue from individual cancer patients is not always available, public transcriptome databases may be used as a comparison. However, the transcriptional changes between healthy controls and cancer cells may be less revealing than a comparison with similar cohorts of cancer biopsies. For instance, different subtypes of melanoma (i.e. mucosal versus uveal or cutaneous) have some similarities, but differences might reveal informative vulnerabilities that could be targeted in a MTB setting. Lastly, the ability to infer tumor infiltration of immune cells based on RNA expression could be a powerful means to complement traditional immunohistochemistry approaches that are still relevant for predicting response to immunotherapies. Variant annotation The process of variant annotation aims at assembling as much relevant information as necessary to select or discard a given variant while at the same time keeping the amount of information that needs to be parsed manually as small as possible. Possible annotations range from basic attributes like affected gene, coding or noncoding, synonymous or nonsynonymous to complex classifications like clinical significance. Clinical significance is the most relevant piece of information for a clinician about any variant. Typically, variants are categorized as pathogenic, likely pathogenic, of unknown significance, likely benign, or benign. However, the classification of specific variants is not consistent across available databases such as ClinVar [57], CIViC [83], COSMIC [58] and dbSNP [54, 55]. For instance, algorithms such as SnpEff [84] categorize variants based on the predicted impact on protein function, whereas ClinVar [57] links particular variants to known functional or clinical features. Additionally, the vast majority of detected variants have not yet been assigned a level of functional relevance or clinical significance. Thus, focussing only on variants annotated as (likely) pathogenic will often result in no variants at all being reported. This is unsatisfactory and potentially misleading. Annotation tools such as SnpEff [84] and ANNOVAR [85] can be applied to help extract interesting variants for the clinical report. Furthermore, a useful database for the identification of potentially deleterious SNVs is dbNSFP [86]. It contains predictions from a large set of functional prediction tools for all possible nonsynonymous SNVs and splice variants in the human genome. Among others, annotations include deleteriousness and affected protein domains. Both can be very useful for variant prioritization. For example, a deleterious variant ranks higher than a non-deleterious variant and a nonsynonymous coding variant within a protein domain ranks higher than a non-protein-truncating variant outside of a protein domain. For functional effect prediction of indels, PROVEAN [87] can be used. It predicts the functional effects of single and also multiple amino acid substitutions, in-frame insertions and deletions. Another helpful annotation when it comes to variant prioritization is whether a variant affects a potential cancer driver gene. Information on genes that have been reported as driver genes can be obtained from the literature [88, 89] and databases such as UniProt [90], IntOGen [91, 92] and COSMIC [58]. With the goal of recommending drugs, it is useful to annotate genes with drugs that target them. Popular online resources to query drug–gene interactions are DGIdb [93, 94], OncoKB [95] or CIViC [83]. It would be desirable to also annotate genes with indirectly interacting drugs, i.e. drugs that target proteins up- or downstream of the gene within the relevant pathway. Such annotation methods are currently being developed, e.g. [96], but no easy-to-use tool or API has yet been established. Interpretation of molecular profiles and clinical reporting Interpreting the clinical significance of genomic variants and transcriptional changes, i.e. the synthesis of all available information about an event and its relevance to clinical action [97] is a daunting and laborious task. It constitutes the bottleneck of the whole process from biopsy collection to reporting to the MTB [97] because it cannot be fully automated in a reliable way. Nevertheless, a properly curated list of evidence-based therapy recommendations forms the basis for the MTB to decide on the treatment of a patient. Thus, the ultimate goal of clinical reporting is to apply clinical interpretation to select relevant variants and to recommend targeted, personalized therapies [98]. The best case scenario for reporting is a single pathogenic mutation with an associated, clearly defined and clinically verified therapy, such as BRAFV600E and vemurafenib [17]. However, more often, several damaging mutations of unknown significance are identified and it is unclear which, if any, have functional or clinical relevance. This is especially true in the case of comprehensive sequencing. Consequently, the potentially long list of mutations and their associated drugs needs to be filtered automatically to obtain a relevant but manageable selection of drug–gene interactions that can then be further curated manually. Examples of such filters are exclusion of non-cancer drugs or of drugs with a nonsensical mode of action for their associated mutation, such as an inhibitor for a deleted gene. For the report, each listed drug–gene association has to be assigned a level of confidence. In 2017, the Association for Molecular Pathology, the American College of Medical Genetics and Genomics, the American Society of Clinical Oncology and the College of American Pathologists have established four evidence levels based on professional guidelines as well as size and number of studies supporting a mutation and its associated drug [99]. While these categories may or may not fit to the local or national situation of a reporting facility, the adherence to a joint consensus is favorable, as it facilitates the comparison with other resources, like OncoKB [95] and PharmGKB [100], and also the longitudinal use of findings in the clinic. As mentioned above, it is not unusual for variants to be assigned contradicting levels of clinical significance across and even within individual databases. Therefore, preparation of a meaningful tumor board report often needs to include a manual investigation of the associated literature to properly annotate and clinically interpret the identified variant. To determine the clinical actionability of a variant, one can consider, for example, the cell type content of the biopsy, the tissue-specificity of gene expression alterations and, when not using germ line controls, potential germ line variants. Alternatively, all findings, even contradictory ones, can be reported, thereby leaving the entire interpretation up to the MTB. However, it is questionable whether the latter approach is a practical solution given the often very short time frame that is available in the MTB to discuss particular cases. This trade-off between comprehensiveness and conciseness is a common theme in clinical reporting. Molecular Tumor Board Zurich In early 2015, we started the Molecular Tumor Board Zurich (MTBZ) to comprehensively profile and report on end-of-treatment line melanoma patients [101]. An important prerequisite for the success of this endeavor was to bridge the gap between the medical and technical disciplines and establish a common language to better understand the needs for efficient and effective reporting to the tumor board. The goal of this project was to overcome certain shortcomings in the standard of care. We address these issues by (i) comprehensive sequencing, (ii) automated and comprehensive annotation, (iii) investigation beyond disease-specific therapies and (iv) identification of therapies with lacking or reduced efficacy. For patients without any traditional treatment options remaining, comprehensive profiling of the tumor might offer new treatment options. Therefore, we established a protocol based on WES and WGS of tumor and matched normal samples, specifically WES for SNV and small indel calling and low-pass WGS for CNV calling. In addition to the identification of somatic variants, WES allows us to provide more information potentially relevant to the clinician, namely, mutational burden and the patient’s HLA type. We report the mutational burden of a tumor, which is especially useful for the decision on using immunotherapies, for instance, in the case of CTLA-4 blockade in melanoma [102]. Further, we put it into context by comparing it with the distribution of mutational load within publicly available samples from the same and other cancer types [103]. The HLA-I type of a patient, which can be inferred from WES data using, for example, OptiType [104], provides information on eligibility for certain cancer vaccination trials [105]. Another important difference to standard procedures is the implementation of an automated and comprehensive annotation pipeline querying multiple databases for clinical significance, finding clinical trial opportunities worldwide and putting observed variants into the context of large studies like TCGA using the cBio Cancer Genomics Portal [106]. The use of the latter is twofold: We can assess (i) whether a variant is typical for the cancer type which improves confidence, and (ii) whether a variant uncommon in the given type of cancer is commonly observed in another type of cancer and could explain why previous standard treatments had not been successful. We group therapies associated with detected somatic mutations into (i) cancer-type-specific therapies, (ii) non-cancer-type-specific therapies, (iii) investigational therapies and (iv) therapies potentially lacking benefit (Figure 3). The first category represents all suggested therapies which have been approved for the given cancer type by the local regulatory body, i.e. Swissmedic. The second group consists of therapies that are approved but not for the cancer type under consideration. This group is especially relevant owing to the increasing understanding that the genomic profile of a tumor is a better predictor for response than the tissue of origin alone [107]. By limiting this group to approved drugs only, it constitutes a source of available options to clinicians in Switzerland, where health insurances often approve the use of off-label treatments. The third group contains therapies which are not approved, but have been shown to be effective in preclinical studies and are currently in clinical trials, either open or ongoing. Although this group is usually based on low or insufficient levels of evidence, owing to singleton studies or only pre-clinical evidence, it frequently contains references to open clinical trials that the patient might be eligible for. Figure 3 Open in new tabDownload slide Example of concise report summary from an MTBZ report, including mutational burden, HLA-I type of the patient, mutational state of cancer-type-specific set of important genes, grouped according to level of approval. Figure 3 Open in new tabDownload slide Example of concise report summary from an MTBZ report, including mutational burden, HLA-I type of the patient, mutational state of cancer-type-specific set of important genes, grouped according to level of approval. The final group includes therapies for which the genetic profile might cause reduced efficacy. In the fast-moving process of understanding the efficacy of novel therapeutics and their range of effects on different targets, a single trial showing lack of efficacy may be sufficient to exclude a therapy. For example, in a patient with neuroendocrine carcinoma, paclitaxel was a candidate drug for non-cancer-type-specific therapies. However, a clinical phase II study [108] showed that high-dose paclitaxel lacked antitumor activity and displayed significant hematologic toxicity in patients with advanced neuroendocrine tumors. Therefore, paclitaxel was listed as potentially lacking benefit. In a first pilot study, we analyzed tumor biopsies and matched germ line samples from five metastatic melanoma patients with progressive disease on standard treatment and produced reports within a clinically relevant time period of 4-12 weeks from tumor biopsy. Briefly, we performed WES and WGS on tumor biopsy samples together with a blood sample as matched normal control. Based on the pipeline framework described in [109], we use Trimmomatic [38] to remove adapters and quality trim the raw read sequences. We apply BWA [42] for read mapping and subsequently remove PCR duplicates using picard tools (http://broadinstitute.github.io/picard). Following the GATK best practices [46–48], we perform indel realignment and base recalibration previous to the variant calling. SNVs are called based on a combination of Mutect [61], Strelka [62] and VarScan2 [52] and further annotated based on various databases including dbSNP [54, 55], COSMIC [58] and ClinVar [57], and functional annotation based on dbNSFP [86, 110]. CNVs are called based on WGS, using BIC-seq2 [66]. All variants are compared against DGidb [93] to select the first set of candidates for possible targeted treatments based on reported drug–gene interactions. Candidate treatments are further prioritized, for instance based on the Swissmedic approval of the therapy, availability of clinical trials and treatment success in existing clinical studies. Finally, selected variants and respective treatment options are reported in the clinical report and discussed with the treating clinician. In the five melanoma patients, we detected between 3 and 11 actionable aberrations per patient, most commonly in genes of the PI3K, cell cycle checkpoint and MAPK pathways. In two cases, the MTB recommended therapy based on our results: in one case, immunotherapy based on high mutational load, and in the other a chemotherapeutic drug based on a loss of a receptor activating the detoxification pathway of the drug. We observed a near-complete durable response in the first patient and a progression of disease in the second. The reasons for not following the report recommendations for the other patients were rapid decline of one patient’s condition and treatment with a newly approved immunotherapy regimen in two others. Together with our clinical collaborators, we were able to draft a set of best practices on what to include in the report. These best practices are also viable for other disciplines outside of oncology. First of all, the report should begin with a concise summary of the most important findings. In our report, we focus on mutational load, the state of genes commonly mutated in the specific cancer type, a therapy summary and HLA-I type (Figure 3). Starting on page 2, the report should increase in depth such that the reader who would like to know more details can simply read on. Given the limited time to discuss a case in the MTB meeting, it is key that the most important facts can be grasped quickly from scanning the first page. Nevertheless, ideally, the report provides all information obtained from processing of the patient samples. A selected list of clinical trial opportunities based on the molecular profile of the tumor are an important part of our report. Here, we refer to trials which are currently recruiting, thus offering a chance for the patient to get access to a potentially beneficial therapy, which might otherwise not be available. To allow the clinician to quickly assess the suitability of the trial, our report includes drug name, trial phase and title, as well as trial locations. Given the rapid developments in molecular profiling technologies as well as in variant calling and annotation algorithms and databases, naturally, the MTBZ workflow is constant work in progress. In our most recent reports, for example, we started to incorporate transcriptomics data allowing us to detect up- and downregulation of genes and transcripts, gene fusions, alternative splicing events, as well as expression status of somatic mutations. Future directions Bioinformatics workflows for the analysis and clinical interpretation of tumor molecular profiles have to various degrees been implemented in clinical trials and MTBs at specialized cancer centers and university hospitals worldwide. The initial results of these efforts are promising, but it has also become clear that exploiting the full potential of precision oncology faces many challenges. One current bottleneck is efficient and precise annotation of variants. This step requires databases containing well-curated variants as well as their interactions with potential drugs. Text mining is a promising approach to accelerate and improve the process of not only curating variants across the globe, but also finding evidence in literature for interaction between drugs and genes as well as the effect of drug combinations [111]. Stronger proof for annotation in the form of globally curated variants and better literature evidence will ultimately speed up the process of interpreting results from molecular diagnostic testing, and thus overcome the bottleneck of precision oncology. The rapid development of molecular profiling techniques will continue to provide new opportunities for precision oncology. For example, single-cell sequencing [112, 113], which allows for processing the DNA of hundreds and the gene expression levels of thousands of cells independently at the same time, will lead to increasing sensitivity levels with respect to mutation identification and the detection to tumor subclones, both of which are likely to affect treatment outcome. Further, multi-omics approaches will provide more insight into dysregulated pathways and increase the level of confidence in reporting an actionable variant when it can be confirmed by RNA, protein or epigenetic profiling. At the same time, multi-omics data will pose new bioinformatics challenges to integrate multiple data types and identify potentially efficacious treatments. Moreover, powerful predictions of patient response to a personalized treatment strategy will come from functionally testing the suggested therapies on ex vivo tumor slices [114], in 2D or 3D cultures of the patient’s tumor or in patient-derived xenograft models [115]. This approach, although still in its infancy, will provide another level of therapeutic decision support for the MTB by allowing for the exclusion or confirmation of therapeutic efficacy and choice of the most efficacious drug combinations. Key Points Robust, reproducible, transparent and comprehensive bioinformatics pipelines are required for precision oncology, including molecular tumor boards and cancer basket trials. Variant calling, interpretation and annotation are at the core of improving cancer treatment by providing timely and reliable therapy recommendations. Clinical reporting of molecular findings is an important step that requires close interactions between bioinformaticians and clinicians. Funding Part of this work has been funded by EC Horizon 2020 project No. 633974 (SOUND – Statistical multi-Omics UNDerstanding of Patient Samples), SystemsX.ch RTD Grant 2013/150, ERC Synergy Grant 609883, and Innovation Pool Funding of the University Hospital Zurich. Jochen Singer is a PhD student in the Computational Biology Group at the Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland. Anja Irmisch is a cancer biology scientist in a translational oncology group at the Department of Dermatology at the University of Zurich Hospital in Zurich, Switzerland. Hans-Joachim Ruscheweyh is a postdoc in the Computational Biology Group at the Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland. Franziska Singer is a senior bioinformatics scientist in the Clinical Bioinformatics Unit of NEXUS Personalized Health Technologies at ETH Zurich in Basel, Switzerland. Nora C. Toussaint is a senior bioinformatics scientist in the Clinical Bioinformatics Unit of NEXUS Personalized Health Technologies at ETH Zurich in Zurich, Switzerland. Mitchell P. Levesque is associate professor for translational oncology at the Department of Dermatology at the University of Zurich Hospital in Zurich, Switzerland. Daniel J. Stekhoven is head of the Clinical Bioinformatics Unit of NEXUS Personalized Health Technologies at ETH Zurich in Zurich, Switzerland. Niko Beerenwinkel is associate professor of computational biology at the Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland. References 1 Kotecha RS , Kees UR, Cole CH, et al. Rare childhood cancers–an increasing entity requiring the need for global consensus and collaboration . Cancer Med 2015 ; 4 ( 6 ): 819 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Meric-Bernstam F , Brusco L, Shaw K, et al. Feasibility of large-scale genomic testing to facilitate enrollment onto genomically matched clinical trials . J Clin Oncol 2015 ; 33 ( 25 ): 2753 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Burrell RA , McGranahan N, Bartek J, et al. The causes and consequences of genetic heterogeneity in cancer evolution . Nature 2013 ; 501 ( 7467 ): 338 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Gerlinger M , Rowan AJ, Horswell S, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing . N Engl J Med 2012 ; 366 ( 10 ): 883 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Oberg JA , Glade Bender JL, Sulis ML, et al. Implementation of next generation sequencing into pediatric hematology-oncology practice: moving beyond actionable alterations . Genome Med 2016 ; 8 : 133 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Lane BR , Bissonnette J, Waldherr T, et al. Development of a center for personalized cancer care at a regional cancer center: feasibility trial of an institutional tumor sequencing advisory board . J Mol Diagn 2015 ; 17 ( 6 ): 695 – 704 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Pincez T , Clément N, Lapouble E, et al. Feasibility and clinical integration of molecular profiling for target identification in pediatric solid tumors . Pediatr Blood Cancer 2017 ; 64 ( 6 ): e26365 . Google Scholar Crossref Search ADS WorldCat 8 Bryce AH , Egan JB, Borad MJ, et al. Experience with precision genomics and tumor board, indicates frequent target identification, but barriers to delivery . Oncotarget 2017 ; 8 : 27145 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Seeber A , Gastl G, Ensinger C, et al. Treatment of patients with refractory metastatic cancer according to molecular profiling on tumor tissue in the clinical routine: an interim-analysis of the ONCO-T-PROFILE project . Genes Cancer 2016 ; 7 ( 9-10 ): 301 – 8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 10 Parker BA , Schwaederlé M, Scur MD, et al. Breast cancer experience of the molecular tumor board at the university of california, san diego moores cancer center . J Oncol Pract 2015 ; 11 ( 6 ): 442 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Schwaederle M , Parker BA, Schwab RB, et al. Molecular tumor board: the University of California-San Diego Moores Cancer Center experience . Oncologist 2014 ; 19 ( 6 ): 631 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Beltran H , Eng K, Mosquera JM, et al. Whole-exome sequencing of metastatic cancer and biomarkers of treatment response . JAMA Oncol 2015 ; 1 ( 4 ): 466 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Worst BC , van Tilburg CM, Balasubramanian GP, et al. Next-generation personalised medicine for high-risk paediatric cancer patients—the INFORM pilot study . Eur J Cancer 2016 ; 65 : 91 – 101 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Massard C , Michiels S, Ferté C, et al. High-throughput genomics and clinical outcome in hard-to-treat advanced cancers: results of the MOSCATO 01 trial . Cancer Discov 2017 ; 7 ( 6 ): 586 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Conley BA , Gray R, Chen A, et al. Abstract CT101: NCI-molecular analysis for therapy choice (NCI-MATCH) clinical trial: interim analysis . Cancer Res 2016 ; 76 : CT101 . Google Scholar OpenURL Placeholder Text WorldCat 16 Hamblin A , Wordsworth S, Fermont JM, et al. Clinical applicability and cost of a 46-gene panel for genomic analysis of solid tumours: retrospective validation and prospective audit in the UK National Health Service . PLoS Med 2017 ; 14 ( 2 ): e1002230 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Chapman PB , Hauschild A, Robert C, et al. Improved survival with vemurafenib in melanoma with BRAF V600E mutation . N Engl J Med 2011 ; 364 ( 26 ): 2507 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Hyman DM , Puzanov I, Subbiah V, et al. Vemurafenib in multiple nonmelanoma cancers with BRAF V600 mutations . N Engl J Med 2015 ; 373 ( 8 ): 726 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Jennings LJ , Arcila ME, Corless C, et al. Guidelines for validation of next-generation sequencing-based oncology panels: a joint consensus recommendation of the association for molecular pathology and college of american pathologists . J Mol Diagn 2017 ; 19 ( 3 ): 341 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Aziz N , Zhao Q, Bry L, et al. College of American Pathologists’ laboratory standards for next-generation sequencing clinical tests . Arch Pathol Lab Med 2015 ; 139 ( 4 ): 481 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Matthijs G , Souche E, Alders M, et al. Guidelines for diagnostic next-generation sequencing . Eur J Hum Genet 2016 ; 24 ( 1 ): 2 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Guan P , Sung W-K. Structural variation detection using next-generation sequencing data: a comparative technical review . Methods 2016 ; 102 : 36 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Alioto TS , Buchhalter I, Derdak S, et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing . Nat Commun 2015 ; 6 : 10001. Google Scholar Crossref Search ADS PubMed WorldCat 24 Krøigård AB , Thomassen M, Lænkholm A-V, et al. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data . PLoS One 2016 ; 11 ( 3 ): e0151664 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Hofmann AL , Behr J, Singer J, et al. Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers . BMC Bioinformatics 2017 ; 18 : 8 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Rigter T , Henneman L, Kristoffersson U, et al. Reflecting on earlier experiences with unsolicited findings: points to consider for next-generation sequencing and informed consent in diagnostics . Hum Mutat 2013 ; 34 ( 10 ): 1322 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Köster J , Rahmann S. Snakemake–a scalable bioinformatics workflow engine . Bioinformatics 2012 ; 28 ( 19 ): 2520 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Di Tommaso P , Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows . Nat Biotechnol 2017 ; 35 ( 4 ): 316 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Vivian J , Rao AA, Nothaft FA, et al. Toil enables reproducible, open source, big biomedical data analyses . Nat Biotechnol 2017 ; 35 ( 4 ): 314 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Sadedin SP , Pope B, Oshlack A. Bpipe: a tool for running and managing bioinformatics pipelines . Bioinformatics 2012 ; 28 ( 11 ): 1525 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Goecks J , Nekrutenko A, Taylor J, et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences . Genome Biol 2010 ; 11 ( 8 ): R86 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Pabinger S , Dander A, Fischer M, et al. A survey of tools for variant analysis of next-generation genome sequencing data . Brief Bioinformatics 2014 ; 15 ( 2 ): 256 – 78 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Cai L , Yuan W, Zhang Z, et al. In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data . Sci Rep 2016 ; 6 : 36540 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Nam J-Y , Kim NKD, Kim SC, et al. Evaluation of somatic copy number estimation tools for whole-exome sequencing data . Brief Bioinformatics 2016 ; 17 ( 2 ): 185 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Alkodsi A , Louhimo R, Hautaniemi S. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data . Brief Bioinformatics 2015 ; 16 ( 2 ): 242 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Del Fabbro C , Scalabrin S, Morgante M, et al. An extensive evaluation of read trimming effects on Illumina NGS data analysis . PLoS One 2013 ; 8 ( 12 ): e85024 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads . EMBnet j 2011 ; 17 ( 1 ): 10. Google Scholar Crossref Search ADS WorldCat 38 Bolger AM , Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data . Bioinformatics 2014 ; 30 ( 15 ): 2114 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Sturm M , Schroeder C, Bauer P. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data . BMC Bioinformatics 2016 ; 17 : 208. Google Scholar Crossref Search ADS PubMed WorldCat 40 Dodt M , Roehr JT, Ahmed R, et al. FLEXBAR-flexible barcode and adapter processing for next-generation sequencing platforms . Biology 2012 ; 1 ( 3 ): 895 – 905 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Fonseca NA , Rung J, Brazma A, et al. Tools for mapping high-throughput sequencing data . Bioinformatics 2012 ; 28 ( 24 ): 3169 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Li H , Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics 2009 ; 25 ( 14 ): 1754 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Langmead B , Salzberg SL. Fast gapped-read alignment with Bowtie 2 . Nat Methods 2012 ; 9 ( 4 ): 357 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Li H , Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools . Bioinformatics 2009 ; 25 ( 16 ): 2078 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Zhou W , Chen T, Zhao H, et al. Bias from removing read duplication in ultra-deep sequencing experiments . Bioinformatics 2014 ; 30 ( 8 ): 1073 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 46 McKenna A , Hanna M, Banks E, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data . Genome Res 2010 ; 20 ( 9 ): 1297 – 303 . Google Scholar Crossref Search ADS PubMed WorldCat 47 DePristo MA , Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data . Nat Genet 2011 ; 43 ( 5 ): 491 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Van der Auwera GA , Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline . Curr Protoc Bioinformatics 2013 ; 11 : 11.10.1 – 33 . Google Scholar OpenURL Placeholder Text WorldCat 49 Pirooznia M , Kramer M, Parla J, et al. Validation and assessment of variant calling pipelines for next-generation sequencing . Hum Genomics 2014 ; 8 : 14 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Tian S , Yan H, Kalmbach M, et al. Impact of post-alignment processing in variant discovery from whole exome data . BMC Bioinformatics 2016 ; 17 ( 1 ): 403 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Liu Q , Guo Y, Li J, et al. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data . BMC Genomics 2012 ; 13 (Suppl 8) : S8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 52 Koboldt DC , Zhang Q, Larson DE, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing . Genome Res 2012 ; 22 ( 3 ): 568 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Kockan C , Hach F, Sarrafi I, et al. SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA . Bioinformatics 2017 ; 33 ( 1 ): 26 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Sherry ST , Ward M, Sirotkin K. dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation . Genome Res 1999 ; 9 : 677 – 9 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 55 Sherry ST , Ward MH, Kholodov M, et al. dbSNP: the NCBI database of genetic variation . Nucleic Acids Res 2001 ; 29 ( 1 ): 308 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Lek M , Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60, 706 humans . Nature 2016 ; 536 ( 7616 ): 285 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Landrum MJ , Lee JM, Riley GR, et al. ClinVar: public archive of relationships among sequence variation and human phenotype . Nucleic Acids Res 2014 ; 42 : D980 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Forbes SA , Beare D, Gunasekaran P, et al. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer . Nucleic Acids Res 2015 ; 43 : D805 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Gerstung M , Beisel C, Rechsteiner M, et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations . Nat Commun 2012 ; 3 : 811. Google Scholar Crossref Search ADS PubMed WorldCat 60 Roth A , Ding J, Morin R, et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data . Bioinformatics 2012 ; 28 ( 7 ): 907 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Cibulskis K , Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples . Nat Biotechnol 2013 ; 31 ( 3 ): 213 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Saunders CT , Wong WSW, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs . Bioinformatics 2012 ; 28 ( 14 ): 1811 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Gerstung M , Papaemmanuil E, Campbell PJ. Subclonal variant calling with multiple samples and prior knowledge . Bioinformatics 2014 ; 30 ( 9 ): 1198 – 204 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Zhao M , Wang Q, Wang Q, et al. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives . BMC Bioinformatics 2013 ; 14(Suppl 11) : S1 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Magi A , Tattini L, Cifola I, et al. EXCAVATOR: detecting copy number variants from whole-exome sequencing data . Genome Biol 2013 ; 14 ( 10 ): R120 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Xi R , Lee S, Xia Y, et al. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants . Nucleic Acids Res 2016 ; 44 ( 13 ): 6274 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Kuilman T , Velds A, Kemper K, et al. CopywriteR: DNA copy number detection from off-target sequence data . Genome Biol 2015 ; 16 : 49. Google Scholar Crossref Search ADS PubMed WorldCat 68 Ye K , Schulz MH, Long Q, et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads . Bioinformatics 2009 ; 25 ( 21 ): 2865 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Zeitouni B , Boeva V, Janoueix-Lerosey I, et al. SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data . Bioinformatics 2010 ; 26 ( 15 ): 1895 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Rausch T , Zichner T, Schlattl A, et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis . Bioinformatics 2012 ; 28 ( 18 ): i333 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Layer RM , Chiang C, Quinlan AR, et al. LUMPY: a probabilistic framework for structural variant discovery . Genome Biol 2014 ; 15 ( 6 ): R84 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Tattini L , D’Aurizio R, Magi A. Detection of genomic structural variants from next-generation sequencing data . Front Bioeng Biotechnol 2015 ; 3 : 92 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Ding L , Wendl MC, McMichael JF, et al. Expanding the computational toolbox for mining cancer genomes . Nat Rev Genet 2014 ; 15 ( 8 ): 556 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Dobin A , Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner . Bioinformatics 2013 ; 29 ( 1 ): 15 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Kim D , Pertea G, Trapnell C, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions . Genome Biol 2013 ; 14 ( 4 ): R36 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Anders S , Pyl PT, Huber W. HTSeq–a Python framework to work with high-throughput sequencing data . Bioinformatics 2015 ; 31 ( 2 ): 166 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Liao Y , Smyth GK, Shi W. FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features . Bioinformatics 2014 ; 30 ( 7 ): 923 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Love MI , Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 . Genome Biol 2014 ; 15 ( 12 ): 550. Google Scholar Crossref Search ADS PubMed WorldCat 79 McCarthy DJ , Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation . Nucleic Acids Res 2012 ; 40 ( 10 ): 4288 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Robinson MD , McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics 2010 ; 26 ( 1 ): 139 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Carithers LJ , Moore HM. The genotype-tissue expression (GTEx) project . Biopreserv Biobank 2015 ; 13 ( 5 ): 307 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Leek JT , Johnson WE, Parker HS, et al. The sva package for removing batch effects and other unwanted variation in high-throughput experiments . Bioinformatics 2012 ; 28 ( 6 ): 882 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Griffith M , Spies NC, Krysiak K, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer . Nat Genet 2017 ; 49 ( 2 ): 170 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Cingolani P , Platts A, Wang LL, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3 . Fly 2012 ; 6 ( 2 ): 80 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Wang K , Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data . Nucleic Acids Res 2010 ; 38 ( 16 ): e164. Google Scholar Crossref Search ADS PubMed WorldCat 86 Liu X , Wu C, Li C, et al. dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs . Hum Mutat 2016 ; 37 ( 3 ): 235 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Choi Y , Sims GE, Murphy S, et al. Predicting the functional effect of amino acid substitutions and indels . PLoS One 2012 ; 7 ( 10 ): e46688 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Vogelstein B , Papadopoulos N, Velculescu VE, et al. Cancer genome landscapes . Science 2013 ; 339 ( 6127 ): 1546 – 58 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Tamborero D , Gonzalez-Perez A, Perez-Llamas C, et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types . Sci Rep 2013 ; 3 : 2650 . Google Scholar Crossref Search ADS PubMed WorldCat 90 The UniProt Consortium . UniProt: the universal protein knowledgebase . Nucleic Acids Res 2017 ; 45 : D158 – 69 . Crossref Search ADS PubMed WorldCat 91 Rubio-Perez C , Tamborero D, Schroeder MP, et al. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities . Cancer Cell 2015 ; 27 ( 3 ): 382 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 92 Gonzalez-Perez A , Perez-Llamas C, Deu-Pons J, et al. IntOGen-mutations identifies cancer drivers across tumor types . Nat Methods 2013 ; 10 ( 11 ): 1081 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 93 Wagner AH , Coffman AC, Ainscough BJ, et al. DGIdb 2.0: mining clinically relevant drug-gene interactions . Nucleic Acids Res 2016 ; 44 : D1036 – 44 . Google Scholar Crossref Search ADS PubMed WorldCat 94 Thurnherr T , Singer F, Stekhoven DJ, et al. Genomic variant annotation workflow for clinical applications . F1000Research 2016 ; 5 : 1963 . Google Scholar Crossref Search ADS PubMed WorldCat 95 Chakravarty D , Gao J, Phillips S, et al. OncoKB: a precision oncology knowledge base . JCO Precision Oncology 2017 . doi: 10.1200/PO.17.00011. Google Scholar OpenURL Placeholder Text WorldCat 96 Schneider L , Stöckel D, Kehl T, et al. DrugTargetInspector: an assistance tool for patient treatment stratification . Int J Cancer 2016 ; 138 ( 7 ): 1765 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 97 Good BM , Ainscough BJ, McMichael JF, et al. Organizing knowledge to enable personalization of medicine in cancer . Genome Biol 2014 ; 15 : 438 . Google Scholar Crossref Search ADS PubMed WorldCat 98 Le Tourneau C , Kamal M, Tsimberidou A-M, et al. Treatment algorithms based on tumor molecular profiling: the essence of precision medicine trials . J Natl Cancer Inst 2016 ; 108 ( 4 ): djv362 . Google Scholar Crossref Search ADS WorldCat 99 Li MM , Datto M, Duncavage EJ, et al. Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the association for molecular pathology, american society of clinical oncology, and college of american pathologists . J Mol Diagn 2017 ; 19 ( 1 ): 4 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 100 Whirl-Carrillo M , McDonagh EM, Hebert JM, et al. Pharmacogenomics knowledge for personalized medicine . Clin Pharmacol Ther 2012 ; 92 ( 4 ): 414 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 101 Singer F , Irmisch A, Toussaint NC, et al. Establishing molecular diagnostics in Swiss clinics. 2017 , In preparation. 102 Snyder A , Makarov V, Merghoub T, et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma . N Engl J Med 2014 ; 371 ( 23 ): 2189 – 99 . Google Scholar Crossref Search ADS PubMed WorldCat 103 Alexandrov LB , Nik-Zainal S, Wedge DC, et al. Signatures of mutational processes in human cancer . Nature 2013 ; 500 ( 7463 ): 415 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 104 Szolek A , Schubert B, Mohr C, et al. OptiType: precision HLA typing from next-generation sequencing data . Bioinformatics 2014 ; 30 ( 23 ): 3310 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 105 Legat A , Maby-El Hajjami H, Baumgaertner P, et al. Vaccination with LAG-3Ig (IMP321) and peptides induces specific CD4 and CD8 T-cell responses in metastatic melanoma patients–report of a phase I/IIa clinical trial . Clin Cancer Res 2016 ; 22 ( 6 ): 1330 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 106 Cerami E , Gao J, Dogrusoz U, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data . Cancer Discov 2012 ; 2 ( 5 ): 401 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 107 Redig AJ , Jänne PA. Basket trials and the evolution of clinical trial design in an era of genomic medicine . J Clin Oncol 2015 ; 33 ( 9 ): 975 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 108 Ansell SM , Pitot HC, Burch PA, et al. A phase II study of high-dose paclitaxel in patients with advanced neuroendocrine tumors . Cancer 2001 ; 91 ( 8 ): 1543 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 109 Singer J , Ruscheweyh H-J, Hofmann AL, et al. NGS-pipe: a flexible, easily extendable, and highly configurable framework for NGS analysis . Bioinformatics 2017 . doi: 10.1093/bioinformatics/btx540. Google Scholar OpenURL Placeholder Text WorldCat 110 Liu X , Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions . Hum Mutat 2011 ; 32 ( 8 ): 894 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 111 Singhal A , Simmons M, Lu Z, et al. Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine . PLoS Comput Biol 2016 ; 12 ( 11 ): e1005017. Google Scholar Crossref Search ADS PubMed WorldCat 112 Gawad C , Koh W, Quake SR. Single-cell genome sequencing: current state of the science . Nat Rev Genet 2016 ; 17 ( 3 ): 175 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 113 Svensson V , Natarajan KN, Ly L-H, et al. Power analysis of single-cell RNA-sequencing experiments . Nat Methods 2017 ; 14 ( 4 ): 381 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 114 Davies EJ , Dong M, Gutekunst M, et al. Capturing complex tumour biology in vitro: histological and molecular characterisation of precision cut slices . Sci Rep 2015 ; 5 : 17187 . Google Scholar Crossref Search ADS PubMed WorldCat 115 Pauli C , Hopkins BD, Prandi D, et al. Personalized in vitro and in vivo cancer models to guide precision medicine . Cancer Discov 2017 ; 7 ( 5 ): 462 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes Jochen Singer, Anja Irmisch and Hans-Joachim Ruscheweyh contributed equally to this work. © The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author(s) 2017. Published by Oxford University Press.
Developing a ‘personalome’ for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomesVitali, Francesca; Li, Qike; Schissler, A Grant; Berghout, Joanne; Kenost, Colleen; Lussier, Yves A
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx149pmid: 29272327
Abstract The development of computational methods capable of analyzing -omics data at the individual level is critical for the success of precision medicine. Although unprecedented opportunities now exist to gather data on an individual’s -omics profile (‘personalome’), interpreting and extracting meaningful information from single-subject -omics remain underdeveloped, particularly for quantitative non-sequence measurements, including complete transcriptome or proteome expression and metabolite abundance. Conventional bioinformatics approaches have largely been designed for making population-level inferences about ‘average’ disease processes; thus, they may not adequately capture and describe individual variability. Novel approaches intended to exploit a variety of -omics data are required for identifying individualized signals for meaningful interpretation. In this review—intended for biomedical researchers, computational biologists and bioinformaticians—we survey emerging computational and translational informatics methods capable of constructing a single subject's ‘personalome’ for predicting clinical outcomes or therapeutic responses, with an emphasis on methods that provide interpretable readouts. Key points: (i) the single-subject analytics of the transcriptome shows the greatest development to date and, (ii) the methods were all validated in simulations, cross-validations or independent retrospective data sets. This survey uncovers a growing field that offers numerous opportunities for the development of novel validation methods and opens the door for future studies focusing on the interpretation of comprehensive ‘personalomes’ through the integration of multiple -omics, providing valuable insights into individual patient outcomes and treatments. single-subject studies, personalome, precision medicine, n-of-1 Introduction The arrival of precision medicine has led to a more individual-based view of diseases, with characteristics of single subjects being central to the prediction of clinical outcomes and prescription of tailored treatments. This concept is not new; in fact, evidence-based clinical practice guidelines [1] stratify treatments according to some patient characteristics (e.g. gender, ancestry, age, family history, some laboratory test results). However, precision medicine differs from the traditional medical approach, as it seeks to leverage not only clinical variables and clinician-selected genetic tests but also broad and data-intensive molecular and general -omics profiles of a patient [2]. These large and heterogeneous data cannot be interpreted directly by medical practitioners and require an automatic procedure for extracting relevant knowledge before incorporation into clinical practice. Therefore, it is fundamental to develop computational methods aimed at analyzing these data at the individual level. Current approaches aimed at analyzing disease or other biological processes, therapeutic efficacy and -omic data still leverage well-established cohort-based population analyses such as case-control studies [e.g. gene expression classifiers (GExpCs)], observational trials or controlled intervention trials. These large cohort/group approaches place emphasis on the group average rather than individual participants; though this group average may not represent any actual individual’s personal profile, let alone be meaningful to understanding the profile of a given specific patient. On the other hand, the framework of N-of-1 trials has been applied to repeated measures of a single analyte for over two decades [3]. This approach is based on the collection of various relevant data for one person as frequently as possible [4]. In this way, novel strategies can be explored to compare different treatments of the same person. Moreover, by looking at commonalities across multiple N-of-1 studies collecting the same type of data, it is possible to estimate the efficacy of an intervention in a specific subset population (i.e. people sharing a particular genetic profile). N-of-1 trials demonstrated their power to evaluate treatment effectiveness in a single subject for one variable [5], but proposed approaches for one analyte do not scale for -omics legion-size data sets. Although we now have an unprecedented technical opportunity to gather data relating to an individual’s -omic profile, bioinformatics tools to understand these data comprehensively, and at the individual level, remain underdeveloped. Novel approaches for identifying individualized (single-subject)—and not cohort—signals are required for gathering insights into the biology of diseases and healthy states of individuals. This review focuses on computational methods aimed at analyzing quantitative transcriptomic measurements of an individual and the combination of transcriptome with other -omic data. In this review, we define the personalome as an interpretable personal molecular mechanism profile of an individual derived from one or more scales of -omic data, especially when designed to enable precision medicine. ‘Personal -omics’ means the -omics measures of a single subject. Molecular mechanisms are any molecular functions or biological processes such as a missense mutation in DNA, or a differentially expressed pathway (DEP) at the transcriptome or proteome. To be considered interpretable at the molecular mechanism, the raw -omics profile must have been subjected to analyses performing (i) dimension reduction and (ii) biomolecular interpretation of the mechanisms involved in molecules of life (Figure 1). For example, full genomes are reduced to variant and mutation calls through analyses against a reference genome, or in the case of cancer, by also comparing paired cancer and unaffected tissue to determine somatic versus germline mutations. Here, we show how differentially expressed molecules of life and pathways can be unveiled in a single subject through the analysis of transcriptome data. Figure 1 Open in new tabDownload slide Flow chart of methods designed for clinical interpretation of single-subject -omics. This review addresses the gap of knowledge to compare and contrast single-subject methods designed to reduce the dimension of raw -omics data (left) and to provide a biomolecular interpretation of signals (gray rectangle). For DNA sequencing, variant and mutations calls as well as all functional annotations in single subjects (e.g. missense mutation) already bridge this gap. However, this intermediate step is often omitted for other molecules of life, such as mRNAs, miRNAs, proteins, methylated DNA regions and metabolites (carbohydrates and lipids). This review focuses on single-subject methods that analyze transcriptome data. ‘Clinical applications’ section provides emerging evidence that the newly available, unbiased SSA of the transcriptome enable innovative types of studies to investigate their clinical utility by addressing the gap of biomolecular interpretation of raw -omics signals. Among possible studies, we demonstrate that -omics clinical prediction classifiers that operate directly at the -omics scale may be redesigned for the parsimonious transformed signal of single-subject studies for improved clinical utility. Figure 1 Open in new tabDownload slide Flow chart of methods designed for clinical interpretation of single-subject -omics. This review addresses the gap of knowledge to compare and contrast single-subject methods designed to reduce the dimension of raw -omics data (left) and to provide a biomolecular interpretation of signals (gray rectangle). For DNA sequencing, variant and mutations calls as well as all functional annotations in single subjects (e.g. missense mutation) already bridge this gap. However, this intermediate step is often omitted for other molecules of life, such as mRNAs, miRNAs, proteins, methylated DNA regions and metabolites (carbohydrates and lipids). This review focuses on single-subject methods that analyze transcriptome data. ‘Clinical applications’ section provides emerging evidence that the newly available, unbiased SSA of the transcriptome enable innovative types of studies to investigate their clinical utility by addressing the gap of biomolecular interpretation of raw -omics signals. Among possible studies, we demonstrate that -omics clinical prediction classifiers that operate directly at the -omics scale may be redesigned for the parsimonious transformed signal of single-subject studies for improved clinical utility. We surveyed emerging novel computational biology, biostatistical and translational informatics methods that construct a single subject’s personalome by analyzing transcriptome data to predict outcomes or therapeutic responses without requiring the large cohort needed for conventional approaches. Our review methodology is detailed in the Supplementary Material S1. Particular emphasis is placed on those methods that provide clinically interpretable readouts rather than simple categorical classification, as the latter are known to be difficult to reproduce across data sets and contain noisy, incidental and passenger variation [6–9]. The papers and methods selected for review reflect the authors' views and are not intended to provide an exhaustive search. Figure 2A depicts all considered publications by year of publication and number of citations, and the studies are shown with different colors and shapes according to the type of required data input and output, respectively. Figure 2B shows the number of citations over time. Figure 2 Open in new tabDownload slide SSA studies included in this review. (A) Each numbered point represents a publication plotted by year of publication and the relative number of citations (in log2 scale). Numbers correspond to the publication in this article’s reference list, colors indicate the type of input required, i.e. one single-subject sample (1 ss SAMPLE—green), two paired single-subject sample (2 ss SAMPLES—purple) or if the method requires the collection of multiple samples from the same subject (multiple ss SAMPLES—orange). The shapes represent the type of output provided by the selected studies, i.e. DEGs—circle, DEPs—X. Finally, blue squares indicate methods based on the integration of transcriptome data with other - omics. (B) Number of citations over time starting from the publication year for the single-subject studies analyzing transcriptome data. Color and shape codification is the same as for the (A). Figure 2 Open in new tabDownload slide SSA studies included in this review. (A) Each numbered point represents a publication plotted by year of publication and the relative number of citations (in log2 scale). Numbers correspond to the publication in this article’s reference list, colors indicate the type of input required, i.e. one single-subject sample (1 ss SAMPLE—green), two paired single-subject sample (2 ss SAMPLES—purple) or if the method requires the collection of multiple samples from the same subject (multiple ss SAMPLES—orange). The shapes represent the type of output provided by the selected studies, i.e. DEGs—circle, DEPs—X. Finally, blue squares indicate methods based on the integration of transcriptome data with other - omics. (B) Number of citations over time starting from the publication year for the single-subject studies analyzing transcriptome data. Color and shape codification is the same as for the (A). The review is divided according to the type of data inputs in the methods (i.e. transcriptome and integrated -omics). A review of the validations of all methods follows, and finally, we discuss and conclude with the broad challenges, the applications and the opportunities in developing a personalome for precision medicine, i.e. how the single-subject analyses (SSAs) of -omics data can bring novel insights in disease mechanisms specific of a patient and unveil potential patient-specific treatments. A table of content for the review is provided in Table 1. Table 1 Table of content of the review Section . Pages . Transcriptome p. 2 Cross-subject transcriptome analyses p. 4 Single-subject transcriptome analyses p. 4 DEGs identification in single-subjects p. 7 DEPs identification in single-subjects p. 8 Longitudinal time series analyses of transcriptome p. 10 Single-subject transcriptome integrated with other ‐omics p. 11 Validation of single-subject methods p. 12 Clinical applications p. 12 Perspective and conclusion p. 13 Section . Pages . Transcriptome p. 2 Cross-subject transcriptome analyses p. 4 Single-subject transcriptome analyses p. 4 DEGs identification in single-subjects p. 7 DEPs identification in single-subjects p. 8 Longitudinal time series analyses of transcriptome p. 10 Single-subject transcriptome integrated with other ‐omics p. 11 Validation of single-subject methods p. 12 Clinical applications p. 12 Perspective and conclusion p. 13 Open in new tab Table 1 Table of content of the review Section . Pages . Transcriptome p. 2 Cross-subject transcriptome analyses p. 4 Single-subject transcriptome analyses p. 4 DEGs identification in single-subjects p. 7 DEPs identification in single-subjects p. 8 Longitudinal time series analyses of transcriptome p. 10 Single-subject transcriptome integrated with other ‐omics p. 11 Validation of single-subject methods p. 12 Clinical applications p. 12 Perspective and conclusion p. 13 Section . Pages . Transcriptome p. 2 Cross-subject transcriptome analyses p. 4 Single-subject transcriptome analyses p. 4 DEGs identification in single-subjects p. 7 DEPs identification in single-subjects p. 8 Longitudinal time series analyses of transcriptome p. 10 Single-subject transcriptome integrated with other ‐omics p. 11 Validation of single-subject methods p. 12 Clinical applications p. 12 Perspective and conclusion p. 13 Open in new tab Transcriptome Transcriptome analysis aims to interpret the quantification of transcribed genetic material, including both coding and noncoding RNA. Different from DNA, which is relatively static, analyses of the transcriptome capture the collective impact of tissue type, sequence variation, regulation, environment, external stimulation (e.g. drug treatments) and interactions between them. High-throughput technologies, such as microarray and RNA sequencing (RNA-Seq), are capable of assessing transcript expression at genome-scale for an individual sample, with RNA-Seq providing unbiased detection, broader dynamic range, increased specificity and sensitivity and easier detection of rare and low-abundance transcripts. The transcriptome provides a snapshot of transcriptional activity under the condition where the RNA was collected, allowing researchers to study the biological impact of certain diseases or effect of treatments [10]. This allows us to better understand general disease mechanisms, discover biomarkers or identify drug targets at the cohort scale when sufficient samples are collected, but also has the power to reveal individual-specific signals, whose detection and analysis through computational methods can lead to far more precise medical understanding and decision-making. Analysis of more than one transcriptome of an individual enables the assessment of personal dynamic changes over time or in response to therapy or other environmental changes. Yet, identifying important individual signals is not a trivial task, as transcript expression variations in a given tissue and time point are further modulated by stochastic variability, cyclic patterns (ex circadian) and platform biases or measurement errors in addition to signals, which are truly relevant to the disease state. The power of the methods reported in this section is that starting from thousands of genes they are able to provide information on the key genes and mechanisms (i.e. pathways) of a disease. This can allow to speed up the planning of future and effective studies. Cross-subject transcriptome analyses Conventional transcriptome analytics require well-powered cohorts of both cases and controls and describe variation in transcriptome when comparing two or more classes with a variety of methods (e.g. t-test [11, 12], analysis of variance [13], linear mixed models [14], modeling via the negative binomial distribution [15–17]). These strategies are designed to identify DEGs of ‘average responses across patients’ under particular experimental conditions (e.g. disease versus normal; or predrug and postdrug treatment). To extract more interpretable results, genes detected as differentially expressed are often further categorized according to enrichment or membership in knowledge bases such as curated biological pathways or functional gene sets (e.g. Kyoto Encyclopedia of Genes and Genomes (KEGG) [18]), Gene Ontology (GO) [19]. In this way, DEPs of average responses of patients can be identified, providing a more comprehensible view of the transcriptomic processes under study versus a simple gene list that requires significant gene recognition and subject matter expertise for interpretation. A wide array of studies and tools belong to this category including popular ones as gene set enrichment analysis (GSEA) [20] and DAVID [21]. In general, there are two main strategies to identify DEPs: (i) gene set-centric (GC) and (ii) pathway-centric analyses (PC). The GC approach is generally performed in two steps: first, DEGs are selected and the DEPs are computed by statistically testing the genes against the background. A critical limitation of GC strategy is that the results strongly depend on the DEGs identified in the first step. In fact, small changes in the DEG analysis may lead to the detection of a slightly different DEG list that can result in high changes of the identified DEPs. In addition, the final result is significantly affected by the arbitrary cutoff chosen in the enrichment step, as the majority of statistical test require a P-value threshold [22]. Therefore, we are providing the minimum number of genes in each gene set (Figure 5, column ‘Minimum # of transcript per scored gene set’), as methods providing a higher minimal threshold will be less susceptible to this bias. However, another limitation common to all reviewed DEPs is that similar gene sets are not identified as biomolecularly related in the resulting set, though postprocessing methods are available to address it [23–25]. Table 2 Additional details on single-subject transcriptome analyses of DEGs Publication . Name . Description . Wang et al. [34] RankComp RankComp requires two inputs: (i) a disease sample and (ii) a set of accumulated normal samples, which can be can be accrued during the same experiment or a priori from various external resources. RankComp begins by ranking genes within the samples (both the case and the normal) according to increasing expression values. Next, pairwise rank comparison are performed to identify (a) stable gene pairs, and (b) reversal gene pairs. Stable gene pairs are defined as those with the same ordering in 99% of the accumulated normal samples [expressiongeneA > expressiongeneB] while reversal gene pairs are identified by disruption of that ordering in the disease sample [expressiongeneA < expressiongeneB]. Fisher’s exact test is conducted to test the null hypothesis that the numbers of reversal gene pairs supporting its upregulation or downregulation are equal. This procedure enables extraction of a list of DEGs for a single subject, and interpretable results can be obtained through manual examination or by performing gene set enrichment analyses Liu et al. [35] DNB Computational approach based on DNB theory to detect pre-disease states Wang. et al. [36] DEGseq DEGseq identifies DEGs using RNA-Seq data collected from a single subject. When replicates are not available, the authors suggest a MA-plot-based method with a random sampling model, which assumes the expression counts follow a binomial distribution. Given the average of log2-transformed expression levels, it approximates the log2 expression fold change by a normal distribution, and then calculates a Z-score based on this distribution. P-values are computed based on Z-scores Tarazona et al. [37] NOISeq NOISeq is a data-adaptive and nonparametric approach, which has a variant, NOISeq-sim, that works without replicates. NOISeq-sim uses simulated replicates when real replicates do not exist. It simulates replicates under the assumption that gene expression counts follow multinomial distribution in which the probability of each gene corresponds to the probability of a read mapping to that gene. The probability of each gene is estimated by the proportion of its read counts relative to the total number of mapped reads from the only sample under the corresponding experimental condition. With the simulated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by gene‘s (M, D) pair computed between conditions Feng et al. [38] GFOLD This method assumes a Poisson distribution ( λ ) for the gene expression counts and a uniform prior distribution for λ . After computing a posterior distribution of λ for each gene, GFOLD ranks gene expression changes of all genes based on the cth percentile of these posterior distributions, where c is determined by users. In this way, it penalizes genes with low expression levels for their larger variances Anders et al. [39] DESeq When neither condition (i.e. affected and control sample) has replicate transcriptomes, DESeq assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates [33] Robinson et al. [17] edgeR edgeR assumes that RNA-Seq data follow negative binomial distribution for which, given the mean, the variance is determined by a dispersion parameter. When working without replicates, edgeR assigns the same value of the dispersion parameter to all genes and conducts a negative binomial exact test to compute P-values. Note that the value of dispersion is predetermined based on investigators' understanding of the biological nature of the samples rather than estimated from data [18] Publication . Name . Description . Wang et al. [34] RankComp RankComp requires two inputs: (i) a disease sample and (ii) a set of accumulated normal samples, which can be can be accrued during the same experiment or a priori from various external resources. RankComp begins by ranking genes within the samples (both the case and the normal) according to increasing expression values. Next, pairwise rank comparison are performed to identify (a) stable gene pairs, and (b) reversal gene pairs. Stable gene pairs are defined as those with the same ordering in 99% of the accumulated normal samples [expressiongeneA > expressiongeneB] while reversal gene pairs are identified by disruption of that ordering in the disease sample [expressiongeneA < expressiongeneB]. Fisher’s exact test is conducted to test the null hypothesis that the numbers of reversal gene pairs supporting its upregulation or downregulation are equal. This procedure enables extraction of a list of DEGs for a single subject, and interpretable results can be obtained through manual examination or by performing gene set enrichment analyses Liu et al. [35] DNB Computational approach based on DNB theory to detect pre-disease states Wang. et al. [36] DEGseq DEGseq identifies DEGs using RNA-Seq data collected from a single subject. When replicates are not available, the authors suggest a MA-plot-based method with a random sampling model, which assumes the expression counts follow a binomial distribution. Given the average of log2-transformed expression levels, it approximates the log2 expression fold change by a normal distribution, and then calculates a Z-score based on this distribution. P-values are computed based on Z-scores Tarazona et al. [37] NOISeq NOISeq is a data-adaptive and nonparametric approach, which has a variant, NOISeq-sim, that works without replicates. NOISeq-sim uses simulated replicates when real replicates do not exist. It simulates replicates under the assumption that gene expression counts follow multinomial distribution in which the probability of each gene corresponds to the probability of a read mapping to that gene. The probability of each gene is estimated by the proportion of its read counts relative to the total number of mapped reads from the only sample under the corresponding experimental condition. With the simulated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by gene‘s (M, D) pair computed between conditions Feng et al. [38] GFOLD This method assumes a Poisson distribution ( λ ) for the gene expression counts and a uniform prior distribution for λ . After computing a posterior distribution of λ for each gene, GFOLD ranks gene expression changes of all genes based on the cth percentile of these posterior distributions, where c is determined by users. In this way, it penalizes genes with low expression levels for their larger variances Anders et al. [39] DESeq When neither condition (i.e. affected and control sample) has replicate transcriptomes, DESeq assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates [33] Robinson et al. [17] edgeR edgeR assumes that RNA-Seq data follow negative binomial distribution for which, given the mean, the variance is determined by a dispersion parameter. When working without replicates, edgeR assigns the same value of the dispersion parameter to all genes and conducts a negative binomial exact test to compute P-values. Note that the value of dispersion is predetermined based on investigators' understanding of the biological nature of the samples rather than estimated from data [18] Open in new tab Table 2 Additional details on single-subject transcriptome analyses of DEGs Publication . Name . Description . Wang et al. [34] RankComp RankComp requires two inputs: (i) a disease sample and (ii) a set of accumulated normal samples, which can be can be accrued during the same experiment or a priori from various external resources. RankComp begins by ranking genes within the samples (both the case and the normal) according to increasing expression values. Next, pairwise rank comparison are performed to identify (a) stable gene pairs, and (b) reversal gene pairs. Stable gene pairs are defined as those with the same ordering in 99% of the accumulated normal samples [expressiongeneA > expressiongeneB] while reversal gene pairs are identified by disruption of that ordering in the disease sample [expressiongeneA < expressiongeneB]. Fisher’s exact test is conducted to test the null hypothesis that the numbers of reversal gene pairs supporting its upregulation or downregulation are equal. This procedure enables extraction of a list of DEGs for a single subject, and interpretable results can be obtained through manual examination or by performing gene set enrichment analyses Liu et al. [35] DNB Computational approach based on DNB theory to detect pre-disease states Wang. et al. [36] DEGseq DEGseq identifies DEGs using RNA-Seq data collected from a single subject. When replicates are not available, the authors suggest a MA-plot-based method with a random sampling model, which assumes the expression counts follow a binomial distribution. Given the average of log2-transformed expression levels, it approximates the log2 expression fold change by a normal distribution, and then calculates a Z-score based on this distribution. P-values are computed based on Z-scores Tarazona et al. [37] NOISeq NOISeq is a data-adaptive and nonparametric approach, which has a variant, NOISeq-sim, that works without replicates. NOISeq-sim uses simulated replicates when real replicates do not exist. It simulates replicates under the assumption that gene expression counts follow multinomial distribution in which the probability of each gene corresponds to the probability of a read mapping to that gene. The probability of each gene is estimated by the proportion of its read counts relative to the total number of mapped reads from the only sample under the corresponding experimental condition. With the simulated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by gene‘s (M, D) pair computed between conditions Feng et al. [38] GFOLD This method assumes a Poisson distribution ( λ ) for the gene expression counts and a uniform prior distribution for λ . After computing a posterior distribution of λ for each gene, GFOLD ranks gene expression changes of all genes based on the cth percentile of these posterior distributions, where c is determined by users. In this way, it penalizes genes with low expression levels for their larger variances Anders et al. [39] DESeq When neither condition (i.e. affected and control sample) has replicate transcriptomes, DESeq assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates [33] Robinson et al. [17] edgeR edgeR assumes that RNA-Seq data follow negative binomial distribution for which, given the mean, the variance is determined by a dispersion parameter. When working without replicates, edgeR assigns the same value of the dispersion parameter to all genes and conducts a negative binomial exact test to compute P-values. Note that the value of dispersion is predetermined based on investigators' understanding of the biological nature of the samples rather than estimated from data [18] Publication . Name . Description . Wang et al. [34] RankComp RankComp requires two inputs: (i) a disease sample and (ii) a set of accumulated normal samples, which can be can be accrued during the same experiment or a priori from various external resources. RankComp begins by ranking genes within the samples (both the case and the normal) according to increasing expression values. Next, pairwise rank comparison are performed to identify (a) stable gene pairs, and (b) reversal gene pairs. Stable gene pairs are defined as those with the same ordering in 99% of the accumulated normal samples [expressiongeneA > expressiongeneB] while reversal gene pairs are identified by disruption of that ordering in the disease sample [expressiongeneA < expressiongeneB]. Fisher’s exact test is conducted to test the null hypothesis that the numbers of reversal gene pairs supporting its upregulation or downregulation are equal. This procedure enables extraction of a list of DEGs for a single subject, and interpretable results can be obtained through manual examination or by performing gene set enrichment analyses Liu et al. [35] DNB Computational approach based on DNB theory to detect pre-disease states Wang. et al. [36] DEGseq DEGseq identifies DEGs using RNA-Seq data collected from a single subject. When replicates are not available, the authors suggest a MA-plot-based method with a random sampling model, which assumes the expression counts follow a binomial distribution. Given the average of log2-transformed expression levels, it approximates the log2 expression fold change by a normal distribution, and then calculates a Z-score based on this distribution. P-values are computed based on Z-scores Tarazona et al. [37] NOISeq NOISeq is a data-adaptive and nonparametric approach, which has a variant, NOISeq-sim, that works without replicates. NOISeq-sim uses simulated replicates when real replicates do not exist. It simulates replicates under the assumption that gene expression counts follow multinomial distribution in which the probability of each gene corresponds to the probability of a read mapping to that gene. The probability of each gene is estimated by the proportion of its read counts relative to the total number of mapped reads from the only sample under the corresponding experimental condition. With the simulated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by gene‘s (M, D) pair computed between conditions Feng et al. [38] GFOLD This method assumes a Poisson distribution ( λ ) for the gene expression counts and a uniform prior distribution for λ . After computing a posterior distribution of λ for each gene, GFOLD ranks gene expression changes of all genes based on the cth percentile of these posterior distributions, where c is determined by users. In this way, it penalizes genes with low expression levels for their larger variances Anders et al. [39] DESeq When neither condition (i.e. affected and control sample) has replicate transcriptomes, DESeq assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates [33] Robinson et al. [17] edgeR edgeR assumes that RNA-Seq data follow negative binomial distribution for which, given the mean, the variance is determined by a dispersion parameter. When working without replicates, edgeR assigns the same value of the dispersion parameter to all genes and conducts a negative binomial exact test to compute P-values. Note that the value of dispersion is predetermined based on investigators' understanding of the biological nature of the samples rather than estimated from data [18] Open in new tab The PC strategy is a distinct approach to derive statistics directly on the pathways without using DEGs. This approach is more sensitive to a concordant change of expression in the same direction, even if the transcripts would not be otherwise identified as DEGs. While more sensitive to directionally dysregulated pathways than GCs, current implementations of PCs are not designed to identify dysregulated pathways with both upregulated and downregulated transcripts. However, a limitation when focusing on the identification of DEPs relies on the selection of the considered prior knowledge on pathways. Currently, several knowledge sources, such as KEGG [18], Reactome, [26] and Pathway Common [27] can be used. This may cause redundancy and different results; moreover, such data sources may contain incomplete, incorrect or inconsistent data. Such dependencies between pathways could result in correlated P-values and over dispersion of the number of significant pathways, leading to biased results [28]. Therefore, future studies are required to compare the robustness of DEP methods in presence of noise and missing gene set annotations. While these approaches for transcriptome analysis are strong in the right context and if properly powered, only few are designed to scale down to individuals. For many DEG detection methods, this failure to scale down to a single subject is an inherent limitation of the underlying mathematical constructs, as they rely on a minimum of three replicates to assess gene-level variance, overdispersion and/or other parameters requiring multiple subjects. Under most experimental designs, cross-sample replicates are used, though triplicate samples from the same individual could potentially be used as a proxy when these are not resource limiting. Although the cost of high-throughput sequencing has been declining, it is still resource-prohibitive to sequence multiple samples, especially when sample procurement is naturally invasive. Other conventional approaches for analyzing transcriptomes exploit curated knowledge of a particular disease to specifically examine validated or hypothesized markers whose gene expression differs from the reference ‘normal’ or is expressed above a predetermined threshold. This is the case for Oncotype DX™ [29], PAM50 [30] and other clinically available tests that classify samples into tumor subtypes. Reliance on a predefined panel of genes dodges the problems of dimensionality and signal-to-noise detection in raw transcriptome data, but limits scalability across multiple characteristics of a disease and prevents the investigation of novel transcripts and disease mechanisms. To address these issues, other clustering-based techniques can be applied to gather patterned genes across/within samples or data sets (for a review, see [31]) and to obtain classifiers which can then be explored for within-group commonalities and cross-group differences [32]. However, they require a large number of samples, as well as careful external validation in large data sets that have adequate protection from bias and have been reviewed elsewhere [33]. Single-subject transcriptome analyses In the context of SSA, several studies have been proposed for extracting relevant biological knowledge from transcriptome data without the large cohort requirement. These approaches can be divided into different categories based on either (i) the number of samples from the same subject they require or (ii) the type of output they provide. As illustrated in Figure 3, single-subject studies can be categorized into GC (DEGs; Figure 4 and Table 2) or PC (DEPs; Figure 5). Based on this classification, we reported the related studies in ‘DEGs identification in single subjects’ and ‘DEP identification in single subjects’ sections, respectively. Table 3 Additional details on single-subject transcriptome analyses of DEPs Publication . Name . Description . Wang et al. [41] IndividPath IndividPath computes REOs from a pathway point of view reducing the dimension of the sample representation. Patient-specific DEPs of a sample are obtained by applying a similar procedure to RankComp [35], in which REOs in an individual sample are compared with the highly stable REOs identified from a large cohort of normal samples. The authors identify the biological pathways with significantly disrupted ordering of gene expression via P-values. In this case, P-values are determined by testing whether the frequency of reversal gene pairs observed in a sample within each pathway is significantly greater than that expected by chance using the hypergeometric distribution model (i.e. a Fisher’s exact test) Drier et al. [43] Pathifier Pathifier has been developed to compute PDSs for cancer tumor samples by aggregating gene-level information into pathway-level information, providing meaningful dimension reduction. Pathifier analyzes one pathway at a time and assigns a PDS to each sample by using the expression levels of the genes belonging to the pathway. To calculate PDSs, a PCA is performed to reduce the dimensions and capture the variation of the data. Next, the method identifies the best principal curve using both cohort samples (normal and disease). Then, the PDS of a sample is obtained by computing the distance of a single sample from the median of the normal samples on the principal curve. The output of this approach is therefore a list of DEPs for each sample representing the level of deregulation of each pathway Ahn et al. [42] iPAS iPAS provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the standard deviation of the normal samples. Z-scores are used as inputs to calculate iPAS for the disease sample, for example, using the average of the Z-scores in a pathway. iPAS is then computed for every normal sample to construct a null distribution, which assesses the significance of disease iPAS’s deviation from the normal reference. Yang et al. [44] FAIME The FAIME transforms a vector of mRNA quantification into pathway-level metrics derived from a single biological sample. Each mRNA is annotated to a gene, and genes are annotated to gene sets via knowledge base integration. Every pathway receives a score that quantifies the ‘average’ over-expression of genes within the pathway, when compared with genes in background (not in the pathway). This process provides mechanism-level interpretation to a single transcriptome. Barbie et al. [45] ssGSEA ssGSEA uses the difference in empirical cumulative distribution functions of gene expression ranks inside and outside a gene set (i.e. pathway) to calculate an enrichment statistic per sample, akin to the FAIME methodology described above. The procedure adopted is similar to GSEA [21] except that ssGSEA uses gene expression intensity at the single sample level to compute enrichment scores Gardeux et al. [46] N-of-1 pathways Wilcoxon This method aggregates gene expression values from two paried samples into gene sets provided by external knowledge sources (e.g. GO, KEGG). Each externally defined gene set is assessed for differential expression using the nonparametric analog of a paired t-test, the Wilcoxon signed-rank test. The result is a metric of pathway-level dysregulation in the form of either a P-value or corresponding signed z-score (sign indicates whether the case sample is upregulated or downregulated compared with baseline sample). Computing such a metric across all pathways in an ontology provides a mechanistically anchored profile of personal transcriptome dysregulation for each patient Schissler et al. [47] N-of-1 pathways MD N-of-1-pathways MD seeks to improve the differential expression testing component of the framework introduced by Gardeux et al. [58]. The rationale behind using the statistical generalization of distance is to incorporate the observed covariance structure between the two paired samples (as they are derived from the same patient). Briefly, the average log2 fold-change of expression within the pathway is adjusted using components of the variance–covariance matrix. Then, a nonparametric bootstrap is performed to estimate the standard error of the pathway average expression. This provides pathway metrics that are more clinically relevant than a Wilcoxon test statistic and simulation studies showed increased power under the MD framework Schissler et al. [48] ClusterT The Cluster-T is yet another improvement to the differential test procedure of N-of-1-pathways. It was shown that under nontrivial inter-genetic correlation, the bootstrapping procedure of the MD failed to produce adequate estimates of the standard error of the average log2 fold-change of expression. This problem proved to be challenging without bringing in external knowledge of context-specific gene–gene correlation. With this external knowledge, genes are clustered within pathways and, under certain assumptions, the test statistic was shown to follow a t-distribution with degrees of freedom dependent on the number of clusters. In novel multivariate gene expression simulations, the Clustered-T showed far superior performance in false-positive rates Li et al. [50] N-of-1-pathways MixEnrich N-of-1 pathways MixEnrich improves both N-of-1 pathways Wilcoxon and MD by detecting DEPs when they are bidirectionally dysregulated and/or background noise is present. Both Wilcoxon and MD are not designed to detect dysregulated pathways with upregulated and downregulated genes (bidirectional dysregulation), which are ubiquitous in biological systems. MixEnrich identifies bidirectional dysregulation by first clustering genes into upregulated, downregulated and unaltered genes. Subsequently, MixEnrich identifies pathways enriched with upregulated and/or downregulated transcripts. The enrichment test performed by MixEnrich detects only pathways with a significantly higher proportion of dysregulated genes with respect to the background. It is therefore more robust in presence of background noise (i.e. a large number of dysregulated genes unrelated to the phenotype) Li et al. [49] N-of-1-pathways kMEn N-of-1 pathways kMEn further improves the N-of-1 pathways MixEnrich method by using a nonparametric model (i.e. k-means clustering) to cluster genes into upregulated, downregulated and unaltered clusters. The distribution of log2 fold-change of gene expression is complex and may vary from experiment to experiment. Hence, a nonparametric model might be more flexible to model that distribution Publication . Name . Description . Wang et al. [41] IndividPath IndividPath computes REOs from a pathway point of view reducing the dimension of the sample representation. Patient-specific DEPs of a sample are obtained by applying a similar procedure to RankComp [35], in which REOs in an individual sample are compared with the highly stable REOs identified from a large cohort of normal samples. The authors identify the biological pathways with significantly disrupted ordering of gene expression via P-values. In this case, P-values are determined by testing whether the frequency of reversal gene pairs observed in a sample within each pathway is significantly greater than that expected by chance using the hypergeometric distribution model (i.e. a Fisher’s exact test) Drier et al. [43] Pathifier Pathifier has been developed to compute PDSs for cancer tumor samples by aggregating gene-level information into pathway-level information, providing meaningful dimension reduction. Pathifier analyzes one pathway at a time and assigns a PDS to each sample by using the expression levels of the genes belonging to the pathway. To calculate PDSs, a PCA is performed to reduce the dimensions and capture the variation of the data. Next, the method identifies the best principal curve using both cohort samples (normal and disease). Then, the PDS of a sample is obtained by computing the distance of a single sample from the median of the normal samples on the principal curve. The output of this approach is therefore a list of DEPs for each sample representing the level of deregulation of each pathway Ahn et al. [42] iPAS iPAS provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the standard deviation of the normal samples. Z-scores are used as inputs to calculate iPAS for the disease sample, for example, using the average of the Z-scores in a pathway. iPAS is then computed for every normal sample to construct a null distribution, which assesses the significance of disease iPAS’s deviation from the normal reference. Yang et al. [44] FAIME The FAIME transforms a vector of mRNA quantification into pathway-level metrics derived from a single biological sample. Each mRNA is annotated to a gene, and genes are annotated to gene sets via knowledge base integration. Every pathway receives a score that quantifies the ‘average’ over-expression of genes within the pathway, when compared with genes in background (not in the pathway). This process provides mechanism-level interpretation to a single transcriptome. Barbie et al. [45] ssGSEA ssGSEA uses the difference in empirical cumulative distribution functions of gene expression ranks inside and outside a gene set (i.e. pathway) to calculate an enrichment statistic per sample, akin to the FAIME methodology described above. The procedure adopted is similar to GSEA [21] except that ssGSEA uses gene expression intensity at the single sample level to compute enrichment scores Gardeux et al. [46] N-of-1 pathways Wilcoxon This method aggregates gene expression values from two paried samples into gene sets provided by external knowledge sources (e.g. GO, KEGG). Each externally defined gene set is assessed for differential expression using the nonparametric analog of a paired t-test, the Wilcoxon signed-rank test. The result is a metric of pathway-level dysregulation in the form of either a P-value or corresponding signed z-score (sign indicates whether the case sample is upregulated or downregulated compared with baseline sample). Computing such a metric across all pathways in an ontology provides a mechanistically anchored profile of personal transcriptome dysregulation for each patient Schissler et al. [47] N-of-1 pathways MD N-of-1-pathways MD seeks to improve the differential expression testing component of the framework introduced by Gardeux et al. [58]. The rationale behind using the statistical generalization of distance is to incorporate the observed covariance structure between the two paired samples (as they are derived from the same patient). Briefly, the average log2 fold-change of expression within the pathway is adjusted using components of the variance–covariance matrix. Then, a nonparametric bootstrap is performed to estimate the standard error of the pathway average expression. This provides pathway metrics that are more clinically relevant than a Wilcoxon test statistic and simulation studies showed increased power under the MD framework Schissler et al. [48] ClusterT The Cluster-T is yet another improvement to the differential test procedure of N-of-1-pathways. It was shown that under nontrivial inter-genetic correlation, the bootstrapping procedure of the MD failed to produce adequate estimates of the standard error of the average log2 fold-change of expression. This problem proved to be challenging without bringing in external knowledge of context-specific gene–gene correlation. With this external knowledge, genes are clustered within pathways and, under certain assumptions, the test statistic was shown to follow a t-distribution with degrees of freedom dependent on the number of clusters. In novel multivariate gene expression simulations, the Clustered-T showed far superior performance in false-positive rates Li et al. [50] N-of-1-pathways MixEnrich N-of-1 pathways MixEnrich improves both N-of-1 pathways Wilcoxon and MD by detecting DEPs when they are bidirectionally dysregulated and/or background noise is present. Both Wilcoxon and MD are not designed to detect dysregulated pathways with upregulated and downregulated genes (bidirectional dysregulation), which are ubiquitous in biological systems. MixEnrich identifies bidirectional dysregulation by first clustering genes into upregulated, downregulated and unaltered genes. Subsequently, MixEnrich identifies pathways enriched with upregulated and/or downregulated transcripts. The enrichment test performed by MixEnrich detects only pathways with a significantly higher proportion of dysregulated genes with respect to the background. It is therefore more robust in presence of background noise (i.e. a large number of dysregulated genes unrelated to the phenotype) Li et al. [49] N-of-1-pathways kMEn N-of-1 pathways kMEn further improves the N-of-1 pathways MixEnrich method by using a nonparametric model (i.e. k-means clustering) to cluster genes into upregulated, downregulated and unaltered clusters. The distribution of log2 fold-change of gene expression is complex and may vary from experiment to experiment. Hence, a nonparametric model might be more flexible to model that distribution REOs = Relative expression orderings. Open in new tab Table 3 Additional details on single-subject transcriptome analyses of DEPs Publication . Name . Description . Wang et al. [41] IndividPath IndividPath computes REOs from a pathway point of view reducing the dimension of the sample representation. Patient-specific DEPs of a sample are obtained by applying a similar procedure to RankComp [35], in which REOs in an individual sample are compared with the highly stable REOs identified from a large cohort of normal samples. The authors identify the biological pathways with significantly disrupted ordering of gene expression via P-values. In this case, P-values are determined by testing whether the frequency of reversal gene pairs observed in a sample within each pathway is significantly greater than that expected by chance using the hypergeometric distribution model (i.e. a Fisher’s exact test) Drier et al. [43] Pathifier Pathifier has been developed to compute PDSs for cancer tumor samples by aggregating gene-level information into pathway-level information, providing meaningful dimension reduction. Pathifier analyzes one pathway at a time and assigns a PDS to each sample by using the expression levels of the genes belonging to the pathway. To calculate PDSs, a PCA is performed to reduce the dimensions and capture the variation of the data. Next, the method identifies the best principal curve using both cohort samples (normal and disease). Then, the PDS of a sample is obtained by computing the distance of a single sample from the median of the normal samples on the principal curve. The output of this approach is therefore a list of DEPs for each sample representing the level of deregulation of each pathway Ahn et al. [42] iPAS iPAS provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the standard deviation of the normal samples. Z-scores are used as inputs to calculate iPAS for the disease sample, for example, using the average of the Z-scores in a pathway. iPAS is then computed for every normal sample to construct a null distribution, which assesses the significance of disease iPAS’s deviation from the normal reference. Yang et al. [44] FAIME The FAIME transforms a vector of mRNA quantification into pathway-level metrics derived from a single biological sample. Each mRNA is annotated to a gene, and genes are annotated to gene sets via knowledge base integration. Every pathway receives a score that quantifies the ‘average’ over-expression of genes within the pathway, when compared with genes in background (not in the pathway). This process provides mechanism-level interpretation to a single transcriptome. Barbie et al. [45] ssGSEA ssGSEA uses the difference in empirical cumulative distribution functions of gene expression ranks inside and outside a gene set (i.e. pathway) to calculate an enrichment statistic per sample, akin to the FAIME methodology described above. The procedure adopted is similar to GSEA [21] except that ssGSEA uses gene expression intensity at the single sample level to compute enrichment scores Gardeux et al. [46] N-of-1 pathways Wilcoxon This method aggregates gene expression values from two paried samples into gene sets provided by external knowledge sources (e.g. GO, KEGG). Each externally defined gene set is assessed for differential expression using the nonparametric analog of a paired t-test, the Wilcoxon signed-rank test. The result is a metric of pathway-level dysregulation in the form of either a P-value or corresponding signed z-score (sign indicates whether the case sample is upregulated or downregulated compared with baseline sample). Computing such a metric across all pathways in an ontology provides a mechanistically anchored profile of personal transcriptome dysregulation for each patient Schissler et al. [47] N-of-1 pathways MD N-of-1-pathways MD seeks to improve the differential expression testing component of the framework introduced by Gardeux et al. [58]. The rationale behind using the statistical generalization of distance is to incorporate the observed covariance structure between the two paired samples (as they are derived from the same patient). Briefly, the average log2 fold-change of expression within the pathway is adjusted using components of the variance–covariance matrix. Then, a nonparametric bootstrap is performed to estimate the standard error of the pathway average expression. This provides pathway metrics that are more clinically relevant than a Wilcoxon test statistic and simulation studies showed increased power under the MD framework Schissler et al. [48] ClusterT The Cluster-T is yet another improvement to the differential test procedure of N-of-1-pathways. It was shown that under nontrivial inter-genetic correlation, the bootstrapping procedure of the MD failed to produce adequate estimates of the standard error of the average log2 fold-change of expression. This problem proved to be challenging without bringing in external knowledge of context-specific gene–gene correlation. With this external knowledge, genes are clustered within pathways and, under certain assumptions, the test statistic was shown to follow a t-distribution with degrees of freedom dependent on the number of clusters. In novel multivariate gene expression simulations, the Clustered-T showed far superior performance in false-positive rates Li et al. [50] N-of-1-pathways MixEnrich N-of-1 pathways MixEnrich improves both N-of-1 pathways Wilcoxon and MD by detecting DEPs when they are bidirectionally dysregulated and/or background noise is present. Both Wilcoxon and MD are not designed to detect dysregulated pathways with upregulated and downregulated genes (bidirectional dysregulation), which are ubiquitous in biological systems. MixEnrich identifies bidirectional dysregulation by first clustering genes into upregulated, downregulated and unaltered genes. Subsequently, MixEnrich identifies pathways enriched with upregulated and/or downregulated transcripts. The enrichment test performed by MixEnrich detects only pathways with a significantly higher proportion of dysregulated genes with respect to the background. It is therefore more robust in presence of background noise (i.e. a large number of dysregulated genes unrelated to the phenotype) Li et al. [49] N-of-1-pathways kMEn N-of-1 pathways kMEn further improves the N-of-1 pathways MixEnrich method by using a nonparametric model (i.e. k-means clustering) to cluster genes into upregulated, downregulated and unaltered clusters. The distribution of log2 fold-change of gene expression is complex and may vary from experiment to experiment. Hence, a nonparametric model might be more flexible to model that distribution Publication . Name . Description . Wang et al. [41] IndividPath IndividPath computes REOs from a pathway point of view reducing the dimension of the sample representation. Patient-specific DEPs of a sample are obtained by applying a similar procedure to RankComp [35], in which REOs in an individual sample are compared with the highly stable REOs identified from a large cohort of normal samples. The authors identify the biological pathways with significantly disrupted ordering of gene expression via P-values. In this case, P-values are determined by testing whether the frequency of reversal gene pairs observed in a sample within each pathway is significantly greater than that expected by chance using the hypergeometric distribution model (i.e. a Fisher’s exact test) Drier et al. [43] Pathifier Pathifier has been developed to compute PDSs for cancer tumor samples by aggregating gene-level information into pathway-level information, providing meaningful dimension reduction. Pathifier analyzes one pathway at a time and assigns a PDS to each sample by using the expression levels of the genes belonging to the pathway. To calculate PDSs, a PCA is performed to reduce the dimensions and capture the variation of the data. Next, the method identifies the best principal curve using both cohort samples (normal and disease). Then, the PDS of a sample is obtained by computing the distance of a single sample from the median of the normal samples on the principal curve. The output of this approach is therefore a list of DEPs for each sample representing the level of deregulation of each pathway Ahn et al. [42] iPAS iPAS provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the standard deviation of the normal samples. Z-scores are used as inputs to calculate iPAS for the disease sample, for example, using the average of the Z-scores in a pathway. iPAS is then computed for every normal sample to construct a null distribution, which assesses the significance of disease iPAS’s deviation from the normal reference. Yang et al. [44] FAIME The FAIME transforms a vector of mRNA quantification into pathway-level metrics derived from a single biological sample. Each mRNA is annotated to a gene, and genes are annotated to gene sets via knowledge base integration. Every pathway receives a score that quantifies the ‘average’ over-expression of genes within the pathway, when compared with genes in background (not in the pathway). This process provides mechanism-level interpretation to a single transcriptome. Barbie et al. [45] ssGSEA ssGSEA uses the difference in empirical cumulative distribution functions of gene expression ranks inside and outside a gene set (i.e. pathway) to calculate an enrichment statistic per sample, akin to the FAIME methodology described above. The procedure adopted is similar to GSEA [21] except that ssGSEA uses gene expression intensity at the single sample level to compute enrichment scores Gardeux et al. [46] N-of-1 pathways Wilcoxon This method aggregates gene expression values from two paried samples into gene sets provided by external knowledge sources (e.g. GO, KEGG). Each externally defined gene set is assessed for differential expression using the nonparametric analog of a paired t-test, the Wilcoxon signed-rank test. The result is a metric of pathway-level dysregulation in the form of either a P-value or corresponding signed z-score (sign indicates whether the case sample is upregulated or downregulated compared with baseline sample). Computing such a metric across all pathways in an ontology provides a mechanistically anchored profile of personal transcriptome dysregulation for each patient Schissler et al. [47] N-of-1 pathways MD N-of-1-pathways MD seeks to improve the differential expression testing component of the framework introduced by Gardeux et al. [58]. The rationale behind using the statistical generalization of distance is to incorporate the observed covariance structure between the two paired samples (as they are derived from the same patient). Briefly, the average log2 fold-change of expression within the pathway is adjusted using components of the variance–covariance matrix. Then, a nonparametric bootstrap is performed to estimate the standard error of the pathway average expression. This provides pathway metrics that are more clinically relevant than a Wilcoxon test statistic and simulation studies showed increased power under the MD framework Schissler et al. [48] ClusterT The Cluster-T is yet another improvement to the differential test procedure of N-of-1-pathways. It was shown that under nontrivial inter-genetic correlation, the bootstrapping procedure of the MD failed to produce adequate estimates of the standard error of the average log2 fold-change of expression. This problem proved to be challenging without bringing in external knowledge of context-specific gene–gene correlation. With this external knowledge, genes are clustered within pathways and, under certain assumptions, the test statistic was shown to follow a t-distribution with degrees of freedom dependent on the number of clusters. In novel multivariate gene expression simulations, the Clustered-T showed far superior performance in false-positive rates Li et al. [50] N-of-1-pathways MixEnrich N-of-1 pathways MixEnrich improves both N-of-1 pathways Wilcoxon and MD by detecting DEPs when they are bidirectionally dysregulated and/or background noise is present. Both Wilcoxon and MD are not designed to detect dysregulated pathways with upregulated and downregulated genes (bidirectional dysregulation), which are ubiquitous in biological systems. MixEnrich identifies bidirectional dysregulation by first clustering genes into upregulated, downregulated and unaltered genes. Subsequently, MixEnrich identifies pathways enriched with upregulated and/or downregulated transcripts. The enrichment test performed by MixEnrich detects only pathways with a significantly higher proportion of dysregulated genes with respect to the background. It is therefore more robust in presence of background noise (i.e. a large number of dysregulated genes unrelated to the phenotype) Li et al. [49] N-of-1-pathways kMEn N-of-1 pathways kMEn further improves the N-of-1 pathways MixEnrich method by using a nonparametric model (i.e. k-means clustering) to cluster genes into upregulated, downregulated and unaltered clusters. The distribution of log2 fold-change of gene expression is complex and may vary from experiment to experiment. Hence, a nonparametric model might be more flexible to model that distribution REOs = Relative expression orderings. Open in new tab Table 4 Summary of the method validation in single subjects Publication . Method . In silico validation . Real dataset validation . Independent dataset validation . In vitro validation . In vivo validation . Clinical trial validation . Transcriptome Gardeux et al. [46] N-of-1 pathways W • • ⊘ • ⊘ • Wang et al. [36] DEGseq • • • ⊘ ⊘ ⊘ Anders et al. [39] DESeq • • • ⊘ ⊘ ⊘ Feng et al. [38] GFOLD • • • ⊘ ⊘ ⊘ Wang et al. [34] RankComp • • • ⊘ ⊘ ⊘ Yang et al. [44] FAIME • • • ⊘ ⊘ ⊘ Drier et al. [43] Pathifier ⊘ • • ⊘ ⊘ ⊘ Li et al. [50] N-of-1 pathways MixEnrich • • ⊘ ⊘ ⊘ ⊘ Li et al. [49] N-of-1-pathways kMEn • • ⊘ ⊘ ⊘ ⊘ Schissler et al. [47] N-of-1 pathways MD • • ⊘ ⊘ ⊘ ⊘ Liu et al. [35] DNB • • ⊘ ⊘ ⊘ ⊘ Wang et al. [41] IndividPath ⊘ • ⊘ ⊘ ⊘ ⊘ Ahn et al. [42] iPAS ⊘ • ⊘ ⊘ ⊘ ⊘ Schissler et al. [48] ClusterT • ⊘ ⊘ ⊘ ⊘ ⊘ Tarazona et al. [37] NOISeq ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Robinson et al. [17] edgeR ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Wu et al. [40] FPCA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Barbie et al. [45] ssGSEA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Martini et al. [51] timeClip ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Multi-omics Vaske et al. [69] PARADIGM • • ⊘ ⊘ ⊘ ⊘ Chen et al. [68] iPOP ⊘ • ⊘ ⊘ ⊘ ⊘ Publication . Method . In silico validation . Real dataset validation . Independent dataset validation . In vitro validation . In vivo validation . Clinical trial validation . Transcriptome Gardeux et al. [46] N-of-1 pathways W • • ⊘ • ⊘ • Wang et al. [36] DEGseq • • • ⊘ ⊘ ⊘ Anders et al. [39] DESeq • • • ⊘ ⊘ ⊘ Feng et al. [38] GFOLD • • • ⊘ ⊘ ⊘ Wang et al. [34] RankComp • • • ⊘ ⊘ ⊘ Yang et al. [44] FAIME • • • ⊘ ⊘ ⊘ Drier et al. [43] Pathifier ⊘ • • ⊘ ⊘ ⊘ Li et al. [50] N-of-1 pathways MixEnrich • • ⊘ ⊘ ⊘ ⊘ Li et al. [49] N-of-1-pathways kMEn • • ⊘ ⊘ ⊘ ⊘ Schissler et al. [47] N-of-1 pathways MD • • ⊘ ⊘ ⊘ ⊘ Liu et al. [35] DNB • • ⊘ ⊘ ⊘ ⊘ Wang et al. [41] IndividPath ⊘ • ⊘ ⊘ ⊘ ⊘ Ahn et al. [42] iPAS ⊘ • ⊘ ⊘ ⊘ ⊘ Schissler et al. [48] ClusterT • ⊘ ⊘ ⊘ ⊘ ⊘ Tarazona et al. [37] NOISeq ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Robinson et al. [17] edgeR ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Wu et al. [40] FPCA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Barbie et al. [45] ssGSEA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Martini et al. [51] timeClip ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Multi-omics Vaske et al. [69] PARADIGM • • ⊘ ⊘ ⊘ ⊘ Chen et al. [68] iPOP ⊘ • ⊘ ⊘ ⊘ ⊘ Open in new tab Table 4 Summary of the method validation in single subjects Publication . Method . In silico validation . Real dataset validation . Independent dataset validation . In vitro validation . In vivo validation . Clinical trial validation . Transcriptome Gardeux et al. [46] N-of-1 pathways W • • ⊘ • ⊘ • Wang et al. [36] DEGseq • • • ⊘ ⊘ ⊘ Anders et al. [39] DESeq • • • ⊘ ⊘ ⊘ Feng et al. [38] GFOLD • • • ⊘ ⊘ ⊘ Wang et al. [34] RankComp • • • ⊘ ⊘ ⊘ Yang et al. [44] FAIME • • • ⊘ ⊘ ⊘ Drier et al. [43] Pathifier ⊘ • • ⊘ ⊘ ⊘ Li et al. [50] N-of-1 pathways MixEnrich • • ⊘ ⊘ ⊘ ⊘ Li et al. [49] N-of-1-pathways kMEn • • ⊘ ⊘ ⊘ ⊘ Schissler et al. [47] N-of-1 pathways MD • • ⊘ ⊘ ⊘ ⊘ Liu et al. [35] DNB • • ⊘ ⊘ ⊘ ⊘ Wang et al. [41] IndividPath ⊘ • ⊘ ⊘ ⊘ ⊘ Ahn et al. [42] iPAS ⊘ • ⊘ ⊘ ⊘ ⊘ Schissler et al. [48] ClusterT • ⊘ ⊘ ⊘ ⊘ ⊘ Tarazona et al. [37] NOISeq ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Robinson et al. [17] edgeR ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Wu et al. [40] FPCA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Barbie et al. [45] ssGSEA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Martini et al. [51] timeClip ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Multi-omics Vaske et al. [69] PARADIGM • • ⊘ ⊘ ⊘ ⊘ Chen et al. [68] iPOP ⊘ • ⊘ ⊘ ⊘ ⊘ Publication . Method . In silico validation . Real dataset validation . Independent dataset validation . In vitro validation . In vivo validation . Clinical trial validation . Transcriptome Gardeux et al. [46] N-of-1 pathways W • • ⊘ • ⊘ • Wang et al. [36] DEGseq • • • ⊘ ⊘ ⊘ Anders et al. [39] DESeq • • • ⊘ ⊘ ⊘ Feng et al. [38] GFOLD • • • ⊘ ⊘ ⊘ Wang et al. [34] RankComp • • • ⊘ ⊘ ⊘ Yang et al. [44] FAIME • • • ⊘ ⊘ ⊘ Drier et al. [43] Pathifier ⊘ • • ⊘ ⊘ ⊘ Li et al. [50] N-of-1 pathways MixEnrich • • ⊘ ⊘ ⊘ ⊘ Li et al. [49] N-of-1-pathways kMEn • • ⊘ ⊘ ⊘ ⊘ Schissler et al. [47] N-of-1 pathways MD • • ⊘ ⊘ ⊘ ⊘ Liu et al. [35] DNB • • ⊘ ⊘ ⊘ ⊘ Wang et al. [41] IndividPath ⊘ • ⊘ ⊘ ⊘ ⊘ Ahn et al. [42] iPAS ⊘ • ⊘ ⊘ ⊘ ⊘ Schissler et al. [48] ClusterT • ⊘ ⊘ ⊘ ⊘ ⊘ Tarazona et al. [37] NOISeq ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Robinson et al. [17] edgeR ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Wu et al. [40] FPCA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Barbie et al. [45] ssGSEA ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Martini et al. [51] timeClip ⊘ ⊘ ⊘ ⊘ ⊘ ⊘ Multi-omics Vaske et al. [69] PARADIGM • • ⊘ ⊘ ⊘ ⊘ Chen et al. [68] iPOP ⊘ • ⊘ ⊘ ⊘ ⊘ Open in new tab Figure 3 Open in new tabDownload slide Current strategies to analyze single-subject transcriptomes. Analysis of single-subject transcriptome can be usually divided into two categories based on the required number of samples: (i) single sample analyses, (ii) paired sample analyses, or (iii) more samples (not shown). They can also be categorized according to their outputs: (i) Differentially Expressed Genes (DEGs), (ii) Differentially Expressed Pathways (DEPs), or Disease Scores (DSs). Note: DEP* = not true DEP, rather a relative expression level of the pathways because there are no references or baseline to compare the pathway expression of a single sample. Figure 3 Open in new tabDownload slide Current strategies to analyze single-subject transcriptomes. Analysis of single-subject transcriptome can be usually divided into two categories based on the required number of samples: (i) single sample analyses, (ii) paired sample analyses, or (iii) more samples (not shown). They can also be categorized according to their outputs: (i) Differentially Expressed Genes (DEGs), (ii) Differentially Expressed Pathways (DEPs), or Disease Scores (DSs). Note: DEP* = not true DEP, rather a relative expression level of the pathways because there are no references or baseline to compare the pathway expression of a single sample. Figure 4 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data to identify DEGs. Note: Additional details are available in Table 2. Figure 4 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data to identify DEGs. Note: Additional details are available in Table 2. Figure 5 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data to identify DEPs. Figure 5 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data to identify DEPs. Figure 6 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data combined with other -omics. Figure 6 Open in new tabDownload slide Summary of single-subject methods that analyze transcriptome data combined with other -omics. The selected studies can be further divided according to the number of samples involved: (i) analysis of single samples (top of Figure 3), (ii) analyses of single individuals using paired samples from the same subject (bottom of Figure 3) and (iii) multiple measurements in single subjects (not shown in Figure 3). This last class of methods is reported in ‘Longitudinal time series analyses of transcriptome’ section. The utility of single-subject discovery of differentially expressed patterns is central to precision medicine. For example, implicating DEGs in a patient may identify an unconventional treatment (i.e. personal drug repositioning) for this disease, assuming that these DEGs are well-established targets of the drug [52] in another related disease state (e.g. cancer targets). On the other hand, if the aim of the study is to investigate a disease or a particular condition from a broader point of view and to promote greater interpretation of the gene expression results, DEP analyses should be preferred. For example, Figure 5 refers to methods directly imputing DEPs from the transcriptome, i.e. PC approaches. To the best of our knowledge, these methods have not been compared with enriching DEGs into DEPs using methods from Figure 4. However, these DEP methods have been evaluated in vitro and in vivo as shown in the last section of the article, thus remain the validated strategy for imputing DEPs until properly compared with single-subject DEGs followed by enrichment. Another key difference in the analysis of single-subject transcriptome is the number of samples required from the subject. In addition, we have found that successful single sample-methods require not only the individual sample but also a cohort reference (‘reference-based’) to perform comparisons and to detect DEGs or DEPs (Figures 4 and 5, column ‘Cohort reference size’). This type of strategy is particularly useful when matched normal and disease samples are unavailable or limited (e.g. brain or heart tissue samples). Analysis of individuals using paired samples naturally requires that both samples be drawn from the same subject. As samples are isogenic aside from potential somatic variation and/or taken from the same tissue and environmental context aside from any experimentally induced stimuli, this design increases the signal-to-noise ratio and improves the detection of relevant DEGs or pathways. For example, studying both tumor and non-tumor tissues from a cancer patient focuses attention on pathogenic and compensatory mechanisms that differentiate the two tissues because of the disease state. While we review the methods that strive to mine the most information from limited data (e.g. a pair of transcriptomes of a single subject), investigators need to be cautious that methods do not replace data [53]. An additional aspect we underlined is the requirement of user-defined parameters heuristics of the considered publications heuristics (Column 8 in Figures 4 and 5). Automated methods, not requiring user-defined parameters, are considered superior, as they are less biased and more convenient. The following subsections will focus on methods for imputing single-subject DEGs (‘DEGs identification in single subjects’ section) and then single-subject DEPs (‘DEP identification in single subjects’ section). DEGs identification in single subjects In this section, we outline and describe emerging studies aimed at extracting DEGs starting from: (i) one sample of the individual and (ii) paired samples drawn from the same subject. A detailed description of the methods is provided in Figure 4 and Table 2. Approaches based on a single sample of an individual We identified two single-sample methods ([34, 35]) designed to inform on an individual’s transcriptome aberrations. Both methods require the application of a cohort reference, but differ in their predicted outputs. The first one, called RankComp [34], identifies DEGs by comparing the gene expression of the affected sample with a baseline, akin to a reference genome or normal range for clinical testing. RankComp has been applied separately to both total mRNA and microRNA (miRNA) investigations [54]. In the second study, they demonstrated the power of their method to identify deregulation of miRNAs and miRNA–target pairs with mutually exclusive alterations. This approach has the limitation of not being sensitive enough for detecting genes whose differentially expression causes minor changes in the ranking. The second method [35], DNB (Figure 4), different from RankComp, predicts critical disease transition from one sample of an individual, by comparing it with multiple control samples (from other data sets). This type of approach is particularly interesting for investigating individual profiles and classifying them as healthy, pre-disease or disease state. Approaches based on paired samples of an individual Although DEG identification often requires a large cohort of samples, a few attempts have been made to identify DEGs from only a pair of transcriptomes. These methods provide an opportunity to identify a set of personalized DEGs of a single subject without requiring costly transcriptome replicates. Among these methods, DESeq [39] and edgeR [17] were originally designed as cohort-based methods (Figure 4, column ‘Designed for ss’), but have wide applications. When replicates are not available, these two methods can still be applied. Without replicates, DESeq is conservative, as it assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates. edgeR’s performance relies on investigators’ understanding of study, as a parameter in the model is predetermined by the biological nature of the samples. DEGseq [36] is designed for discovering DEGs from only a pair of transcriptomes; yet, its assumption of binomial distribution of RNA-Seq data is insufficient when overdispersion in gene expression is present. NOISeq-sim [37] simulates replicates when real replicates do not exist. With the simulated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by a gene’s (M, D) pair computed between conditions. Finally, GFOLD [38] is another method designed for transcriptome analysis without replicates, as it provides biologically meaningful gene ranks of differential expression, but no significance assessment. DEPs identification in single subjects In this subsection, we report other methods that create biologically interpretable results from a single subject’s transcriptome bypassing detection of significant differences in gene-level expression to go directly to pathway-level signals (DEPs). Such analyses aim to promote a higher-level interpretation of the underlying gene expression data, providing a holistic view of pathway perturbation, instead of focusing attention on any particular gene. All the approaches belonging to this category incorporate a large body of prior biological knowledge (e.g. pathway knowledge sources such as KEGG [18, 55]). This allows researchers to reduce the dimension of a transcriptome-wide gene list (∼22k in human) to a much smaller set (e.g. ∼5000 GO-BP terms) which is then analyzed according to term or pathway overrepresentation or other involvement. This dimension reduction has been showed to improve the prediction of prognosis and therapies [56, 57]. A detailed description of the methods is provided in Figure 5 and Table 3. Approaches based on a single sample of an individual We identified three methods that require a single sample of an individual and a cohort reference (individPath [41], Pathifier [43], individualized pathway aberrance score, iPAS [42]) and two approaches capable of extracting DEPs from within an individual’s transcriptome without external comparison (single-subject GSEA, ssGSEA [45], Functional Analysis of Individual Microarray Expression, FAIME [44]) (Figure 5). A detailed description of the methodologies used by these studies is provided in Table 3. Each of the reference-based methods begins by aggregating gene-level information into pathway-level information, providing meaningful dimension reduction, and then apply statistical analyses directly at the pathway level. The first method, individPath, uses relative expressions orderings to directly stratify patients based on individual deregulated pathway status. The authors showed that individPath could predict individually identified, but in-common pathway biomarkers from lung adenocarcinoma and breast cancer data sets that were correlated with survival analysis. The second method is Pathifier, which computes pathway deregulation scores (PDSs—Table 3) for SSA using principal component analysis (PCA) and curve fitting. Drier et al. [43] showed how PDSs successfully reflect deregulation of pathways in glioblastoma and colorectal cancer data sets and could provide clinically relevant stratification of patients. Pathifier has also been successfully applied to provide a classification of breast cancer subtype [59] and to perform a personalized analysis for understanding the status of homologous recombination pathway dysregulation in breast cancer [60]. Finally, an additional method proposed by Ahn et al. [42] quantifies the aberrance of an individual sample’s pathways by comparison with accumulated normal data. The authors provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the SD of the normal samples. DEP approaches requiring a cohort reference, such as iPAS, Pathifier and individPath, are constrained by (i) the number of available normal samples (power), (ii) platform-dependencies and (iii) limited sensitivity to detect pathways that contain only few genes. The large number of normal cohort sample required limits the applicability of these methods in infrequent diseases, or when obtaining appropriate samples and/or defining an appropriate ‘normal’ state is complex. Moreover, the reference cohort may be heterogeneous, and pooling together normal samples means that transcriptome of different individuals is merged, which can obscure stratification and correlation patterns in the normal data. Because of these limitations, other methods have been proposed to circumvent the normal reference requirement using solely a sample from the subject under study. These strategies aim to reduce dimension by injecting domain knowledge while reducing gene-level noise inherent to a single case sample. Two such methods are the FAIME [44] and ssGSEA [45]. Both methods seek to quantify the effect size and statistical significance of consistent overexpression or underexpression of aggregated gene expression within externally defined gene sets, compared with the genes not annotated to the gene set (background). In the terminology of Goeman and Buhlmann [61], this framework is ‘competitive’ in the sense that scores reflect relative gene set expression when compared with the background. In this manner, both FAIME and ssGSEA detect aberrant pathway expression for an individual’s sample. The methods differ in implementation, however; FAIME operates on the normalized gene expression, while ssGSEA performs calculations on the ranks. A limitation of these two methods is that they provide a ranking of pathways in terms of their deregulated with respect to other pathways using the gene expression data of the individual (e.g. more or less expressed than an average expression). Therefore, these methods do not identify functionally altered pathways against a reference as in the previous methods because a pathway more or less expressed than average may the normal expected level of expression of that pathway. The output of both ssGSEA and FAIME report DEPs, which allow enhanced functional interpretation of disease-associated biological processes relative to less readily interpretable lengthy lists of DEGs. This approach could be useful when little pathological knowledge is available for the disease or when substantial pathway heterogeneity may underlie the clinical phenotype. In the case of single-subject studies, DEP lists can be used not only to investigate biological mechanisms specific of certain patients but also to suggest potential treatments or combination of treatments based on gene products annotated to the pathology-associated DEP, or other known interactions. Approaches based on paired samples of an individual In the following, we will focus on single-subject-based methods that analyze paired samples without the requirement of replicates. In this category, we identified the methods known as N-of-1-pathways (Figure 5 and Table 3). These methods provide a statistical informatic approach by aggregating gene-level measurements from two samples into gene sets (pathways) provided by curated knowledge bases (e.g. GO, KEGG). This consolidation seeks to reduce noise from gene-level measurements and provide meaningful dimension reduction. These profiles are designed to have clinical translational value by providing a systems biology perspective instead of focusing on single biomarkers. A drug targeting a non-DEG product at first glance could seem useless. However, the pathway could yet to be dysregulated and the drug may still have therapeutic value. For example, an epithelial growth factor receptor inhibitor (erlotinib) was successfully used in dual therapy to abate pathway-wide overexpression in oral carcinomas [58]. The first transcriptome analytic framework for quantifying within-patient differential expression from a pair of samples was introduced by Gardeux et al. [46] that developed the N-of-1-pathways Wilcoxon. Schissler et al. [47] extended the analysis of within-patient paired samples with the N-of-1-pathways Mahalanobis distance (MD) to improve on the Wilcoxon-based approach. MD provides an effect size of pathway-differential expression that incorporates the variance–covariance structure between the two samples. However, MD’s testing procedure failed to account for inter-gene correlation within pathways, which could result in the inflation of false-positive rates [48]. In response to this shortcoming, Schissler et al. [48] developed ClusterT to estimate co-expression of genes within the relevant biological context. For example, TCGA breast cancer RNA-Seq samples could be used to characterize clusters of genes within pathways, with positively correlated genes within the same cluster. This approach bears similarities to the ‘accumulated normal sample’ strategy described above, but differs in the way that an ontology is characterized by clusters within the context of analyzing a single subject’s pair. The authors envision co-expression cluster-augmented knowledge bases to enable clinical translation without the additional burden of accumulating ad hoc normal samples. N-of-1 pathway Wilcoxon, MD and the Clustered-T approaches perform gene set testing, one pathway at a time, in a ‘self-contained’ fashion [61]. This offers an opportunity for small-scale gene expression testing, as whole-transcriptome measurement is not required. Seeking gene set test procedures in the paired-sample setting that explore pathway-level expression relative to the rest of the transcriptome (i.e. a ‘competitive’ test), Li et al. developed two procedures, N-of-1-pathways k-Means Enrichment (kMEn) [49] and Mixture-Enrichment (MixEnrich) [50]. The benefits of the techniques lie in the detection of bidirectional pathway dysregulation (mRNAs within the same pathway that are both overexpressed and underexpressed) and in noisy samples with a high frequency of DEGs. All the N-of-1 pathways methods showed their power in the identification of DEPs that result from diverse health disorder [46, 47–50]. Longitudinal time series analyses of transcriptome Biological processes are highly dynamic, and understanding how diseases evolve over time can reveal factors involved in determining the disease status, progression and compensation. However, -omic technologies are typically gathered at infrequent or even single static points. Comprehensive longitudinal -omics data (i.e. one or more type of omics measured over time) can provide key information for understanding the whole evolution of biological processes and underlying biological mechanisms. Comprehensive longitudinal analyses are typically limited by the substantial associated costs of sample collection and patient follow-up. As a result, with perhaps rare exceptions, long time series experiments have few or no isogenic replicates in single subjects. Traditional analysis of time series gene expression data aims at identifying gene sets that exhibit common or distinct patterns of expression between two or more conditions (i.e. gene modification, treatment). The computational complexity to analyze such data is higher, as time course data involves the three dimensions of gene, time and condition. When considering time series data from an individual, several strategies can be applied depending on the experimental setup (i.e. number of time points and condition considered). For example, baseline comparisons can be performed by considering samples gathered during ostensibly healthy physiological states of the patient as the reference population if multiple time points are sampled. Samples gathered during ostensibly healthy physiological states of the patient approaches to extract meaningful knowledge from time series transcriptome data are based on clustering algorithms [62], hidden Markov models [63], Gaussian processes [64] or Bayesian approaches [65] (for a review, see [66, 67]). These techniques can be applied for single-subject transcriptome analysis to extract DEGs or gene expression trajectory patterns from multiple experimental conditions where multiple time points are studied. However, when replicates are not available, few models have been proposed for the identification of DEGs or DEPs from longitudinal data. In Figure 4, we reported a method [40] aimed at extracting DEGs from time series data, i.e. gene whose expression changed significantly with respect to time. Wu et al. [40] propose a nonparametric method that integrates a functional principal component analysis (FPCA) into a hypothesis testing framework to extract gene-specific expression trajectories. As this approach is based on FPCA, the user has to select the number of the first principal components that can explain the data. Therefore, the selection of such parameters can affect the overall results. On the other hand, Martini et al. [51] developed an approach to extract time-dependent pathways (DEPs) without the requirement of replicates (Figure 5). This method combines dimension reduction and graph decomposition theory. It first extracts time-dependent pathways and decomposes them into cliques to isolate the time-dependent portion. Although this approach is tailored to time course gene expression data without replicates, it does not provide information about the directionality of the identified DEPs. Its output is the activation versus non-activation of a pathway. Moreover, it has been designed specifically for long time series data, not showing statistical power with short time course data. Therefore, it cannot be applied to investigate biological processes involving small time series (generally short-term) responses. One of the major limitations of all transcriptome analyses is their inability to fully capture the dynamics of the represented system because of, for instance, posttranscriptional modifications. To this end, analysis of the product of transcriptome can provide significant insights and source of information. Single-subject transcriptome integrated with other -omics In this section, we will report the state of art of current analyses aimed at analyzing the transcriptome combined with other -omics for SSA. The retrieved studies are summarized in Figure 6. The integration and analysis of different high-throughput molecular assays and data is one of the major topics in precision medicine for understanding patient-specific variations. This approach enables the possibility of obtaining a comprehensive view of the genetic, biochemical, metabolic, proteomic and epigenetic processes underlying a disease that, otherwise, could not be fully investigated by using single -omics approaches. The increased power of multi-omics studies have been already assessed in the understanding of diseases, biomarkers and drug discovery. These methods are based on supervised or unsupervised machine learning techniques and typically aim at classifying patients into cancer subtypes [70–74] or are designed for drug repurposing [73, 75, 76]. Even if these strategies are useful for precision medicine, they are not able to extract meaningful knowledge on individual-specific biological mechanisms. They still rely on the integration of -omics profiles from populations of subjects. In our review of the literature, only few computational single-subject algorithms aimed at analyzing transcriptome data combined with other -omics have been proposed in the past years. Chen et al. [77] pioneered an ambitious project to integrate, analyze and provide clinically interpretable results from multi-omics profiles of an individual. The authors proposed the integrative personal -omic profile (iPOP), using Dr Snyder’s -omics as a test case colloquially referred to as the ‘Snyderome’. iPOP combines genomic, transcriptomic, proteomic, metabolomics and autoantibody profiles collected from a single subject over a 14-month period. A key aspect of this study, other than the focus on collecting data from a single person, was its comprehensive longitudinal nature and sampling during a variety of incidental environmental exposures including two viral infections and physician-recommended diet changes. This resulted in 3 billion measurements taken over 20 time points and >30 TB of data [77]. The article confirmed that some disease risks could be assessed from the genome sequence of the patient, but actual onset and assessment of certain other diseases, such as hypertriglyceridemia, could not be diagnosed based only on the genomic profile. Interestingly, proteome and metabolome were also required to understand the biological mechanisms underlying response to the viral infections. Association between expression and disease status was also revealed through the analysis of transcriptome data. PARADIGM [78] integrates transcriptome and DNA CNV data to compute pathway scores that represent the alternation of a person’s pathways. Pathway scores are calculated as a joint probability of a directed factor graph, a form a probabilistic graphical model. Variables in the graphical model correspond to different molecular entities; edges in the graph represent within- or between-scale interactions. The interactions are determined by central dogma and knowledge of annotated pathways, such as pathway interaction database [77]. Validation of single-subject omics methods We further classified each publication in this review according to the method(s) used for result validation (Table 4). The majority of approaches have been validated with in silico simulation of data or by cross-validation in the same data set. A few methods have validated their results across replicate samples, or have had pathology-associated DEG or DEP results successfully reproduced in independent data set. To our knowledge, only the N-of-1pathways W [46] SSA method has been validated in vitro and as a prognostic outcome classifier in a prospective study. In that study, patient-specific DEPs were identified in response to an ex vivo stimulation of their PBMCs with rhinovirus and used to accurately predict risk of asthmatic exacerbation in those same patients over a 2-year follow-up period. This strongly supports the conclusion that the field of single-subject studies of personalomes is an emerging field that is in need of more rigorous validations for translation to clinical practice. Additionally, new validation strategies need to be developed for in vivo and clinical trial validations of personalome imputations. Clinical applications To better understand the requirement for single-subject studies, we revisit the types of approaches and transformations required for clinicians to interpret the more clinically used method of DNA sequencing. As shown in Figure 1, we highlighted the critical steps for clinical interpretation of DNA sequencing. The full genome of 3.5 billion base pairs is evaluated against reference genomes to identify the single-subject variants and mutations, yielding a substantial dimension reduction as well as a transformation from molecular data to a biomolecular interpretation of the sequence. Additional studies provided external knowledge for clinical interpretation. For example, missense and nonsense mutations are known to affect the host genes, many of which are known to lead to Mendelian diseases annotated in OMIM [78]. Reproducible genome-wide association studies led to the creation of the NHGRI Catalog that annotates the disease risk associated to certain single-nucleotide polymorphisms. In other words, for DNAseq, an SSA (intermediate step of mutation and variant calls) precedes clinical utility studies. However, this has not been the case for the majority of the studies at other omics scales. This review focuses on comparing and contrasting SSA that incorporates this previously unavailable intermediate step for other molecules of life, such as mRNAs, miRNAs, proteins, methylated DNA regions and metabolites (carbohydrates and lipids). For example, oncologists already use assays for determining expression fold change and protein function of oncogenes and tumor suppressors through the comparison of tumor tissue with external references or unaffected paired tissue. As these curated approaches may not scale to the full omics data for other diseases, we provide emerging evidence that the newly available unbiased SSA enables new types of studies investigating their clinical utility by addressing the gap of biomolecular interpretation of raw omics signal. Among possible studies, we demonstrate that omics clinical prediction classifiers that operate directly at the omics scale may be redesigned for the parsimonious transformed signal of single-subject studies for improved clinical utility. For example, Gardeux et al. [79] quantified the personal pathway-level transcriptomic response of peripheral blood mononucleocytes to rhinovirus ex vivo and trained a classifier predictive of children prone to asthma exacerbations. The dimension of the signal was reduced from the entire transcriptomes of paired samples in 20 subjects (∼106 data points) to the effect size of statistically significant responsive pathways in at least one subject (∼104 data points). While many unbiased fully specified GExpCs designed over the entire transcriptome have been published in peer-reviewed journals, few have been FDA approved because of their lack of a mechanistic relationship between the features (gene transcripts) and the disease progression [6, 80, 81]. Two additional important limitations of the clinical utility of conventional GExpCs include (i) their platform dependence that limits their face-value validity (e.g. specific to AgilentTM) [8], and (ii) distinct GExpCs are paradoxically obtained from distinct cohorts of the same phenotypes [6, 8]. Interestingly, the transformation of a signal from conventional raw gene expression to effect size obtained after DEP-type SSA enables us to address these three limitations. First, SSA generates an effect size and P-value for each subject, analogous to mechanisms-level features ascribed to a patient. In addition, Zhang et al. [44] have shown that the FAIME DEP transformation leads to the rediscovery of at least 50% of the same gene set-level features (KEGG, GO) in seven distinct data sets of head and neck cancers when learning fully specified gene set-level classifiers (GenesetCs). The discovered features were consistently predictive of disease progression in independent validation data sets. Furthermore, three studies demonstrated that the discovered GenesetCs overlap by >50% of gene set features across expression platforms (Affymetrix, Agilent, RNAseq) [44, 82, 83], thus addressing another limitation of GExpC. Finally, a recent report from Gardeux et al. [79] shows that DEP single-subject studies in paired samples could generate features of higher quality than those obtained directly from gene expression in small cohorts. Specifically, a GenesetC predictive of exacerbation of pediatric asthmatic patients was confirmed in an independent cohort (learning set 40 subjects, validation set 22 subjects). This study suggests that SSA could reduce the cohort size for classifier development, as conventional GExpCs generally require hundreds of subjects in their learning sets. Perspective and conclusion The development and analysis of personal transcriptome interpretation are essential for precision medicine, as therapeutic decision-making pertains not exclusively to genomic sequences but to Genome x Environment interactions (GxE) as well. For example, isogenic twins may experience different diseases because of their distinct environment exposures, despite sharing identical genomes. Even in the presence of the same diseases, their therapeutic responses may vary as a result of other GxE conditions [79, 84]. The analysis of single-subject transcriptomes is valuable for extracting useful knowledge to better understand individual variability and patient-specific mechanisms underlying a disease and for suggesting tailored therapies. Selecting the best method for evaluation of a given subject’s personalome is first dependent on the biological question and experimental design approach that is best suited for determining an answer. This review revealed that ongoing advances in high-throughput technologies, emerging research and clinical questions urge continued investigation and development toward experimentally validated methods for unveiling tailored treatments from patient-specific transcriptomes (Table 4). In recent years, this nascent field of single-subject -omics has demonstrated considerable growth as reflected by the number of approaches being published and underlined by the high number of citations for the earlier works (Figure 2). Figures 4, 5 and 6 detail the computational analysis options that are available for transcriptome data and the sampling regimens that each requires to be applied effectively (e.g. single sample, paired samples, longitudinal samples), as well as whether access to an appropriate external reference database is necessary or what type of output is provided (e.g. DEGs or DEPs). Each broad category of currently available methods has both advantages and limitations. Approximately half of the bioinformatics methods we surveyed perform a comparison between the single subject’s profile and a reference, most often a cohort of accumulated normal samples or samples of a well-defined disease subtype [34, 35, 41–43]. These methods are generally able to capture patient variability and provide clinically interpretable results. However, accumulating the reference may be challenging and not factor in the heterogeneity of the reference sample, and subtle effects may be difficult to detect. This may result in missing crucial alterations present in the patient profiles. Nonetheless, these methods are appropriate when a robust reference is obtainable, and/or cases where a paired sample design does not make sense. Recommended DEG and DEP approaches to SSAs As transcriptomes vary by cell type and with environmental exposures, clinically or biologically interpretable altered mechanisms are more convincing when developed in isogenic (same subject) conditions than in heterogenic ones. We thus recommend clinical or experimental designs that generate a baseline in the same individual, i.e. paired samples (Figures 1, 2 and 5), which are well evaluated (Table 4) and have been validated in many publications (Figure 2). At this point in time, multi-omics methods have not been evaluated sufficiently to recommend one over another, even though they have the potential for being the best methods. Among analytical techniques exploiting a baseline, the more measures the better; thus FPCA and timeClip are favored for DEGs and DEPs, respectfully, when three or more samples are available over time. For discovery of DEGs among paired samples analyses, we recommend the use of DEGseq, as it is designed for single subjects, provides effect sizes and P-values, considers a limited variance estimate and is validated in independent samples. On the other hand, edgeR and GFOLD are suboptimal as they require user-defined parameters (Figure 4, column ‘User-defined parameters heuristics’). The unbiased and parameter-free DEseq approach, which is not designed for a single subject, is likely performing better in these conditions than either edgeR or GFOLD that require subjective, and possibly biased, user-defined parameters. However, currently no study has yet been conducted to compare the accuracy of different single-subject DEG methods against one another. For discovery of DEPs in paired samples, N-of-1-pathways kMen has been shown in simulation and in real data sets to outperform other paired DEPs methods; however, the N-o1-pathways Wilcoxon remains the most validated, which includes a clinical trial. Of note, ClusterT is the only approach controlling for intergenic correlation (Figure 5, column ‘Intergenic correlation’) that can create enrichment biases; however, additional validations are required. In absence of multiple samples from a single subject with its own isogenic reference, we recommend analyses providing biologically and clinically interpretable results of altered expression against heterogenic references (a population). Among single-sample SSA, we recommend RankComp and individPath for DEG and DEP determination, respectively. RankComp is currently the only method that provides DEGs based on the comparison of a single sample against a reference cohort. While for DEP determination, we suggest individPath because of its rigorous formal model and the small number of transcripts required to detect DEPs. Imputing altered or dysregulated expression of a transcript of a pathway is not feasible for inadequately designed clinical assays or experiments aimed at interpreting a single transcriptome in the absence of any transcriptome reference (e.g. isogenic, heterogenic). To address this, transcript and pathway expression can be compared within a sample using FAIME or ssGSEA. However, the output of higher or lower expression of a mechanism as compared with the sample expression may simply be the normal state of such a mechanism with the interpretation being ambiguous. While transcriptome analyses can provide DEGs and DEPs for single subjects and are the most mature, we anticipate that as the field advances, it will be possible to reveal novel physiological state correlations through the construction and analysis of multi-scale personalomes. The analysis of a single scale (i.e. -omics data) alone cannot reveal the complex picture underlying a disease that may be fully captured only by fusing together multi-omics data (from genome to metabolome, to exposome) of an individual, i.e. via comprehensive personalome profiles. In fact, the combination of multiple -omics data can lead to the detection of a comprehensive individual variability, essential for providing new insights into disease pathophysiology and mechanisms that may explain the differences in drug responses in the human population. As shown in Figure 1 and discussed in ‘Clinical applications’ section, by delivering dimension reduction and biomolecular interpretations, SSAs enable new types of transcriptomic analyses for clinical interpretation that compare with the methods applied to DNAseq for clinical interpretation. For example, current DNA sequence-based, and classifier-based SSA commercial offerings provide oncologists with annotations of oncogenic or tumor suppressor genes with copy number variants, gain- or loss-of-function mutation, expression fold changes (tumor versus normal) or gene expression against reference tissue and occasionally protein activity. However, these are limited results for a handful of known genes that have been highly curated to apply to a narrow set of diseases, while novel SSA approaches discussed in this manuscript unbiasedly assess the entire transcriptome for DEPs and DEGs in diseases that may be far less well studied, analyses which are not currently available commercially. Clinical utility of these assessments requires additional studies or a knowledge base, similarly to the interpretation of novel mutations for DNA (Figure 1; ‘Studies informing clinical interpretation’). Opportunities for future work Analysis of multi-omics dynamic profiles including transcriptome, proteome, methylome and metabolome can additionally provide indicators of real-time phenotypes and physiology in individuals that cannot be obtained through examination of the static genome alone. In doing so, GxE interactions are revealed [79]. -Omics integration has been used successfully for the identification of novel associations between biological entities (e.g. genes, proteins) and disease [74], patient stratification [73] and biomarker discovery or drug repositioning [76]. However, these strategies have not yet been applied for the integration of multi-omics data of an individual and biological knowledge. When taking into account the integration of multiple -omics, an important aspect to consider is the variability of data between each -omics, not only with respect to the represented biological process but also with the associated noise levels, identification accuracy, coverage and temporal resolution of data. These differences complicate the integration and joint modeling of multi-omics data. While this is intuitively clear, it remains computationally and experimentally challenging to effectively integrate longitudinal multi-omics data. For one, each biological entity (e.g. gene, metabolites) has different time-dependent modulation and responds to signals on a different specific time scale, even if contributing to the same biological process. Second, biological processes that take place in inaccessible tissues (e.g. brain, internal organs) cannot be feasibly monitored in a longitudinal approach, even if a single sample is possible. Additional challenges are related to the same variables of autocorrelation across repeated measurements, random effects and missing data. Moreover, the design of longitudinal studies of a single subject must account for repeated measures preferably being equally spaced in time, allowing the increase in statistical power of the approach [85]. An obvious opportunity that has not been reported is to learn convergent patterns at one -omics scale (e.g. transcriptome) and correlate it with those of another scale (e.g. proteomics), thus providing internal validation and increasing the noise-to-signal ratio. Futures studies for SSA of transcriptomes will need to focus on four underreported approaches: (1) variance estimation in isogenic conditions from single-subject measures (without requirement of reference transcriptomes), (2) activity level of pathways (functional, e.g. upregulated versus downregulated) rather than expression direction (overexpressed versus underexpressed), (3) the analysis and integration of comprehensive personal -omics data to infer dysregulated molecules and mechanisms and (4) rigorous validation of DEGs, DEPs, or other results with appropriate in vitro, in vivo and clinical trial investigations. As clinical research continues to explore the importance of patient heterogeneity, we encourage more investigators to adopt single-subject study designs and -omics analyses when appropriate to maximize the information made available by high-throughput technologies. Access to these analysis tools may also allow researchers to more thoroughly explore certain rare case studies, outliers and patients-of-special-interest in a way that they could not have done if relying only on traditional large cohort-based statistics. This is particularly true if an isogenic paired-sample study design can be used to answer a meaningful biological or clinical question. The use of personalome integrated with the available external knowledge (e.g. repositories on disease–disease association, target–gene interactions, gene–gene interaction) can provide new opportunities for developing more robust and comprehensive results that account for all the interacting -omics and temporal behaviors of the biological system of an individual. Finally, personalome researchers should consider a creative application of powerful engineering and mathematical tools that have not yet been applied to study the mechanisms underpinning the personalome of an individual. For example, computational methods used to analyze time series data include generalized linear mixed models, generalized estimating equations, Markov models, nonparametric or semi-parametric models or Bayesian models and dynamic pathway analysis [85]. However, these methods have not yet been applied to study the machinery underpinning the personalome of an individual. Clearly, these methods are not directly applicable as they are cohort-centric; however, innovations may altogether extend the paradigm of their current implementation. Key Points For the ‘personalome’ to enable precision medicine from -omics data, we need to move from cohort-focused assays and analytics to individualized (single-subject) studies. We survey and categorize methodology by biological and informatics input, mathematical formalism and procedure output. Our review focuses on the transcriptome dimension of the personalome showing a great development to date, while proteomics, multi-scale and other scales of biology present open challenges. The personalome methods need more rigorous validations, as few have been validated in vitro, in vivo or in clinical trials. The emerging personalome represents a largely unexplored application of -omics data and potentially has important consequences for improving patient outcomes. Funding Publication of this article has been funded in part by the following grants and organizations: National Institute of Health (NIH)/Office of the Director Precision Medicine Initiative (grant number 1UG3OD023171-01), the Precision Medicine Initiative of the Center for Biomedical Informatics and Biostatistics of the University of Arizona Health Sciences, NIH/National Heart, Lung, and Blood Institute (grant numbers HL126609-01, HL132523, U01 HL125208), NIH/National Cancer Institute (grant numbers P30CA023074, 1R01CA190696-01), NIH/National Institute of Allergy and Infectious Diseases (grant number U01AI122275-01). Francesca Vitali, PhD, is a Research Assistant Professor at the University of Arizona. Her main research interests are in pharmacogenomics, drug repurposing, precision medicine, bioinformatics and big data techniques. Qike Li is a PhD candidate at the University of Arizona. His research interests are in the area of single-subject analytics with applications in precision medicine. Grant Schissler is an Assistant Professor at University of Nevada, Reno. Recently, he has helped to build statistic informatics tools that allow clinical researchers to interpret genomic data of individual patients. Joanne Berghout, PhD, is a Research Assistant Professor at the University of Arizona. She uses genetics and ontologies to uncover patterns and candidate genes associated with Mendelian and complex diseases. Colleen Kenost, EdD, Director of Operations, Center for Biomedical Informatics and Biostatistic, the University of Arizona. Her role is to translate research prerogatives into action and operationalize strategic plans. Yves Lussier, MD, Professor of Medicine and Director, Center for Biomedical Informatics and Biostatistics, University of Arizona. His research group solves problems related to computational precision medicine and translational bioinformatics. References 1 Stone NJ , Robinson JG, Lichtenstein AH, et al. 2013 ACC/AHA guideline on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular risk in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines . J Am Coll Cardiol 2014 ; 63 (25 Pt B): 2889 – 934 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Collins FS , Varmus H. A new initiative on precision medicine . N Engl J Med 2015 ; 372 ( 9 ): 793 – 5 . http://dx.doi.org/10.1056/NEJMp1500523 Google Scholar Crossref Search ADS PubMed WorldCat 3 Guyatt GH , Keller JL, Jaeschke R, et al. The n-of-1 randomized controlled trial: clinical usefulness. Our three-year experience . Ann Intern Med 1990 ; 112 ( 4 ): 293 – 9 . http://dx.doi.org/10.7326/0003-4819-112-4-293 Google Scholar Crossref Search ADS PubMed WorldCat 4 Schork NJ. Personalized medicine: time for one-person trials . Nature 2015 ; 520 ( 7549 ): 609 – 11 . http://dx.doi.org/10.1038/520609a Google Scholar Crossref Search ADS PubMed WorldCat 5 Scuffham PA , Nikles J, Mitchell GK, et al. Using N-of-1 trials to improve patient management and save costs . J Gen Intern Med 2010 ; 25 ( 9 ): 906 – 13 . http://dx.doi.org/10.1007/s11606-010-1352-7 Google Scholar Crossref Search ADS PubMed WorldCat 6 Massague J. Sorting out breast-cancer gene signatures . N Engl J Med 2007 ; 356 : 294 – 7 . http://dx.doi.org/10.1056/NEJMe068292 Google Scholar Crossref Search ADS PubMed WorldCat 7 Stec J , Wang J, Coombes K, et al. Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix GeneChips . J Mol Diagn 2005 ; 7 ( 3 ): 357 – 67 . http://dx.doi.org/10.1016/S1525-1578(10)60565-X Google Scholar Crossref Search ADS PubMed WorldCat 8 Simon R , Radmacher MD, Dobbin K, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification . J Natl Cancer Inst 2003 ; 95 ( 1 ): 14 – 18 . http://dx.doi.org/10.1093/jnci/95.1.14 Google Scholar Crossref Search ADS PubMed WorldCat 9 Dupuy A , Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting . J Natl Cancer Inst 2007 ; 99 ( 2 ): 147 – 57 . http://dx.doi.org/10.1093/jnci/djk018 Google Scholar Crossref Search ADS PubMed WorldCat 10 Conesa A , Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis . Genome Biol 2016 ; 17 ( 1 ): 13 . http://dx.doi.org/10.1186/s13059-016-0881-8 Google Scholar Crossref Search ADS PubMed WorldCat 11 Ritchie ME , Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies . Nucleic Acids Res 2015 ; 43 ( 7 ): e47 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Li J , Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data . Stat Methods Med Res 2013 ; 22 ( 5 ): 519 – 36 . http://dx.doi.org/10.1177/0962280211428386 Google Scholar Crossref Search ADS PubMed WorldCat 13 Tusher VG , Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response . Proc Natl Acad Sci USA 2001 ; 98 ( 9 ): 5116 – 21 . http://dx.doi.org/10.1073/pnas.091062498 Google Scholar Crossref Search ADS PubMed WorldCat 14 Kerr MK , Martin M, Churchill GA. Analysis of variance for gene expression microarray data . J Comput Biol 2000 ; 7 ( 6 ): 819 – 37 . http://dx.doi.org/10.1089/10665270050514954 Google Scholar Crossref Search ADS PubMed WorldCat 15 Trapnell C , Hendrickson DG, Sauvageau M, et al. Differential analysis of gene regulation at transcript resolution with RNA-seq . Nat Biotechnol 2013 ; 31 ( 1 ): 46 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Love MI , Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 . Genome Biol 2014 ; 15 ( 12 ): 550 . http://dx.doi.org/10.1186/s13059-014-0550-8 Google Scholar Crossref Search ADS PubMed WorldCat 17 Robinson MD , McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics 2010 ; 26 ( 1 ): 139 – 40 . http://dx.doi.org/10.1093/bioinformatics/btp616 Google Scholar Crossref Search ADS PubMed WorldCat 18 Kanehisa M , Goto S. KEGG: kyoto encyclopedia of genes and genomes . Nucleic Acids Res 2000 ; 28 ( 1 ): 27 – 30 . http://dx.doi.org/10.1093/nar/28.1.27 Google Scholar Crossref Search ADS PubMed WorldCat 19 Ashburner M , Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology . Nat Genet 2000 ; 25 ( 1 ): 25 – 9 . http://dx.doi.org/10.1038/75556 Google Scholar Crossref Search ADS PubMed WorldCat 20 Subramanian A , Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci 2005 ; 102 ( 43 ): 15545 – 50 . http://dx.doi.org/10.1073/pnas.0506580102 Google Scholar Crossref Search ADS PubMed WorldCat 21 Huang da W , Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources . Nat Protoc 2009 ; 4 : 44 – 57 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Huang da W , Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists . Nucleic Acids Res 2009 ; 37 ( 1 ): 1 – 13 . http://dx.doi.org/10.1093/nar/gkn923 Google Scholar Crossref Search ADS PubMed WorldCat 23 Falcon S , Gentleman R. Using GOstats to test gene lists for GO term association . Bioinformatics 2007 ; 23 ( 2 ): 257 – 8 . http://dx.doi.org/10.1093/bioinformatics/btl567 Google Scholar Crossref Search ADS PubMed WorldCat 24 Grossmann S , Bauer S, Robinson PN, et al. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis . Bioinformatics 2007 ; 23 ( 22 ): 3024 – 31 . http://dx.doi.org/10.1093/bioinformatics/btm440 Google Scholar Crossref Search ADS PubMed WorldCat 25 Yang X , Li J, Lee Y, et al. GO-Module: functional synthesis and improved interpretation of gene ontology patterns . Bioinformatics 2011 ; 27 ( 10 ): 1444 – 6 . http://dx.doi.org/10.1093/bioinformatics/btr142 Google Scholar Crossref Search ADS PubMed WorldCat 26 Fabregat A , Sidiropoulos K, Viteri G, et al. Reactome pathway analysis: a high-performance in-memory approach . BMC Bioinformatics 2017 ; 18 ( 1 ): 142 . http://dx.doi.org/10.1186/s12859-017-1559-2 Google Scholar Crossref Search ADS PubMed WorldCat 27 Cerami EG , Gross BE, Demir E, et al. Pathway commons, a web resource for biological pathway data . Nucleic Acids Res 2011 ; 39 : D685 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Vivar JC , Pemu P, McPherson R, et al. Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in omics studies and “Big Data” Biology . OMICS 2013 ; 17 ( 8 ): 414 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Sparano JA , Paik S. Development of the 21-gene assay and its application in clinical practice and clinical trials . J Clin Oncol 2008 ; 26 ( 5 ): 721 – 8 . http://dx.doi.org/10.1200/JCO.2007.15.1068 Google Scholar Crossref Search ADS PubMed WorldCat 30 Parker JS , Mullins M, Cheang MC, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes . J Clin Oncol 2009 ; 27 ( 8 ): 1160 – 7 . http://dx.doi.org/10.1200/JCO.2008.18.1370 Google Scholar Crossref Search ADS PubMed WorldCat 31 Daxin J , Chun T, Aidong Z. Cluster analysis for gene expression data: a survey . IEEE Trans Knowl Data Eng 2004 ; 16 : 1370 – 86 . http://dx.doi.org/10.1109/TKDE.2004.68 Google Scholar Crossref Search ADS WorldCat 32 Cancer Genome Atlas Research Network . Integrated genomic analyses of ovarian carcinoma . Nature 2011 ; 474 : 609 – 15 . http://dx.doi.org/10.1038/nature10166 Crossref Search ADS PubMed WorldCat 33 Nair VS , Maeda LS, Ioannidis JP. Clinical outcome prediction by microRNAs in human cancer: a systematic review . J Natl Cancer Inst 2012 ; 104 ( 7 ): 528 – 40 . http://dx.doi.org/10.1093/jnci/djs027 Google Scholar Crossref Search ADS PubMed WorldCat 34 Wang H , Sun Q, Zhao W, et al. Individual-level analysis of differential expression of genes and pathways for personalized medicine . Bioinformatics 2015 ; 31 ( 1 ): 62 – 8 . http://dx.doi.org/10.1093/bioinformatics/btu522 Google Scholar Crossref Search ADS PubMed WorldCat 35 Liu R , Yu X, Liu X, et al. Identifying critical transitions of complex diseases based on a single sample . Bioinformatics 2014 ; 30 ( 11 ): 1579 – 86 . http://dx.doi.org/10.1093/bioinformatics/btu084 Google Scholar Crossref Search ADS PubMed WorldCat 36 Wang L , Feng Z, Wang X, et al. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data . Bioinformatics 2010 ; 26 ( 1 ): 136 – 8 . http://dx.doi.org/10.1093/bioinformatics/btp612 Google Scholar Crossref Search ADS PubMed WorldCat 37 Tarazona S , Furio-Tari P, Turra D, et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package . Nucleic Acids Res 2015 ; 43 ( 21 ): e140 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 38 Feng J , Meyer CA, Wang Q, et al. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data . Bioinformatics 2012 ; 28 ( 21 ): 2782 – 8 . http://dx.doi.org/10.1093/bioinformatics/bts515 Google Scholar Crossref Search ADS PubMed WorldCat 39 Anders S , Huber W. Differential expression analysis for sequence count data . Genome Biol 2010 ; 11 ( 10 ): R106 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Wu S , Wu H. More powerful significant testing for time course gene expression data using functional principal component analysis approaches . BMC Bioinformatics 2013 ; 14 ( 1 ): 6 . http://dx.doi.org/10.1186/1471-2105-14-6 Google Scholar Crossref Search ADS PubMed WorldCat 41 Wang H , Cai H, Ao L, et al. Individualized identification of disease-associated pathways with disrupted coordination of gene expression . Brief Bioinform 2016 ; 17 ( 1 ): 78 – 87 . http://dx.doi.org/10.1093/bib/bbv030 Google Scholar Crossref Search ADS PubMed WorldCat 42 Ahn T , Lee E, Huh N, et al. Personalized identification of altered pathways in cancer using accumulated normal tissue data . Bioinformatics 2014 ; 30 ( 17 ): I422 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Drier Y , Sheffer M, Domany E. Pathway-based personalized analysis of cancer . Proc Natl Acad Sci USA 2013 ; 110 ( 16 ): 6388 – 93 . http://dx.doi.org/10.1073/pnas.1219651110 Google Scholar Crossref Search ADS PubMed WorldCat 44 Yang X , Regan K, Huang Y, et al. Single sample expression-anchored mechanisms predict survival in head and neck cancer . PLoS Comput Biol 2012 ; 8 ( 1 ): e1002350 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Barbie DA , Tamayo P, Boehm JS, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 . Nature 2009 ; 462 ( 7269 ): 108 – 12 . http://dx.doi.org/10.1038/nature08460 Google Scholar Crossref Search ADS PubMed WorldCat 46 Gardeux V , Achour I, Li J, et al. ‘N-of-1-pathways’ unveils personal deregulated mechanisms from a single pair of RNA-Seq samples: towards precision medicine . J Am Med Inform Assoc 2014 ; 21 ( 6 ): 1015 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Schissler AG , Gardeux V, Li Q, et al. Dynamic changes of RNA-sequencing expression for precision medicine: N-of-1-pathways Mahalanobis distance within pathways of single subjects predicts breast cancer survival . Bioinformatics 2015 ; 31 ( 12 ): i293 – 302 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Schissler AG , Piegorsch WW, Lussier YA. Testing for differentially expressed genetic pathways with single-subject N-of-1 data in the presence of inter-gene correlation . Stat Methods Med Res 2017 . doi: 10.1177/0962280217712271. Google Scholar OpenURL Placeholder Text WorldCat 49 Li Q , Schissler AG, Gardeux V, et al. kMEn: analyzing noisy and bidirectional transcriptional pathway responses in single subjects . J Biomed Inform 2017 ; 66 : 32 – 41 . http://dx.doi.org/10.1016/j.jbi.2016.12.009 Google Scholar Crossref Search ADS PubMed WorldCat 50 Li Q , Schissler AG, Gardeux V, et al. N-of-1-pathways MixEnrich: advancing precision medicine via single-subject analysis in discovering dynamic changes of transcriptomes . BMC Med Genomics 2017 ; 10 ( S1 ): 27 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Martini P , Sales G, Calura E, et al. timeClip: pathway analysis for time course data without replicates . BMC Bioinformatics 2014 ; 15 (Suppl 5): S3 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Vitali F , Cohen LD, Demartini A, et al. A network-based data integration approach to support drug repurposing and multi-target therapies in triple negative breast cancer . PLoS One 2016 ; 11 : e0162407 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Hansen KD , Wu Z, Irizarry RA, et al. Sequencing technology does not eliminate biological variability . Nat Biotech 2011 ; 29 ( 7 ): 572 – 3 . http://dx.doi.org/10.1038/nbt.1910 Google Scholar Crossref Search ADS WorldCat 54 Peng F , Zhang Y, Wang R, et al. Identification of differentially expressed miRNAs in individual breast cancer patient and application in personalized medicine . Oncogenesis 2016 ; 5 ( 2 ): e194 . Google Scholar Crossref Search ADS PubMed WorldCat 55 Kanehisa M , Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs . Nucleic Acids Res 2017 ; 45 ( D1 ): D353 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Simon R. Lost in translation: problems and pitfalls in translating laboratory observations to clinical utility . Eur J Cancer 2008 ; 44 ( 18 ): 2707 – 13 . http://dx.doi.org/10.1016/j.ejca.2008.09.009 Google Scholar Crossref Search ADS PubMed WorldCat 57 Narayanan M , Huynh JL, Wang K, et al. Common dysregulation network in the human prefrontal cortex underlies two neurodegenerative diseases . Mol Syst Biol 2014 ; 10 ( 7 ): 743 . http://dx.doi.org/10.15252/msb.20145304 Google Scholar Crossref Search ADS PubMed WorldCat 58 Chawla A , Adkins D, Worden FP, et al. Effect of the addition of temsirolimus to cetuximab in cetuximab-resistant head and neck cancers: Results of the randomized PII MAESTRO study . J Clin Oncol 2014 ; 32 : 6089 . Google Scholar Crossref Search ADS WorldCat 59 Livshits A , Git A, Fuks G, et al. Pathway-based personalized analysis of breast cancer expression data . Mol Oncol 2015 ; 9 ( 7 ): 1471 – 83 . http://dx.doi.org/10.1016/j.molonc.2015.04.006 Google Scholar Crossref Search ADS PubMed WorldCat 60 Liu C , Srihari S, Lal S, et al. Personalised pathway analysis reveals association between DNA repair pathway dysregulation and chromosomal instability in sporadic breast cancer . Mol Oncol 2016 ; 10 ( 1 ): 179 – 93 . http://dx.doi.org/10.1016/j.molonc.2015.09.007 Google Scholar Crossref Search ADS PubMed WorldCat 61 Goeman JJ , Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues . Bioinformatics 2007 ; 23 ( 8 ): 980 – 7 . http://dx.doi.org/10.1093/bioinformatics/btm051 Google Scholar Crossref Search ADS PubMed WorldCat 62 Jung I , Jo K, Kang H, et al. TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes . Bioinformatics 2017 . pii: btw780. doi: 10.1093/bioinformatics/btw780. Google Scholar OpenURL Placeholder Text WorldCat 63 Schliep A , Schonhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data . Bioinformatics 2003 ; 19 (Suppl 1): i255 – 263 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Heinonen M , Guipaud O, Milliat F, et al. Detecting time periods of differential gene expression using Gaussian processes: an application to endothelial cells exposed to radiotherapy dose fraction . Bioinformatics 2015 ; 31 ( 5 ): 728 – 35 . http://dx.doi.org/10.1093/bioinformatics/btu699 Google Scholar Crossref Search ADS PubMed WorldCat 65 Tai YC , Speed TP. On gene ranking using replicated microarray time course data . Biometrics 2009 ; 65 ( 1 ): 40 – 51 . http://dx.doi.org/10.1111/j.1541-0420.2008.01057.x Google Scholar Crossref Search ADS PubMed WorldCat 66 Spies D , Ciaudo C. Dynamics in transcriptomics: advancements in RNA-seq time course and downstream analysis . Comput Struct Biotechnol J 2015 ; 13 : 469 – 77 . http://dx.doi.org/10.1016/j.csbj.2015.08.004 Google Scholar Crossref Search ADS PubMed WorldCat 67 Bar-Joseph Z , Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data . Nat Rev Genet 2012 ; 13 ( 8 ): 552 – 64 . http://dx.doi.org/10.1038/nrg3244 Google Scholar Crossref Search ADS PubMed WorldCat 68 Chen R , Mias GI, Li-Pook-Than J, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes . Cell 2012 ; 148 ( 6 ): 1293 – 307 . http://dx.doi.org/10.1016/j.cell.2012.02.009 Google Scholar Crossref Search ADS PubMed WorldCat 69 Vaske CJ , Benz SC, Sanborn JZ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM . Bioinformatics 2010 ; 26 ( 12 ): i237 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Shen R , Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis . Bioinformatics 2009 ; 25 ( 22 ): 2906 – 12 . http://dx.doi.org/10.1093/bioinformatics/btp543 Google Scholar Crossref Search ADS PubMed WorldCat 71 List M , Hauschild AC, Tan Q, et al. Classification of breast cancer subtypes by combining gene expression and DNA methylation data . J Integr Bioinform 2014 ; 11 ( 2 ): 236 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Ray P , Zheng L, Lucas J, et al. Bayesian joint analysis of heterogeneous genomics data . Bioinformatics 2014 ; 30 ( 10 ): 1370 – 6 . http://dx.doi.org/10.1093/bioinformatics/btu064 Google Scholar Crossref Search ADS PubMed WorldCat 73 Gligorijevic V , Malod-Dognin N, Przulj N. Patient-specific data fusion for cancer stratification and personalised treatment . Pac Symp Biocomput 2016 ; 21 : 321 – 32 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 74 Lock EF , Hoadley KA, Marron JS, et al. Joint and Individual Variation Explained (Jive) for integrated analysis of multiple data types . Ann Appl Stat 2013 ; 7 ( 1 ): 523 – 42 . http://dx.doi.org/10.1214/12-AOAS597 Google Scholar Crossref Search ADS PubMed WorldCat 75 Gottlieb A , Stein GY, Ruppin E, et al. PREDICT: a method for inferring novel drug indications with application to personalized medicine . Mol Syst Biol 2011 ; 7 : 496 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Napolitano F , Zhao Y, Moreira VM, et al. Drug repositioning: a machine-learning approach through data integration . J Cheminform 2013 ; 5 ( 1 ): 30 . http://dx.doi.org/10.1186/1758-2946-5-30 Google Scholar Crossref Search ADS PubMed WorldCat 77 Schaefer CF , Anthony K, Krupa S, et al. PID: the Pathway Interaction Database . Nucleic Acids Res 2009 ; 37 : D674 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Amberger JS , Bocchini CA, Schiettecatte F, et al. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders . Nucleic Acids Res 2015 ; 43 : D789 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Gardeux V , Berghout J, Achour I, et al. A genome-by-environment interaction classifier for precision medicine: personal transcriptome response to rhinovirus identifies children prone to asthma exacerbations . J Am Med Inform Assoc 2017 ; 24 : 1116 – 26 . http://dx.doi.org/10.1093/jamia/ocx069 Google Scholar Crossref Search ADS PubMed WorldCat 80 Chen J , Sam L, Huang Y, et al. Protein interaction network underpins concordant prognosis among heterogeneous breast cancer signatures . J Biomed Inform 2010 ; 43 ( 3 ): 385 – 96 . http://dx.doi.org/10.1016/j.jbi.2010.03.009 Google Scholar Crossref Search ADS PubMed WorldCat 81 Chen JL , Li J, Stadler WM, et al. Protein-network modeling of prostate cancer gene signatures reveals essential pathways in disease recurrence . J Am Med Inform Assoc 2011 ; 18 ( 4 ): 392 – 402 . http://dx.doi.org/10.1136/amiajnl-2011-000178 Google Scholar Crossref Search ADS PubMed WorldCat 82 Perez-Rathke A , Li H, Lussier YA. Interpreting personal transcriptomes: personalized mechanism-scale profiling of RNA-seq data . Pac Symp Biocomput 2013 ; 159 – 70 . Google Scholar OpenURL Placeholder Text WorldCat 83 Chen JL , Hsu A, Yang X, et al. Curation-free biomodules mechanisms in prostate cancer predict recurrent disease . BMC Med Genomics 2013 ; 6 (Suppl 2): S4 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Carrasco-Ramiro F , Peiro-Pastor R, Aguado B. Human genomics projects and precision medicine . Gene Ther 2017 ; 24 ( 9 ): 551 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Sperisen P , Cominetti O, Martin FP. Longitudinal omics modeling and integration in clinical metabonomics research: challenges in childhood metabolic health research . Front Mol Biosci 2015 ; 2 : 44 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author 2017. Published by Oxford University Press.
Systems Bioinformatics: increasing precision of computational diagnostics and therapeutics through network-based approachesOulas, Anastasis; Minadakis, George; Zachariou, Margarita; Sokratous, Kleitos; Bourdakou, Marilena M; Spyrou, George M
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx151pmid: 29186305
Abstract Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach as in systems biology with a data-driven top-down approach as in bioinformatics. The advent of omics technologies has provided the stepping-stone for the emergence of Systems Bioinformatics. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Systems Bioinformatics is the framework in which systems approaches are applied to such data, setting the level of resolution as well as the boundary of the system of interest and studying the emerging properties of the system as a whole rather than the sum of the properties derived from the system’s individual components. A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers. Here, we provide evidence on how Systems Bioinformatics enhances computational therapeutics and diagnostics, hence paving the way to precision medicine. The aim of this review is to familiarize the reader with the emerging field of Systems Bioinformatics and to provide a comprehensive overview of its current state-of-the-art methods and technologies. Moreover, we provide examples of success stories and case studies that utilize such methods and tools to significantly advance research in the fields of systems biology and systems medicine. Systems Bioinformatics, precision medicine, computational diagnostics, computational therapeutics, network analysis, drug repurposing Introduction Biological data, either as large-scale omics or as classical biodata, are the footprints of biological mechanisms. These mechanisms consist of numerous synergistic effects emerging from various systems of interwoven biomolecules, cells and tissues. Therefore, it is necessary to explore them with a systemic approach to reveal the behaviour of the system as a whole rather than as the sum of its parts. Systems biology provides a holistic perspective on biological mechanisms via the integration of information and knowledge from multiple interdisciplinary fields (such as biology, chemistry, mathematics, computer science and physics). It aims to elucidate synergistic relationships between multiple factors in contrast to representing them as single entities and can lead to the generation of complex molecular networks of interactions modelled by computational or mathematical approaches. Systems biology harnesses its power from technological advances in the field of ‘omics’ and the advent of next-generation sequencing. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Bioinformatics and computational biology have made significant breakthroughs towards the analysis and interpretation of the data obtained from the above-mentioned omics technologies. The sheer size of data generated by these high-throughput methodologies, coupled with the need to analyse, integrate and concurrently interpret this avalanche of information in a systemic way, has paved the way to the upcoming field of Systems Bioinformatics. Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach—as adopted in systems biology—with a data-driven top-down approach used in bioinformatics. The bottom-up approach in systems biology typically brings together information from molecular cells and tissues in the framework of mathematical models to generate insights on the function and dynamic behaviour of cells, organs and organisms. The top-down approach uses bioinformatics methods to extract and analyse information from ‘omics’ data generated through high-throughput techniques. Initially, informatics approaches to systems biology focused mainly on modelling and simulation. However, owing to the lack of sufficient experimental data, these methods fell short in building reliable models. Subsequently, the explosion of multilevel data generation brought a plethora of new methods, tools and solutions capable of studying systemic properties. The application of systemic approaches such as information theory, statistical inference, probabilistic models, graph theory and further network science approaches in the analysis of biological data paved the way to the creation of a distinct field, namely, Systems Bioinformatics. Depending on the availability, the quality and the comprehensiveness of the data, Systems Bioinformatics’ methods contribute significant benefit in narrowing down the gap between genotype to phenotype as well as providing additional information regarding biomarker and drug discovery. These methods are applied to classical biological data, clinical/patient data and omics data as well. They are suitable for extracting precise and personalized results, thus, facilitating systems medicine (medicine that is in bidirectional interaction with computational multiscale analysis and modelling of disease-related mechanisms) and more specifically P4 Medicine (medicine that is personal, participatory, predictive and preventive) [1, 2] (as illustrated in Figure 1). Figure 1 Open in new tabDownload slide Systems Bioinformatics. A schematic representation of the emergence of Systems Bioinformatics as a distinct discipline among other interrelated and interdependent disciplines. The information provided by Bioinformatics, Biology and Systems Biology is integrated in the Systems Bioinformatics framework through computational integration and network-based and other holistic approaches to tackle challenges in Systems Medicine and in particular P4 Medicine. The delivery of individually adapted medical care of high precision, based on multi-source patient information across various levels and in various scales, is the basic idea of modern medicine having various appellations depending on the emphasis given (e.g. translational/systems/P4/precision/personalized medicine). The omics spectrum offers the opportunity and the challenge for multiscale and multi-source analysis towards building a comprehensive profile of the Human System (Figure 2). The major challenges faced by Systems Bioinformatics towards this demanding form of medicine are: (i) the design and development of suitable bioinformatics pipelines to provide valid and sufficient biological information from the high-throughput molecular profiles of the patient, (ii) the development of robust information systems capable for data integration, information extraction and knowledge sharing, (iii) the construction of mathematical models to predict the evolution of a particular disease, its relation with the measured markers, its tolerance/resistance to various drug families and the existing risks to the patient. These challenges can be tackled with state-of-the-art computational methodologies and techniques, such as computational intelligence, machine learning, pattern recognition and data mining, modelling and simulation, network reconstruction and visualization, complex network analysis, deep learning, text mining/semantics and association analysis. Further to these, Systems Bioinformatics serves as the framework for the development of powerful computational methods and tools to create user-friendly platforms to visualize and analyse big and heterogeneous information in the form of a network. Figure 2 Open in new tabDownload slide Network Integration. Multiscale and multisource data generated from the Human System can be represented in network form. These networks can be further analysed and, importantly, they can be integrated forming supernetworks and building a comprehensive profile of the Human System. This review is structured in three main sections. In the first section on ‘Systems Bioinformatics’ we begin with an overview of the systems theory approach for complex biological problems. We then provide an in-depth summary of the network science approach in System Bioinformatics. We introduce certain basic network measures, which are used to analyse the components of a network, both locally and globally, and discuss the biological interpretation of such measures. We then describe in detail key biological network construction methods followed by a discussion on module-based approaches and network signatures. Finally, we describe network manipulation methods in the ‘Network controllability’ and ‘Network integration’ subsections. In the following subsection we provide a short summary regarding modelling and simulation approaches followed by a short discussion on the infrastructures and data management challenges in the field of Systems Bioinformatics. In the last two sections we provide an overview of methods and case studies with regards to the Systems Bioinformatics applications in biomarker and drug discovery. Systems Bioinformatics Systems approaches Biological data have tremendously expanded both in size and complexity. Systems Bioinformatics focuses on the investigation of such vast and complex biological systems and their within interactions using a ‘holistic’ rather than a ‘reductionist’ approach, much like the systems biology field. A holistic approach to science and the analysis and description of a complex phenomenon emphasizes the whole and the interaction of its parts, whereas the reductionist approach focuses on the fundamental parts. In fact, the debate on reductionism versus holism has its roots in ancient years. According to reductionism proponents, the optimal method to understand any science is the decomposition in smaller components. Moreover, in its greedy form, reductionism may see the whole science as physics. Even in its layered-model form, reductionism considers human/health sciences as based on biology, biology based on chemistry and chemistry based on physics. On the other hand, a strong dissent has formulated a solid antireductionism trend. This trend has either epistemological or ontological origins, supporting that complete reductionism is technically impossible and that there are emergent laws that govern the system and cannot be derived from the laws governing the components of the system. Furthermore, it is supported that each system has a ‘buffering capacity’ where many micro-states correspond to fewer macro-states of the system, making reductionism to be considered as pointless after a certain decomposition level [3]. The reductionism’s approach in biology is epitomized by molecular biology, which in the past two decades has led to the generation of a plethora of omics data. These data provide information on the building blocks of the entire organism at different scales and for different types of cells, tissues and organs. Data on DNA fragments, genes, RNA fragments, peptides, proteins and metabolites measured in short time and space intervals provide a spatiotemporal distribution of these building blocks under various states of the organism. These interwoven building blocks control and are controlled by signals in a non-linear way. As a result, the understanding of the system requires something more than simply the bottom-up assembly of the system’ components. Systems theory, which is a holistic approach, addresses the limitations from the reductionism’s point of view by considering the system as a whole, adopting a top-down approach [4]. Thus, it studies the emergent properties of the system such as homeostasis, adaptivity, tolerance, stability and modularity, through some basic overlying hierarchical principles such as entropy, positive and negative feedback control [4]. In the context of Systems approaches, the graph theory and the further science of networks have been successfully applied to the investigation of complex phenomena across a range of different scientific disciplines. The theoretical context of complex networks approach includes concepts that are derived from information theory, dynamical systems, statistical physics and topology approaches, as well as several mathematical methods suited for the analysis of the interaction of components in a complex network. In the following subsection we provide an in-depth overview of such network approaches and examples of their use in Systems Bioinformatics. Networks Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Open in new tabDownload slide Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion). Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks: The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statisticallyinferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data. Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships. Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets. Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section. Systems modelling and simulation To test the validity and predict the behaviour of complex biochemical systems, such as gene networks, it is often required to describe the effects of multiple, simultaneous and dynamic interactions within the components of the system that are too complex to interpret intuitively. Developing and simulating mathematical models is essential in investigating such complex biological systems. These complex systems can be further explored using mathematical models to describe the valid structure (i.e. the components of the system and their interactions based on experimental data) and identify the basic underlying principles of their function to predict behavioural responses to a certain perturbation [92]. Two types of models are commonly used to describe biological processes such as gene networks—‘quantitative’ and ‘logical’ [93, 94]. Quantitative models use differential equations to describe the non-linear dynamic interactions in a network, whereas logical models use the Boolean approach to describe dynamics in a qualitative way. Quantitative models provide precise information and can be directly compared with experiments including time-dependent data. However, they require sufficient knowledge of the mechanistic details and kinetic parameters and, thus, they are limited to applications to networks which are well characterized and are of small to moderate size. Logical models do not require such information and can be applied to large-scale networks with known structure, yet only provide limited information, as they cannot provide quantitative predictions and assist in choosing better alternative behaviours. In summary, each modelling approach has its advantages and disadvantages and recent work suggests that hybrid approaches might be optimal for challenges in systems biology (for a detailed discussion see [93]). Mathematical models are indispensable in pharmacology and diagnostics. For example, spatio-temporal mathematical models of the blood coagulation network have been developed to aid drug development and diagnostics (as extensively reviewed in [95]). Another relevant application is the use of mathematical models of drug-targeted pathways (modelled with a set of differential equations based on the mass action law) to explore drug combinations [96]. Classical bioinformatics and systems biology can complement and strengthen each other in drug discovery and therapeutics where concrete predictions are required [92]. The value of combining high-throughput data with mathematical modelling is shown, for example, in devising personalized treatments in cancer (for extensive review see [97]). Integration of multi-omic data can be used as an additional constraint in constraint-based modelling in systems biology (to optimize parameter estimation and validation). For a recent survey summarizing constraint-based metabolomic modelling and multi-omic integration methods see [98]. Infrastructures and data management It is important to highlight some of the modern computing trends that play an important role in driving research in Systems Bioinformatics and facilitate the transition to personalized medicine. The main limiting factor for research laboratories specializing in Systems Bioinformatics is computational power and resources. Significant investment is required to attain high-performance computer (HPC) servers or clusters, which have the capacity to store, manage and process the vast amount of data generated from high-throughput omics technologies. Often sheer maintenance of these machines can be a costly and a limiting factor that disallows the exploitation of the full potential of HPC. Cloud computing promises to solve major issues of system administration for these computer clusters by allowing for the exploitation of HPC, stored and managed in an expert environment, as virtual resources that are made available through the internet. Tool availability is also a major issue and having organized platforms with tools like CytoScape [99], GATK [100], BLAST [101], omics assemblers (i.e. IDBA-UD [102]) and programming languages and packages like R's Bioconductor library for expression data analysis [103], JAVA, Python, SQL and others is of major importance for scientists to facilitate dissemination of algorithms, data and results. Systems bioinformatics applications In this section we present the impact of Systems Bioinformatics on diagnostics and therapeutics by highlighting success stories and cases in a formatted manner: introductory text/data set collection/network construction/network analysis/findings and significance of research. Systems bioinformatics applications in biomarker discovery The use of networks in computational diagnostics via the detection of molecular biomarkers is one of the hallmarks of Systems Bioinformatics. Numerous recent state-of-the-art studies have made use of such networks to characterize cellular systems by simultaneously analysing thousands of genes, proteins, isoforms and complexes to address issues of computational diagnostics. Here, we highlight a few studies, showcasing the essence of networks’ contribution in precision diagnostics. A recent study [62] used a machine learning approach, which integrates topological features from PPI networks, to identify candidate AD-associated genes. Dataset collection: Positive and negative data sets were collected from Entrez Gene database at the National Centre for Biotechnology Information (NCBI). The positive data set consisted of 458 genes known to be associated with AD. The negative data set consisted of the additional 55 947 Entrez genes, excluding the AD-associated genes. Network construction: Human PPI data sets were extracted from a variety of sources including Online Predicted Human Interaction Database (OPID), STRING, MINT, BIND and InTAct databases. Network analysis: By utilizing the PPI networks, the authors extracted topological features for the AD- and non-AD-associated genes. These features included nine topological properties of the PPI network for each gene, namely, the average shortest path length, betweenness centrality, closeness centrality, clustering coefficient, degree, eccentricity, neighbourhood connectivity, topological coefficient and radiality. Findings and significance of research: The authors further combined sequence features and functional annotations features and concurrently performed feature selection using seven methods including gain-ratio-based attribute evaluation, oneR algorithm, chi-square-based selection, correlation-based selection, information gain-based attribute evaluation and relief-based selection. The most important features were fed into 11 machine learning algorithms to generate classifiers using the training data set capable of predicting AD- and non-AD-associated genes using the selected network, sequence and functional features. Methods included Naive Bayes (NB), NB Tree, Bayes Net, Decision table/NB hybrid classifier, Random Forest, J48, Functional Tree, Locally Weighted Learning (J48 + k-nearest neighbour), Logistic Regression and Support Vector Machine. Training of sophisticated machine learning classifiers using systemic properties can be a key feature in generating personalized medicine diagnostic approaches. The authors finally combined diagnostics with therapeutics by screening 45 known anti-Alzheimer drugs from DrugBank against novel predicted probable AD targets, obtained from their trained classifiers, using molecular docking. They further proposed a novel candidate untried drug, AL-108, with high affinity to potential therapeutic targets. Additional tools were also used to validate preliminary findings, including molecular dynamics simulations and MM/GBSA calculations on the docked complexes [62]. Another interesting study [104] used data from TCGA [105] to successfully construct a multidimensional subnetwork atlas for cancer prognosis. The authors addressed how multiple genetic and epigenetic factors (i.e. gene expression, copy number variation, miRNA expression and DNA methylation) affect molecular states of networks and patient survival. Dataset collection: The multidimensional cancer-associated data sets for 1027 patients for four cancer types were collected from TCGA Cancer Browser (https://genome-can cer.ucsc.edu/proj/site/hgHeatmap/). They contained clinical information, copy-number variation, promoter DNA methylation, mRNA-gene and miRNA expression data. They furthermore extracted PPI data from HPRD for network construction. To enrich these networks with additional miRNA-regulatory information the authors extracted miRNA and target gene information from two miRNA target databases [miRTarBase (Release 4.5) and TarBase v6], which provide experimentally validated miRNA–target interactions. Network construction: PPI interaction network was constructed using data collected from HPRD. Network analysis: The authors fitted a univariate Cox proportional hazards model between each molecular feature and patient survival time and thus scored each gene based on its significance to predict survival. Genes with a positive score were considered as survival-related genes. They next used this score (heat score) as the input into HotNet2, which uses a heat diffusion process and a statistical test-based algorithm to discover subnetwork signatures in the PPI network. Through this way subnetwork signatures of survival-related genes were determined both by the scores of their genes as well as gene topology in the PPI network. Findings and significance of research: The authors then used Monte Carlo cross-validation and permutation testing procedure to assess predictive power of the subnetworks on patient overall survival. They used a Cox proportional hazards model with L1 penalized log partial likelihood (LASSO) for feature selection to train the models based on the molecular profile of individual subnetworks. Finally, the prognostic outcomes for the training set were used to determine the regression coefficients. These coefficients were then used in the testing model to predict outcomes for patients in the test set and calculate the concordance index (C-index). Results reveal novel PPI subnetworks with significant prognostic capabilities for a variety of cancer types. The authors further validated their subnetworks by performing prognostic impact evaluation, functional enrichment analysis, drug target annotation, tumour stratification and independent validation. They highlighted distinct pathways in the underlying subnetworks as potential new targets for therapeutic intervention for certain cancer types. This study integrated the protein interactome with cancer genomics data, thus allowing for a systemic analysis of the molecular mechanisms that underlie genesis of cancer and provides new directions in personalized cancer therapy [104]. Another recent study [106] adopted an approach that uses an enriched library of single-stranded oligodeoxynucleotides to profile complex biological samples. This method allows for the analysis of systemic native biomolecules. The authors defined their method as Adaptive Dynamic Artificial Poly-ligand Targeting and further utilized it as a diagnostic tool to profile plasma exosome of cancer patients. They achieved high classification accuracy in breast cancer patients by analysing the circulating exosomes in their blood [106]. The online database MelGene is yet another example of successful integration of Systems Bioinformatics approaches in current research for molecular diagnostics. This tool provides a comprehensive, regularly updated collection of data from genetic association studies in cutaneous melanoma, including random-effects meta-analysis results of all eligible polymorphisms [107]. The MelGene proposed network connections highlight potentially new loci in relation to melanoma risk. Recent studies have shown that interpretation of proteomics data using network-based approaches can offer additional insights into the mechanistic and dynamics of protein assemblies, and hence into the molecular mechanisms underlying the system under study. Moreover, network-based approaches can be used to reconstruct a disease-perturbed cellular network model showing the interactions of identified differentially expressed proteins involved in selected cellular pathways related to the target pathophysiology. For example, Shirasaki et al. [108] have used affinity purification coupled to MS to investigate the proteome profile of Huntington’s disease. In particular, using a monoclonal antibody against huntingtin (Htt), they identified 747 proteins to be complexed with Htt. A systems-level view of Htt interactome was achieved by using WGCNA, which was used to construct weighted links between the Htt co-purifying proteins. Using topological overlap, the data were clustered into eight Htt-interactome modules that were related to distinct functional aspects such as brain region specificity, aging and protein aggregation modulation or Htt functions directly [108]. Moreover, several network-based approaches have been developed that can identify the cellular pathways which are altered under pathophysiological conditions, and can hence enrich biomarker discovery. For example, functional enrichment analysis of GO biological processes or KEGG pathways [37] of differentially expressed proteins can be performed using both free licence tools such as Database for Annotation, Visualization and Integrated Discovery (DAVID) [109], Protein ANalysis THrough Evolutionary Relationships (PANTHER) [110] and Gene Set Enrichment Analysis (GSEA) [34], as well as commercialized tools such as MetaCore [36] and Ingenuity Pathway Analysis. Furthermore, pathway topology approaches have been developed as alternative to enrichment analysis. For example, Signalling Pathway Impact Analysis [111] and Network Perturbation Amplitude [112] deliberate whether the proteins involved in functional modules defined by other databases interact with each other in cellular networks. Various tools are currently available, which can aid the Systems Bioinformatics application in biomarker discovery. GWAB, a recent tool, makes use of systems approaches and computational methods to boost weak association signals for Genome Wide Association Studies (GWAS), a common problem when analysing this type of data. This tool works by incorporating publicly available data in the form of using GWAS summary statistics (p-values) for SNPs along with reference genes for a disease of interest. The authors demonstrated the feasibility of boosting GWAS disease associations using gene networks and further present a web server for GWAB, for the network-based boosting of human GWAS data [113]. Other tools like GeneMANIA [114] and PINTA [115] allow for gene prioritization and gene function prediction and can greatly aid in computational diagnostics. Systems Bioinformatics applications in drug discovery Systems Bioinformatics contributes in computational therapeutics by providing tools and algorithms for novel drug discovery. Research in this direction is often done in close collaboration with pharmaceutical companies. One of the main challenges faced by both the research community and the industry is the prediction of adverse drug effects, especially during the early stages of drug development. These types of predictions can lead to significant cost reductions by allowing for accurate drug assessment and discontinuation of development for drugs with severe adverse effects. The use of human genetic variation has been known to play an important role in drug response [116]; however, the effect of this factor alone is not sufficient to provide a complete perspective on the matter in hand. Systems pharmacology is a term that is widely used today in many high-calibre, recently published studies [117–119]. Systems pharmacology is a systems biology approach, which focuses on enhancing the understanding of drugs function in the human body at a systems’ level, described by several types of networks, rather than looking at the effect of single molecular components. It shifts away from traditional practice, which considers the effects of a drug with respect to its target protein and instead strives to address the effects of the drug by considering a network of drug–target interactions. Systems Bioinformatics is a precious field in the neighbourhood of systems pharmacology that provides important methods and tools for multi-source and multilevel integration of the omics spectrum with drug networks shedding light in the area of modern drug discovery. A recent area of great interest where Systems Bioinformatics can be of substantial impact and value is the area of drug repurposing or repositioning [120]. This entails the use of Food and Drug Administration (FDA)-approved drugs to treat new diseases, which are different from the ones they were initially designed for. This allows for obvious shortcuts for pharmaceutical companies allowing them to by-pass the timely and costly process of FDA approval for novel drugs. Recent studies used gene expression data derived from microarrays or RNA-Seq data to obtain specific expression profiles for specific diseases of interest. By comparing these to collections of data sets from repositories such as CMap [121], Drugmap Central and more advanced versions like LINCS and the recent Drug Repurposing Hub [121, 122] allows for alternative drugs to be proposed for the treatment of diseases under investigation. In a recent study [38], this approach was used to devise drug/target networks obtained from algorithms of mutual information and co-expression networks aiming to gain insights into the treatment of breast cancer subtypes. Data collection: TCGA mRNA (microarray) gene expression data for Breast Invasive Carcinoma cases were obtained from Firehose (http://gdac.broadinstitute.org/). From a total of 587 samples (526 primary solid tumour samples and 61 primary solid normal samples—17.814 genes), the authors selected a subset of tumour data containing information regarding breast cancer staging, HER2, ER and PR status with their corresponding normal samples as well as breast cancer stages I, II, III and IV. Network construction: The authors examined three major categories of statistical network inference methods: (i) mutual information-based methods, (ii) correlation-based methods and (iii) tree-based methods. They further utilized Biological information-based network methods and one ensemble scheme using all statistical network inference methods. They used the Cytoscape platform and more specifically the GeneMania plug-in to reconstruct the biological information-based gene network. This plug-in uses a large data set unifying functional networks comprising approximately 800 networks for six organisms including Homo sapiens. Using the H.sapiens network they constructed a sub-network for the top 1000 differentially expressed genes (DEGs) from the TCGA data set merging five Network types: Co-expression, Physical Interaction, Genetic interaction, Co-localization and Pathways. Network analysis: The authors further performed gene re-ranking using the underlying networks. To investigate the influence of the reconstructed 17 gene networks (12 statistically and 5 biologically inferred) on gene prioritization, they applied a method that allows for a custom network selection combining the log fold change absolute values with the selected underlying network topology to re-rank the initial DEGs. The basic idea of the method is the reconciliation of the gene expression values taking into account the underlying gene network topological features such as degree and betweenness. The network patterns were further analysed to investigate their exclusive contribution with respect to breast cancer subtypes and stages. The authors then performed drug repurposing using the up- and down-regulated genes forming disease signatures by querying them in a well-established drug repurposing pipeline, namely, LINCS-L1000 (http://www.lincscloud.org/), an advanced version of CMap. In summary, the authors obtained 63 unique drugs for the breast cancer stages and 58 for the breast cancer subtypes. To further examine the resulting drugs, the authors constructed super networks by combining top drugs extracted from their analysis with the FDA-approved breast cancer drugs, connecting them with their target genes and superimposing these on the gene expression networks. Findings and significance of research: The authors performed an analysis that concluded to eight network patterns, four for the stages (I, II, III and IV) and four for the subtypes (Triple Negative, Luminal A, Luminal B and HER2). These patterns were shown to highlight four exclusive stage-related pathways including phenylalanine metabolism for Stage II, peroxisome proliferator-activated signalling pathway and glycolysis and gluconeogenesis for Stage III and toll-like receptor signalling pathway for Stage IV. Finally, the authors performed drug repurposing to elucidate potential anti-breast-cancer properties for known drugs and they compared the molecular structure for their predicted re-purposed drugs against 25 FDA-approved drugs of clinical use. Two out of these 25 drugs (Gemcitabine and Palbociclib) were also found as repurposed drugs by the authors. In Stage I, two repurposed drugs, Clofarabine and Kinetin-riboside, were found to be structurally similar to Gemcitabine. Clofarabine seems to have potential efficacy in epigenetic therapy of solid tumours, especially at early stages of carcinogenesis. Another recent line of work [64] performed network-based in silico drug efficacy screening by exploiting network-based approaches. The authors investigated the association between drug targets and diseases, presenting a drug–disease proximity measure [64]. Data collection: The authors used 1489 diseases defined by Medical Subject Headings (MeSH) compiled in a recent study [90]. For each disease, the disease–gene associations were collected from OMIM and GWAS catalogue. For each disease, the authors extracted information on FDA-approved drugs from DrugBank and matched 79 of these diseases with at least one drug using tools like MEDI-HPS and Metab2Mesh resulting in 238 unique drugs and 384 targets. The authors took information published by [90] that contained experimentally documented human protein physical interactions from TRANSFAC, IntAct, MINT, BioGRID, HPRD, KEGG, BIGG, CORUM, PhosphoSitePlus and a large-scale signalling network. Network construction: The human PPI network was compiled using information extracted from the databases described above, to generate an elaborate human interactome. The largest connected component of this interactome was consequently used in their analysis, consisting of 141 150 interactions between 13 329 proteins. Entrez Gene IDs were used to map disease-associated genes to the corresponding proteins in the interactome. Network analysis: The proximity between a disease and a drug was evaluated using various distance measures that take into account the path lengths between drug targets and disease proteins. The authors focused on two types of network-based proximity relationships between drugs and disease proteins: (i) the most straightforward measure is the average shortest path length between all targets of a drug and the proteins involved in the same disease; (ii) the second proximity is the closest measure, representing the average shortest path length between the drug’s targets and the nearest disease protein. Findings and significance of research: The authors validated their approach and optimized their proximity thresholds by assessing how well relative proximity discriminates 402 known drug–disease pairs from the 18 162 unknown drug–disease pairs by comparing the area under Receiver Operating Characteristic curve for different distance measures. Based on these results the authors showed that network proximity delineates therapeutic effects of a drug. This approach of utilizing network proximity in the interactome for drug targets and diseases, allowed for increased understanding in the therapeutic effect of drugs. They made use of cases from Parkinson’s disease and several inflammatory disorders to further substantiate findings. This approach can potentially have significant applications in drug discovery, drug repurposing and assessment of drug adverse effects. Another study [123] led to the development of a current state-of-the-art tool that addresses computational therapeutics from a network perspective, the TCM-Mesh system. This tool allows for the high-throughput network pharmacology analysis for Traditional Chinese Medicine (TCM) [123]. Dataset collection: TCM utilizes data curated from collections of 6235 herbs, 383 840 compounds, 14 298 genes, 6204 diseases, 144 723 gene–disease associations, 3 440 231 pairs of gene interactions, 163 221 side-effect records and 71 toxic records (data as of April 2017). The information for traditional Chinese herbs and traditional Chinese medicine preparation was extracted from TCM Database@Taiwan, TCMID, information of compounds and their targets; diseases and their related proteins were obtained from STITCH and OMIM, respectively; the protein interactions were obtained from STRING; the toxic and side-effect records of compounds were derived from TOXNET and SIDER. Network construction: The authors used Cytoscape as well as a web-based software to facilitate visualization of a compound–gene–disease network construction between TCM and treated diseases. Network analysis: The authors based their network analysis and scored their compounds using the combined score as defined and obtained from the STITCH database. This score represents the strength of the links between the compounds and their associated proteins. Findings and significance of research: The authors used 1293 FDA-approved drugs, as well as compounds from a herbal material Panax ginseng and a patented drug Liuwei Dihuang Wan for evaluating their database. By comparison of different databases, as well as checking against literature, they demonstrated the completeness, effectiveness and accuracy of the TCM-Mesh database and further aided in increased understanding of the molecular mechanisms of TCM action. Various tools are currently available, which can aid the Systems Bioinformatics application in drug discovery. For example, tools like Substructure-Drug-Target Network-Based Inference SDTNBI [124], C(2) Maps [125], Chem2Bio2RDF [126] and PROMISCUOUS [127] cumulatively provide integrated systems and pharmacology databases for chemoinformatics analysis, drug-target prediction, networks of disease–gene–drug connectivity relationships as well as drug repositioning analysis. For a full list of tools and databases adopting or supporting Systems Bioinformatics methodologies, see Table 1. A more comprehensive list of related tools and databases going back to 2010 can be found in Supplementary Table S1. Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling Tool category/description . Publication year . Link . Reference . Network-based therapeutics TCM-Mesh: The database and analytical system for network pharmacology analysis for TCM preparations 2017 http://mesh.tcm.microbioinformatics.org/ [123] SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning 2017 The program is available on request [124] A protein network descriptor server and its use in studying protein, disease, metabolic and drug-targeted networks 2016 http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi [128] systemsDock: a web server for network pharmacology-based prediction and analysis 2016 http://systemsdock.unit.oist.jp/iddp/home/index [129] BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology 2016 https://www.bindingdb.org/bind/index.jsp [130] NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning 2015 http://nffinder.cnb.csic.es/ [131] NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods 2015 http://sbb.hku.hk/services/NutriChem-2.0/FoodDisease.php [132] TIMMA-R: an R package for predicting synergistic multi-targeted drug combinations in cancer cell lines or patient-derived samples 2015 https://cran.r-project.org/web/packages/timma/ [133] Network-based diagnostics GWAB: a web server for the network-based boosting of human genome-wide association data 2017 http://www.inetbio.org/gwab/ [113] Netter: re-ranking gene network inference predictions using structural network properties 2016 https://github.com/JRuyssinck/netter [134] MetaNetVar: Pipeline for applying network analysis tools for genomic variants analysis 2016 https://github.com/NCBI-Hackathons/Network_SNPs [135] GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets 2016 http://www.integrativegenomics.org/ [136] NetDecoder: a network biology platform that decodes context-specific biological networks and gene activities 2016 http://netdecoder.hms.harvard.edu/ [137] MUFFINN: cancer gene discovery via network analysis of somatic mutation data 2016 http://www.inetbio.org/muffinn/ [138] HitWalker2: visual analytics for precision medicine and beyond 2016 https://github.com/biodev/HitWalker2 [139] NCG 5.0: updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings 2016 http://ncg.kcl.ac.uk/ [140] dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation 2015 http://140.138.144.145/∼dbSNO/index.php [141] Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems 2015 http://causalbionet.com/ [142] Network reconstruction-visualization-analysis MotifNet: a web-server for network motif analysis 2017 http://netbio.bgu.ac.il/motifnet/ [143] cMapper: gene-centric connectivity mapper for EBI-RDF platform 2017 http://cmapper.ewostech.net/ [144] BRANE Clust: Cluster-assisted gene regulatory network inference refinement 2017 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-clust.html [145] vcfr: a package to manipulate and visualize variant call format data in R 2017 https://cran.r-project.org/web/packages/vcfR/index.html [146] shinyheatmap: Ultra-fast low-memory heatmap web interface for big data genomics 2017 http://shinyheatmap.com/ [147] PROXiMATE: a database of mutant protein–protein complex thermodynamics and kinetics 2017 http://www.iitm.ac.in/bioinfo/PROXiMATE/ [148] Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks 2017 https://github.com/ibalaur/MetabolicFramework [149] RAIN: RNA–protein association and interaction networks 2017 http://rth.dk/resources/rain/ [150] Phenopolis: an open platform for harmonization and analysis of genetic and phenotypic data 2017 https://uclex.cs.ucl.ac.uk/ [151] Pheno4J: a gene to phenotype graph database 2017 https://github.com/phenopolis/pheno4j [152] SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network 2017 https://github.com/YuanlongLiu/SigMod [153] iRegNet3D: three-dimensional integrated regulatory network for the genomic analysis of coding and non-coding disease mutations 2017 http://iregnet3d.yulab.org/index/ [154] JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data 2017 https://github.com/jijiadong/JDINAC [155] SmartR: An open-source platform for interactive visual analytics for translational research data 2017 https://github.com/transmart/SmartR [156] D-Map: random walking on gene network inference maps towards differential avenue discovery 2017 http://bioserver-3.bioacademy.gr/Bioserver/DMap/index.php [157] TRaCE+: Ensemble inference of gene regulatory networks from transcriptional expression profiles of gene knock-out experiments 2016 http://www.cabsel.ethz.ch/tools/trace.html [158] The Network Library: a framework to rapidly integrate network biology resources 2016 https://github.com/gsummer Web-based network analysis and visualization using CellMaps 2016 http://cellmaps.babelomics.org/ [159] PathwAX: a web server for network crosstalk based pathway annotation 2016 http://pathwax.sbc.su.se/ [160] Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology 2016 http://brg.ai.sri.com/ptools/ [161] NAPS: Network analysis of protein structures 2016 http://bioinf.iiit.ac.in/NAPS/ [162] UbiNet: an online resource for exploring the functional associations and regulatory networks of protein ubiquitylation 2016 http://140.138.144.145/∼ubinet/index.php [163] MET network in PubMed: a text-mined network visualization and curation system 2016 http://btm.tmu.edu.tw/metastasisway [164] QuIN: a web server for querying and visualizing chromatin interaction networks 2016 https://quin.jax.org/ [165] NET-GE: a web server for NETwork-based human gene enrichment 2016 http://net-ge.biocomp.unibo.it/enrich [166] IIIDB: a database for isoform–isoform interactions and isoform network modules 2015 http://syslab.nchu.edu.tw/IIIDB/ [167] cyNeo4j: connecting Neo4j and Cytoscape 2015 http://apps.cytoscape.org/apps/cyneo4j [168] BRANE Cut: biologically related a priori network enhancement with graph cuts for gene regulatory network inference 2015 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-cut.html [169] NetExplore: a web server for modelling small network motifs 2015 http://line.bioinfolab.net/nex/NetExplore.htm [170] COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems 2015 http://coxpresdb.jp/ [171] NAIL: a software toolset for inferring, analysing and visualizing regulatory networks 2015 https://sourceforge.net/projects/nailsystemsbiology/ [172] LncReg: a reference resource for lncRNA-associated regulatory networks 2015 http://bioinformatics.ustc.edu.cn/lncreg/ [173] TeloPIN: a database of telomeric proteins interaction network in mammalian cells 2015 http://songyanglab.sysu.edu.cn/telopin/ [174] MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse 2015 http://guanlab.ccmb.med.umich.edu/misomine/ [175] CerebralWeb: a Cytoscape.js plug-in to visualize networks stratified by subcellular localization 2015 http://www.innatedb.ca/CerebralWeb/ [176] Network-based integration NaviCom: a web application to create interactive molecular network portraits using multilevel omics data 2017 https://navicom.curie.fr/bridge.php [177] KeyPathwayMinerWeb: online multi-omics network enrichment 2016 https://keypathwayminer.compbio.sdu.dk/keypathwayminer/ [178] Visual Omics Explorer (VOE): a cross-platform portal for interactive data visualization 2016 http://bcil.github.io/VOE/ [179] ModuleAlign: module-based global alignment of PPI networks 2016 http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html [180] Fuse: multiple network alignment via data fusion 2016 http://www0.cs.ucl.ac.uk/staff/natasa/FUSE/index.html [181] The SMAL web server: global multiple network alignment from pairwise alignments 2016 http://haddock6.sfsu.edu/smal/ [182] Mergeomics: a web server for identifying pathological pathways, networks and key regulators via multidimensional data integration 2016 http://mergeomics.research.idre.ucla.edu/ [183] MAGNA ++: Maximizing accuracy in global network alignment via both node and edge conservation 2015 http://www3.nd.edu/∼cone/MAGNA±+/ [184] ZoomOut: analysing multiple networks as single nodes 2015 http://bioserver-3.bioacademy.gr/Bioserver/ZoomOut/ [185] RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse 2015 http://www.regnetworkweb.org/ [186] Systems biology and modelling FAIRDOMHub: a repository and collaboration environment for sharing systems biology research 2017 https://fair-dom.org/publication/fairdomhub-a-repository-and-collaboration-environment-for-sharing-systems-biology-research/ [187] The systems biology format converter 2016 https://www.ebi.ac.uk/biomodels/tools/converters/ [188] SBtab: a flexible table format for data exchange in systems biology 2016 https://www.sbtab.net/ PeTTSy: a computational tool for perturbation analysis of complex systems biology models 2016 http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ [189] AMIGO2: a toolbox for dynamic modelling, optimization and control in systems biology 2016 https://sites.google.com/site/amigo2toolbox/ [190] ComPPI: a cellular compartment-specific database for PPI network analysis 2015 http://comppi.linkgroup.hu/ [191] JSBML 1.0: providing a smorgasbord of options to encode systems biology models 2015 http://sbml.org/Software/JSBML [192] MpTheory Java library: a multi-platform Java library for systems biology based on the Metabolic P theory 2015 http://mptheory.scienze.univr.it/ [193] SYSBIONS: nested sampling for systems biology 2015 http://www.theosysbio.bio.ic.ac.uk/resources/sysbions/ [194] Dizzy-Beats: a Bayesian evidence analysis tool for systems biology 2015 https://sourceforge.net/p/bayesevidence/home/Home/ [195] Tool category/description . Publication year . Link . Reference . Network-based therapeutics TCM-Mesh: The database and analytical system for network pharmacology analysis for TCM preparations 2017 http://mesh.tcm.microbioinformatics.org/ [123] SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning 2017 The program is available on request [124] A protein network descriptor server and its use in studying protein, disease, metabolic and drug-targeted networks 2016 http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi [128] systemsDock: a web server for network pharmacology-based prediction and analysis 2016 http://systemsdock.unit.oist.jp/iddp/home/index [129] BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology 2016 https://www.bindingdb.org/bind/index.jsp [130] NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning 2015 http://nffinder.cnb.csic.es/ [131] NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods 2015 http://sbb.hku.hk/services/NutriChem-2.0/FoodDisease.php [132] TIMMA-R: an R package for predicting synergistic multi-targeted drug combinations in cancer cell lines or patient-derived samples 2015 https://cran.r-project.org/web/packages/timma/ [133] Network-based diagnostics GWAB: a web server for the network-based boosting of human genome-wide association data 2017 http://www.inetbio.org/gwab/ [113] Netter: re-ranking gene network inference predictions using structural network properties 2016 https://github.com/JRuyssinck/netter [134] MetaNetVar: Pipeline for applying network analysis tools for genomic variants analysis 2016 https://github.com/NCBI-Hackathons/Network_SNPs [135] GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets 2016 http://www.integrativegenomics.org/ [136] NetDecoder: a network biology platform that decodes context-specific biological networks and gene activities 2016 http://netdecoder.hms.harvard.edu/ [137] MUFFINN: cancer gene discovery via network analysis of somatic mutation data 2016 http://www.inetbio.org/muffinn/ [138] HitWalker2: visual analytics for precision medicine and beyond 2016 https://github.com/biodev/HitWalker2 [139] NCG 5.0: updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings 2016 http://ncg.kcl.ac.uk/ [140] dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation 2015 http://140.138.144.145/∼dbSNO/index.php [141] Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems 2015 http://causalbionet.com/ [142] Network reconstruction-visualization-analysis MotifNet: a web-server for network motif analysis 2017 http://netbio.bgu.ac.il/motifnet/ [143] cMapper: gene-centric connectivity mapper for EBI-RDF platform 2017 http://cmapper.ewostech.net/ [144] BRANE Clust: Cluster-assisted gene regulatory network inference refinement 2017 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-clust.html [145] vcfr: a package to manipulate and visualize variant call format data in R 2017 https://cran.r-project.org/web/packages/vcfR/index.html [146] shinyheatmap: Ultra-fast low-memory heatmap web interface for big data genomics 2017 http://shinyheatmap.com/ [147] PROXiMATE: a database of mutant protein–protein complex thermodynamics and kinetics 2017 http://www.iitm.ac.in/bioinfo/PROXiMATE/ [148] Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks 2017 https://github.com/ibalaur/MetabolicFramework [149] RAIN: RNA–protein association and interaction networks 2017 http://rth.dk/resources/rain/ [150] Phenopolis: an open platform for harmonization and analysis of genetic and phenotypic data 2017 https://uclex.cs.ucl.ac.uk/ [151] Pheno4J: a gene to phenotype graph database 2017 https://github.com/phenopolis/pheno4j [152] SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network 2017 https://github.com/YuanlongLiu/SigMod [153] iRegNet3D: three-dimensional integrated regulatory network for the genomic analysis of coding and non-coding disease mutations 2017 http://iregnet3d.yulab.org/index/ [154] JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data 2017 https://github.com/jijiadong/JDINAC [155] SmartR: An open-source platform for interactive visual analytics for translational research data 2017 https://github.com/transmart/SmartR [156] D-Map: random walking on gene network inference maps towards differential avenue discovery 2017 http://bioserver-3.bioacademy.gr/Bioserver/DMap/index.php [157] TRaCE+: Ensemble inference of gene regulatory networks from transcriptional expression profiles of gene knock-out experiments 2016 http://www.cabsel.ethz.ch/tools/trace.html [158] The Network Library: a framework to rapidly integrate network biology resources 2016 https://github.com/gsummer Web-based network analysis and visualization using CellMaps 2016 http://cellmaps.babelomics.org/ [159] PathwAX: a web server for network crosstalk based pathway annotation 2016 http://pathwax.sbc.su.se/ [160] Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology 2016 http://brg.ai.sri.com/ptools/ [161] NAPS: Network analysis of protein structures 2016 http://bioinf.iiit.ac.in/NAPS/ [162] UbiNet: an online resource for exploring the functional associations and regulatory networks of protein ubiquitylation 2016 http://140.138.144.145/∼ubinet/index.php [163] MET network in PubMed: a text-mined network visualization and curation system 2016 http://btm.tmu.edu.tw/metastasisway [164] QuIN: a web server for querying and visualizing chromatin interaction networks 2016 https://quin.jax.org/ [165] NET-GE: a web server for NETwork-based human gene enrichment 2016 http://net-ge.biocomp.unibo.it/enrich [166] IIIDB: a database for isoform–isoform interactions and isoform network modules 2015 http://syslab.nchu.edu.tw/IIIDB/ [167] cyNeo4j: connecting Neo4j and Cytoscape 2015 http://apps.cytoscape.org/apps/cyneo4j [168] BRANE Cut: biologically related a priori network enhancement with graph cuts for gene regulatory network inference 2015 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-cut.html [169] NetExplore: a web server for modelling small network motifs 2015 http://line.bioinfolab.net/nex/NetExplore.htm [170] COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems 2015 http://coxpresdb.jp/ [171] NAIL: a software toolset for inferring, analysing and visualizing regulatory networks 2015 https://sourceforge.net/projects/nailsystemsbiology/ [172] LncReg: a reference resource for lncRNA-associated regulatory networks 2015 http://bioinformatics.ustc.edu.cn/lncreg/ [173] TeloPIN: a database of telomeric proteins interaction network in mammalian cells 2015 http://songyanglab.sysu.edu.cn/telopin/ [174] MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse 2015 http://guanlab.ccmb.med.umich.edu/misomine/ [175] CerebralWeb: a Cytoscape.js plug-in to visualize networks stratified by subcellular localization 2015 http://www.innatedb.ca/CerebralWeb/ [176] Network-based integration NaviCom: a web application to create interactive molecular network portraits using multilevel omics data 2017 https://navicom.curie.fr/bridge.php [177] KeyPathwayMinerWeb: online multi-omics network enrichment 2016 https://keypathwayminer.compbio.sdu.dk/keypathwayminer/ [178] Visual Omics Explorer (VOE): a cross-platform portal for interactive data visualization 2016 http://bcil.github.io/VOE/ [179] ModuleAlign: module-based global alignment of PPI networks 2016 http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html [180] Fuse: multiple network alignment via data fusion 2016 http://www0.cs.ucl.ac.uk/staff/natasa/FUSE/index.html [181] The SMAL web server: global multiple network alignment from pairwise alignments 2016 http://haddock6.sfsu.edu/smal/ [182] Mergeomics: a web server for identifying pathological pathways, networks and key regulators via multidimensional data integration 2016 http://mergeomics.research.idre.ucla.edu/ [183] MAGNA ++: Maximizing accuracy in global network alignment via both node and edge conservation 2015 http://www3.nd.edu/∼cone/MAGNA±+/ [184] ZoomOut: analysing multiple networks as single nodes 2015 http://bioserver-3.bioacademy.gr/Bioserver/ZoomOut/ [185] RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse 2015 http://www.regnetworkweb.org/ [186] Systems biology and modelling FAIRDOMHub: a repository and collaboration environment for sharing systems biology research 2017 https://fair-dom.org/publication/fairdomhub-a-repository-and-collaboration-environment-for-sharing-systems-biology-research/ [187] The systems biology format converter 2016 https://www.ebi.ac.uk/biomodels/tools/converters/ [188] SBtab: a flexible table format for data exchange in systems biology 2016 https://www.sbtab.net/ PeTTSy: a computational tool for perturbation analysis of complex systems biology models 2016 http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ [189] AMIGO2: a toolbox for dynamic modelling, optimization and control in systems biology 2016 https://sites.google.com/site/amigo2toolbox/ [190] ComPPI: a cellular compartment-specific database for PPI network analysis 2015 http://comppi.linkgroup.hu/ [191] JSBML 1.0: providing a smorgasbord of options to encode systems biology models 2015 http://sbml.org/Software/JSBML [192] MpTheory Java library: a multi-platform Java library for systems biology based on the Metabolic P theory 2015 http://mptheory.scienze.univr.it/ [193] SYSBIONS: nested sampling for systems biology 2015 http://www.theosysbio.bio.ic.ac.uk/resources/sysbions/ [194] Dizzy-Beats: a Bayesian evidence analysis tool for systems biology 2015 https://sourceforge.net/p/bayesevidence/home/Home/ [195] Open in new tab Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling Tool category/description . Publication year . Link . Reference . Network-based therapeutics TCM-Mesh: The database and analytical system for network pharmacology analysis for TCM preparations 2017 http://mesh.tcm.microbioinformatics.org/ [123] SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning 2017 The program is available on request [124] A protein network descriptor server and its use in studying protein, disease, metabolic and drug-targeted networks 2016 http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi [128] systemsDock: a web server for network pharmacology-based prediction and analysis 2016 http://systemsdock.unit.oist.jp/iddp/home/index [129] BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology 2016 https://www.bindingdb.org/bind/index.jsp [130] NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning 2015 http://nffinder.cnb.csic.es/ [131] NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods 2015 http://sbb.hku.hk/services/NutriChem-2.0/FoodDisease.php [132] TIMMA-R: an R package for predicting synergistic multi-targeted drug combinations in cancer cell lines or patient-derived samples 2015 https://cran.r-project.org/web/packages/timma/ [133] Network-based diagnostics GWAB: a web server for the network-based boosting of human genome-wide association data 2017 http://www.inetbio.org/gwab/ [113] Netter: re-ranking gene network inference predictions using structural network properties 2016 https://github.com/JRuyssinck/netter [134] MetaNetVar: Pipeline for applying network analysis tools for genomic variants analysis 2016 https://github.com/NCBI-Hackathons/Network_SNPs [135] GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets 2016 http://www.integrativegenomics.org/ [136] NetDecoder: a network biology platform that decodes context-specific biological networks and gene activities 2016 http://netdecoder.hms.harvard.edu/ [137] MUFFINN: cancer gene discovery via network analysis of somatic mutation data 2016 http://www.inetbio.org/muffinn/ [138] HitWalker2: visual analytics for precision medicine and beyond 2016 https://github.com/biodev/HitWalker2 [139] NCG 5.0: updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings 2016 http://ncg.kcl.ac.uk/ [140] dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation 2015 http://140.138.144.145/∼dbSNO/index.php [141] Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems 2015 http://causalbionet.com/ [142] Network reconstruction-visualization-analysis MotifNet: a web-server for network motif analysis 2017 http://netbio.bgu.ac.il/motifnet/ [143] cMapper: gene-centric connectivity mapper for EBI-RDF platform 2017 http://cmapper.ewostech.net/ [144] BRANE Clust: Cluster-assisted gene regulatory network inference refinement 2017 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-clust.html [145] vcfr: a package to manipulate and visualize variant call format data in R 2017 https://cran.r-project.org/web/packages/vcfR/index.html [146] shinyheatmap: Ultra-fast low-memory heatmap web interface for big data genomics 2017 http://shinyheatmap.com/ [147] PROXiMATE: a database of mutant protein–protein complex thermodynamics and kinetics 2017 http://www.iitm.ac.in/bioinfo/PROXiMATE/ [148] Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks 2017 https://github.com/ibalaur/MetabolicFramework [149] RAIN: RNA–protein association and interaction networks 2017 http://rth.dk/resources/rain/ [150] Phenopolis: an open platform for harmonization and analysis of genetic and phenotypic data 2017 https://uclex.cs.ucl.ac.uk/ [151] Pheno4J: a gene to phenotype graph database 2017 https://github.com/phenopolis/pheno4j [152] SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network 2017 https://github.com/YuanlongLiu/SigMod [153] iRegNet3D: three-dimensional integrated regulatory network for the genomic analysis of coding and non-coding disease mutations 2017 http://iregnet3d.yulab.org/index/ [154] JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data 2017 https://github.com/jijiadong/JDINAC [155] SmartR: An open-source platform for interactive visual analytics for translational research data 2017 https://github.com/transmart/SmartR [156] D-Map: random walking on gene network inference maps towards differential avenue discovery 2017 http://bioserver-3.bioacademy.gr/Bioserver/DMap/index.php [157] TRaCE+: Ensemble inference of gene regulatory networks from transcriptional expression profiles of gene knock-out experiments 2016 http://www.cabsel.ethz.ch/tools/trace.html [158] The Network Library: a framework to rapidly integrate network biology resources 2016 https://github.com/gsummer Web-based network analysis and visualization using CellMaps 2016 http://cellmaps.babelomics.org/ [159] PathwAX: a web server for network crosstalk based pathway annotation 2016 http://pathwax.sbc.su.se/ [160] Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology 2016 http://brg.ai.sri.com/ptools/ [161] NAPS: Network analysis of protein structures 2016 http://bioinf.iiit.ac.in/NAPS/ [162] UbiNet: an online resource for exploring the functional associations and regulatory networks of protein ubiquitylation 2016 http://140.138.144.145/∼ubinet/index.php [163] MET network in PubMed: a text-mined network visualization and curation system 2016 http://btm.tmu.edu.tw/metastasisway [164] QuIN: a web server for querying and visualizing chromatin interaction networks 2016 https://quin.jax.org/ [165] NET-GE: a web server for NETwork-based human gene enrichment 2016 http://net-ge.biocomp.unibo.it/enrich [166] IIIDB: a database for isoform–isoform interactions and isoform network modules 2015 http://syslab.nchu.edu.tw/IIIDB/ [167] cyNeo4j: connecting Neo4j and Cytoscape 2015 http://apps.cytoscape.org/apps/cyneo4j [168] BRANE Cut: biologically related a priori network enhancement with graph cuts for gene regulatory network inference 2015 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-cut.html [169] NetExplore: a web server for modelling small network motifs 2015 http://line.bioinfolab.net/nex/NetExplore.htm [170] COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems 2015 http://coxpresdb.jp/ [171] NAIL: a software toolset for inferring, analysing and visualizing regulatory networks 2015 https://sourceforge.net/projects/nailsystemsbiology/ [172] LncReg: a reference resource for lncRNA-associated regulatory networks 2015 http://bioinformatics.ustc.edu.cn/lncreg/ [173] TeloPIN: a database of telomeric proteins interaction network in mammalian cells 2015 http://songyanglab.sysu.edu.cn/telopin/ [174] MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse 2015 http://guanlab.ccmb.med.umich.edu/misomine/ [175] CerebralWeb: a Cytoscape.js plug-in to visualize networks stratified by subcellular localization 2015 http://www.innatedb.ca/CerebralWeb/ [176] Network-based integration NaviCom: a web application to create interactive molecular network portraits using multilevel omics data 2017 https://navicom.curie.fr/bridge.php [177] KeyPathwayMinerWeb: online multi-omics network enrichment 2016 https://keypathwayminer.compbio.sdu.dk/keypathwayminer/ [178] Visual Omics Explorer (VOE): a cross-platform portal for interactive data visualization 2016 http://bcil.github.io/VOE/ [179] ModuleAlign: module-based global alignment of PPI networks 2016 http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html [180] Fuse: multiple network alignment via data fusion 2016 http://www0.cs.ucl.ac.uk/staff/natasa/FUSE/index.html [181] The SMAL web server: global multiple network alignment from pairwise alignments 2016 http://haddock6.sfsu.edu/smal/ [182] Mergeomics: a web server for identifying pathological pathways, networks and key regulators via multidimensional data integration 2016 http://mergeomics.research.idre.ucla.edu/ [183] MAGNA ++: Maximizing accuracy in global network alignment via both node and edge conservation 2015 http://www3.nd.edu/∼cone/MAGNA±+/ [184] ZoomOut: analysing multiple networks as single nodes 2015 http://bioserver-3.bioacademy.gr/Bioserver/ZoomOut/ [185] RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse 2015 http://www.regnetworkweb.org/ [186] Systems biology and modelling FAIRDOMHub: a repository and collaboration environment for sharing systems biology research 2017 https://fair-dom.org/publication/fairdomhub-a-repository-and-collaboration-environment-for-sharing-systems-biology-research/ [187] The systems biology format converter 2016 https://www.ebi.ac.uk/biomodels/tools/converters/ [188] SBtab: a flexible table format for data exchange in systems biology 2016 https://www.sbtab.net/ PeTTSy: a computational tool for perturbation analysis of complex systems biology models 2016 http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ [189] AMIGO2: a toolbox for dynamic modelling, optimization and control in systems biology 2016 https://sites.google.com/site/amigo2toolbox/ [190] ComPPI: a cellular compartment-specific database for PPI network analysis 2015 http://comppi.linkgroup.hu/ [191] JSBML 1.0: providing a smorgasbord of options to encode systems biology models 2015 http://sbml.org/Software/JSBML [192] MpTheory Java library: a multi-platform Java library for systems biology based on the Metabolic P theory 2015 http://mptheory.scienze.univr.it/ [193] SYSBIONS: nested sampling for systems biology 2015 http://www.theosysbio.bio.ic.ac.uk/resources/sysbions/ [194] Dizzy-Beats: a Bayesian evidence analysis tool for systems biology 2015 https://sourceforge.net/p/bayesevidence/home/Home/ [195] Tool category/description . Publication year . Link . Reference . Network-based therapeutics TCM-Mesh: The database and analytical system for network pharmacology analysis for TCM preparations 2017 http://mesh.tcm.microbioinformatics.org/ [123] SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning 2017 The program is available on request [124] A protein network descriptor server and its use in studying protein, disease, metabolic and drug-targeted networks 2016 http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi [128] systemsDock: a web server for network pharmacology-based prediction and analysis 2016 http://systemsdock.unit.oist.jp/iddp/home/index [129] BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology 2016 https://www.bindingdb.org/bind/index.jsp [130] NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning 2015 http://nffinder.cnb.csic.es/ [131] NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods 2015 http://sbb.hku.hk/services/NutriChem-2.0/FoodDisease.php [132] TIMMA-R: an R package for predicting synergistic multi-targeted drug combinations in cancer cell lines or patient-derived samples 2015 https://cran.r-project.org/web/packages/timma/ [133] Network-based diagnostics GWAB: a web server for the network-based boosting of human genome-wide association data 2017 http://www.inetbio.org/gwab/ [113] Netter: re-ranking gene network inference predictions using structural network properties 2016 https://github.com/JRuyssinck/netter [134] MetaNetVar: Pipeline for applying network analysis tools for genomic variants analysis 2016 https://github.com/NCBI-Hackathons/Network_SNPs [135] GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets 2016 http://www.integrativegenomics.org/ [136] NetDecoder: a network biology platform that decodes context-specific biological networks and gene activities 2016 http://netdecoder.hms.harvard.edu/ [137] MUFFINN: cancer gene discovery via network analysis of somatic mutation data 2016 http://www.inetbio.org/muffinn/ [138] HitWalker2: visual analytics for precision medicine and beyond 2016 https://github.com/biodev/HitWalker2 [139] NCG 5.0: updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings 2016 http://ncg.kcl.ac.uk/ [140] dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation 2015 http://140.138.144.145/∼dbSNO/index.php [141] Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems 2015 http://causalbionet.com/ [142] Network reconstruction-visualization-analysis MotifNet: a web-server for network motif analysis 2017 http://netbio.bgu.ac.il/motifnet/ [143] cMapper: gene-centric connectivity mapper for EBI-RDF platform 2017 http://cmapper.ewostech.net/ [144] BRANE Clust: Cluster-assisted gene regulatory network inference refinement 2017 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-clust.html [145] vcfr: a package to manipulate and visualize variant call format data in R 2017 https://cran.r-project.org/web/packages/vcfR/index.html [146] shinyheatmap: Ultra-fast low-memory heatmap web interface for big data genomics 2017 http://shinyheatmap.com/ [147] PROXiMATE: a database of mutant protein–protein complex thermodynamics and kinetics 2017 http://www.iitm.ac.in/bioinfo/PROXiMATE/ [148] Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks 2017 https://github.com/ibalaur/MetabolicFramework [149] RAIN: RNA–protein association and interaction networks 2017 http://rth.dk/resources/rain/ [150] Phenopolis: an open platform for harmonization and analysis of genetic and phenotypic data 2017 https://uclex.cs.ucl.ac.uk/ [151] Pheno4J: a gene to phenotype graph database 2017 https://github.com/phenopolis/pheno4j [152] SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network 2017 https://github.com/YuanlongLiu/SigMod [153] iRegNet3D: three-dimensional integrated regulatory network for the genomic analysis of coding and non-coding disease mutations 2017 http://iregnet3d.yulab.org/index/ [154] JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data 2017 https://github.com/jijiadong/JDINAC [155] SmartR: An open-source platform for interactive visual analytics for translational research data 2017 https://github.com/transmart/SmartR [156] D-Map: random walking on gene network inference maps towards differential avenue discovery 2017 http://bioserver-3.bioacademy.gr/Bioserver/DMap/index.php [157] TRaCE+: Ensemble inference of gene regulatory networks from transcriptional expression profiles of gene knock-out experiments 2016 http://www.cabsel.ethz.ch/tools/trace.html [158] The Network Library: a framework to rapidly integrate network biology resources 2016 https://github.com/gsummer Web-based network analysis and visualization using CellMaps 2016 http://cellmaps.babelomics.org/ [159] PathwAX: a web server for network crosstalk based pathway annotation 2016 http://pathwax.sbc.su.se/ [160] Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology 2016 http://brg.ai.sri.com/ptools/ [161] NAPS: Network analysis of protein structures 2016 http://bioinf.iiit.ac.in/NAPS/ [162] UbiNet: an online resource for exploring the functional associations and regulatory networks of protein ubiquitylation 2016 http://140.138.144.145/∼ubinet/index.php [163] MET network in PubMed: a text-mined network visualization and curation system 2016 http://btm.tmu.edu.tw/metastasisway [164] QuIN: a web server for querying and visualizing chromatin interaction networks 2016 https://quin.jax.org/ [165] NET-GE: a web server for NETwork-based human gene enrichment 2016 http://net-ge.biocomp.unibo.it/enrich [166] IIIDB: a database for isoform–isoform interactions and isoform network modules 2015 http://syslab.nchu.edu.tw/IIIDB/ [167] cyNeo4j: connecting Neo4j and Cytoscape 2015 http://apps.cytoscape.org/apps/cyneo4j [168] BRANE Cut: biologically related a priori network enhancement with graph cuts for gene regulatory network inference 2015 http://www-syscom.univ-mlv.fr/∼pirayre/Codes-GRN-BRANE-cut.html [169] NetExplore: a web server for modelling small network motifs 2015 http://line.bioinfolab.net/nex/NetExplore.htm [170] COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems 2015 http://coxpresdb.jp/ [171] NAIL: a software toolset for inferring, analysing and visualizing regulatory networks 2015 https://sourceforge.net/projects/nailsystemsbiology/ [172] LncReg: a reference resource for lncRNA-associated regulatory networks 2015 http://bioinformatics.ustc.edu.cn/lncreg/ [173] TeloPIN: a database of telomeric proteins interaction network in mammalian cells 2015 http://songyanglab.sysu.edu.cn/telopin/ [174] MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse 2015 http://guanlab.ccmb.med.umich.edu/misomine/ [175] CerebralWeb: a Cytoscape.js plug-in to visualize networks stratified by subcellular localization 2015 http://www.innatedb.ca/CerebralWeb/ [176] Network-based integration NaviCom: a web application to create interactive molecular network portraits using multilevel omics data 2017 https://navicom.curie.fr/bridge.php [177] KeyPathwayMinerWeb: online multi-omics network enrichment 2016 https://keypathwayminer.compbio.sdu.dk/keypathwayminer/ [178] Visual Omics Explorer (VOE): a cross-platform portal for interactive data visualization 2016 http://bcil.github.io/VOE/ [179] ModuleAlign: module-based global alignment of PPI networks 2016 http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html [180] Fuse: multiple network alignment via data fusion 2016 http://www0.cs.ucl.ac.uk/staff/natasa/FUSE/index.html [181] The SMAL web server: global multiple network alignment from pairwise alignments 2016 http://haddock6.sfsu.edu/smal/ [182] Mergeomics: a web server for identifying pathological pathways, networks and key regulators via multidimensional data integration 2016 http://mergeomics.research.idre.ucla.edu/ [183] MAGNA ++: Maximizing accuracy in global network alignment via both node and edge conservation 2015 http://www3.nd.edu/∼cone/MAGNA±+/ [184] ZoomOut: analysing multiple networks as single nodes 2015 http://bioserver-3.bioacademy.gr/Bioserver/ZoomOut/ [185] RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse 2015 http://www.regnetworkweb.org/ [186] Systems biology and modelling FAIRDOMHub: a repository and collaboration environment for sharing systems biology research 2017 https://fair-dom.org/publication/fairdomhub-a-repository-and-collaboration-environment-for-sharing-systems-biology-research/ [187] The systems biology format converter 2016 https://www.ebi.ac.uk/biomodels/tools/converters/ [188] SBtab: a flexible table format for data exchange in systems biology 2016 https://www.sbtab.net/ PeTTSy: a computational tool for perturbation analysis of complex systems biology models 2016 http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ [189] AMIGO2: a toolbox for dynamic modelling, optimization and control in systems biology 2016 https://sites.google.com/site/amigo2toolbox/ [190] ComPPI: a cellular compartment-specific database for PPI network analysis 2015 http://comppi.linkgroup.hu/ [191] JSBML 1.0: providing a smorgasbord of options to encode systems biology models 2015 http://sbml.org/Software/JSBML [192] MpTheory Java library: a multi-platform Java library for systems biology based on the Metabolic P theory 2015 http://mptheory.scienze.univr.it/ [193] SYSBIONS: nested sampling for systems biology 2015 http://www.theosysbio.bio.ic.ac.uk/resources/sysbions/ [194] Dizzy-Beats: a Bayesian evidence analysis tool for systems biology 2015 https://sourceforge.net/p/bayesevidence/home/Home/ [195] Open in new tab Discussion The concept of utilizing networks to visualize the complex interaction of mechanisms implicated in disease has been around for several years. However, two important breakthroughs separate previous network-based approaches and are currently driving the state-of-the-art in Systems Bioinformatics: (i) construction of multiple networks representing each level of the omics spectrum and the integration of these in a layered network that exchanges information within and between layers [62, 67, 97] and (ii) the advent of novel techniques and methodologies for analysing and understanding these networks using mathematical algorithms and approaches derived from graph theory and information theory [6]. Using these methods for extracting biologically meaningful information from multiple levels of the omics spectrum can provide the integrated systemic knowledge for the development of a comprehensive Human System profile, which increases diagnostic accuracy and concurrently allows for novel therapeutic advances and assess response to therapy. Nevertheless, the network-based approaches, either for evidence-based or for statistically inferred molecular networks, have a number of limitations. Specifically, networks based on experimental evidence are not complete as experiments are only a snapshot of the real biological world. Moreover, statistically inferred networks represent an undetermined computational problem because the number of the inferred relationships is much larger than the number of the independent measurements [196]. Owing to the lack of sufficient ground truth to validate the reconstructed molecular networks, special attention must be given when choosing benchmark data sets (e.g. existing curated databases with experimental information, medium-throughput experimental information and simulated data sets that mimic real data). To this extent, the DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative facilitates researchers from the Systems Bioinformatics field to assess the validity of the networks they are using and proceed with optimization and parameter-tuning regarding network reconstruction [197]. A subsequent limitation of the network-based approaches is the low overlap that the various network reconstruction approaches have, and the inadequacy in selecting the proper network each time. It is likely that computational approaches checking and exploiting complementarities and providing ensemble solutions of a network construction consensus will maximize the information content. In our opinion, Systems Bioinformatics might currently appear rather aspirational, yet, considering its potential it is likely to have a major impact on medicine and pharmacology in the next decade. The field of medicine is expected to benefit from the invaluable knowledge attained from Systems Bioinformatics methodologies. The molecular basis of complex, polygenic diseases is highly heterogeneous and affected by multiple factors simultaneously. These include genetic predisposition, multipart molecular mechanisms and effects of the environment, diet, drug administration/response and numerous other factors. Although it might not be possible to replace the use of traditional approaches for therapeutics and diagnosis with computational methods, yet, it is likely that Systems Bioinformatics will provide revolutionary approaches and tools to clinicians in order to demystify the complex nature of these diseases. Computational diagnostics and therapeutics, enhanced by Systems Bioinformatics approaches, will not only aid clinicians in patient consultation and care but will also catalyse significant breakthroughs in prognostic measures, detection of disease at an early onset and overall disease prevention. Key Points Systems Βioinformatics is an emerging field, which integrates information across different levels by combining the systems biology bottom-up approach with a data-driven top-down approach as in classical bioinformatics. The advent of omics technologies has provided the stepping-stone for the emergence of Systems Bioinformatics as a holistic and systems approach in investigating complex biological systems. The key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers. The network approach in Systems Bioinformatics comes with limitations including the lack of ground truth for the constructed biological network. Systems Bioinformatics methods can enhance computational therapeutics and diagnostics hence, paving the way to precision medicine. Funding Anastasis Oulas, George Minadakis, Margarita Zachariou, Kleitos Sokratous and George M. Spyrou are funded by the European Commission Research Executive Agency Grant BIORISE (No. 669026), under the Spreading Excellence, Widening Participation, Science with and for Society Framework. This work was partly supported by H2020-WIDESPREAD-04-2017-Teaming Phase 1, Grant Agreement 763781, Integrated Precision Medicine Technologies. Anastasis Oulas is a Postdoctoral Research Fellow at the Bioinformatics ERA Chair and Bioinformatics Group of the Cyprus Institute of Neurology and Genetics. George Minadakis is a Postdoctoral Research Fellow at the Bioinformatics ERA Chair and Bioinformatics Group of the Cyprus Institute of Neurology and Genetics. Margarita Zachariou is a Postdoctoral Research Fellow at the Bioinformatics ERA Chair and Bioinformatics Group of the Cyprus Institute of Neurology and Genetics. Kleitos Sokratous is a Postdoctoral Research Fellow at the Bioinformatics ERA Chair and Bioinformatics Group of the Cyprus Institute of Neurology and Genetics. Marilena M. Bourdakou is a Visiting Scientist at the Bioinformatics ERA Chair and Bioinformatics Group of the Cyprus Institute of Neurology and Genetics. George M. Spyrou holds the Bioinformatics ERA Chair and is the Head of the Bioinformatics Group at the Cyprus Institute of Neurology and Genetics. References 1 Hood L , Friend SH. Predictive, personalized, preventive, participatory (P4) cancer medicine . Nat Rev Clin Oncol 2011 ; 8 ( 3 ): 184 – 7 . http://dx.doi.org/10.1038/nrclinonc.2010.227 Google Scholar Crossref Search ADS PubMed WorldCat 2 Tian Q , Price ND, Hood L. Systems cancer medicine: towards realization of predictive, preventive, personalized and participatory (P4) medicine . J Intern Med 2012 ; 271 ( 2 ): 111 – 21 . http://dx.doi.org/10.1111/j.1365-2796.2011.02498.x Google Scholar Crossref Search ADS PubMed WorldCat 3 Gatherer D. So what do we really mean when we say that systems biology is holistic? BMC Syst Biol 2010 ; 4 : 22 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Berlin R , Gruen R, Best J. Systems medicine-complexity within, simplicity without . J Healthc Inform Res 2017 ; 1 ( 1 ): 119 – 37 . http://dx.doi.org/10.1007/s41666-017-0002-9 Google Scholar Crossref Search ADS PubMed WorldCat 5 Emmert-Streib F , Dehmer M. Networks for systems biology: conceptual connection of data and function . IET Syst Biol 2011 ; 5 ( 3 ): 185 – 207 . http://dx.doi.org/10.1049/iet-syb.2010.0025 Google Scholar Crossref Search ADS PubMed WorldCat 6 Najafi A , Bidkhori G, Bozorgmehr JH, et al. Genome scale modeling in systems biology: algorithms and resources . Curr Genomics 2014 ; 15 ( 2 ): 130 – 59 . http://dx.doi.org/10.2174/1389202915666140319002221 Google Scholar Crossref Search ADS PubMed WorldCat 7 Barabasi AL , Albert R. Emergence of scaling in random networks . Science 1999 ; 286 ( 5439 ): 509 – 12 . http://dx.doi.org/10.1126/science.286.5439.509 Google Scholar Crossref Search ADS PubMed WorldCat 8 Barabasi AL , Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease . Nat Rev Genet 2011 ; 12 ( 1 ): 56 – 68 . http://dx.doi.org/10.1038/nrg2918 Google Scholar Crossref Search ADS PubMed WorldCat 9 Yu H , Kim PM, Sprecher E, et al. The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics . PLoS Comput Biol 2007 ; 3 ( 4 ): e59 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Blain-Moraes S , Tarnal V, Vanini G, et al. Network efficiency and posterior alpha patterns are markers of recovery from general anesthesia: a high-density electroencephalography study in healthy volunteers . Front Hum Neurosci 2017 ; 11 : 328 . http://dx.doi.org/10.3389/fnhum.2017.00328 Google Scholar Crossref Search ADS PubMed WorldCat 11 Farrar DC , Mian AZ, Budson AE, et al. Retained executive abilities in mild cognitive impairment are associated with increased white matter network connectivity . Eur Radiol 2017 , doi: 10.1007/s00330-017-4951-4 . Google Scholar OpenURL Placeholder Text WorldCat 12 Gao ZK , Dang WD, Li S, et al. PageRank versatility analysis of multilayer modality-based network for exploring the evolution of oil-water slug flow . Sci Rep 2017 ; 7 ( 1 ): 5493 . http://dx.doi.org/10.1038/s41598-017-05890-0 Google Scholar Crossref Search ADS PubMed WorldCat 13 Barabasi AL , Oltvai ZN. Network biology: understanding the cell's functional organization . Nat Rev Genet 2004 ; 5 ( 2 ): 101 – 13 . http://dx.doi.org/10.1038/nrg1272 Google Scholar Crossref Search ADS PubMed WorldCat 14 Rual JF , Venkatesan K, Hao T, et al. Towards a proteome-scale map of the human protein-protein interaction network . Nature 2005 ; 437 ( 7062 ): 1173 – 8 . http://dx.doi.org/10.1038/nature04209 Google Scholar Crossref Search ADS PubMed WorldCat 15 Lewis BP , Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets . Cell 2005 ; 120 ( 1 ): 15 – 20 . http://dx.doi.org/10.1016/j.cell.2004.12.035 Google Scholar Crossref Search ADS PubMed WorldCat 16 Carninci P , Kasukawa T, Katayama S, et al. The transcriptional landscape of the mammalian genome . Science 2005 ; 309 ( 5740 ): 1559 – 63 . http://dx.doi.org/10.1126/science.1112014 Google Scholar Crossref Search ADS PubMed WorldCat 17 Jeong H , Tombor B, Albert R, et al. The large-scale organization of metabolic networks . Nature 2000 ; 407 ( 6804 ): 651 – 4 . http://dx.doi.org/10.1038/35036627 Google Scholar Crossref Search ADS PubMed WorldCat 18 Rolland T , Taşan M, Charloteaux B, et al. A proteome-scale map of the human interactome network . Cell 2014 ; 159 ( 5 ): 1212 – 26 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Kapushesky M , Emam I, Holloway E, et al. Gene expression atlas at the European bioinformatics institute . Nucleic Acids Res 2010 ; 38 : D690 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Liu X , Yu X, Zack DJ, et al. TiGER: a database for tissue-specific gene expression and regulation . BMC Bioinformatics 2008 ; 9 : 271 . http://dx.doi.org/10.1186/1471-2105-9-271 Google Scholar Crossref Search ADS PubMed WorldCat 21 Harris MA , Clark J, Ireland A, et al. The Gene Ontology (GO) database and informatics resource . Nucleic Acids Res 2004 ; 32 : D258 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Croft D , Mundo AF, Haw R, et al. The Reactome pathway knowledgebase . Nucleic Acids Res 2014 ; 42 : D472 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Fabregat A , Sidiropoulos K, Garapati P, et al. The Reactome pathway Knowledgebase . Nucleic Acids Res 2016 ; 44 ( D1 ): D481 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Raychaudhuri S , Plenge R, Rossin E, et al. Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions . PLoS Genet 2009 ; 5 : e100534 . Google Scholar Crossref Search ADS WorldCat 25 Siegenthaler C , Gunawan R. Assessment of network inference methods: how to cope with an underdetermined problem . PLoS One 2014 ; 9 ( 3 ): e90481 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Szklarczyk D , Franceschini A, Wyder S, et al. STRING v10: protein–protein interaction networks, integrated over the tree of life . Nucleic Acids Res 2015 ; 43 ( D1 ): D447 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Peri S , Navarro JD, Kristiansen TZ, et al. Human protein reference database as a discovery resource for proteomics . Nucleic Acids Res 2004 ; 32 ( 90001 ): D497 – 501 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Bader GD , Betel D, Hogue CWV. BIND: the biomolecular interaction network database . Nucleic Acids Res 2003 ; 31 ( 1 ): 248 – 50 . http://dx.doi.org/10.1093/nar/gkg056 Google Scholar Crossref Search ADS PubMed WorldCat 29 Chatr-aryamontri A , Ceol A, Palazzi LM, et al. MINT: the molecular INTeraction database . Nucleic Acids Res 2007 ; 35 : D572 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Stark C , Breitkreutz B-J, Reguly T, et al. BioGRID: a general repository for interaction datasets . Nucleic Acids Res 2006 ; 34 ( 90001 ): D535 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Casas C , Isus L, Herrando-Grabulosa M, et al. Network-based proteomic approaches reveal the neurodegenerative, neuroprotective and pain-related mechanisms involved after retrograde axonal damage . Sci Rep 2015 ; 5 : 9185 . http://dx.doi.org/10.1038/srep09185 Google Scholar Crossref Search ADS PubMed WorldCat 32 Severin J , Waterhouse AM, Kawaji H, et al. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions . Genome Biol 2009 ; 10 ( 4 ): R39 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Jiang C , Xuan Z, Zhao F, et al. TRED: a transcriptional regulatory element database, new entries and other development . Nucleic Acids Res 2007 ; 35 : D137 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Subramanian A , Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA 2005 ; 102 ( 43 ): 15545 – 50 . http://dx.doi.org/10.1073/pnas.0506580102 Google Scholar Crossref Search ADS PubMed WorldCat 35 Feist P , Hummon A. Proteomic challenges: sample preparation techniques for microgram-quantity protein analysis from biological samples . Int J Mol Sci 2015 ; 16 ( 2 ): 3537 . http://dx.doi.org/10.3390/ijms16023537 Google Scholar Crossref Search ADS PubMed WorldCat 36 Ekins S , Nikolsky Y, Bugrim A, et al. Pathway mapping tools for analysis of high content data. In: Taylor DL, Haskins JR, Giuliano KA (eds). High Content Screening: A Powerful Approach to Systems Cell Biology and Drug Discovery . Totowa, NJ : Humana Press , 2006 , 319 – 50 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 37 Kanehisa M , Goto S, Sato Y, et al. KEGG for integration and interpretation of large-scale molecular data sets . Nucleic Acids Res 2012 ; 40 : D109 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Bourdakou MM , Athanasiadis EI, Spyrou GM. Discovering gene re-ranking efficiency and conserved gene-gene relationships derived from gene co-expression network analysis on breast cancer data . Sci Rep 2016 ; 6 ( 1 ): 20518 . http://dx.doi.org/10.1038/srep20518 Google Scholar Crossref Search ADS PubMed WorldCat 39 Marbach D , Costello JC, Kuffner R, et al. Wisdom of crowds for robust gene network inference . Nat Methods 2012 ; 9 ( 8 ): 796 – 804 . http://dx.doi.org/10.1038/nmeth.2016 Google Scholar Crossref Search ADS PubMed WorldCat 40 Beltrao P , Cagney G, Krogan NJ. Quantitative genetic interactions reveal biological modularity . Cell 2010 ; 141 ( 5 ): 739 – 45 . http://dx.doi.org/10.1016/j.cell.2010.05.019 Google Scholar Crossref Search ADS PubMed WorldCat 41 Boone C , Bussey H, Andrews BJ. Exploring genetic interactions and networks with yeast . Nat Rev Genet 2007 ; 8 ( 6 ): 437 – 49 . http://dx.doi.org/10.1038/nrg2085 Google Scholar Crossref Search ADS PubMed WorldCat 42 Stuart JM , Segal E, Koller D, et al. A gene-coexpression network for global discovery of conserved genetic modules . Science 2003 ; 302 ( 5643 ): 249 – 55 . http://dx.doi.org/10.1126/science.1087447 Google Scholar Crossref Search ADS PubMed WorldCat 43 Chatr-Aryamontri A , Oughtred R, Boucher L, et al. The BioGRID interaction database: 2017 update . Nucleic Acids Res 2017 ; 45 ( D1 ): D369 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Kraskov A , Stogbauer H, Grassberger P. Estimating mutual information . Phys Rev E Stat Nonlin Soft Matter Phys 2004 ; 69 ( 6 Pt 2 ): 066138 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Margolin AA , Nemenman I, Basso K, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context . BMC Bioinformatics 2006 ; 7 (Suppl 1): S7 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Faith JJ , Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles . PLoS Biol 2007 ; 5 ( 1 ): e8 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Meyer PE , Kontos K, Lafitte F, et al. Information-theoretic inference of large transcriptional regulatory networks . EURASIP J Bioinform Syst Biol 2007 ; 2007 : 79879 . Google Scholar Crossref Search ADS WorldCat 48 Altay G , Emmert-Streib F. Inferring the conservative causal core of gene regulatory networks . BMC Syst Biol 2010 ; 4 : 132 . http://dx.doi.org/10.1186/1752-0509-4-132 Google Scholar Crossref Search ADS PubMed WorldCat 49 Opgen-Rhein R , Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data . BMC Syst Biol 2007 ; 1 ( 1 ): 37 . http://dx.doi.org/10.1186/1752-0509-1-37 Google Scholar Crossref Search ADS PubMed WorldCat 50 Langfelder P , Horvath S. WGCNA: an R package for weighted correlation network analysis . BMC Bioinformatics 2008 ; 9 : 559 . http://dx.doi.org/10.1186/1471-2105-9-559 Google Scholar Crossref Search ADS PubMed WorldCat 51 Friedman J , Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso . Biostatistics 2008 ; 9 ( 3 ): 432 – 41 . http://dx.doi.org/10.1093/biostatistics/kxm045 Google Scholar Crossref Search ADS PubMed WorldCat 52 Kramer N , Schafer J, Boulesteix AL. Regularized estimation of large-scale gene association networks using graphical Gaussian models . BMC Bioinformatics 2009 ; 10 : 384 . http://dx.doi.org/10.1186/1471-2105-10-384 Google Scholar Crossref Search ADS PubMed WorldCat 53 Huynh-Thu VA , Irrthum A, Wehenkel L, et al. Inferring regulatory networks from expression data using tree-based methods . PLoS One 2010 ; 5 ( 9 ): e12776 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Goh KI , Choi IG. Exploring the human diseasome: the human disease network . Brief Funct Genomics 2012 ; 11 ( 6 ): 533 – 42 . http://dx.doi.org/10.1093/bfgp/els032 Google Scholar Crossref Search ADS PubMed WorldCat 55 Goh KI , Cusick ME, Valle D, et al. The human disease network . Proc Natl Acad Sci USA 2007 ; 104 ( 21 ): 8685 – 90 . http://dx.doi.org/10.1073/pnas.0701361104 Google Scholar Crossref Search ADS PubMed WorldCat 56 Mitra K , Carvunis AR, Ramesh SK, et al. Integrative approaches for finding modular structure in biological networks . Nat Rev Genet 2013 ; 14 ( 10 ): 719 – 32 . http://dx.doi.org/10.1038/nrg3552 Google Scholar Crossref Search ADS PubMed WorldCat 57 Loscalzo J , Barabasi AL. Systems biology and the future of medicine . Wiley Interdiscip Rev Syst Biol Med 2011 ; 3 ( 6 ): 619 – 27 . http://dx.doi.org/10.1002/wsbm.144 Google Scholar Crossref Search ADS PubMed WorldCat 58 Hart T , Dider S, Han W, et al. Toward repurposing metformin as a precision anti-cancer therapy using structural systems pharmacology . Sci Rep 2016 ; 6 ( 1 ): 20441 . http://dx.doi.org/10.1038/srep20441 Google Scholar Crossref Search ADS PubMed WorldCat 59 Kim BY , Song KH, Lim CY, et al. Therapeutic properties of Scutellaria baicalensis in db/db mice evaluated using connectivity map and network pharmacology . Sci Rep 2017 ; 7 : 41711 . http://dx.doi.org/10.1038/srep41711 Google Scholar Crossref Search ADS PubMed WorldCat 60 Moran LB , Graeber MB. Towards a pathway definition of Parkinson's disease: a complex disorder with links to cancer, diabetes and inflammation . Neurogenetics 2008 ; 9 ( 1 ): 1 – 13 . http://dx.doi.org/10.1007/s10048-007-0116-y Google Scholar Crossref Search ADS PubMed WorldCat 61 Mukhopadhyay S , Saha R, Palanisamy A, et al. A systems biology pipeline identifies new immune and disease related molecular signatures and networks in human cells during microgravity exposure . Sci Rep 2016 ; 6 ( 1 ): 25975 . http://dx.doi.org/10.1038/srep25975 Google Scholar Crossref Search ADS PubMed WorldCat 62 Jamal S , Goyal S, Shanker A, et al. Integrating network, sequence and functional features using machine learning approaches towards identification of novel Alzheimer genes . BMC Genomics 2016 ; 17 ( 1 ): 807 . http://dx.doi.org/10.1186/s12864-016-3108-1 Google Scholar Crossref Search ADS PubMed WorldCat 63 Ray M , Ruan J, Zhang W. Variations in the transcriptome of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases . Genome Biol 2008 ; 9 ( 10 ): R148 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Guney E , Menche J, Vidal M, et al. Network-based in silico drug efficacy screening . Nat Commun 2016 ; 7 : 10331 . http://dx.doi.org/10.1038/ncomms10331 Google Scholar Crossref Search ADS PubMed WorldCat 65 Breuer K , Foroushani AK, Laird MR, et al. InnateDB: systems biology of innate immunity and beyond–recent updates and continuing curation . Nucleic Acids Res 2013 ; 41 ( D1 ): D1228 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Hwang S , Son SW, Kim SC, et al. A protein interaction network associated with asthma . J Theor Biol 2008 ; 252 ( 4 ): 722 – 31 . http://dx.doi.org/10.1016/j.jtbi.2008.02.011 Google Scholar Crossref Search ADS PubMed WorldCat 67 Menche J , Guney E, Sharma A, et al. Integrating personalized gene expression profiles into predictive disease-associated gene pools . NPJ Syst Biol Appl 2017 ; 3 ( 1 ): 10 . http://dx.doi.org/10.1038/s41540-017-0009-0 Google Scholar Crossref Search ADS PubMed WorldCat 68 Cheng F , Hong H, Yang S, et al. Individualized network-based drug repositioning infrastructure for precision oncology in the panomics era . Brief Bioinform 2017 ; 18 : 682 – 97 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 69 Lee E , Jung H, Radivojac P, et al. Analysis of AML genes in dysregulated molecular networks . BMC Bioinformatics 2009 ; 10 (Suppl 9): S2 . Google Scholar Crossref Search ADS WorldCat 70 Lim J , Hao T, Shaw C, et al. A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration . Cell 2006 ; 125 ( 4 ): 801 – 14 . http://dx.doi.org/10.1016/j.cell.2006.03.032 Google Scholar Crossref Search ADS PubMed WorldCat 71 Goehler H , Lalowski M, Stelzl U, et al. A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease . Mol Cell 2004 ; 15 ( 6 ): 853 – 65 . http://dx.doi.org/10.1016/j.molcel.2004.09.016 Google Scholar Crossref Search ADS PubMed WorldCat 72 Camargo LM , Collura V, Rain JC, et al. Disrupted in Schizophrenia 1 Interactome: evidence for the close connectivity of risk genes and a potential synaptic basis for schizophrenia . Mol Psychiatry 2007 ; 12 ( 1 ): 74 – 86 . http://dx.doi.org/10.1038/sj.mp.4001880 Google Scholar Crossref Search ADS PubMed WorldCat 73 Simoes SN , Martins DC Jr, Pereira CA, et al. NERI: network-medicine based integrative approach for disease gene prioritization by relative importance . BMC Bioinformatics 2015 ; 16 (Suppl 19): S9 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Wang H , Guo W, Liu F, et al. Patients with first-episode, drug-naive schizophrenia and subjects at ultra-high risk of psychosis shared increased cerebellar-default mode network connectivity at rest . Sci Rep 2016 ; 6 ( 1 ): 26124 . http://dx.doi.org/10.1038/srep26124 Google Scholar Crossref Search ADS PubMed WorldCat 75 Kitsak M , Sharma A, Menche J, et al. Tissue specificity of human disease module . Sci Rep 2016 ; 6 : 35241 . http://dx.doi.org/10.1038/srep35241 Google Scholar Crossref Search ADS PubMed WorldCat 76 Wang W-X , Ni X, Lai Y-C, et al. Optimizing controllability of complex networks by minimum structural perturbations . Phys Rev E 2012 ; 85 ( 2 ): 026115 . http://dx.doi.org/10.1103/PhysRevE.85.026115 Google Scholar Crossref Search ADS WorldCat 77 Albert R , Jeong H, Barabasi AL. Error and attack tolerance of complex networks . Nature 2000 ; 406 ( 6794 ): 378 – 82 . http://dx.doi.org/10.1038/35019019 Google Scholar Crossref Search ADS PubMed WorldCat 78 Lu ZM , Li XF. Attack vulnerability of network controllability . PLoS One 2016 ; 11 ( 9 ): e0162289 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Dong G , Gao J, Du R, et al. Robustness of network of networks under targeted attack . Phys Rev E Stat Nonlin Soft Matter Phys 2013 ; 87 ( 5 ): 052804 . http://dx.doi.org/10.1103/PhysRevE.87.052804 Google Scholar Crossref Search ADS PubMed WorldCat 80 Samay N , Morrison S, Husseini A. Network Science Communities . Cambridge : Cambridge University Press , 2014 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 81 Pósfai M , Liu Y-Y, Slotine J-J, et al. Effect of correlations on network controllability . Sci Rep 2013 ; 3 ( 1 ):. Google Scholar OpenURL Placeholder Text WorldCat 82 Liu Y-Y , Slotine J-J, Barabási A-L. Control centrality and hierarchical structure in complex networks . PLoS One 2012 ; 7 ( 9 ): e44459 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Motter AE , Lai YC. Cascade-based attacks on complex networks . Phys Rev E Stat Nonlin Soft Matter Phys 2002 ; 66 ( 6 Pt 2 ): 065102 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 84 Holme P , Kim BJ, Yoon CN, et al. Attack vulnerability of complex networks . Phys Rev E 2002 ; 65 ( 5 Pt 2 ): 056109 . Google Scholar Crossref Search ADS WorldCat 85 Wang XF , Chen G. Synchronization in scale-free dynamical networks: robustness and fragility . IEEE Trans Circuits Syst I Fundam Theory Appl 2002 ; 49 : 54 – 62 . http://dx.doi.org/10.1109/81.974874 Google Scholar Crossref Search ADS WorldCat 86 Pu C-L , Pei W-J, Michaelson A. Robustness analysis of network controllability . Physica A 2012 ; 391 ( 18 ): 4420 – 5 . http://dx.doi.org/10.1016/j.physa.2012.04.019 Google Scholar Crossref Search ADS WorldCat 87 Han J-DJ , Bertin N, Hao T, et al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network . Nature 2004 ; 430 : 88 – 93 . http://dx.doi.org/10.1038/nature02555 Google Scholar Crossref Search ADS PubMed WorldCat 88 Pequito S , Preciado VM, Barabasi AL, et al. Trade-offs between driving nodes and time-to-control in complex networks . Sci Rep 2017 ; 7 : 39978 . http://dx.doi.org/10.1038/srep39978 Google Scholar Crossref Search ADS PubMed WorldCat 89 Vinayagam A , Gibson TE, Lee HJ, et al. Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets . Proc Natl Acad Sci USA 2016 ; 113 ( 18 ): 4976 – 81 . http://dx.doi.org/10.1073/pnas.1603992113 Google Scholar Crossref Search ADS PubMed WorldCat 90 Menche J , Sharma A, Kitsak M, et al. Disease networks. Uncovering disease-disease relationships through the incomplete interactome . Science 2015 ; 347 ( 6224 ): 1257601 . http://dx.doi.org/10.1126/science.1257601 Google Scholar Crossref Search ADS PubMed WorldCat 91 Ma X , Liu Z, Zhang Z, et al. Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data . BMC Bioinformatics 2017 ; 18 ( 1 ): 72 . http://dx.doi.org/10.1186/s12859-017-1490-6 Google Scholar Crossref Search ADS PubMed WorldCat 92 Kitano H. Computational systems biology . Nature 2002 ; 420 ( 6912 ): 206 – 10 . http://dx.doi.org/10.1038/nature01254 Google Scholar Crossref Search ADS PubMed WorldCat 93 Le Novere N. Quantitative and logic modelling of molecular and gene networks . Nat Rev Genet 2015 ; 16 : 146 – 58 . http://dx.doi.org/10.1038/nrg3885 Google Scholar Crossref Search ADS PubMed WorldCat 94 Samaga R , Klamt S. Modeling approaches for qualitative and semi-quantitative analysis of cellular signaling networks . Cell Commun Signal 2013 ; 11 ( 1 ): 43 . http://dx.doi.org/10.1186/1478-811X-11-43 Google Scholar Crossref Search ADS PubMed WorldCat 95 Shibeko AM , Panteleev MA. Untangling the complexity of blood coagulation network: use of computational modelling in pharmacology and diagnostics . Brief Bioinform 2016 ; 17 ( 3 ): 429 – 39 . http://dx.doi.org/10.1093/bib/bbv040 Google Scholar Crossref Search ADS PubMed WorldCat 96 Huang L , Jiang Y, Chen Y. Predicting drug combination index and simulating the network-regulation dynamics by mathematical modeling of drug-targeted EGFR-ERK signaling pathway . Sci Rep 2017 ; 7 : 40752 . http://dx.doi.org/10.1038/srep40752 Google Scholar Crossref Search ADS PubMed WorldCat 97 Ram PT , Mendelsohn J, Mills GB. Bioinformatics and systems biology . Mol Oncol 2012 ; 6 ( 2 ): 147 – 54 . http://dx.doi.org/10.1016/j.molonc.2012.01.008 Google Scholar Crossref Search ADS PubMed WorldCat 98 Vijayakumar S , Conway M, Lio P, et al. Seeing the wood for the trees: a forest of methods for optimization and omic-network integration in metabolic modelling . Brief Bioinform 2017 , doi: 10.1093/bib/bbx053 . Google Scholar OpenURL Placeholder Text WorldCat 99 Shannon P , Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks . Genome Res 2003 ; 13 ( 11 ): 2498 – 504 . http://dx.doi.org/10.1101/gr.1239303 Google Scholar Crossref Search ADS PubMed WorldCat 100 McKenna A , Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data . Genome Res 2010 ; 20 ( 9 ): 1297 – 303 . http://dx.doi.org/10.1101/gr.107524.110 Google Scholar Crossref Search ADS PubMed WorldCat 101 Altschul SF , Gish W, Miller W, et al. Basic local alignment search tool . J Mol Biol 1990 ; 215 ( 3 ): 403 – 10 . http://dx.doi.org/10.1016/S0022-2836(05)80360-2 Google Scholar Crossref Search ADS PubMed WorldCat 102 Peng Y , Leung HC, Yiu SM, et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth . Bioinformatics 2012 ; 28 ( 11 ): 1420 – 8 . http://dx.doi.org/10.1093/bioinformatics/bts174 Google Scholar Crossref Search ADS PubMed WorldCat 103 Ihaka R , Gentleman R. R: a language for data analysis and graphics . J Comput Graph Stat 1996 ; 5 ( 3 ): 299 . Google Scholar OpenURL Placeholder Text WorldCat 104 Zhang F , Ren C, Lau KK, et al. A network medicine approach to build a comprehensive atlas for the prognosis of human cancer . Brief Bioinform 2016 ; 17 : 1044 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 105 Weinstein JN , Collisson EA, Mills GB, et al. The cancer genome atlas pan-cancer analysis project . Nat Genet 2013 ; 45 ( 10 ): 1113 – 20 . http://dx.doi.org/10.1038/ng.2764 Google Scholar Crossref Search ADS PubMed WorldCat 106 Domenyuk V , Zhong Z, Stark A, et al. Plasma exosome profiling of cancer patients by a next generation systems biology approach . Sci Rep 2017 ; 7 : 42741 . http://dx.doi.org/10.1038/srep42741 Google Scholar Crossref Search ADS PubMed WorldCat 107 Antonopoulou K , Stefanaki I, Lill CM, et al. Updated field synopsis and systematic meta-analyses of genetic association studies in cutaneous melanoma: the MelGene database . J Invest Dermatol 2015 ; 135 ( 4 ): 1074 – 9 . http://dx.doi.org/10.1038/jid.2014.491 Google Scholar Crossref Search ADS PubMed WorldCat 108 Shirasaki DI , Greiner ER, Al-Ramahi I, et al. Network organization of the huntingtin proteomic interactome in mammalian brain . Neuron 2012 ; 75 ( 1 ): 41 – 57 . http://dx.doi.org/10.1016/j.neuron.2012.05.024 Google Scholar Crossref Search ADS PubMed WorldCat 109 Huang DW , Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources . Nat Protocols 2008 ; 4 ( 1 ): 44 – 57 . http://dx.doi.org/10.1038/nprot.2008.211 Google Scholar Crossref Search ADS WorldCat 110 Mi H , Lazareva-Ulitsky B, Loo R, et al. The PANTHER database of protein families, subfamilies, functions and pathways . Nucleic Acids Res 2005 ; 33 : D284 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 111 Xu G , Paige JS, Jaffrey SR. Global analysis of lysine ubiquitination by ubiquitin remnant immunoaffinity profiling . Nat Biotech 2010 ; 28 ( 8 ): 868 – 73 . http://dx.doi.org/10.1038/nbt.1654 Google Scholar Crossref Search ADS WorldCat 112 Kim W , Bennett Eric J, Huttlin Edward L, et al. Systematic and quantitative assessment of the ubiquitin-modified proteome . Mol Cell 2011 ; 44 ( 2 ): 325 – 40 . http://dx.doi.org/10.1016/j.molcel.2011.08.025 Google Scholar Crossref Search ADS PubMed WorldCat 113 Shim JE , Bang C, Yang S, et al. GWAB: a web server for the network-based boosting of human genome-wide association data . Nucleic Acids Res 2017 , doi: 10.1093/nar/gkx284 . Google Scholar OpenURL Placeholder Text WorldCat 114 Warde-Farley D , Donaldson SL, Comes O, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function . Nucleic Acids Res 2010 ; 38 (Suppl 2): W214 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 115 Nitsch D , Tranchevent LC, Goncalves JP, et al. PINTA: a web server for network-based gene prioritization from expression data . Nucleic Acids Res 2011 ; 39 : W334 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 116 Madian AG , Wheeler HE, Jones RB, et al. Relating human genetic variation to variation in drug responses . Trends Genet 2012 ; 28 ( 10 ): 487 – 95 . http://dx.doi.org/10.1016/j.tig.2012.06.008 Google Scholar Crossref Search ADS PubMed WorldCat 117 Yu W , Li Z, Long F, et al. A systems pharmacology approach to determine active compounds and action mechanisms of Xipayi KuiJie'an enema for treatment of ulcerative colitis . Sci Rep 2017 ; 7 ( 1 ): 1189 . http://dx.doi.org/10.1038/s41598-017-01335-w Google Scholar Crossref Search ADS PubMed WorldCat 118 Wang J , Liu R, Liu B, et al. Systems Pharmacology-based strategy to screen new adjuvant for hepatitis B vaccine from Traditional Chinese Medicine Ophiocordyceps sinensis . Sci Rep 2017 ; 7 : 44788 . http://dx.doi.org/10.1038/srep44788 Google Scholar Crossref Search ADS PubMed WorldCat 119 Irurzun-Arana I , Pastor JM, Troconiz IF, et al. Advanced Boolean modeling of biological networks applied to systems pharmacology . Bioinformatics 2017 ; 33 : 1040 – 8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 120 Lotfi Shahreza M , Ghadiri N, Mousavi SR, et al. A review of network-based approaches to drug repositioning . Brief Bioinform 2017 , doi: 10.1093/bib/bbx017 . Google Scholar OpenURL Placeholder Text WorldCat 121 Lamb J , Crawford ED, Peck D, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease . Science 2006 ; 313 ( 5795 ): 1929 – 35 . http://dx.doi.org/10.1126/science.1132939 Google Scholar Crossref Search ADS PubMed WorldCat 122 Corsello SM , Bittker JA, Liu Z, et al. The Drug Repurposing Hub: a next-generation drug library and information resource . Nat Med 2017 ; 23 ( 4 ): 405 – 8 . http://dx.doi.org/10.1038/nm.4306 Google Scholar Crossref Search ADS PubMed WorldCat 123 Zhang RZ , Yu SJ, Bai H, et al. TCM-Mesh: the database and analytical system for network pharmacology analysis for TCM preparations . Sci Rep 2017 ; 7 ( 1 ): 2821 . http://dx.doi.org/10.1038/s41598-017-03039-7 Google Scholar Crossref Search ADS PubMed WorldCat 124 Wu Z , Cheng F, Li J, et al. SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug-target interactions and drug repositioning . Brief Bioinform 2017 ; 18 : 333 – 47 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 125 Huang H , Wu X, Pandey R, et al. C(2)Maps: a network pharmacology database with comprehensive disease-gene-drug connectivity relationships . BMC Genomics 2012 ; 13 (Suppl 6): S17 . Google Scholar Crossref Search ADS PubMed WorldCat 126 Chen B , Dong X, Jiao D, et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data . BMC Bioinformatics 2010 ; 11 ( 1 ): 255 . http://dx.doi.org/10.1186/1471-2105-11-255 Google Scholar Crossref Search ADS PubMed WorldCat 127 von Eichborn J , Murgueitio MS, Dunkel M, et al. PROMISCUOUS: a database for network-based drug-repositioning . Nucleic Acids Res 2011 ; 39 : D1060 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 128 Zhang P , Tao L, Zeng X, et al. A protein network descriptor server and its use in studying protein, disease, metabolic and drug targeted networks . Brief Bioinform 2016 , doi: 10.1093/bib/bbw071 . Google Scholar OpenURL Placeholder Text WorldCat 129 Hsin KY , Matsuoka Y, Asai Y, et al. systemsDock: a web server for network pharmacology-based prediction and analysis . Nucleic Acids Res 2016 ; 44 ( W1 ): W507 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 130 Gilson MK , Liu T, Baitaluk M, et al. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology . Nucleic Acids Res 2016 ; 44 ( D1 ): D1045 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 131 Setoain J , Franch M, Martinez M, et al. NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning . Nucleic Acids Res 2015 ; 43 ( W1 ): W193 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 132 Jensen K , Panagiotou G, Kouskoumvekaki I. NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods . Nucleic Acids Res 2015 ; 43 : D940 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 133 He L , Wennerberg K, Aittokallio T, et al. TIMMA-R: an R package for predicting synergistic multi-targeted drug combinations in cancer cell lines or patient-derived samples . Bioinformatics 2015 ; 31 ( 11 ): 1866 – 8 . http://dx.doi.org/10.1093/bioinformatics/btv067 Google Scholar Crossref Search ADS PubMed WorldCat 134 Ruyssinck J , Demeester P, Dhaene T, et al. Netter: re-ranking gene network inference predictions using structural network properties . BMC Bioinformatics 2016 ; 17 ( 1 ): 76 . http://dx.doi.org/10.1186/s12859-016-0913-0 Google Scholar Crossref Search ADS PubMed WorldCat 135 Moyer E , Hagenauer M, Lesko M, et al. MetaNetVar: pipeline for applying network analysis tools for genomic variants analysis . F1000Res 2016 ; 5 : 674 . http://dx.doi.org/10.12688/f1000research.8288.1 Google Scholar Crossref Search ADS PubMed WorldCat 136 Dozmorov MG , Cara LR, Giles CB, et al. GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets . Bioinformatics 2016 ; 32 ( 15 ): 2256 – 63 . http://dx.doi.org/10.1093/bioinformatics/btw169 Google Scholar Crossref Search ADS PubMed WorldCat 137 da Rocha EL , Ung CY, McGehee CD, et al. NetDecoder: a network biology platform that decodes context-specific biological networks and gene activities . Nucleic Acids Res 2016 ; 44 : e100 . Google Scholar Crossref Search ADS PubMed WorldCat 138 Cho A , Shim JE, Kim E, et al. MUFFINN: cancer gene discovery via network analysis of somatic mutation data . Genome Biol 2016 ; 17 ( 1 ): 129 . http://dx.doi.org/10.1186/s13059-016-0989-x Google Scholar Crossref Search ADS PubMed WorldCat 139 Bottomly D , McWeeney SK, Wilmot B. HitWalker2: visual analytics for precision medicine and beyond . Bioinformatics 2016 ; 32 ( 8 ): 1253 – 5 . http://dx.doi.org/10.1093/bioinformatics/btv739 Google Scholar Crossref Search ADS PubMed WorldCat 140 An O , Dall'Olio GM, Mourikis TP, et al. NCG 5.0: updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings . Nucleic Acids Res 2016 ; 44 ( D1 ): D992 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 141 Chen YJ , Lu CT, Su MG, et al. dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation . Nucleic Acids Res 2015 ; 43 : D503 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 142 Boue S , Talikka M, Westra JW, et al. Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems . Database 2015 ; 2015 : bav030 . Google Scholar Crossref Search ADS PubMed WorldCat 143 Smoly IY , Lerman E, Ziv-Ukelson M, et al. MotifNet: a web-server for network motif analysis . Bioinformatics 2017 ; 33 : 1907 – 9 . http://dx.doi.org/10.1093/bioinformatics/btx056 Google Scholar Crossref Search ADS PubMed WorldCat 144 Shoaib M , Ansari AA, Ahn SM. cMapper: gene-centric connectivity mapper for EBI-RDF platform . Bioinformatics 2017 ; 33 ( 2 ): 266 – 71 . http://dx.doi.org/10.1093/bioinformatics/btw612 Google Scholar Crossref Search ADS PubMed WorldCat 145 Pirayre A , Couprie C, Duval L, et al. BRANE clust: cluster-assisted gene regulatory network inference refinement . IEEE/ACM Trans Comput Biol Bioinform 2017 , doi: 10.1109/TCBB.2017.2688355 . Google Scholar OpenURL Placeholder Text WorldCat 146 Knaus BJ , Grunwald NJ. vcfr: a package to manipulate and visualize variant call format data in R . Mol Ecol Resour 2017 ; 17 ( 1 ): 44 – 53 . http://dx.doi.org/10.1111/1755-0998.12549 Google Scholar Crossref Search ADS PubMed WorldCat 147 Khomtchouk BB , Hennessy JR, Wahlestedt C. Shinyheatmap: ultra fast low memory heatmap web interface for big data genomics . PLoS One 2017 ; 12 ( 5 ): e0176334 . Google Scholar Crossref Search ADS PubMed WorldCat 148 Jemimah S , Yugandhar K, Gromiha MM. PROXiMATE: a database of mutant protein-protein complex thermodynamics and kinetics . Bioinformatics 2017 ; 33 : 2787 – 8 . http://dx.doi.org/10.1093/bioinformatics/btx312 Google Scholar Crossref Search ADS PubMed WorldCat 149 Balaur I , Mazein A, Saqi M, et al. Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks . Bioinformatics 2017 ; 33 : 1096 – 8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 150 Junge A , Refsgaard JC, Garde C, et al. RAIN: RNA-protein association and interaction networks . Database 2017 ; 2017 : baw167 . Google Scholar Crossref Search ADS WorldCat 151 Pontikos N , Yu J, Moghul I, et al. Phenopolis: an open platform for harmonization and analysis of genetic and phenotypic data . Bioinformatics 2017 ; 33 : 2421 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 152 Mughal S , Moghul I, Yu J, et al. Pheno4J: a gene to phenotype graph database . Bioinformatics 2017 ; 33 : 3317 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 153 Liu Y , Brossard M, Roqueiro D, et al. SigMod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network . Bioinformatics 2017 ; 33 : 1536 – 44 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 154 Liang S , Tippens ND, Zhou Y, et al. iRegNet3D: three-dimensional integrated regulatory network for the genomic analysis of coding and non-coding disease mutations . Genome Biol 2017 ; 18 ( 1 ): 10 . http://dx.doi.org/10.1186/s13059-016-1138-2 Google Scholar Crossref Search ADS PubMed WorldCat 155 Ji J , He D, Feng Y, et al. JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data . Bioinformatics 2017 ; 33 : 3080 – 7 . http://dx.doi.org/10.1093/bioinformatics/btx360 Google Scholar Crossref Search ADS PubMed WorldCat 156 Herzinger S , Gu W, Satagopam V, et al. SmartR: an open-source platform for interactive visual analytics for translational research data . Bioinformatics 2017 ; 33 ; 2229 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 157 Athanasiadis E , Bourdakou M, Spyrou G. D-Map: random walking on gene network inference maps towards differential avenue discovery . IEEE/ACM Trans Comput Biol Bioinform 2017 ; 14 ( 2 ): 484 – 90 . http://dx.doi.org/10.1109/TCBB.2016.2535267 Google Scholar Crossref Search ADS PubMed WorldCat 158 Ud-Dean SM , Heise S, Klamt S, et al. TRaCE+: ensemble inference of gene regulatory networks from transcriptional expression profiles of gene knock-out experiments . BMC Bioinformatics 2016 ; 17 ( 1 ): 252 . Google Scholar Crossref Search ADS PubMed WorldCat 159 Summer G , Kelder T, Radonjic M, et al. The Network Library: a framework to rapidly integrate network biology resources . Bioinformatics 2016 ; 32 ( 17 ): i473 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 160 Ogris C , Helleday T, Sonnhammer EL. PathwAX: a web server for network crosstalk based pathway annotation . Nucleic Acids Res 2016 ; 44 ( W1 ): W105 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 161 Karp PD , Latendresse M, Paley SM, et al. Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology . Brief Bioinform 2016 ; 17 : 877 – 90 . http://dx.doi.org/10.1093/bib/bbv079 Google Scholar Crossref Search ADS PubMed WorldCat 162 Chakrabarty B , Parekh N. NAPS: network analysis of protein structures . Nucleic Acids Res 2016 ; 44 ( W1 ): W375 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 163 Nguyen VN , Huang KY, Weng JT, et al. UbiNet: an online resource for exploring the functional associations and regulatory networks of protein ubiquitylation . Database 2016 ; 2016 : baw054 . Google Scholar Crossref Search ADS PubMed WorldCat 164 Dai HJ , Su CH, Lai PT, et al. MET network in PubMed: a text-mined network visualization and curation system . Database 2016 ; 2016 : baw090 . Google Scholar Crossref Search ADS PubMed WorldCat 165 Thibodeau A , Marquez EJ, Luo O, et al. QuIN: a web server for querying and visualizing chromatin interaction networks . PLoS Comput Biol 2016 ; 12 ( 6 ): e1004809 . Google Scholar Crossref Search ADS PubMed WorldCat 166 Bovo S , Di Lena P, Martelli PL, et al. NET-GE: a web-server for NETwork-based human gene enrichment . Bioinformatics 2016 ; 32 ( 22 ): 3489 – 91 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 167 Tseng YT , Li W, Chen CH, et al. IIIDB: a database for isoform-isoform interactions and isoform network modules . BMC Genomics 2015 ; 16 (Suppl 2): S10 . Google Scholar Crossref Search ADS PubMed WorldCat 168 Summer G , Kelder T, Ono K, et al. cyNeo4j: connecting Neo4j and Cytoscape . Bioinformatics 2015 ; 31 ( 23 ): 3868 – 9 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 169 Pirayre A , Couprie C, Bidard F, et al. BRANE Cut: biologically-related a priori network enhancement with graph cuts for gene regulatory network inference . BMC Bioinformatics 2015 ; 16 ( 1 ): 368 . http://dx.doi.org/10.1186/s12859-015-0754-2 Google Scholar Crossref Search ADS PubMed WorldCat 170 Papatsenko D , Lemischka IR. NetExplore: a web server for modeling small network motifs . Bioinformatics 2015 ; 31 ( 11 ): 1860 – 2 . http://dx.doi.org/10.1093/bioinformatics/btv058 Google Scholar Crossref Search ADS PubMed WorldCat 171 Okamura Y , Aoki Y, Obayashi T, et al. COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems . Nucleic Acids Res 2015 ; 43 : D82 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 172 Hurley DG , Cursons J, Wang YK, et al. NAIL, a software toolset for inferring, analyzing and visualizing regulatory networks . Bioinformatics 2015 ; 31 ( 13 ): 277 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 173 Zhou Z , Shen Y, Khan MR, et al. LncReg: a reference resource for lncRNA-associated regulatory networks . Database 2015 ; 2015 : bav083 . Google Scholar Crossref Search ADS PubMed WorldCat 174 Luo Z , Dai Z, Xie X, et al. TeloPIN: a database of telomeric proteins interaction network in mammalian cells . Database 2015 ; 2015 : bav018 . Google Scholar Crossref Search ADS PubMed WorldCat 175 Li HD , Omenn GS, Guan Y. MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse . Database 2015 ; 2015 : bav045 . Google Scholar Crossref Search ADS PubMed WorldCat 176 Frias S , Bryan K, Brinkman FS, et al. CerebralWeb: a Cytoscape.js plug-in to visualize networks stratified by subcellular localization . Database 2015 ; 2015 : bav041 . Google Scholar Crossref Search ADS PubMed WorldCat 177 Dorel M , Viara E, Barillot E, et al. NaviCom: a web application to create interactive molecular network portraits using multi-level omics data . Database 2017 , doi: 10.1093/database/bax026 . Google Scholar OpenURL Placeholder Text WorldCat 178 List M , Alcaraz N, Dissing-Hansen M, et al. KeyPathwayMinerWeb: online multi-omics network enrichment . Nucleic Acids Res 2016 ; 44 ( W1 ): W98 – W104 . Google Scholar Crossref Search ADS PubMed WorldCat 179 Kim B , Ali T, Hosmer S, et al. Visual Omics Explorer (VOE): a cross-platform portal for interactive data visualization . Bioinformatics 2016 ; 32 ( 13 ): 2050 – 2 . http://dx.doi.org/10.1093/bioinformatics/btw119 Google Scholar Crossref Search ADS PubMed WorldCat 180 Hashemifar S , Ma J, Naveed H, et al. ModuleAlign: module-based global alignment of protein-protein interaction networks . Bioinformatics 2016 ; 32 ( 17 ): i658 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat 181 Gligorijevic V , Malod-Dognin N, Przulj N. Fuse: multiple network alignment via data fusion . Bioinformatics 2016 ; 32 ( 8 ): 1195 – 203 . http://dx.doi.org/10.1093/bioinformatics/btv731 Google Scholar Crossref Search ADS PubMed WorldCat 182 Dohrmann J , Singh R. The SMAL web server: global multiple network alignment from pairwise alignments . Bioinformatics 2016 ; 32 ( 21 ): 3330 – 2 . http://dx.doi.org/10.1093/bioinformatics/btw402 Google Scholar Crossref Search ADS PubMed WorldCat 183 Arneson D , Bhattacharya A, Shu L, et al. Mergeomics: a web server for identifying pathological pathways, networks, and key regulators via multidimensional data integration . BMC Genomics 2016 ; 17 ( 1 ): 722 . http://dx.doi.org/10.1186/s12864-016-3057-8 Google Scholar Crossref Search ADS PubMed WorldCat 184 Vijayan V , Saraph V, Milenkovic T. MAGNA ++: maximizing accuracy in global network alignment via both node and edge conservation . Bioinformatics 2015 ; 31 ( 14 ): 2409 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 185 Athanasiadis EI , Bourdakou MM, Spyrou GM. ZoomOut: analyzing multiple networks as single nodes . IEEE/ACM Trans Comput Biol Bioinform 2015 ; 12 ( 5 ): 1213 – 6 . http://dx.doi.org/10.1109/TCBB.2015.2424411 Google Scholar Crossref Search ADS PubMed WorldCat 186 Liu ZP , Wu C, Miao H, et al. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse . Database 2015 ; 2015 : bav095 . Google Scholar Crossref Search ADS PubMed WorldCat 187 Wolstencroft K , Krebs O, Snoep JL, et al. FAIRDOMHub: a repository and collaboration environment for sharing systems biology research . Nucleic Acids Res 2017 ; 45 ( D1 ): D404 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 188 Rodriguez N , Pettit JB, Dalle Pezze P, et al. The systems biology format converter . BMC Bioinformatics 2016 ; 17 : 154 . http://dx.doi.org/10.1186/s12859-016-1000-2 Google Scholar Crossref Search ADS PubMed WorldCat 189 Lubitz T , Hahn J, Bergmann FT, et al. SBtab: a flexible table format for data exchange in systems biology . Bioinformatics 2016 ; 32 ( 16 ): 2559 – 61 . http://dx.doi.org/10.1093/bioinformatics/btw179 Google Scholar Crossref Search ADS PubMed WorldCat 190 Balsa-Canto E , Henriques D, Gabor A, et al. AMIGO2, a toolbox for dynamic modeling, optimization and control in systems biology . Bioinformatics 2016 ; 32 ( 21 ): 3357 – 9 . http://dx.doi.org/10.1093/bioinformatics/btw411 Google Scholar Crossref Search ADS PubMed WorldCat 191 Veres DV , Gyurko DM, Thaler B, et al. ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis . Nucleic Acids Res 2015 ; 43 : D485 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 192 Rodriguez N , Thomas A, Watanabe L, et al. JSBML 1.0: providing a smorgasbord of options to encode systems biology models . Bioinformatics 2015 ; 31 : 3383 – 6 . http://dx.doi.org/10.1093/bioinformatics/btv341 Google Scholar Crossref Search ADS PubMed WorldCat 193 Marchetti L , Manca V. MpTheory Java library: a multi-platform Java library for systems biology based on the Metabolic P theory . Bioinformatics 2015 ; 31 ( 8 ): 1328 – 30 . http://dx.doi.org/10.1093/bioinformatics/btu814 Google Scholar Crossref Search ADS PubMed WorldCat 194 Johnson R , Kirk P, Stumpf MP. SYSBIONS: nested sampling for systems biology . Bioinformatics 2015 ; 31 ( 4 ): 604 – 5 . http://dx.doi.org/10.1093/bioinformatics/btu675 Google Scholar Crossref Search ADS PubMed WorldCat 195 Aitken S , Kilpatrick AM, Akman OE. Dizzy-Beats: a Bayesian evidence analysis tool for systems biology . Bioinformatics 2015 ; 31 ( 11 ): 1863 – 5 . http://dx.doi.org/10.1093/bioinformatics/btv062 Google Scholar Crossref Search ADS PubMed WorldCat 196 De Smet R , Marchal K. Advantages and limitations of current network inference methods . Nat Rev Microbiol 2010 ; 8 ( 10 ): 717 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 197 Stolovitzky G , Monroe D, Califano A. Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference . Ann N Y Acad Sci 2007 ; 1115 ( 1 ): 1 – 22 . http://dx.doi.org/10.1196/annals.1407.021 Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
Computational profiling of the gut–brain axis: microflora dysbiosis insights to neurological disordersDovrolis,, Nikolas;Kolios,, George;Spyrou, George, M;Maroulakou,, Ioanna
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx154pmid: 29186317
Abstract Almost 2500 years after Hippocrates’ observations on health and its direct association to the gastrointestinal tract, a paradigm shift has recently occurred, making the gut and its symbionts (bacteria, fungi, archaea and viruses) a point of convergence for studies. It is nowadays well established that the gut microflora’s compositional diversity regulates via its genes (the microbiome) the host’s health and provides preliminary insights into disease progression and regulation. The microbiome’s involvement is evident in immunological and physiological studies that link changes in its biodiversity to its contributions to the host’s phenotype but also in neurological investigations, substantiating the aptly named gut–brain axis. The definitive mechanisms of this last bidirectional interaction will be our main focus because it presents researchers with a new conundrum. In this review, we prospect current literature for computational analysis methodologies that accommodate the need for better understanding of the microbiome–gut–brain interactions and neurological disorder onset and progression, through cross-disciplinary systems biology applications. We will present bioinformatics tools used in exploring these synergies that help build and interpret microbial 16S ribosomal RNA data sets, produced by shotgun and high-throughput sequencing of healthy and neurological disorder samples stored in biological databases. These approaches provide alternative means for researchers to form hypotheses to their inquests faster, cheaper and swith precision. The goal of these studies relies on the integration of combined metagenomics and metabolomics assessments. An accurate characterization of the microbiome and its functionality can support new diagnostic, prognostic and therapeutic strategies for neurological disorders, customized for each individual host. gut–brain axis, microflora, microbiome, neurological disorders, precision medicine, computational metagenomics The host and its microflora: an interesting symbiosis The philosophical expression ‘no man is an island’ takes a whole new meaning, when one considers the fact that from the time of birth, each of us coexists with an assortment of bacteria, fungi, archaea and viruses. These ∼1014 microorganisms constitute the human microflora [1] (also known as microbiota) colonizing the skin, mouth, lungs, reproductive and gastrointestinal (GI) tract of everyone, creating a mutualistic biological interaction, a symbiosis. Especially the gut, with its physiology and large surface, acts as the perfect host environment for the microflora’s development, exhibiting the greatest diversity and abundance of bacterial populations. The composition of the human microflora, although evolving through the early stages of life and being perturbed by habitat, lifestyle, medication and health, is unique in each individual, creating a form of personal ‘fingerprint’ [2]. This evolution includes interactions between the members of the microflora fighting for ‘dominance’ among themselves. There are of course similarities across the field with bacterial phyla like Bacteroidetes, Firmicutes and Actinobacteria being present in every host [3], but the difference lies in the abundance of their subpopulations. Interestingly enough, in 2009, Turnbaugh et al. [4] observed that even though the microflora composition may vary between individuals, its core function remains the same in similar pathophysiological conditions. In recent years, the combined genetic composition of the microflora, called the microbiome, has been implicated directly with numerous aspects of human health in ways that previously were, and in many cases still remain, unknown [5]. The beneficial role of the host–microflora relationship is dependent on a semi-stable homeostasis which, when disturbed, leads to dysbiosis [6], a status inducing or signifying pathological conditions. Under homeostasis, the functional role [7] of the microbiome includes defense versus pathogens and inflammation via its interactions with the mucosa, vitamin synthesis, energy production, metabolism alteration, dietary modifications like turning fibers into short-chain fatty acids (SCFAs) while contributing to neurodevelopment [8], adult brain function [9] and longevity [10, 11]. During dysbiosis on the other hand, certain microbial populations become differentially abundant driving their metabolic contributions to follow accordingly, strongly affecting the host epigenome [12–14]. The gut microflora actively attributes to the development and maintenance of the gut immune system [15, 16], the permeability of the blood brain barrier (BBB) [17] and its imbalance has already been linked to various pathological conditions like inflammatory bowel diseases (IBDs) [18], cardiovascular conditions [19], atherosclerosis [20], diabetes [21], cancer [22], metabolic syndrome [23], human immunodeficiency virus (HIV) [24], chronic kidney disease [25], antiphospholipid syndrome [26] and most importantly for the premise of this review various neurological [27] and neuropsychiatric [28] conditions. The gut–brain–microbiome axis It has been known for a while now that the enteric nervous system acts as a kind of second ‘brain’ [29, 30] providing a bridge between the gut, the mucosal immune system, the neuroendocrine system, the autonomic nervous system, the vagus nerve and by extension the brain [31]. Previous hypotheses pointed at the brain as the instigator of this relationship trying to ‘control’ the gut, but later studies pointed at a bidirectional relationship. These observations provided the basis for the investigations of the gut–brain axis on a more advanced level revealing four distinct signaling pathways composed of neural, immunological, endocrinological and microbial communications [32]. With the newfound knowledge of the microflora’s implication in human health, the axis expanded to include the microbiome among its components forming what can be found in literature as the microbiome–gut–brain axis [33, 34]. Microbial metabolites interact with the host environment, controlling immune responses via the mucosa, reaching the brain via the bloodstream and modulating neural responses. It is clear that there is a whole ecosystem that affects the homeostasis and pathological conditions alike, via known and unknown mechanisms [35]. For example, the microbiome’s contribution to the metabolism of tryptophan, an essential amino acid for the synthesis of serotonin in the central nervous system (CNS), leads to its absorption by the gut and the crossing of the BBB [36]. The SCFAs, which are immunoregulating metabolites of gut microflora, influence microglia homeostasis and shape brain development [37]. Nitric oxide inhibition via microbial metabolites contributes to microglia maturation [14]. Recently, Bellono et al. [38] have shown that enterochromaffin cells express chemosensors that regulate serotonin-sensitive nerve fibers and establish a direct communication between the gut ecosystem and the nervous system. Current knowledge has linked the gut–brain axis to variable systematic pathological conditions like obesity [39–41], irritable bowel syndrome [42–44], upper GI disorders [45] like gastroparesis, dyspepsia and anorexia, infant colic [46] but mainly to neurological conditions affecting mental state and development, memory and behavior [47]. Clinical and preclinical studies have delved into characterizing the gut microflora dysbiosis in neurological conditions, pointing at differentially abundant microbial genera. From the early stages of life through adolescence, the gut microflora appears to influence not only normal neurological development but also the onset and/or the progression of pathological conditions like autism, schizophrenia, psychosis and bipolar disorder in both animal models and patients [48–50]. Autism spectrum disorders (ASDs), which are characterized by pathological neurodevelopment, have been linked to altered microbiome states in recent studies [51–56]. Increases of the population of bacteria of the genus Lactobacillus have been identified in patients exhibiting first episodes of psychosis and correlated positively with symptom severity (whereas Lachnospiraceae and Ruminococcaceae correlated negatively) in a study by Schwarz et al. [57]. These kinds of differences in microbial composition could possibly provide future strategies in the development of diagnostic tools for various disorders. A longitudinal study performed by Evans et al. [58] highlighted the population loss of Faecalibacterium as important in bipolar disorder, after excluding covariant factors. In 2013 Nieto et al. [59], using oral antibiotics in mice altered the gut microbial composition leading to an increase of brain-derived neurotrophic factor’s expression in the hippocampus that is implicated in cognitive impairment, morphological and functional synaptic pathology and contribution to N-methyl-D-aspartate receptor dysfunction. This dysfunction has been associated with schizophrenia. The gut–brain axis continues to shape our neurological and mental health beyond adolescence. Stress [60], insomnia [61], depression [62], anxiety [63] and even fear-related signaling [64], although not fatal in most cases, directly affect the quality of life of millions daily, regardless of age. As an example, Zheng et al. [65] in a 2016 paper, presented a four-part study, which at first tested germ-free mice and observed a reduction of depression-like symptoms prompting a microbiota–gut–brain axis involvement in depression. They, then, continued the experiment on patients exhibiting major depression disorder (MDD) versus healthy controls to find significant differences in the abundance of the bacterial phyla Firmicutes, Actinobacteria and Bacteroidetes. The third step was fecal microflora transplantation from both MDD and healthy controls to the germ-free mice, which concluded that the mice recipients of the ‘MDD microflora’ after 2 weeks showed increased depression-like and anxiety-like symptomology. Finally, by applying functional shotgun metagenomics, they investigated the metabolic effects of microbiota on ‘MDD microflora’ mice and identified several dysregulated metabolic pathways, especially those involved with carbohydrate metabolism and its function in depression. When it comes to quality of life and in some cases even mortality, strokes and progressive neurodegenerative diseases show dramatic percentages in the ageing population [36]. The microbiota–gut–brain axis has been implicated in the outcome of ischemic brain injury [66] and also in amyotrophic lateral sclerosis [67], multiple sclerosis [68], Parkinson’s [69] and Alzheimer’s disease (AD) [70]. A few months before this review, Bonfili et al. [71] using 3xTg-AD mouse models (transgenic mice with three mutations associated with familial AD) investigated the role of microflora regulation via administration of SLAB51 probiotics (a mixture of lactic acid bacteria and bifidobacteria) in the etiopathology of AD. Their experiments provided insights in regulating amyloid load, counteracting cognitive decline and brain damage, increasing gut hormone concentrations and regulating proteosomal and autophagic pathways. They calculated statistically significant microflora compositional and functional changes between wild-type and AD models, after probiotic treatment, specifically attributed to the increase in Bifidobacterium spp., the reduction in Campylobacterales and their role in inflammation via the regulation of pro-inflammatory cytokines. As evident from the above examples, preclinical and clinical studies can be enhanced significantly by bioinformatics approaches, enriching our apprehension of the microbiome’s involvement. The findings of such approaches provide a unique perspective to the composition and functional role of the microbiome, allowing researchers to theorize on dysbiosis as a cause or an effect of specific conditions and at the same time investigating the effects of intervention to the microflora (Figure 1). The next chapters of this review highlight these technology-based methodologies and provide the outline of how the insilico process formulates in microbiome studies. Figure 1 View largeDownload slide A graphical abstract of this review highlighting the gut–brain axis communication pathways, the host mechanisms the microflora regulates and some of its major perturbagens. It also presents a basic pipeline of computational analysis found in contemporary microbiome publications. Figure 1 View largeDownload slide A graphical abstract of this review highlighting the gut–brain axis communication pathways, the host mechanisms the microflora regulates and some of its major perturbagens. It also presents a basic pipeline of computational analysis found in contemporary microbiome publications. Computational metagenomic approaches In the field of metagenomics research, some fundamental questions often arise: How do we know so much about the microbiome and how did we get there so fast after decades of speculation? How exactly do we know what the microbiome is composed of? Can we identify interactions between populations of the microflora? How did we associate specific members of the microflora and their metabolic products with a diverse spectrum of health conditions? The response lies in the technological advantage, gene and next-generation sequencing (NGS) [72, 73] has provided for the uncultured microflora and the fast strides of Bioinformatics. Before delving into the functional role of microbial populations in the pathophysiology of disorders, we must be able to identify them with high sensitivity and specificity. NGS has provided to a large extend these capabilities by introducing shotgun along the 16S ribosomal RNA (rRNA) sequencing [74, 75]. The 16S rRNA gene is considered to be the de facto housekeeping gene of bacterial and archaeal populations. At this point, the first concession of studying the microbiome is introduced in the form of focusing on the bacteriome’s (bacterial microbiome) implications and often foregoing the mycobiome’s (fungal microbiome) [76, 77] and virome’s [78, 79] (viral microbiome), which both have been associated with pathological conditions but are still largely understudied. This concession is largely based on the richness (quantification of how many distinct species) and abundance (quantification of how many members the species have) of bacterial populations over those of the fungi and viruses but also on their ease of detection and better understanding of their biological processes. Metagenomics [80, 81] is the term introduced to specify the study of the metagenome, which is the combined DNA composition of environmental samples. In the case of human microflora, in fecal and histological biopsy samples, it refers to the identification and quantification of the genetic contributions of microbial subpopulations. [82]. Shotgun metagenomics, although more expensive, provide a higher resolution and accuracy of the results but those become more complex because they include all the microorganisms of a sample [83], including host DNA. 16S rRNA metagenomics, on the other hand, are more accessible and faster to achieve in a laboratory setting when the focus of the study is the bacteria and archaea in multiple control and patient samples. Both approaches use a practice that introduces an amount of variance between different studies, the utilization of NGS library construction for RNA or DNA [84]. Additionally, the 16S rRNA standard operating procedure requires another step in the library building with a fair amount of uncertainty, the amplification of hypervariable regions of the 16S rRNA gene via multiplex polymerase chain reaction (PCR) primers [85]. In both cases, whether it is the sequencing of a whole sample or of the 16S rRNA amplicons, we end up with small reads (25–500 base pairs) allowing for microorganisms who are unknown or in small abundances to be detected. These reads require extensive bioinformatics preprocessing with specialized tools for read trimming, merging, assembly, scaffolding and mapping [86]. Table 1 provides an overview of preprocessing tools and supplies information on their ability to perform: Table 1 Applications for the preprocessing of microbial sequence reads Tool Trimming, merging, scaffolding, assembly Quality contol Denoising Chimera detection Reference Abyss 2.0 ✓ [88] Bambus 2 ✓ [89] BBAP ✓ [90] CATCh ✓ [91] ChimeraSlayer ✓ [92] dupRadar ✓ [93] EP_metagenomic ✓ [94] IDBA-UD ✓ [95] IM-TORNADO ✓ ✓ ✓ ✓ [96] InteMAP ✓ [97] IPED ✓ [98] MAP ✓ [99] MeFiT ✓ [100] MEGAHIT ✓ [101] MESER ✓ [102] MetAMOS ✓ ✓ ✓ ✓ [103] metaSPAdes ✓ [104] MetaVelvet ✓ [105] mothur ✓ ✓ ✓ ✓ [106] NoDe ✓ [107] OCToPUS ✓ ✓ ✓ ✓ [108] Orione ✓ ✓ ✓ ✓ [109] PRICE ✓ [110] QIIME ✓ ✓ ✓ ✓ [111] Qualimap2 ✓ [112] Ray Meta ✓ [113] ROP ✓ [114] Sequins ✓ [115] sleuth ✓ [116] Snowball ✓ ✓ [117] Trimmomatic ✓ ✓ [118] UCHIME ✓ [119] VSEARCH ✓ [120] Xander ✓ [121] Tool Trimming, merging, scaffolding, assembly Quality contol Denoising Chimera detection Reference Abyss 2.0 ✓ [88] Bambus 2 ✓ [89] BBAP ✓ [90] CATCh ✓ [91] ChimeraSlayer ✓ [92] dupRadar ✓ [93] EP_metagenomic ✓ [94] IDBA-UD ✓ [95] IM-TORNADO ✓ ✓ ✓ ✓ [96] InteMAP ✓ [97] IPED ✓ [98] MAP ✓ [99] MeFiT ✓ [100] MEGAHIT ✓ [101] MESER ✓ [102] MetAMOS ✓ ✓ ✓ ✓ [103] metaSPAdes ✓ [104] MetaVelvet ✓ [105] mothur ✓ ✓ ✓ ✓ [106] NoDe ✓ [107] OCToPUS ✓ ✓ ✓ ✓ [108] Orione ✓ ✓ ✓ ✓ [109] PRICE ✓ [110] QIIME ✓ ✓ ✓ ✓ [111] Qualimap2 ✓ [112] Ray Meta ✓ [113] ROP ✓ [114] Sequins ✓ [115] sleuth ✓ [116] Snowball ✓ ✓ [117] Trimmomatic ✓ ✓ [118] UCHIME ✓ [119] VSEARCH ✓ [120] Xander ✓ [121] Note: These steps precede the microbial characterization (binning/OTU picking). View Large Table 1 Applications for the preprocessing of microbial sequence reads Tool Trimming, merging, scaffolding, assembly Quality contol Denoising Chimera detection Reference Abyss 2.0 ✓ [88] Bambus 2 ✓ [89] BBAP ✓ [90] CATCh ✓ [91] ChimeraSlayer ✓ [92] dupRadar ✓ [93] EP_metagenomic ✓ [94] IDBA-UD ✓ [95] IM-TORNADO ✓ ✓ ✓ ✓ [96] InteMAP ✓ [97] IPED ✓ [98] MAP ✓ [99] MeFiT ✓ [100] MEGAHIT ✓ [101] MESER ✓ [102] MetAMOS ✓ ✓ ✓ ✓ [103] metaSPAdes ✓ [104] MetaVelvet ✓ [105] mothur ✓ ✓ ✓ ✓ [106] NoDe ✓ [107] OCToPUS ✓ ✓ ✓ ✓ [108] Orione ✓ ✓ ✓ ✓ [109] PRICE ✓ [110] QIIME ✓ ✓ ✓ ✓ [111] Qualimap2 ✓ [112] Ray Meta ✓ [113] ROP ✓ [114] Sequins ✓ [115] sleuth ✓ [116] Snowball ✓ ✓ [117] Trimmomatic ✓ ✓ [118] UCHIME ✓ [119] VSEARCH ✓ [120] Xander ✓ [121] Tool Trimming, merging, scaffolding, assembly Quality contol Denoising Chimera detection Reference Abyss 2.0 ✓ [88] Bambus 2 ✓ [89] BBAP ✓ [90] CATCh ✓ [91] ChimeraSlayer ✓ [92] dupRadar ✓ [93] EP_metagenomic ✓ [94] IDBA-UD ✓ [95] IM-TORNADO ✓ ✓ ✓ ✓ [96] InteMAP ✓ [97] IPED ✓ [98] MAP ✓ [99] MeFiT ✓ [100] MEGAHIT ✓ [101] MESER ✓ [102] MetAMOS ✓ ✓ ✓ ✓ [103] metaSPAdes ✓ [104] MetaVelvet ✓ [105] mothur ✓ ✓ ✓ ✓ [106] NoDe ✓ [107] OCToPUS ✓ ✓ ✓ ✓ [108] Orione ✓ ✓ ✓ ✓ [109] PRICE ✓ [110] QIIME ✓ ✓ ✓ ✓ [111] Qualimap2 ✓ [112] Ray Meta ✓ [113] ROP ✓ [114] Sequins ✓ [115] sleuth ✓ [116] Snowball ✓ ✓ [117] Trimmomatic ✓ ✓ [118] UCHIME ✓ [119] VSEARCH ✓ [120] Xander ✓ [121] Note: These steps precede the microbial characterization (binning/OTU picking). View Large Read preprocessing Quality control, to ensure error reads, artifacts and bias are detected and corrected Denoising, to remove the noise often introduced by DNA/RNA preparation and PCR Chimera detection, to identify and remove chimeras, which are artificial recombinants formed during the PCR amplification stage [87] It is obvious that there is no clear winner on sequencing methodologies but rather a better suited for the job in front of us. The products of the sequencing process, regardless of the technology used, are distinct sequences of the microflora members of the samples reported in fasta or fastq files and a mapping file containing all the necessary metadata for the samples. These files will be the input of the next steps for the identification of the species the sequences belong to and assigning them taxonomies. Operational taxonomic unit (OTU) is a term introduced to describe clusters of similar sequences, which might represent a species. Although not necessarily flawless, this approach typically uses a 97% similarity of sequences for the clustering and leads to the selection of 1 sequence per OTU to represent the taxa it belongs to via phylogenetic alignment. Various bioinformatics approaches and algorithms exist for this process, which also known as binning, either in workflows or in individual implementations of homology- and prediction- based methods both for shotgun and 16S rRNA metagenomics. Most of these algorithms rely primarily on two specific practices and hybrid implementations of them: denovo and closed reference OTU picking for 16S rRNA data or homology-independent/dependent binning for shotgun data accordingly. Denovo OTU picking is largely based on prediction-based implementations like Infernal [122], UPARSE [123], UCLUST [124], CD-HIT [125], PyNAST [126] METAXA2 [127], CLUSTOM-CLOUD [128], SWARM [129], OptiClust [130] and NINJA-OPS [131], which when clustering do not take into account any existing database for reference sequences but rather try to construct their own phylogenetic tree and assign taxonomies to OTUs after aligning them. The same concept applies to homology-independent binning through applications like CONCOCT [132], GroopM [133], MetaFast [134], MetaBAT [135], MaxBin [136], VizBin [137], COCACOLA [138] and MetaProb [139]. This methodology is better suited when trying to identify metagenomes of habitats with largely unknown members or trying to identify pathogenic microorganisms of unknown origin. It is by far the most computationally demanding approach albeit the most accurate, as no reads are disregarded. On the contrary, when the host environment contains by large known species, like the gut microflora, a closed reference OTU picking strategy (or a homology-dependent one for shotgun data) can provide accurate results in really fast times by using algorithms, which look up reference sequences in the latest versions of databases like RDP [140], GreenGenes [141], SILVA [142], RefSeq [143], HPMCD [144], etc., and cluster the data according to their similarity with those. Implementations of this approach include Taxonomer [145], IMSA-A [146], BLCA [147] and SPINGO [148] for closed reference OTU picking, and MetaPhlAn [149], MEGAN6 [150], Centrifuge [151], MGMapper [152] and OPAL [153] for homology-dependent binning. The output of these pipelines, independent of the methodology used, is usually an OTU table, which contains all the OTUs found in a sample, how many times and their assigned taxonomy among various other metadata. The processes described above are summarized visually in Figure 2. Figure 2 View largeDownload slide 16S rRNA and shotgun metagenomics pipelines for extracting information on the host's gut microbiome. Figure 2 View largeDownload slide 16S rRNA and shotgun metagenomics pipelines for extracting information on the host's gut microbiome. Owing to the fact that different tools are required for shotgun and 16S rRNA approaches, with the help of specialized platforms for bioinformatics resource like OMICtools [154], researchers can create their own workflows to achieve results by combining applications from any of the aforementioned categories or use standardized ones like QIIME, mothur and many others [103, 106, 109, 155–164], which perform multiple tasks of data preparation and downstream analysis. It is the easiest way for scientists to acquire and analyze their microbiome data with the added benefit of creating standardized reproducible results. At this point, we should highlight the fact that metagenome bioinformatics are computationally cumbersome and require copious amounts of processing power, memory and storage but are rapidly advancing because of their rising popularity, the employment of Bioinformatics scientists and their open-source nature. It is widespread practice today for researchers to store their sequence and OTU data on online platforms after their publication to help promote knowledge of the microbiome. These platforms are in fact supported and sometimes financed by organizations and global microbiome initiatives like the Human Microbiome Project [165], whose goal is to standardize the process and disseminate the necessity of similar studies. This way we are rapidly acquiring not only the tools but also the actual data to perform evaluations between different approaches and meta-analyses to infer answers for hypotheses the original authors might not have considered. This is highly dependent on the correct metadata annotation of the stored data, constituting it crucial for reuse and repurposing. There is a variety of online solutions for metagenomics data publishing, a nonexhaustive list of which is included in Table 2. Users of these databases should take note that comparing studies or samples created via different methodologies can be problematic on principle, as the data might not be directly comparable but in need of further analysis. Table 2 Repositories containing public data sets of sequence/OTU data that can be used for metagenomics studies Database URL Description References EBI-metagenomics https://www.ebi.ac.uk/metagenomics/ Part of the European Nucleotide Archive, it offers a pipeline for raw sequence analysis and archiving of metagenomic data. The added value is the fact that users can view the analysis results of each sample [166] Human Microbiome Project Data Portal https://portal.hmpdacc.org/ Perhaps the most daunting of the databases, hmpdacc provides a way for users to browse and download data from the Human Microbiome Project. The interface is hard to navigate to find what you are looking for regarding specific conditions. The iHMP spin-off website which focuses on three specific health conditions (pregnancy, IBD and diabetes type 2) makes things a little easier just for those conditions [167] Human Pan-Microbe Community database http://www.hpmcd.org/index.php Taking an approach similar to IMG/M, HPMCD is offering comparison metagenomics based on microbial populations. The samples are based on EBI metagenomics samples [144] IMG/M https://img.jgi.doe.gov/cgi-bin/m/main.cgi The Integrated Microbial Genomes and Microbial Samples database takes a unique approach of providing microbial genomes from different studies and the ability to compare them. Perhaps not the most intuitive of the databases for reanalyses of specific conditions but rather the role of specific organisms [168] iMicrobe https://www.imicrobe.us/ iMicrobe provides an intuitive search for their data sets based on metadata, which is user-friendly. One drawback is similar to MG-RAST where whole studies cannot be downloaded at once but rather their individual samples. [169] MG-RAST http://metagenomics.anl.gov/ A constantly updated database and pipeline for NGS metagenomics. Data can be accessed via http, ftp and directly via their API. Perhaps a small drawback is the inability to download a whole study from their website something that is possible via ftp [170] QIITA https://qiita.ucsd.edu/ Web-based metagenomic database and pipeline of tools for 16S rRNA and shotgun data sets, originally created for the American Gut Project. QIITA offers data sets in various states of assembly from raw sequences to OTU tables. End user-friendly with resources, which can easily be added in a different pipeline for reanalysis [171] Repositive https://repositive.io/ Repositive is an all-purpose repository of genomic data created as a central hub for genomic data, but it contains metagenomic studies as well. Requires a free account to get started on the data [172] Database URL Description References EBI-metagenomics https://www.ebi.ac.uk/metagenomics/ Part of the European Nucleotide Archive, it offers a pipeline for raw sequence analysis and archiving of metagenomic data. The added value is the fact that users can view the analysis results of each sample [166] Human Microbiome Project Data Portal https://portal.hmpdacc.org/ Perhaps the most daunting of the databases, hmpdacc provides a way for users to browse and download data from the Human Microbiome Project. The interface is hard to navigate to find what you are looking for regarding specific conditions. The iHMP spin-off website which focuses on three specific health conditions (pregnancy, IBD and diabetes type 2) makes things a little easier just for those conditions [167] Human Pan-Microbe Community database http://www.hpmcd.org/index.php Taking an approach similar to IMG/M, HPMCD is offering comparison metagenomics based on microbial populations. The samples are based on EBI metagenomics samples [144] IMG/M https://img.jgi.doe.gov/cgi-bin/m/main.cgi The Integrated Microbial Genomes and Microbial Samples database takes a unique approach of providing microbial genomes from different studies and the ability to compare them. Perhaps not the most intuitive of the databases for reanalyses of specific conditions but rather the role of specific organisms [168] iMicrobe https://www.imicrobe.us/ iMicrobe provides an intuitive search for their data sets based on metadata, which is user-friendly. One drawback is similar to MG-RAST where whole studies cannot be downloaded at once but rather their individual samples. [169] MG-RAST http://metagenomics.anl.gov/ A constantly updated database and pipeline for NGS metagenomics. Data can be accessed via http, ftp and directly via their API. Perhaps a small drawback is the inability to download a whole study from their website something that is possible via ftp [170] QIITA https://qiita.ucsd.edu/ Web-based metagenomic database and pipeline of tools for 16S rRNA and shotgun data sets, originally created for the American Gut Project. QIITA offers data sets in various states of assembly from raw sequences to OTU tables. End user-friendly with resources, which can easily be added in a different pipeline for reanalysis [171] Repositive https://repositive.io/ Repositive is an all-purpose repository of genomic data created as a central hub for genomic data, but it contains metagenomic studies as well. Requires a free account to get started on the data [172] View Large Table 2 Repositories containing public data sets of sequence/OTU data that can be used for metagenomics studies Database URL Description References EBI-metagenomics https://www.ebi.ac.uk/metagenomics/ Part of the European Nucleotide Archive, it offers a pipeline for raw sequence analysis and archiving of metagenomic data. The added value is the fact that users can view the analysis results of each sample [166] Human Microbiome Project Data Portal https://portal.hmpdacc.org/ Perhaps the most daunting of the databases, hmpdacc provides a way for users to browse and download data from the Human Microbiome Project. The interface is hard to navigate to find what you are looking for regarding specific conditions. The iHMP spin-off website which focuses on three specific health conditions (pregnancy, IBD and diabetes type 2) makes things a little easier just for those conditions [167] Human Pan-Microbe Community database http://www.hpmcd.org/index.php Taking an approach similar to IMG/M, HPMCD is offering comparison metagenomics based on microbial populations. The samples are based on EBI metagenomics samples [144] IMG/M https://img.jgi.doe.gov/cgi-bin/m/main.cgi The Integrated Microbial Genomes and Microbial Samples database takes a unique approach of providing microbial genomes from different studies and the ability to compare them. Perhaps not the most intuitive of the databases for reanalyses of specific conditions but rather the role of specific organisms [168] iMicrobe https://www.imicrobe.us/ iMicrobe provides an intuitive search for their data sets based on metadata, which is user-friendly. One drawback is similar to MG-RAST where whole studies cannot be downloaded at once but rather their individual samples. [169] MG-RAST http://metagenomics.anl.gov/ A constantly updated database and pipeline for NGS metagenomics. Data can be accessed via http, ftp and directly via their API. Perhaps a small drawback is the inability to download a whole study from their website something that is possible via ftp [170] QIITA https://qiita.ucsd.edu/ Web-based metagenomic database and pipeline of tools for 16S rRNA and shotgun data sets, originally created for the American Gut Project. QIITA offers data sets in various states of assembly from raw sequences to OTU tables. End user-friendly with resources, which can easily be added in a different pipeline for reanalysis [171] Repositive https://repositive.io/ Repositive is an all-purpose repository of genomic data created as a central hub for genomic data, but it contains metagenomic studies as well. Requires a free account to get started on the data [172] Database URL Description References EBI-metagenomics https://www.ebi.ac.uk/metagenomics/ Part of the European Nucleotide Archive, it offers a pipeline for raw sequence analysis and archiving of metagenomic data. The added value is the fact that users can view the analysis results of each sample [166] Human Microbiome Project Data Portal https://portal.hmpdacc.org/ Perhaps the most daunting of the databases, hmpdacc provides a way for users to browse and download data from the Human Microbiome Project. The interface is hard to navigate to find what you are looking for regarding specific conditions. The iHMP spin-off website which focuses on three specific health conditions (pregnancy, IBD and diabetes type 2) makes things a little easier just for those conditions [167] Human Pan-Microbe Community database http://www.hpmcd.org/index.php Taking an approach similar to IMG/M, HPMCD is offering comparison metagenomics based on microbial populations. The samples are based on EBI metagenomics samples [144] IMG/M https://img.jgi.doe.gov/cgi-bin/m/main.cgi The Integrated Microbial Genomes and Microbial Samples database takes a unique approach of providing microbial genomes from different studies and the ability to compare them. Perhaps not the most intuitive of the databases for reanalyses of specific conditions but rather the role of specific organisms [168] iMicrobe https://www.imicrobe.us/ iMicrobe provides an intuitive search for their data sets based on metadata, which is user-friendly. One drawback is similar to MG-RAST where whole studies cannot be downloaded at once but rather their individual samples. [169] MG-RAST http://metagenomics.anl.gov/ A constantly updated database and pipeline for NGS metagenomics. Data can be accessed via http, ftp and directly via their API. Perhaps a small drawback is the inability to download a whole study from their website something that is possible via ftp [170] QIITA https://qiita.ucsd.edu/ Web-based metagenomic database and pipeline of tools for 16S rRNA and shotgun data sets, originally created for the American Gut Project. QIITA offers data sets in various states of assembly from raw sequences to OTU tables. End user-friendly with resources, which can easily be added in a different pipeline for reanalysis [171] Repositive https://repositive.io/ Repositive is an all-purpose repository of genomic data created as a central hub for genomic data, but it contains metagenomic studies as well. Requires a free account to get started on the data [172] View Large Information overload and microbiome analytics As with all -omics approaches, metagenomics is plighted by vast amounts of data which, although characterized using the techniques above, need to be analyzed, comprehended and rationalized. Apart from computers, humans also must be able to see these data in ways easily understood and offer conjecture to their involvement in human health. Certain metrics and visualization techniques were introduced with the advancement of Bioinformatics toward that goal. Most of the standardized workflows mentioned previously, like QIIME, perform analysis of the microbiome data and exportation of results in diagrams and figures. A categorization of analyses and feedback bioinformatics applications can provide us with is: Microbial community composition, hierarchy and quantitative representation (taxa abundance) These tools focus on representing which taxa are abundant and at which percentage, in the individual samples or in the sample groupings based on their metadata. Raw reads abundance percentages derive from counting the number of OTU sequences present in the samples or a comparison between them to calculate their relative abundance. Following the biological taxonomy of phylum-> class-> order-> family-> genus-> OTU (species), we visualize the microbial composition in distinct levels and even in hierarchies using phylogenetic trees, homocentric diagrams and barplots. Diversity analysis There are two basic metrics of Diversity analyses in microbial samples. α-Diversity, which represents the biodiversity of the samples (how rich a sample is in different microbial communities), and β-Diversity, which characterizes how different the composition of the microbiome in the samples is across groupings of metadata that characterize the environment (e.g. healthy controls versus patients). α-Diversity is usually calculated via rarefaction [173] and algorithms like Chao1, Shannon, etc., and represented via rarefaction or box plots, while β-diversity is predominantly calculated using UniFrac distance metrics [174] and illustrated with principal coordinates analysis plots. In the case of the latter, there is also the ability to use a jackknifing algorithm [175]. Multivariate statistical analysis of microbiome composition in correlation to sample metadata This category focuses on inferring biological associations between microbial species and specific sample groupings. It is important for researchers testing a specific hypothesis to know the differential abundance between sample groupings to see which taxa contribute in statistically significant measurements to dysbiosis. Negative binominal (DeSEQ2), RandomForest, Kruskal–Wallis, Wilcoxon rank test, analysis of variance, t-test and other parametric and non-parametric statistical tests are used to that effect. As metagenomics analysis is based on multiple testing, false discovery rate correction of the P statistical importance via algorithms like Bonferroni, Benjamini–Hochberg or the more recent StructFDR [176], which is specialized for metagenomic data, is important. Guides like GUSTA ME [177] and Statistics How to (http://www.statisticshowto.com/) offer a way for researchers to understand these statistical strategies faster to decide which one conforms to their needs. Also algorithms like MixMC [178] Pearson’s correlation heatmaps, canonical correspondence analysis, redundancy analysis, etc. [179], measure how quantitatively different the microbial composition is in different groupings and what changes researchers can expect to find while studying them. Network analysis Network metrics are engaged to detect microbial species that co-occur, are mutually exclusive or point to specific associations with the sample metadata. This helps researchers model microbial community interactions and infer relationships. Networks are visualized in their traditional node–edge form, where nodes usually represent individual taxa and edges represent their relationships. Pearson’s correlation, Spearman’s rho or the recent mLDM [180] are some of the algorithms used to calculate these relationships. Specialized network construction and analysis tools for microbe–microbe and microbe–host interactions like MMinte [181] have been created to provide a semantic point of view to the microbiome. Additionally, external all-purpose network analysis and visualization applications like Cytoscape [182], Gephi [183] and the Network Workbench Tool [184] can also be used, as many of the microbiome applications can export their constructed networks in appropriate formats. Biomarker discovery Biomarker discovery in metagenomics is the way to identify which specific microbial taxa and their combinations contribute to explanatory variables. Once again, parametric or nonparametric tests are applied to OTU tables, and their results are represented in various forms like odds ratio diagrams. These tests usually apply when one wants to compare two different states in tandem. In recent years, implementations, such as LeFSe [185], have been introduced, which can analyze multiple factors simultaneously to discover biomarkers of dysbiosis. Functional analysis of the microbiome—metabolomics Even though quantification of the microflora’s composition is important to understand the parties involved in dysbiosis and their association with pathophysiology, their actual functionality is the key for examining if they are the cause or mere casualties of disorders. As showcased earlier when talking about the gut–brain axis, microbial metabolic processes, the preeminent way of the microbes to interact with the host, play a vital role to health. Metabolomics is the large-scale study to identify and quantify metabolites, which can provide insights into the host environment during homeostasis or disease. Studies can be focused either on cellular processes that affect the microbiome by creating a nurturing or hostile environment for the microflora or on the extragenomic perturbations caused by microbial metabolites on the host. Usually, modern studies focus on the latter trying to prove or disprove correlation between certain microbial populations and host disorders. Metabolite identification can either occur by analyzing the results of traditional methods like chromatography, mass spectrometry and nuclear magnetic resonance [186–189] or by using metagenomics tools that infer the metabolic products of microbial populations via their genes. Similar to the OTU classification process, functional metagenomics require different approaches in their analysis and visualization of results. Owing to the nature of metagenomics downstream analysis tools to offer insights to multiple of the above categories, Table 3 summarizes some stand-alone implementations and R packages along with their functionalities. Most of the applications require the appropriate input of sequences or OTU tables to analyze and provide visualizations of their results. Even though Tool A might offer a wider variety of operations than Tool B and can be preferred, the truth is that most of them are interchangeable and their usage relies on scientific community adoption and subjective ease of use. Some might argue that the speed and computational requirements of some of the implementations are not subjective, and there are clear winners, but it all depends on the computational power of the end-user’s equipment. Bioinformaticians may choose to even adapt some of them to their own needs, as they are open source, and create their mix and match pipelines. What is important though, is that the interpretation of their statistical analyses, remains in the hands of the researchers and should be used properly regarding different hypotheses. Statistics by themselves if not critically viewed can lead toward skewed conclusions especially in metagenomics, where so many variables are relevant and should be considered. Some researchers might even choose to run their data through multiple applications with the same functionality to verify their findings and use each tool’s resolution and specificity to their benefit. Figure 3 also summarizes frequently asked questions, which may arise during metagenomics research and which of these categories of tools are able to provide answers to them. Table 3 Open-source implementations of microbiome downstream analysis Tool Microbial community composition, hierarchy and quantitative representation Diversity analysis Multivariate statistical analysis of microbiome composition in correlation to sample metadata Network analysis Biomarker discovery Functional analysis/ metabolomics Reference Stand-alone implementations BugBase ✓ ✓ ✓ ✓ [190] Calypso ✓ ✓ ✓ ✓ ✓ ✓ [191] COGNIZER ✓ [192] EMPeror ✓ ✓ [193] Explicet ✓ ✓ ✓ [194] FishTaco ✓ [195] FMAP ✓ [196] FragGeneScan ✓ [197] FuncTree ✓ [198] Galaxy/Hutlab N/A Genboree Microbiome Toolset ✓ ✓ ✓ ✓ [199] Glimmer-MG ✓ [200] GraPhlAn ✓ [201] HUMAnN2 ✓ [202] IMP ✓ ✓ ✓ ✓ ✓ ✓ [203] Krona ✓ [204] LEfSe ✓ ✓ [185] MEGAN6 ✓ ✓ ✓ ✓ [150] MetaCoMET ✓ ✓ ✓ [205] METAGENassist ✓ ✓ ✓ [206] MetaShot ✓ [161] Metaviz ✓ ✓ ✓ [207] MG-RAST ✓ ✓ ✓ [170] Microbiome Analyst ✓ ✓ ✓ ✓ ✓ ✓ [208] Mminte ✓ ✓ [181] MOCAT 2 ✓ ✓ [209] mothur ✓ ✓ ✓ [106] Parallel-META 3 ✓ ✓ ✓ ✓ ✓ ✓ [210] Phoenix 2 ✓ ✓ [211] PICRUSt ✓ [212] Prodigal ✓ [213] QIIME ✓ ✓ ✓ ✓ [111] Rhea ✓ ✓ ✓ [214] SAMSA ✓ [215] ShortBRED ✓ [216] STAMP ✓ ✓ ✓ ✓ [217] Tax4Fun ✓ [218] Taxonomer ✓ [145] VAMPS ✓ ✓ [219] Vikodak ✓ [220] R packages ade4 ✓ ✓ [221] enveomics ✓ ✓ ✓ [222] metaDprof ✓ ✓ [223] metagenomeSeq ✓ ✓ [224] MMiRKAT ✓ [225] mmnet ✓ ✓ ✓ [226] phyloseq ✓ ✓ ✓ ✓ [227] RAIDA ✓ [228] RevEcoR ✓ ✓ [229] ShotgunFunctionalizeR ✓ [230] vegan ✓ ✓ ✓ [231] Tool Microbial community composition, hierarchy and quantitative representation Diversity analysis Multivariate statistical analysis of microbiome composition in correlation to sample metadata Network analysis Biomarker discovery Functional analysis/ metabolomics Reference Stand-alone implementations BugBase ✓ ✓ ✓ ✓ [190] Calypso ✓ ✓ ✓ ✓ ✓ ✓ [191] COGNIZER ✓ [192] EMPeror ✓ ✓ [193] Explicet ✓ ✓ ✓ [194] FishTaco ✓ [195] FMAP ✓ [196] FragGeneScan ✓ [197] FuncTree ✓ [198] Galaxy/Hutlab N/A Genboree Microbiome Toolset ✓ ✓ ✓ ✓ [199] Glimmer-MG ✓ [200] GraPhlAn ✓ [201] HUMAnN2 ✓ [202] IMP ✓ ✓ ✓ ✓ ✓ ✓ [203] Krona ✓ [204] LEfSe ✓ ✓ [185] MEGAN6 ✓ ✓ ✓ ✓ [150] MetaCoMET ✓ ✓ ✓ [205] METAGENassist ✓ ✓ ✓ [206] MetaShot ✓ [161] Metaviz ✓ ✓ ✓ [207] MG-RAST ✓ ✓ ✓ [170] Microbiome Analyst ✓ ✓ ✓ ✓ ✓ ✓ [208] Mminte ✓ ✓ [181] MOCAT 2 ✓ ✓ [209] mothur ✓ ✓ ✓ [106] Parallel-META 3 ✓ ✓ ✓ ✓ ✓ ✓ [210] Phoenix 2 ✓ ✓ [211] PICRUSt ✓ [212] Prodigal ✓ [213] QIIME ✓ ✓ ✓ ✓ [111] Rhea ✓ ✓ ✓ [214] SAMSA ✓ [215] ShortBRED ✓ [216] STAMP ✓ ✓ ✓ ✓ [217] Tax4Fun ✓ [218] Taxonomer ✓ [145] VAMPS ✓ ✓ [219] Vikodak ✓ [220] R packages ade4 ✓ ✓ [221] enveomics ✓ ✓ ✓ [222] metaDprof ✓ ✓ [223] metagenomeSeq ✓ ✓ [224] MMiRKAT ✓ [225] mmnet ✓ ✓ ✓ [226] phyloseq ✓ ✓ ✓ ✓ [227] RAIDA ✓ [228] RevEcoR ✓ ✓ [229] ShotgunFunctionalizeR ✓ [230] vegan ✓ ✓ ✓ [231] Note: These tools use microbial sequences and/or OTU tables to extract information on the microflora’s composition and functionality. View Large Table 3 Open-source implementations of microbiome downstream analysis Tool Microbial community composition, hierarchy and quantitative representation Diversity analysis Multivariate statistical analysis of microbiome composition in correlation to sample metadata Network analysis Biomarker discovery Functional analysis/ metabolomics Reference Stand-alone implementations BugBase ✓ ✓ ✓ ✓ [190] Calypso ✓ ✓ ✓ ✓ ✓ ✓ [191] COGNIZER ✓ [192] EMPeror ✓ ✓ [193] Explicet ✓ ✓ ✓ [194] FishTaco ✓ [195] FMAP ✓ [196] FragGeneScan ✓ [197] FuncTree ✓ [198] Galaxy/Hutlab N/A Genboree Microbiome Toolset ✓ ✓ ✓ ✓ [199] Glimmer-MG ✓ [200] GraPhlAn ✓ [201] HUMAnN2 ✓ [202] IMP ✓ ✓ ✓ ✓ ✓ ✓ [203] Krona ✓ [204] LEfSe ✓ ✓ [185] MEGAN6 ✓ ✓ ✓ ✓ [150] MetaCoMET ✓ ✓ ✓ [205] METAGENassist ✓ ✓ ✓ [206] MetaShot ✓ [161] Metaviz ✓ ✓ ✓ [207] MG-RAST ✓ ✓ ✓ [170] Microbiome Analyst ✓ ✓ ✓ ✓ ✓ ✓ [208] Mminte ✓ ✓ [181] MOCAT 2 ✓ ✓ [209] mothur ✓ ✓ ✓ [106] Parallel-META 3 ✓ ✓ ✓ ✓ ✓ ✓ [210] Phoenix 2 ✓ ✓ [211] PICRUSt ✓ [212] Prodigal ✓ [213] QIIME ✓ ✓ ✓ ✓ [111] Rhea ✓ ✓ ✓ [214] SAMSA ✓ [215] ShortBRED ✓ [216] STAMP ✓ ✓ ✓ ✓ [217] Tax4Fun ✓ [218] Taxonomer ✓ [145] VAMPS ✓ ✓ [219] Vikodak ✓ [220] R packages ade4 ✓ ✓ [221] enveomics ✓ ✓ ✓ [222] metaDprof ✓ ✓ [223] metagenomeSeq ✓ ✓ [224] MMiRKAT ✓ [225] mmnet ✓ ✓ ✓ [226] phyloseq ✓ ✓ ✓ ✓ [227] RAIDA ✓ [228] RevEcoR ✓ ✓ [229] ShotgunFunctionalizeR ✓ [230] vegan ✓ ✓ ✓ [231] Tool Microbial community composition, hierarchy and quantitative representation Diversity analysis Multivariate statistical analysis of microbiome composition in correlation to sample metadata Network analysis Biomarker discovery Functional analysis/ metabolomics Reference Stand-alone implementations BugBase ✓ ✓ ✓ ✓ [190] Calypso ✓ ✓ ✓ ✓ ✓ ✓ [191] COGNIZER ✓ [192] EMPeror ✓ ✓ [193] Explicet ✓ ✓ ✓ [194] FishTaco ✓ [195] FMAP ✓ [196] FragGeneScan ✓ [197] FuncTree ✓ [198] Galaxy/Hutlab N/A Genboree Microbiome Toolset ✓ ✓ ✓ ✓ [199] Glimmer-MG ✓ [200] GraPhlAn ✓ [201] HUMAnN2 ✓ [202] IMP ✓ ✓ ✓ ✓ ✓ ✓ [203] Krona ✓ [204] LEfSe ✓ ✓ [185] MEGAN6 ✓ ✓ ✓ ✓ [150] MetaCoMET ✓ ✓ ✓ [205] METAGENassist ✓ ✓ ✓ [206] MetaShot ✓ [161] Metaviz ✓ ✓ ✓ [207] MG-RAST ✓ ✓ ✓ [170] Microbiome Analyst ✓ ✓ ✓ ✓ ✓ ✓ [208] Mminte ✓ ✓ [181] MOCAT 2 ✓ ✓ [209] mothur ✓ ✓ ✓ [106] Parallel-META 3 ✓ ✓ ✓ ✓ ✓ ✓ [210] Phoenix 2 ✓ ✓ [211] PICRUSt ✓ [212] Prodigal ✓ [213] QIIME ✓ ✓ ✓ ✓ [111] Rhea ✓ ✓ ✓ [214] SAMSA ✓ [215] ShortBRED ✓ [216] STAMP ✓ ✓ ✓ ✓ [217] Tax4Fun ✓ [218] Taxonomer ✓ [145] VAMPS ✓ ✓ [219] Vikodak ✓ [220] R packages ade4 ✓ ✓ [221] enveomics ✓ ✓ ✓ [222] metaDprof ✓ ✓ [223] metagenomeSeq ✓ ✓ [224] MMiRKAT ✓ [225] mmnet ✓ ✓ ✓ [226] phyloseq ✓ ✓ ✓ ✓ [227] RAIDA ✓ [228] RevEcoR ✓ ✓ [229] ShotgunFunctionalizeR ✓ [230] vegan ✓ ✓ ✓ [231] Note: These tools use microbial sequences and/or OTU tables to extract information on the microflora’s composition and functionality. View Large Figure 3 View largeDownload slide Common questions in metagenomics research and the specific categories of downstream analysis that can provide answers. Figure 3 View largeDownload slide Common questions in metagenomics research and the specific categories of downstream analysis that can provide answers. Finally, worth mentioning is that many commercial solutions, which in some cases come as bundles with sequencing equipment, provide similar functionality, as the tools mentioned above with the added benefit of offering training and troubleshooting support, but carrying the disadvantage of their cost. These solutions include products like ERA-7 (https://era7bioinformatics.com/), CLC Genomics Workbench (https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/), Strand NGS (http://www.strand-ngs.com/) and NovoWorx (http://www.novocraft.com/products/novoworx/). Computational systems have catered to the needs of life sciences for many years now, following a parallel progress and evolution. Algorithms have been developed, applications coded and hardware constructed specifically for bioinformatics and medical informatics as demonstrated here. The goal of these efforts is to enhance research and to accommodate new and complex hypotheses that could be examined with speed and precision. Future strives will bring scientists closer to a complete modeling and emulation of the brain and the gut, allowing us to see, in silico, the machinations and evolution of the gut-brain axis even in real time. Recent strives toward that goal have shown great potential like the works of Cockrell et al. [232], Leber et al. [233], Abedi et al. [234] and others. It is our belief that these computational analyses will drive not only the identification but also the treatment of various conditions. Treating the disease, the patient or the patient–microflora complex. Will precision medicine be treating all of them? In 2015, the Precision Medicine Initiative (recently renamed to ‘All of Us’ [235, 236]) was announced by the US government to facilitate a better focus on personalized health and the type of treatment, which accounts for variability and identifies the unique features of each individual. With everything this review has shown about the microbiome and how close we are today to characterize it uniquely for everyone, because of our achievements in bioinformatics, we believe that the parallelism with this initiative is clear. If we are to talk about a person’s diagnosis, prognosis and therapy, it seems almost imperative to consider the whole microflora–host system. It is the entire system that suffers and, perhaps therein, lies the correct course of treatment or the necessary diagnostic and prognostic biomarkers. After all the microbiome has been implicated in regulating pharmacokinetics, pharmacodynamics and driving pharmacogenetics [237–239], providing added value to our investigations of drug metabolism and response. Exercise, diet and a lifestyle away from sedentary conditions have long been known to promote health for assorted reasons especially concerning the cardiovascular system [240, 241]. Today, we know that these factors perturb the gut’s microflora [242–244], driving the homeostasis and by extension the systemic health. Our diet and our medication regiment regulate our microflora’s composition in a larger scale, by adding new microorganisms or creating a hostile environment for others, affecting, among other systems, our gut–brain axis [245, 246]. By using the wisdom acquired via the downstream analysis of the microbiome, we can discuss targeted practices of diet and antibiotic usage, customized for everyone according to their microbial profile. It is an innovative approach to the well-known expression ‘We are what we eat’. There is also a special category of intervention, which includes probiotics, prebiotics and synbiotics that can influence the microflora, can be used as treatment for various conditions and have been the focus of many studies [247–252]. The terms, although popular in literature and gaining popularity in everyday life, are not well understood by the public. Probiotics are live organisms (bacteria, yeasts, etc.) that can supplement a person’s microflora when they are introduced in their diet. Prebiotics are ingredients that help specific microorganisms, already introduced to the organism, flourish and fight off pathogens and/or reach the appropriate numbers for dysbiosis. Finally, synbiotics are a mixture of the previous two groups. Owing to their mechanism of action, these dietary supplements can be used to target specific populations, which the current insights into dysbiosis have already identified by methods such as the ones described previously in this review. For example, Mehta et al. [253] have proposed the usage of lactic acid bacteria probiotics in reducing the oxidative stress implicated in AD, by suppressing D-galactose, which is implicated in increased reactive oxygen species production and nerve growth factor suppression. Finally, in recent years, a new term emerged, psychobiotics [254], which refers to living organisms (gut bacteria) introduced in the host’s system to treat mental disorders. Their method of action targets specifically the gut–brain axis via the neurotropic metabolic products of these microorganisms [255]. Although probiotics and prebiotics may be valuable additions to a personalized treatment regimen, they rely on daily consumption to be useful and contribute to homeostasis. In the past few years, the more targeted and permanent solution of fecal microbiota transplantation [256–258] has been successfully deployed to help the host’s microflora to be repopulated by ‘healthy’ symbionts. Based on what we know, for a transplantation to be successful a plethora of cofounding factors must be considered. What can be deemed as ‘healthy’ donor and ‘normal’ microflora? Is a transplant from someone living in the United States appropriate for someone in Asia? Considering location and different lifestyles, we must rely on our knowledge of the functional role of the microbiome as discussed previously. Also is the host’s lack of clinical symptomology enough to consider a transplant ‘healthy’ or do we have to test for ‘dormant’ GI pathogens [259]? Is fecal material a reliable source of microflora, as it can change constantly because of external factors [260]? Despite of the many difficulties, recent studies have shown promise in treating a variety of pathological conditions, including neuropsychological ones. For example, microflora transplantation has been successful in alleviating autism symptomology in a recent study by Kang et al. [261] where ASD-related behavior was improved by 22% following transplantation and up to 24% in a 8-week period after that (according to the Childhood Autism Rating Scale). Treating the microflora is not the only thing one must consider when trying to combat dysbiosis. One of the major reasons of microbial population loss is the broad usage of antibiotics [262]. Although critical for our health, the extended usage of these drugs has caused some issues going beyond the creation of antibiotic resistant bacteria [263]. Especially during early life, antibiotics can help combat pathogens introduced into the host but are also responsible for dysbiosis [264]. Once more, the need for targeted precision antibiotics comes into the foreground requiring an extensive understanding of their implications to the microflora synthesis and how populations vital to homeostasis can be spared. Complimenting antibacterial treatments with probiotics, which are not susceptible to the antibiotics themselves [265], can prove useful for customized approaches to the needs of patients [266, 267]. In the past 5 years, the microbiome has seen a significant boost in scientific interest and publications. A relative term search (microbiome, microbiota, microflora), in PubMed alone, yields over 35 000 results for just this period with an exponential growth each passing year. Some researchers [268] have even characterized the year 2016 as the ‘banner year’ for microbiome research, something that can be directly attributed to the bioinformatics approaches at our disposal and the constant flow of information linking it to systemic health. This shift toward a better understanding of all the mechanisms describing and being perturbed by our microbiome is driven by our need to be able to better understand the host–microflora relationship. The acquisition of this knowledge can lead, not only in more precise definition of the pathophysiological attributes of disorders but also to the customization of treatment for individuals or specific patient groups. Several aspects of today’s medicine are being driven by genomics, proteomics, epigenomics, metabolomics, microbiomics and their integration via systems biology, allowing researchers to accurately predict the onset, progression and pharmacological response of a pathological condition [269, 270]. Scientists are now able not only to precisely identify and evaluate the microbiome but also track its changes and the ones it provokes through time, dynamically tracking bacterial population abundance differences and metabolite production [271–273]. The complexity of the gut–brain–microbiome axis makes for an interesting target for the application of our research efforts and a perfect candidate to be supported by integrated multidisciplinary approaches [274]. As the embryonic stage via our maternal microbiome and developing rapidly in the first 3–5 years of life [275, 276], our microbial partners help shape the development of our CNS and behavior. During our lifespan, the gut microbiome contributes toward neurological and mental health. The cross talk between the microflora ecology and the host’s physiology is based on interactions on a genetic, protein and metabolic level for both sides involved. The studies previously mentioned in this review highlight the gut microbiome as a modulator of brain development and neurotransmitter signaling systems but also as a mediator of neurological, mental and behavioral function in adults. We are confronted with vast networks of signals and interactions, in which we are called to identify the essential components for homeostasis and understand what perturbations are applied by dysbiosis. It is important in a dynamic ecosystem that research will be focused on the factors that drive permanent or reversible changes that are essential in a variety of functions and their involvement in molecular mechanisms. These fundamental biological mechanisms can be explored via novel high-throughput computational methodologies that combine and analyze the evolution of the microbial communities and their genetic composition, microbial–host biological systems interaction and the effects of external environmental factors on the microbial–host ecosystem. More specifically, computational metagenomics cross-analysis and host genetic susceptibility/genomic background will provide new insights into the onset and progression of CNS disease. In addition, the characterization and quantification of the genomic composition of the microbiome under different environmental factors can provide information of the microbiome’s role as a cause or effect of disease, something that is currently under investigation. Translating the biological networks into computational ones, which include host-omics, meta-omics and related phenotypes in tandem, we can construct prediction models that can reveal valuable information on metabolic and other molecular components as well as signaling pathways mediated in brain health and disease. The development of new combinational databases [277], which proliferate the knowledge derived by our research, will help to make it accessible and usable by other investigators. These novel bioinformatics avenues lead to a better understanding of neurological and mental disease by pinpointing the modifiable factors that influence the microbiome and act as regulators of health. The outcome of this knowledge can be new therapeutic strategies that complement a possible prognostic and diagnostic role of the gut microbiome, in medicine at a personalized as well as a general population level. Key Points Gut–brain axis is a complex communication system mediating human health. Microflora–gut–brain axis is based on a bidirectional relationship. Shotgun and 16S rRNA sequencing precision is essential for our data. Computational downstream analysis of the microbiome provides answers regarding its composition and function. Microbiome research could offer a novel approach to precision medicine. Funding G.M.S. holds the Bioinformatics ERA Chair position funded by the European Commission Research Executive Agency (REA) Grant BIORISE (grant number 669026), under the Spreading Excellence, Widening Participation, Science with and for Society Framework. Nikolas Dovrolis is a PhD candidate of Pharmacology at the Democritus University of Thrace. He is a Computer Science graduate with a Master’s Degree in Molecular Biology and Genetics. George Kolios, MD, PhD, is a Professor of Pharmacology at Democritus University Thrace, Greece. He is a clinical Gastroenterologist, with extensive research in mucosal immunology, focused on intestinal inflammation and microbiota. George M. Spyrou, PhD, holds the Bioinformatics ERA Chair and is the Head of the Bioinformatics Group at the Cyprus Institute of Neurology and Genetics. Ioanna Maroulakou, PhD, is a Professor of Genetics at Democritus University of Thrace and has extensive experience and expertise in Translational Research and Acquired genetic disorders including neurodegenerative diseases. References 1 Berg RD. The indigenous gastrointestinal microflora . Trends Microbiol 1996 ; 4 ( 11 ): 430 – 5 . http://dx.doi.org/10.1016/0966-842X(96)10057-3 Google Scholar Crossref Search ADS PubMed 2 Franzosa EA , Huang K , Meadow JF. Identifying personal microbiomes using metagenomic codes . Proc Natl Acad Sci USA 2015 ; 112 ( 22 ): E2930 – 8 . Google Scholar Crossref Search ADS PubMed 3 Sekirov I , Russell SL , Antunes LCM , et al. Gut microbiota in health and disease . Physiol Rev 2010 ; 90 ( 3 ): 859 – 904 . http://dx.doi.org/10.1152/physrev.00045.2009 Google Scholar Crossref Search ADS PubMed 4 Turnbaugh PJ , Hamady M , Yatsunenko T , et al. A core gut microbiome in obese and lean twins . Nature 2009 ; 457 ( 7228 ): 480 – 4 . http://dx.doi.org/10.1038/nature07540 Google Scholar Crossref Search ADS PubMed 5 Levy M , Blacher E , Elinav E. Microbiome, metabolites and host immunity . Curr Opin Microbiol 2017 ; 35 : 8 – 15 . http://dx.doi.org/10.1016/j.mib.2016.10.003 Google Scholar Crossref Search ADS PubMed 6 Carding S , Verbeke K , Vipond DT , et al. Dysbiosis of the gut microbiota in disease . Microb Ecol Health Dis 2015 ; 26 ( 0 ):. 7 Flint HJ , Scott KP , Louis P , et al. The role of the gut microbiota in nutrition and health . Nat Rev Gastroenterol Hepatol 2012 ; 9 ( 10 ): 577 – 89 . http://dx.doi.org/10.1038/nrgastro.2012.156 Google Scholar Crossref Search ADS PubMed 8 Tognini P. Gut microbiota: a potential regulator of neurodevelopment . Front Cell Neurosci 2017 ; 11 : 25 . Google Scholar Crossref Search ADS PubMed 9 Rogers G , Keating D , Young R , et al. From gut dysbiosis to altered brain function and mental illness: mechanisms and pathways . Mol Psychiatry 2016 ; 21 ( 6 ): 738 – 48 . http://dx.doi.org/10.1038/mp.2016.50 Google Scholar Crossref Search ADS PubMed 10 Gruber J , Kennedy BK. Microbiome and longevity: gut microbes send signals to host mitochondria . Cell 2017 ; 169 ( 7 ): 1168 – 9 . http://dx.doi.org/10.1016/j.cell.2017.05.048 Google Scholar Crossref Search ADS PubMed 11 Han B , Sivaramakrishnan P , Lin C-CJ , et al. Microbial genetic composition tunes host longevity . Cell 2017 ; 169 ( 7 ): 1249 – 62.e1213 . Google Scholar Crossref Search ADS PubMed 12 Lee E-S , Song E-J , Nam Y-D. Dysbiosis of gut microbiome and its impact on epigenetic regulation . J Clin Epigene 2017 , in press. 13 Krautkramer KA , Kreznar JH , Romano KA , et al. Diet-microbiota interactions mediate global epigenetic programming in multiple host tissues . Mol Cell 2016 ; 164 : 982 – 92 . Google Scholar Crossref Search ADS 14 Tse JKY. Gut microbiota, nitric oxide and microglia as pre-requisites for neurodegenerative disorders . ACS Chem Neurosci 2017 ; 8 : 1438 – 47 . http://dx.doi.org/10.1021/acschemneuro.7b00176 Google Scholar Crossref Search ADS PubMed 15 Round JL , Mazmanian SK. The gut microbiota shapes intestinal immune responses during health and disease . Nat Rev Immunol 2009 ; 9 ( 5 ): 313 – 23 . http://dx.doi.org/10.1038/nri2515 Google Scholar Crossref Search ADS PubMed 16 Zmora N , Bashiardes S , Levy M , et al. The role of the immune system in metabolic health and disease . Cell Metab 2017 ; 25 ( 3 ): 506 – 21 . http://dx.doi.org/10.1016/j.cmet.2017.02.006 Google Scholar Crossref Search ADS PubMed 17 Braniste V , Al-Asmakh M , Kowal C , et al. The gut microbiota influences blood-brain barrier permeability in mice . Sci Transl Med 2014 ; 6 ( 263 ): 263ra158 . Google Scholar Crossref Search ADS PubMed 18 Holleran G , Lopetuso L , Ianiro G , et al. Gut microbiota and inflammatory bowel disease: an update . Minerva Gastroenterol Dietol 2017 ; 63 : 373 – 84 . Google Scholar PubMed 19 Tang WW , Hazen SL. The gut microbiome and its role in cardiovascular diseases . Circulation 2017 ; 135 ( 11 ): 1008 – 10 . http://dx.doi.org/10.1161/CIRCULATIONAHA.116.024251 Google Scholar Crossref Search ADS PubMed 20 Drosos I , Tavridou A , Kolios G. New aspects on the metabolic role of intestinal microbiota in the development of atherosclerosis . Metabolism 2015 ; 64 ( 4 ): 476 – 81 . http://dx.doi.org/10.1016/j.metabol.2015.01.007 Google Scholar Crossref Search ADS PubMed 21 Stefanaki C , Peppa M , Mastorakos G , et al. Examining the gut bacteriome, virome, and mycobiome in glucose metabolism disorders: are we on the right track? Metabolism 2017 ; 73 : 52 – 66 . Google Scholar Crossref Search ADS PubMed 22 Bhutia YD , Ogura J , Sivaprakasam S , et al. Gut microbiome and colon cancer: role of bacterial metabolites and their molecular targets in the host . Curr Colorectal Cancer Rep 2017 ; 13 ( 2 ): 111 – 18 . http://dx.doi.org/10.1007/s11888-017-0362-9 Google Scholar Crossref Search ADS PubMed 23 Bouter KE , van Raalte DH , Groen AK , et al. Role of the gut microbiome in the pathogenesis of obesity and obesity-related metabolic dysfunction . Gastroenterology 2017 ; 13 : 111 – 18 . 24 Liu J , Williams B , Frank D , et al. Inside out: HIV, the gut microbiome, and the mucosal immune system . J Immunol 2017 ; 198 ( 2 ): 605 – 14 . http://dx.doi.org/10.4049/jimmunol.1601355 Google Scholar Crossref Search ADS PubMed 25 Nallu A , Sharma S , Ramezani A , et al. Gut microbiome in chronic kidney disease: challenges and opportunities . Transl Res 2017 ; 179 : 24 – 37 . http://dx.doi.org/10.1016/j.trsl.2016.04.007 Google Scholar Crossref Search ADS PubMed 26 Ruff WE , Vieira SM , Kriegel MA. The role of the gut microbiota in the pathogenesis of antiphospholipid syndrome . Curr Rheumatol Rep 2015 ; 17 ( 1 ): 472.http://dx.doi.org/10.1007/s11926-014-0472-1 Google Scholar Crossref Search ADS PubMed 27 Wang Y , Kasper LH. The role of microbiome in central nervous system disorders . Brain Behav Immun 2014 ; 38 : 1 – 12 . http://dx.doi.org/10.1016/j.bbi.2013.12.015 Google Scholar Crossref Search ADS PubMed 28 Dinan TG , Cryan JF. The impact of gut microbiota on brain and behaviour: implications for psychiatry . Curr Opin Clin Nutr Metabol Care 2015 ; 18 ( 6 ): 552 – 8 . http://dx.doi.org/10.1097/MCO.0000000000000221 Google Scholar Crossref Search ADS 29 Gershon M. The Second Brain: A Groundbreaking New Understanding of Nervous Disorders of the Stomach and Intestine . Harper Collins , New York , 1999 . 30 Furness JB , Costa M , The enteric nervous system . Churchill Livingstone Edinburgh etc ., 1987 . 31 Furness JB. The enteric nervous system and neurogastroenterology . Nat Rev Gastroenterol Hepatol 2012 ; 9 ( 5 ): 286 – 94 . http://dx.doi.org/10.1038/nrgastro.2012.32 Google Scholar Crossref Search ADS PubMed 32 Holzer P , Farzi A , Neuropeptides and the microbiota-gut-brain axis. In: Microbial Endocrinology: The Microbiota-Gut-Brain Axis in Health and Disease . Springer , New York , 2014 , 195 – 219 . 33 Cryan JF , O'Mahony SM. The microbiome‐gut‐brain axis: from bowel to behavior . Neurogastroenterol Motil 2011 ; 23 ( 3 ): 187 – 92 . Google Scholar Crossref Search ADS PubMed 34 Bauer KC , Huus KE , Finlay BB. Microbes and the mind: emerging hallmarks of the gut microbiota–brain axis . Cell Microbiol 2016 ; 18 ( 5 ): 632 – 44 . http://dx.doi.org/10.1111/cmi.12585 Google Scholar Crossref Search ADS PubMed 35 Sampson TR , Mazmanian SK. Control of brain development, function, and behavior by the microbiome . Cell Host Microbe 2015 ; 17 ( 5 ): 565 – 76 . http://dx.doi.org/10.1016/j.chom.2015.04.011 Google Scholar Crossref Search ADS PubMed 36 Le Floc’h N , Otten W , Merlot E. Tryptophan metabolism, from nutrition to potential therapeutic applications . Amino Acids 2011 ; 41 ( 5 ): 1195 – 205 . Google Scholar Crossref Search ADS PubMed 37 Erny D , Hrabě de Angelis AL , Jaitin D , et al. Host microbiota constantly control maturation and function of microglia in the CNS . Nat Neurosci 2015 ; 18 ( 7 ): 965 – 77 . Google Scholar Crossref Search ADS PubMed 38 Bellono NW , Bayrer JR , Leitch DB , et al. Enterochromaffin cells are gut chemosensors that couple to sensory neural pathways . Cell 2017 ; 170 : 185 – 98.e16 . Google Scholar Crossref Search ADS PubMed 39 Greathouse KL , Faucher MA , Hastings-Tolsma M. The gut microbiome, obesity, and weight control in women‘s reproductive health . West J Nurs Res 2017 ; 39 : 1094 – 119 . Google Scholar Crossref Search ADS PubMed 40 Komaroff AL. The microbiome and risk for obesity and diabetes . Jama 2017 ; 317 ( 4 ): 355 – 6 . http://dx.doi.org/10.1001/jama.2016.20099 Google Scholar Crossref Search ADS PubMed 41 Sanmiguel CP , Jacobs J , Gupta A , et al. Surgically induced changes in gut microbiome and hedonic eating as related to weight loss: preliminary findings in obese women undergoing bariatric surgery . Psychosomatic Med 2017 ; 79 : 880 – 7 . http://dx.doi.org/10.1097/PSY.0000000000000494 Google Scholar Crossref Search ADS 42 Tap J , Derrien M , Törnblom H , et al. Identification of an intestinal microbiota signature associated with severity of irritable bowel syndrome . Gastroenterology 2017 ; 152 ( 1 ): 111 – 23. e118 . Google Scholar Crossref Search ADS PubMed 43 Ringel Y. The gut microbiome in irritable bowel syndrome and other functional bowel disorders . Gastroenterol Clin N Am 2017 ; 46 ( 1 ): 91 – 101 . http://dx.doi.org/10.1016/j.gtc.2016.09.014 Google Scholar Crossref Search ADS 44 Mahurkar-Joshi S , Labus JS , Jacobs J , et al. 143-Colonic mucosal microbiome is associated with mucosal microrna expression in irritable bowel syndrome . Gastroenterology 2017 ; 152 ( 5 ): S40 – 1 . Google Scholar Crossref Search ADS 45 Sanger GJ , Lee K. Hormones of the gut-brain axis as targets for the treatment of upper gastrointestinal disorders . Nat Rev Drug Discov 2008 ; 7 ( 3 ): 241.http://dx.doi.org/10.1038/nrd2444 Google Scholar Crossref Search ADS PubMed 46 Pärtty A , Kalliomäki M. Infant colic is still a mysterious disorder of the microbiota–gut–brain axis . Acta Paediatrica 2017 ; 106 ( 4 ): 528 – 9 . Google Scholar Crossref Search ADS PubMed 47 Tremlett H , Bauer KC , Appel‐Cresswell S , et al. The gut microbiome in human neurological disease: a review . Ann Neurol 2017 ; 81 : 369 – 82 . Google Scholar Crossref Search ADS PubMed 48 Yang I , Corwin EJ , Brennan PA , et al. The infant microbiome: implications for infant health and neurocognitive development . Nurs Res 2016 ; 65 ( 1 ): 76 – 88 . http://dx.doi.org/10.1097/NNR.0000000000000133 Google Scholar Crossref Search ADS PubMed 49 Sharon G , Sampson TR , Geschwind DH , et al. The central nervous system and the gut microbiome . Cell 2016 ; 167 ( 4 ): 915 – 32 . http://dx.doi.org/10.1016/j.cell.2016.10.027 Google Scholar Crossref Search ADS PubMed 50 Desbonnet L , Clarke G , Traplin A , et al. Gut microbiota depletion from early adolescence in mice: Implications for brain and behaviour . Brain Behav Immun 2015 ; 48 : 165 – 73 . http://dx.doi.org/10.1016/j.bbi.2015.04.004 Google Scholar Crossref Search ADS PubMed 51 Li Q , Han Y , Dy ABC , et al. The gut microbiota and autism spectrum disorders . Front Cell Neurosci 2017 ; 11 : 120 . http://dx.doi.org/10.3389/fncel.2017.00120 Google Scholar Crossref Search ADS PubMed 52 Braun J. Tightening the Case for Gut Microbiota in Autism-Spectrum Disorder . Elsevier , 2017 . 53 Ding HT , Taur Y , Walkup JT. Gut microbiota and autism: key concepts and findings . J Autism Dev Disord 2016 ; 47 : 480 – 9 . Google Scholar Crossref Search ADS 54 Strati F , Cavalieri D , Albanese D , et al. New evidences on the altered gut microbiota in autism spectrum disorders . Microbiome 2017 ; 5 ( 1 ): 24 . http://dx.doi.org/10.1186/s40168-017-0242-1 Google Scholar Crossref Search ADS PubMed 55 Gogou M , Gogou C. The effect of intestinal microbiome on autism spectrum disorder . J Pediatr Sci 2016 ; 8 ( 0 ). 56 Vuong HE , Hsiao EY. Emerging roles for the gut microbiome in autism spectrum disorder . Biol Psychiatry 2017 ; 81 ( 5 ): 411 – 23 . http://dx.doi.org/10.1016/j.biopsych.2016.08.024 Google Scholar Crossref Search ADS PubMed 57 Schwarz E , Maukonen J , Hyytiäinen T , et al. Analysis of microbiota in first episode psychosis identifies preliminary associations with symptom severity and treatment response . Schizophr Res 2017 , doi: 10.1016/j.schres.2017.04.017. 58 Evans SJ , Bassis CM , Hein R , et al. The gut microbiome composition associates with bipolar disorder and illness severity . J Psychiatr Res 2017 ; 87 : 23 – 9 . http://dx.doi.org/10.1016/j.jpsychires.2016.12.007 Google Scholar Crossref Search ADS PubMed 59 Nieto R , Kukuljan M , Silva H. BDNF and schizophrenia: from neurodevelopment to neuronal plasticity, learning, and memory . Front Psychiatry 2013 ; 4 : 45 . Google Scholar Crossref Search ADS PubMed 60 Marin IA , Goertz JE , Ren T , et al. Microbiota alteration is associated with the development of stress-induced despair behavior . Sci Rep 2017 ; 7 : 43859.http://dx.doi.org/10.1038/srep43859 Google Scholar Crossref Search ADS PubMed 61 Lothian J , Blampied NM , Rucklidge JJ. Effect of micronutrients on insomnia in adults a multiple-baseline study . Clin Psychol Sci 2016 ; 4 ( 6 ): 2167702616631740. Google Scholar Crossref Search ADS 62 D’Mello C , Swain MG. Immune-to-brain communication pathways in inflammation-associated sickness and depression. In: Inflammation-Associated Depression: Evidence, Mechanisms and Implications . Springer , Switzerland , 2017 : 73 – 94 . 63 MacQueen G , Surette M , Moayyedi P. The gut microbiota and psychiatric illness . J Psychiatry Neurosci 2017 ; 42 ( 2 ): 75.http://dx.doi.org/10.1503/jpn.170028 Google Scholar Crossref Search ADS PubMed 64 Hoban A , Stilling R , Moloney G , et al. The microbiome regulates amygdala-dependent fear recall . Mol Psychiatry 2017 , doi: 10.1038/mp.2017.100. 65 Zheng P , Zeng B , Zhou C , et al. Gut microbiome remodeling induces depressive-like behaviors through a pathway mediated by the host‘s metabolism . Mol Psychiatry 2016 ; 21 ( 6 ): 786 – 96 . Google Scholar Crossref Search ADS PubMed 66 Benakis C , Brea D , Caballero S , et al. Commensal microbiota affects ischemic stroke outcome by regulating intestinal [gamma][delta] T cells . Nat Med 2016 ; 22 : 516 – 23 . http://dx.doi.org/10.1038/nm.4068 Google Scholar Crossref Search ADS PubMed 67 Zhang Y-g , Wu S , Yi J , et al. Target intestinal microbiota to alleviate disease progression in amyotrophic lateral sclerosis . Clin Ther 2017 ; 39 : 322 – 36 . http://dx.doi.org/10.1016/j.clinthera.2016.12.014 Google Scholar Crossref Search ADS PubMed 68 Mirza A , Mao-Draayer Y. The gut microbiome and microbial translocation in multiple sclerosis . Clin Immunol 2017 , doi: 10.1016/j.clim.2017.03.001. 69 Hill‐Burns EM , Debelius JW , Morton JT , et al. Parkinson's disease and Parkinson's disease medications have distinct signatures of the gut microbiome . Mov Disord 2017 ; 32 : 739 – 49 . Google Scholar Crossref Search ADS PubMed 70 Pistollato F , Sumalla Cano S , Elio I , et al. Role of gut microbiota and nutrients in amyloid formation and pathogenesis of Alzheimer disease . Nutr Rev 2016 ; 74 ( 10 ): 624 – 34 . http://dx.doi.org/10.1093/nutrit/nuw023 Google Scholar Crossref Search ADS PubMed 71 Bonfili L , Cecarini V , Berardi S , et al. Microbiota modulation counteracts Alzheimer‘s disease progression influencing neuronal proteolysis and gut hormones plasma levels . Sci Rep 2017 ; 7 ( 1 ): 2426 . Google Scholar Crossref Search ADS PubMed 72 Schuster SC. Next-generation sequencing transforms today's biology . Nat Methods 2008 ; 5 ( 1 ): 16.http://dx.doi.org/10.1038/nmeth1156 Google Scholar Crossref Search ADS PubMed 73 Metzker ML. Sequencing technologies—the next generation . Nat Rev Genet 2010 ; 11 ( 1 ): 31 – 46 . Google Scholar Crossref Search ADS PubMed 74 Jovel J , Patterson J , Wang W , et al. Characterization of the gut microbiome using 16S or shotgun metagenomics . Front Microbiol 2016 ; 7 : 459 . Google Scholar Crossref Search ADS PubMed 75 Ranjan R , Rani A , Metwally A , et al. Analysis of the microbiome: advantages of whole genome shotgun versus 16S amplicon sequencing . Biochem Biophys Res Commun 2016 ; 469 ( 4 ): 967 – 77 . http://dx.doi.org/10.1016/j.bbrc.2015.12.083 Google Scholar Crossref Search ADS PubMed 76 Cui L , Morris A , Ghedin E. The human mycobiome in health and disease . Genome Med 2013 ; 5 ( 7 ): 63.http://dx.doi.org/10.1186/gm467 Google Scholar Crossref Search ADS PubMed 77 Huseyin CE , O’Toole PW , Cotter PD , et al. Forgotten fungi—the gut mycobiome in human health and disease . FEMS Microbiol Rev 2017 ; 41 : 479 – 511 . Google Scholar Crossref Search ADS PubMed 78 Zhao G , Wu G , Lim ES , et al. VirusSeeker, a computational pipeline for virus discovery and virome composition analysis . Virology 2017 ; 503 : 21 – 30 . http://dx.doi.org/10.1016/j.virol.2017.01.005 Google Scholar Crossref Search ADS PubMed 79 Czeczko P , Greenway SC , de Koning A. EzMap: a simple pipeline for reproducible analysis of the human virome . Bioinformatics 2017 ; 33 : 2573 – 4 . http://dx.doi.org/10.1093/bioinformatics/btx202 Google Scholar Crossref Search ADS PubMed 80 Handelsman J , Rondon MR , Brady SF , et al. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products . Chem Biol 1998 ; 5 ( 10 ): R245 – 9 . Google Scholar Crossref Search ADS PubMed 81 Wooley JC , Godzik A , Friedberg I. A primer on metagenomics . PLoS Comput Biol 2010 ; 6 ( 2 ): e1000667. Google Scholar Crossref Search ADS PubMed 82 Handelsman J. Metagenomics: application of genomics to uncultured microorganisms . Microbiol Mol Biol Rev 2004 ; 68 ( 4 ): 669 – 85 . http://dx.doi.org/10.1128/MMBR.68.4.669-685.2004 Google Scholar Crossref Search ADS PubMed 83 Sharpton TJ. An introduction to the analysis of shotgun metagenomic data . Front Plant Sci 2014 ; 5 : 209. Google Scholar Crossref Search ADS PubMed 84 Head SR , Komori HK , LaMere SA , et al. Library construction for next-generation sequencing: overviews and challenges . Biotechniques 2014 ; 56 ( 2 ): 61 . Google Scholar Crossref Search ADS PubMed 85 Rintala A , Pietilä S , Munukka E , et al. Gut microbiota analysis results are highly dependent on the 16S rRNA gene target region, whereas the impact of DNA extraction is minor . J Biomol Tech 2017 ; 28 : 19 . Google Scholar PubMed 86 Olson ND , Treangen TJ , Hill CM , et al. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes . Brief Bioinform 2017 , doi: 10.1093/bib/bbx098. 87 Bradley RD , Hillis DM. Recombinant DNA sequences generated by PCR amplification . Mol Biol Evol 1997 ; 14 ( 5 ): 592 – 3 . http://dx.doi.org/10.1093/oxfordjournals.molbev.a025797 Google Scholar Crossref Search ADS PubMed 88 Jackman SD , Vandervalk BP , Mohamadi H , et al. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter . Genome Res 2017 ; 27 : 768 – 77 . http://dx.doi.org/10.1101/gr.214346.116 Google Scholar Crossref Search ADS PubMed 89 Koren S , Treangen TJ , Pop M. Bambus 2: scaffolding metagenomes . Bioinformatics 2011 ; 27 ( 21 ): 2964 – 71 . http://dx.doi.org/10.1093/bioinformatics/btr520 Google Scholar Crossref Search ADS PubMed 90 Lin Y-Y , Hsieh C-H , Chen J-H , et al. De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline . BMC Bioinformatics 2017 ; 18 ( 1 ): 223 . http://dx.doi.org/10.1186/s12859-017-1630-z Google Scholar Crossref Search ADS PubMed 91 Mysara M , Saeys Y , Leys N , et al. CATCh, an ensemble classifier for chimera detection in 16S rRNA sequencing studies . Appl Environ Microbiol 2015 ; 81 ( 5 ): 1573 – 84 . http://dx.doi.org/10.1128/AEM.02896-14 Google Scholar Crossref Search ADS PubMed 92 Haas BJ , Gevers D , Earl AM , et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons . Genome Res 2011 ; 21 ( 3 ): 494 – 504 . http://dx.doi.org/10.1101/gr.112730.110 Google Scholar Crossref Search ADS PubMed 93 Sayols S , Scherzinger D , Klein H. dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data . BMC Bioinformatics 2016 ; 17 ( 1 ): 428.http://dx.doi.org/10.1186/s12859-016-1276-2 Google Scholar Crossref Search ADS PubMed 94 Schirmer M , D’Amore R , Ijaz UZ , et al. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data . BMC Bioinformatics 2016 ; 17 ( 1 ): 125 . Google Scholar Crossref Search ADS PubMed 95 Peng Y , Leung HC , Yiu S-M , et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth . Bioinformatics 2012 ; 28 ( 11 ): 1420 – 8 . http://dx.doi.org/10.1093/bioinformatics/bts174 Google Scholar Crossref Search ADS PubMed 96 Jeraldo P , Kalari K , Chen X , et al. IM-TORNADO: a tool for comparison of 16S reads from paired-end libraries . PLoS One 2014 ; 9 ( 12 ): e114804 . Google Scholar Crossref Search ADS PubMed 97 Lai B , Wang F , Wang X , et al. InteMAP: Integrated metagenomic assembly pipeline for NGS short reads . BMC Bioinformatics 2015 ; 16 ( 1 ): 244 . http://dx.doi.org/10.1186/s12859-015-0686-x Google Scholar Crossref Search ADS PubMed 98 Mysara M , Leys N , Raes J , et al. IPED: a highly efficient denoising tool for Illumina MiSeq Paired-end 16S rRNA gene amplicon sequencing data . BMC Bioinformatics 2016 ; 17 ( 1 ): 192 . http://dx.doi.org/10.1186/s12859-016-1061-2 Google Scholar Crossref Search ADS PubMed 99 Lai B , Ding R , Li Y , et al. A de novo metagenomic assembly program for shotgun DNA reads . Bioinformatics 2012 ; 28 ( 11 ): 1455 – 62 . http://dx.doi.org/10.1093/bioinformatics/bts162 Google Scholar Crossref Search ADS PubMed 100 Parikh HI , Koparde VN , Bradley SP , et al. MeFiT: merging and filtering tool for illumina paired-end reads for 16S rRNA amplicon sequencing . BMC Bioinformatics 2016 ; 17 ( 1 ): 491 . http://dx.doi.org/10.1186/s12859-016-1358-1 Google Scholar Crossref Search ADS PubMed 101 Li D , Liu C-M , Luo R , et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph . Bioinformatics 2015 ; 31 ( 10 ): 1674 – 6 . http://dx.doi.org/10.1093/bioinformatics/btv033 Google Scholar Crossref Search ADS PubMed 102 Unno T. Bioinformatic suggestions on MiSeq-based microbial community analysis . J Microbiol Biotechnol 2015 ; 25 ( 6 ): 765 – 70 . http://dx.doi.org/10.4014/jmb.1409.09057 Google Scholar Crossref Search ADS PubMed 103 Treangen TJ , Koren S , Sommer DD , et al. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline . Genome Biol 2013 ; 14 ( 1 ): R2 . Google Scholar Crossref Search ADS PubMed 104 Nurk S , Meleshko D , Korobeynikov A , et al. metaSPAdes: a new versatile de novo metagenomics assembler . Genome Res 2017 ; 27 : 824 – 34 . Google Scholar Crossref Search ADS PubMed 105 Namiki T , Hachiya T , Tanaka H , et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads . Nucleic Acids Res 2012 ; 40 ( 20 ): e155 . Google Scholar Crossref Search ADS PubMed 106 Schloss PD , Westcott SL , Ryabin T , et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities . Appl Environ Microbiol 2009 ; 75 ( 23 ): 7537 – 41 . http://dx.doi.org/10.1128/AEM.01541-09 Google Scholar Crossref Search ADS PubMed 107 Mysara M , Leys N , Raes J , et al. NoDe: a fast error-correction algorithm for pyrosequencing amplicon reads . BMC Bioinformatics 2015 ; 16 : 88.http://dx.doi.org/10.1186/s12859-015-0520-5 Google Scholar Crossref Search ADS PubMed 108 Mysara M , Njima M , Leys N , et al. From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data . Gigascience 2017 ; 6 ( 2 ): 1 – 10 . http://dx.doi.org/10.1093/gigascience/giw017 Google Scholar Crossref Search ADS PubMed 109 Cuccuru G , Orsini M , Pinna A , et al. Orione, a web-based framework for NGS analysis in microbiology . Bioinformatics 2014 ; 30 ( 13 ): 1928 – 9 . http://dx.doi.org/10.1093/bioinformatics/btu135 Google Scholar Crossref Search ADS PubMed 110 Ruby JG , Bellare P , DeRisi JL. PRICE: software for the targeted assembly of components of (Meta) genomic sequence data . G3 2013 ; 3 : 865 – 80 . http://dx.doi.org/10.1534/g3.113.005967 Google Scholar Crossref Search ADS PubMed 111 Caporaso JG , Kuczynski J , Stombaugh J , et al. QIIME allows analysis of high-throughput community sequencing data . Nat Methods 2010 ; 7 ( 5 ): 335 – 6 . http://dx.doi.org/10.1038/nmeth.f.303 Google Scholar Crossref Search ADS PubMed 112 Okonechnikov K , Conesa A , García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data . Bioinformatics 2016 ; 32 : 292 – 4 . Google Scholar PubMed 113 Boisvert S , Raymond F , Godzaridis É , et al. Ray Meta: scalable de novo metagenome assembly and profiling . Genome Biol 2012 ; 13 ( 12 ): R122 . Google Scholar Crossref Search ADS PubMed 114 Mangul S , Yang HT , Strauli N , et al. Dumpster diving in RNA-sequencing to find the source of every last read . bioRxiv 2016 :053041. 115 Hardwick SA , Chen WY , Wong T , et al. Spliced synthetic genes as internal controls in RNA sequencing experiments . Nat Methods 2016 ; 13 ( 9 ): 792 – 8 . http://dx.doi.org/10.1038/nmeth.3958 Google Scholar Crossref Search ADS PubMed 116 Pimentel H , Bray N , Puente S , et al. . Supplementary materials for “Differential analysis of RNA-Seq incorporating quantification uncertainty” . bioRxiv 2016 . 117 Gregor I , Schönhuth A , McHardy AC. Snowball: strain aware gene assembly of metagenomes . Bioinformatics 2016 ; 32 ( 17 ): i649 – 57 . Google Scholar Crossref Search ADS PubMed 118 Bolger AM , Lohse M , Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data . Bioinformatics 2014 ; 30 : 2114 – 20 . http://dx.doi.org/10.1093/bioinformatics/btu170 Google Scholar Crossref Search ADS PubMed 119 Edgar RC , Haas BJ , Clemente JC , et al. UCHIME improves sensitivity and speed of chimera detection . Bioinformatics 2011 ; 27 ( 16 ): 2194 – 200 . http://dx.doi.org/10.1093/bioinformatics/btr381 Google Scholar Crossref Search ADS PubMed 120 Rognes T , Flouri T , Nichols B , et al. VSEARCH: a versatile open source tool for metagenomics . PeerJ 2016 ; 4 : e2584. Google Scholar Crossref Search ADS PubMed 121 Wang Q , Fish JA , Gilman M , et al. Xander: employing a novel method for efficient gene-targeted metagenomic assembly . Microbiome 2015 ; 3 ( 1 ): 32 . http://dx.doi.org/10.1186/s40168-015-0093-6 Google Scholar Crossref Search ADS PubMed 122 Nawrocki EP , Kolbe DL , Eddy SR , Infernal 1. 0: inference of RNA alignments . Bioinformatics 2009 ; 25 ( 10 ): 1335 – 7 . http://dx.doi.org/10.1093/bioinformatics/btp157 Google Scholar Crossref Search ADS PubMed 123 Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads . Nat Methods 2013 ; 10 ( 10 ): 996 – 8 . http://dx.doi.org/10.1038/nmeth.2604 Google Scholar Crossref Search ADS PubMed 124 Edgar RC. Search and clustering orders of magnitude faster than BLAST . Bioinformatics 2010 ; 26 ( 19 ): 2460 – 1 . http://dx.doi.org/10.1093/bioinformatics/btq461 Google Scholar Crossref Search ADS PubMed 125 Li W , Fu L , Niu B , et al. Ultrafast clustering algorithms for metagenomic sequence analysis . Brief Bioinform 2012 ; 13 : 656 – 68 . http://dx.doi.org/10.1093/bib/bbs035 Google Scholar Crossref Search ADS PubMed 126 Caporaso JG , Bittinger K , Bushman FD , et al. PyNAST: a flexible tool for aligning sequences to a template alignment . Bioinformatics 2010 ; 26 ( 2 ): 266 – 7 . http://dx.doi.org/10.1093/bioinformatics/btp636 Google Scholar Crossref Search ADS PubMed 127 Bengtsson‐Palme J , Hartmann M , Eriksson KM , et al. METAXA2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data . Mol Ecol Resour 2015 ; 15 ( 6 ): 1403 – 14 . Google Scholar Crossref Search ADS PubMed 128 Oh J , Choi C-H , Park M-K , et al. Clustom-cloud: In-memory data grid-based software for clustering 16s rrna sequence data in the cloud environment . PLoS One 2016 ; 11 ( 3 ): e0151064 . Google Scholar Crossref Search ADS PubMed 129 Mahé F , Rognes T , Quince C , et al. Swarm v2: highly-scalable and high-resolution amplicon clustering . PeerJ 2015 ; 3 : e1420. Google Scholar Crossref Search ADS PubMed 130 Westcott SL , Schloss PD. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units . mSphere 2017 ; 2 :e00073-17. 131 Al-Ghalith GA , Montassier E , Ward HN , et al. NINJA-OPS: fast accurate marker gene alignment using concatenated ribosomes . PLoS Comput Biol 2016 ; 12 ( 1 ): e1004658 . Google Scholar Crossref Search ADS PubMed 132 Alneberg J , Bjarnason BS , De Bruijn I , et al. Binning metagenomic contigs by coverage and composition . Nat Methods 2014 ; 11 ( 11 ): 1144 – 46 . http://dx.doi.org/10.1038/nmeth.3103 Google Scholar Crossref Search ADS PubMed 133 Imelfort M , Parks D , Woodcroft BJ , et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes . PeerJ 2014 ; 2 : e603. Google Scholar Crossref Search ADS PubMed 134 Ulyantsev VI , Kazakov SV , Dubinkina VB , et al. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data . Bioinformatics 2016 ; 32 : 2760 – 7 . http://dx.doi.org/10.1093/bioinformatics/btw312 Google Scholar Crossref Search ADS PubMed 135 Kang DD , Froula J , Egan R , et al. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities . PeerJ 2015 ; 3 : e1165. Google Scholar Crossref Search ADS PubMed 136 Wu Y-W , Simmons BA , Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets . Bioinformatics 2015 ; 32 : 605 – 7 . Google Scholar Crossref Search ADS PubMed 137 Laczny CC , Sternal T , Plugaru V , et al. VizBin-an application for reference-independent visualization and human-augmented binning of metagenomic data . Microbiome 2015 ; 3 ( 1 ): 1 . http://dx.doi.org/10.1186/s40168-014-0066-1 Google Scholar Crossref Search ADS PubMed 138 Lu YY , Chen T , Fuhrman JA , et al. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment, and paired-end read LinkAge . Bioinformatics 2017 ; 33 : 791 – 8 . Google Scholar PubMed 139 Girotto S , Pizzi C , Comin M. MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures . Bioinformatics 2016 ; 32 ( 17 ): i567 – 75 . Google Scholar Crossref Search ADS PubMed 140 Cole JR , Wang Q , Fish JA , et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis . Nucleic Acids Res 2014 ; 42 : D633 – 42 . Google Scholar Crossref Search ADS PubMed 141 DeSantis TZ , Hugenholtz P , Larsen N , et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB . Appl Environ Microbiol 2006 ; 72 ( 7 ): 5069 – 72 . http://dx.doi.org/10.1128/AEM.03006-05 Google Scholar Crossref Search ADS PubMed 142 Pruesse E , Quast C , Knittel K , et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB . Nucleic Acids Res 2007 ; 35 ( 21 ): 7188 – 96 . http://dx.doi.org/10.1093/nar/gkm864 Google Scholar Crossref Search ADS PubMed 143 O'Leary NA , Wright MW , Brister JR , et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Nucleic Acids Res 2016 ; 44 : D733 – 45 . Google Scholar Crossref Search ADS PubMed 144 Forster SC , Browne HP , Kumar N , et al. HPMCD: the database of human microbial communities from metagenomic datasets and microbial reference genomes . Nucleic Acids Res 2016 ; 44 ( D1 ): D604 – 9 . Google Scholar Crossref Search ADS PubMed 145 Flygare S , Simmon K , Miller C , et al. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling . Genome Biol 2016 ; 17 ( 1 ): 111 . http://dx.doi.org/10.1186/s13059-016-0969-1 Google Scholar Crossref Search ADS PubMed 146 Cox JW , Ballweg RA , Taft DH , et al. A fast and robust protocol for metataxonomic analysis using RNAseq data . Microbiome 2017 ; 5 ( 1 ): 7 . http://dx.doi.org/10.1186/s40168-016-0219-5 Google Scholar Crossref Search ADS PubMed 147 Gao X , Lin H , Revanna K , et al. A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy . BMC Bioinformatics 2017 ; 18 ( 1 ): 247 . http://dx.doi.org/10.1186/s12859-017-1670-4 Google Scholar Crossref Search ADS PubMed 148 Allard G , Ryan FJ , Jeffery IB , et al. SPINGO: a rapid species-classifier for microbial amplicon sequences . BMC Bioinformatics 2015 ; 16 : 324.http://dx.doi.org/10.1186/s12859-015-0747-1 Google Scholar Crossref Search ADS PubMed 149 Segata N , Waldron L , Ballarini A , et al. Metagenomic microbial community profiling using unique clade-specific marker genes . Nat Methods 2012 ; 9 ( 8 ): 811 – 14 . http://dx.doi.org/10.1038/nmeth.2066 Google Scholar Crossref Search ADS PubMed 150 Huson DH , Weber N. Microbial community analysis using MEGAN . Methods Enzymol 2013 ; 531 : 465 – 85 . Google Scholar Crossref Search ADS PubMed 151 Kim D , Song L , Breitwieser FP , et al. Centrifuge: rapid and sensitive classification of metagenomic sequences . Genome Res 2016 ; 26 : 1721 – 9 . http://dx.doi.org/10.1101/gr.210641.116 Google Scholar Crossref Search ADS PubMed 152 Petersen TN , Lukjancenko O , Thomsen MCF , et al. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads . PLoS One 2017 ; 12 ( 6 ): e0176469 . Google Scholar Crossref Search ADS PubMed 153 Luo Y , Yu YW , Zeng J , et al. Metagenomic binning through low density hashing . bioRxiv 2017 : 133116 . 154 Henry VJ , Bandrowski AE , Pepin A-S , et al. OMICtools: an informative directory for multi-omic data analysis . Database 2014 ; 2014 : bau069. Google Scholar Crossref Search ADS PubMed 155 Comeau AM , Douglas GM , Langille MGI , Eisen J. Microbiome helper: a custom and streamlined workflow for microbiome research . mSystems 2017 ; 2 ( 1 ): e00127-16 . Google Scholar Crossref Search ADS PubMed 156 Kultima JR , Sunagawa S , Li J , et al. MOCAT: a metagenomics assembly and gene prediction toolkit . PLoS One 2012 ; 7 ( 10 ): e47656 . Google Scholar Crossref Search ADS PubMed 157 Narayanasamy S , Jarosz Y , Muller EE , et al. IMP: a pipeline for reproducible metagenomic and metatranscriptomic analyses . bioRxiv 2016 : 039263 . 158 Lin H-H , Liao Y-C. drVM: a new tool for efficient genome assembly of known eukaryotic viruses from metagenomes . Gigascience 2017 ; 6 ( 2 ): 1 – 10 . http://dx.doi.org/10.1093/gigascience/gix003 Google Scholar Crossref Search ADS 159 Broeksema B , Calusinska M , McGee F , et al. ICoVeR–an interactive visualization tool for verification and refinement of metagenomic bins . BMC Bioinformatics 2017 ; 18 ( 1 ): 233 . http://dx.doi.org/10.1186/s12859-017-1653-5 Google Scholar Crossref Search ADS PubMed 160 Kerepesi C , Bánky D , Grolmusz V. AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite . Gene 2014 ; 533 ( 2 ): 538 – 40 . Google Scholar Crossref Search ADS PubMed 161 Fosso B , Santamaria M , D’Antonio M , et al. MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data . Bioinformatics 2017 ; 33 : 1730 – 2 . Google Scholar PubMed 162 Giongo A , Crabb DB , Davis-Richardson AG , et al. PANGEA: pipeline for analysis of next generation amplicons . ISME J 2010 ; 4 ( 7 ): 852 – 61 . http://dx.doi.org/10.1038/ismej.2010.16 Google Scholar Crossref Search ADS PubMed 163 Office of Cyber Infrastructure and Computational Biology (OCICB) N . Nephele. http://nephele.niaid.nih.gov 2016 . 164 Hildebrand F , Tadeo R , Voigt AY , et al. LotuS: an efficient and user-friendly OTU processing pipeline . Microbiome 2014 ; 2 ( 1 ): 30 . http://dx.doi.org/10.1186/2049-2618-2-30 Google Scholar Crossref Search ADS PubMed 165 Turnbaugh PJ , Ley RE , Hamady M , et al. The human microbiome project: exploring the microbial part of ourselves in a changing world . Nature 2007 ; 449 ( 7164 ): 804 . http://dx.doi.org/10.1038/nature06244 Google Scholar Crossref Search ADS PubMed 166 Mitchell A , Bucchini F , Cochrane G , et al. EBI metagenomics in 2016-an expanding and evolving resource for the analysis and archiving of metagenomic data . Nucleic Acids Res 2016 ; 44 : D595 – 603 . Google Scholar Crossref Search ADS PubMed 167 Peterson J , Garges S , Giovanni M , et al. The NIH human microbiome project . Genome Res 2009 ; 19 ( 12 ): 2317 – 23 . http://dx.doi.org/10.1101/gr.096651.109 Google Scholar Crossref Search ADS PubMed 168 Markowitz VM , Chen I-MA , Palaniappan K , et al. IMG: the integrated microbial genomes database and comparative analysis system . Nucleic Acids Res 2012 ; 40 ( D1 ): D115 – 22 . Google Scholar Crossref Search ADS PubMed 169 Hurwitz B. iMicrobe: advancing clinical and environmental microbial research using the iPlant cyberinfrastructure. In: Plant and animal genome XXII conference . San Diego, CA , 2014 . 170 Meyer F , Paarmann D , D'Souza M , et al. The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes . BMC Bioinformatics 2008 ; 9 : 386.http://dx.doi.org/10.1186/1471-2105-9-386 Google Scholar Crossref Search ADS PubMed 171 Hyde ER , Sanders J , Tripathi A , et al. Comparing 16S rRNA Marker Gene and Shotgun Metagenomics Datasets in the American Gut Project Using State of the Art Tools . 172 Kovalevskaya NV , Whicher C , Richardson TD , et al. DNAdigest and repositive: connecting the World of Genomic Data . PLoS Biol 2016 ; 14 : e1002418. Google Scholar Crossref Search ADS PubMed 173 Simberloff D. Properties of the rarefaction diversity measurement . Am Nat 1972 ; 106 ( 949 ): 414 – 18 . http://dx.doi.org/10.1086/282781 Google Scholar Crossref Search ADS 174 Lozupone C , Knight R. UniFrac: a new phylogenetic method for comparing microbial communities . Appl Environ Microbiol 2005 ; 71 ( 12 ): 8228 – 35 . http://dx.doi.org/10.1128/AEM.71.12.8228-8235.2005 Google Scholar Crossref Search ADS PubMed 175 Heltshe JF , Forrester NE. Estimating species richness using the jackknife procedure . Biometrics 1983 ; 39 ( 1 ): 1 – 11 . http://dx.doi.org/10.2307/2530802 Google Scholar Crossref Search ADS PubMed 176 Xiao J , Cao H , Chen J. False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing . Bioinformatics 2017 ; 33 : 2873 – 81 . http://dx.doi.org/10.1093/bioinformatics/btx311 Google Scholar Crossref Search ADS PubMed 177 Buttigieg PL , Ramette A. A guide to statistical analysis in microbial ecology: a community-focused, living review of multivariate data analyses . FEMS Microbiol Ecol 2014 ; 90 ( 3 ): 543 – 50 . http://dx.doi.org/10.1111/1574-6941.12437 Google Scholar Crossref Search ADS PubMed 178 Le Cao K-A , Costello M-E , Lakis VA , et al. mixMC: a multivariate statistical framework to gain insight into Microbial Communities . PLoS One 2016 ; 11 : e0160169 . Google Scholar Crossref Search ADS PubMed 179 Ramette A. Multivariate analyses in microbial ecology . FEMS Microbiol Ecol 2007 ; 62 ( 2 ): 142 – 60 . http://dx.doi.org/10.1111/j.1574-6941.2007.00375.x Google Scholar Crossref Search ADS PubMed 180 Yang Y , Chen N , Chen T. mLDM: a new hierarchical Bayesian statistical model for sparse microbioal association discovery . bioRxiv 2016 :042630. 181 Mendes-Soares H , Mundy M , Soares LM , et al. MMinte: an application for predicting metabolic interactions among the microbial species in a community . BMC Bioinformatics 2016 ; 17 : 343 . http://dx.doi.org/10.1186/s12859-016-1230-3 Google Scholar Crossref Search ADS PubMed 182 Shannon P , Markiel A , Ozier O , et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks . Genome Res 2003 ; 13 ( 11 ): 2498 – 504 . http://dx.doi.org/10.1101/gr.1239303 Google Scholar Crossref Search ADS PubMed 183 Bastian M , Heymann S , Jacomy M. Gephi: an open source software for exploring and manipulating networks . ICWSM 2009 ; 8 : 361 – 2 . 184 Vespignani A , Wasserman S , Wernert E , et al. Network Workbench Tool . 185 Segata N , Izard J , Waldron L , et al. Metagenomic biomarker discovery and explanation . Genome Biol 2011 ; 12 ( 6 ): R60 . Google Scholar Crossref Search ADS PubMed 186 Turnbaugh PJ , Ley RE , Mahowald MA , et al. An obesity-associated gut microbiome with increased capacity for energy harvest . Nature 2006 ; 444 ( 7122 ): 1027 – 31 . Google Scholar Crossref Search ADS PubMed 187 Connelly S , Bristol A , Hubert S , et al. Clinical-stage, oral β-lactamase enzyme to prevent clostridium difficile infection triggered by antibiotic-mediated gut microbiome disruption. In: Open Forum Infectious Diseases . Oxford University Press , 2016 , 2221 . 188 Alexander JL , Scott A , Mroz A , et al. 91 Mass spectrometry imaging (MSI) of microbiome-metabolome interactions in colorectal cancer . Gastroenterology 2016 ; 150 ( 4 ): S23 . Google Scholar Crossref Search ADS 189 Weir TL , Manter DK , Sheflin AM , et al. Stool microbiome and metabolome differences between colorectal cancer patients and healthy adults . PLoS One 2013 ; 8 ( 8 ): e70803 . Google Scholar Crossref Search ADS PubMed 190 Ward T , Larson J , Meulemans J , et al. BugBase predicts organism level microbiome phenotypes . bioRxiv 2017 :133462. 191 Zakrzewski M , Proietti C , Ellis JJ , et al. Calypso: a user-friendly web-server for mining and visualizing microbiome–environment interactions . Bioinformatics 2016 ; 33 : 782 – 3 . 192 Bose T , Haque MM , Reddy C , et al. COGNIZER: a framework for functional annotation of metagenomic datasets . PLoS One 2015 ; 10 ( 11 ): e0142102 . Google Scholar Crossref Search ADS PubMed 193 Vázquez-Baeza Y , Pirrung M , Gonzalez A , et al. EMPeror: a tool for visualizing high-throughput microbial community data . Gigascience 2013 ; 2 ( 1 ): 16 . Google Scholar Crossref Search ADS PubMed 194 Robertson CE , Harris JK , Wagner BD , et al. Explicet: Graphical user interface software for metadata-driven management, analysis, and visualization of microbiome data . Bioinformatics 2013 ; 29 : 3100 – 1 . http://dx.doi.org/10.1093/bioinformatics/btt526 Google Scholar Crossref Search ADS PubMed 195 Manor O , Borenstein E. Systematic characterization and analysis of the taxonomic drivers of functional shifts in the human microbiome . Cell Host Microbe 2017 ; 21 ( 2 ): 254 – 67 . http://dx.doi.org/10.1016/j.chom.2016.12.014 Google Scholar Crossref Search ADS PubMed 196 Kim J , Kim MS , Koh AY , et al. FMAP: Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies . BMC Bioinformatics 2016 ; 17 ( 1 ): 420 . http://dx.doi.org/10.1186/s12859-016-1278-0 Google Scholar Crossref Search ADS PubMed 197 Rho M , Tang H , Ye Y. FragGeneScan: predicting genes in short and error-prone reads . Nucleic Acids Res 2010 ; 38 ( 20 ): e191 . Google Scholar Crossref Search ADS PubMed 198 Uchiyama T , Irie M , Mori H , et al. FuncTree: functional analysis and visualization for large-scale omics data . PLoS One 2015 ; 10 ( 5 ): e0126967 . Google Scholar Crossref Search ADS PubMed 199 Riehle K , Coarfa C , Jackson A , et al. The Genboree Microbiome Toolset and the analysis of 16S rRNA microbial sequences . BMC Bioinformatics 2012 ; 13(Suppl 13) : S11 . Google Scholar Crossref Search ADS PubMed 200 Kelley DR , Liu B , Delcher AL , et al. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering . Nucleic Acids Res 2012 ; 40 ( 1 ): e9 . Google Scholar Crossref Search ADS PubMed 201 Asnicar F , Weingart G , Tickle TL , et al. Compact graphical representation of phylogenetic data and metadata with GraPhlAn . PeerJ 2015 ; 3 : e1029. Google Scholar Crossref Search ADS PubMed 202 Abubucker S , Segata N , Goll J , et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome . PLoS Comput Biol 2012 ; 8 ( 6 ): e1002358 . Google Scholar Crossref Search ADS PubMed 203 Narayanasamy S , Jarosz Y , Muller EE , et al. IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses . Genome Biol 2016 ; 17 : 260 . http://dx.doi.org/10.1186/s13059-016-1116-8 Google Scholar Crossref Search ADS PubMed 204 Ondov BD , Bergman NH , Phillippy AM. Interactive metagenomic visualization in a Web browser . BMC Bioinformatics 2011 ; 12 : 385.http://dx.doi.org/10.1186/1471-2105-12-385 Google Scholar Crossref Search ADS PubMed 205 Wang Y , Xu L , Gu YQ , et al. MetaCoMET: a web platform for discovery and visualization of the core microbiome . Bioinformatics 2016 ; 32 : 3469 – 70 . Google Scholar PubMed 206 Arndt D , Xia J , Liu Y , et al. METAGENassist: a comprehensive web server for comparative metagenomics . Nucleic Acids Res 2012 ; 40 : W88 – 95 . Google Scholar Crossref Search ADS PubMed 207 Wagner J , Chelaru F , Kancherla J , et al. Metaviz: interactive statistical and visual analysis of metagenomic data . bioRxiv 2017 :105205. 208 Dhariwal A , Chong J , Habib S , et al. MicrobiomeAnalyst: a web-based tool for comprehensive statistical, visual and meta-analysis of microbiome data . Nucleic Acids Res 2017 ; 45 : W180 – 8 . Google Scholar Crossref Search ADS PubMed 209 Kultima JR , Coelho LP , Forslund K , et al. MOCAT2: a metagenomic assembly, annotation and profiling framework . Bioinformatics 2016 ; 32 ( 16 ): 2520 – 3 . http://dx.doi.org/10.1093/bioinformatics/btw183 Google Scholar Crossref Search ADS PubMed 210 Jing G , Sun Z , Wang H , et al. Parallel-META 3: Comprehensive taxonomical and functional analysis platform for efficient comparison of microbial communities . Sci Rep 2017 ; 7 : 40371.http://dx.doi.org/10.1038/srep40371 Google Scholar Crossref Search ADS PubMed 211 Soh J , Dong X , Caffrey SM , et al. Phoenix 2: a locally installable large-scale 16S rRNA gene sequence analysis pipeline with Web interface . J Biotechnol 2013 ; 167 ( 4 ): 393 – 403 . http://dx.doi.org/10.1016/j.jbiotec.2013.07.004 Google Scholar Crossref Search ADS PubMed 212 Langille MG , Zaneveld J , Caporaso JG , et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences . Nat Biotechnol 2013 ; 31 ( 9 ): 814 – 21 . http://dx.doi.org/10.1038/nbt.2676 Google Scholar Crossref Search ADS PubMed 213 Hyatt D , Chen G-L , LoCascio PF , et al. Prodigal: prokaryotic gene recognition and translation initiation site identification . BMC Bioinformatics 2010 ; 11 : 119.http://dx.doi.org/10.1186/1471-2105-11-119 Google Scholar Crossref Search ADS PubMed 214 Lagkouvardos I , Fischer S , Kumar N , et al. Rhea: a transparent and modular R pipeline for microbial profiling based on 16S rRNA gene amplicons . PeerJ 2017 ; 5 : e2836. Google Scholar Crossref Search ADS PubMed 215 Westreich ST , Korf I , Mills DA , et al. SAMSA: a comprehensive metatranscriptome analysis pipeline . BMC Bioinformatics 2016 ; 17 ( 1 ): 399 . http://dx.doi.org/10.1186/s12859-016-1270-8 Google Scholar Crossref Search ADS PubMed 216 Kaminski J , Gibson MK , Franzosa EA , et al. High-specificity targeted functional profiling in microbial communities with ShortBRED . PLoS Comput Biol 2015 ; 11 ( 12 ): e1004557 . Google Scholar Crossref Search ADS PubMed 217 Parks DH , Tyson GW , Hugenholtz P , et al. STAMP: statistical analysis of taxonomic and functional profiles . Bioinformatics 2014 ; 30 ( 21 ): 3123 – 4 . http://dx.doi.org/10.1093/bioinformatics/btu494 Google Scholar Crossref Search ADS PubMed 218 Aßhauer KP , Wemheuer B , Daniel R , et al. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data . Bioinformatics 2015 ; 31 ( 17 ): 2882 – 4 . Google Scholar Crossref Search ADS PubMed 219 Huse SM , Welch DBM , Voorhis A , et al. VAMPS: a website for visualization and analysis of microbial population structures . BMC Bioinformatics 2014 ; 15 : 41.http://dx.doi.org/10.1186/1471-2105-15-41 Google Scholar Crossref Search ADS PubMed 220 Nagpal S , Haque MM , Mande SS , Ahmed N. Vikodak-A modular framework for inferring functional potential of microbial communities from 16S metagenomic datasets . PLoS One 2016 ; 11 ( 2 ): e0148347. Google Scholar Crossref Search ADS PubMed 221 Dray S , Dufour A-B. The ade4 package: implementing the duality diagram for ecologists . J Stat Softw 2007 ; 22 ( 4 ): 1 – 20 . Google Scholar Crossref Search ADS 222 Rodriguez-R LM , Konstantinidis KT. The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes . PeerJ Preprints 2016 ; 4 : e1900v1 . 223 Luo D , Ziebell S , An L. An informative approach on differential abundance analysis for time-course metagenomic sequencing data . Bioinformatics 2017 ; 33 : 1286 – 92 . Google Scholar Crossref Search ADS PubMed 224 Paulson JN , Stine OC , Bravo HC , Pop M. Differential abundance analysis for microbial marker-gene surveys . Nat Methods 2013 ; 10 ( 12 ): 1200 – 2 . http://dx.doi.org/10.1038/nmeth.2658 Google Scholar Crossref Search ADS PubMed 225 Zhan X , Tong X , Zhao N. A small‐sample multivariate kernel machine test for microbiome association studies . Genet Epidemiol 2017 ; 41 : 210 – 20 . Google Scholar Crossref Search ADS PubMed 226 Cao Y , Zheng X , Li F , et al. mmnet: an R package for metagenomics systems biology analysis . Biomed Res Int 2015 ; 2015 : 1 . 227 McMurdie PJ , Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data . PLoS One 2013 ; 8 ( 4 ): e61217. Google Scholar Crossref Search ADS PubMed 228 Sohn MB , Du R , An L. A robust approach for identifying differentially abundant features in metagenomic samples . Bioinformatics 2015 ; 31 : 2269 – 75 . http://dx.doi.org/10.1093/bioinformatics/btv165 Google Scholar Crossref Search ADS PubMed 229 Cao Y , Wang Y , Zheng X , et al. RevEcoR: an R package for the reverse ecology analysis of microbiomes . BMC Bioinformatics 2016 ; 17 ( 1 ): 294 . http://dx.doi.org/10.1186/s12859-016-1088-4 Google Scholar Crossref Search ADS PubMed 230 Kristiansson E , Hugenholtz P , Dalevi D. ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes . Bioinformatics 2009 ; 25 ( 20 ): 2737 – 8 . http://dx.doi.org/10.1093/bioinformatics/btp508 Google Scholar Crossref Search ADS PubMed 231 Oksanen J , Kindt R , Legendre P , et al. The vegan package . Commun Ecol Package 2007 ; 10 : 631 – 7 . 232 Cockrell C , Christley S , An G , Gabhann FM. Investigation of inflammation and tissue patterning in the gut using a spatially explicit general-purpose model of enteric tissue (SEGMEnT) . PLoS Comput Biol 2014 ; 10 ( 3 ): e1003507. Google Scholar Crossref Search ADS PubMed 233 Leber A , Viladomiu M , Hontecillas R , et al. Systems modeling of interactions between mucosal immunity and the gut microbiome during clostridium difficile infection . PLoS One 2015 ; 10 ( 7 ): e0134849 . Google Scholar Crossref Search ADS PubMed 234 Abedi V , Hontecillas R , Hoops S , et al. ENISI multiscale modeling of mucosal immune responses driven by high performance computing. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) . IEEE, 2015 , p. 680–4. 235 Collins FS , Varmus H. A new initiative on precision medicine . N Engl J Med 2015 ; 372 ( 9 ): 793 – 5 . http://dx.doi.org/10.1056/NEJMp1500523 Google Scholar Crossref Search ADS PubMed 236 Initiative PM. Working group, the precision medicine initiative cohort program: building the foundation for 21st century medicine. PMI Working Group Report to the Advisory Committee to the Director, 2015 . 237 Somberg JC. The Human Microbiome and Therapeutics . LWW , 2012 . 238 ElRakaiby M , Dutilh BE , Rizkallah MR , et al. Pharmacomicrobiomics: the impact of human microbiome variations on systems pharmacology and personalized therapeutics . Omics 2014 ; 18 ( 7 ): 402 – 14 . http://dx.doi.org/10.1089/omi.2014.0018 Google Scholar Crossref Search ADS PubMed 239 Kuntz TM , Gilbert JA. Introducing the microbiome into precision medicine . Trends Pharmacol Sci 2017 ; 38 ( 1 ): 81 – 91 . http://dx.doi.org/10.1016/j.tips.2016.10.001 Google Scholar Crossref Search ADS PubMed 240 Johnson KW , Shameer K , Glicksberg BS , et al. Enabling precision cardiology through multiscale biology and systems medicine . JACC Basic Transl Sci 2017 ; 2 ( 3 ): 311 – 27 . http://dx.doi.org/10.1016/j.jacbts.2016.11.010 Google Scholar Crossref Search ADS PubMed 241 Antman EM , Loscalzo J. Precision medicine in cardiology . Nat Rev Cardiol 2016 ; 13 ( 10 ): 591 – 602 . http://dx.doi.org/10.1038/nrcardio.2016.101 Google Scholar Crossref Search ADS PubMed 242 Hold GL. The gut microbiota, dietary extremes and exercise . Gut 2014 ; 63 ( 12 ): 1838 – 9 . http://dx.doi.org/10.1136/gutjnl-2014-307305 Google Scholar Crossref Search ADS PubMed 243 Kang SS , Jeraldo PR , Kurti A , et al. Diet and exercise orthogonally alter the gut microbiome and reveal independent associations with anxiety and cognition . Mol Neurodegener 2014 ; 9 ( 1 ): 36 . http://dx.doi.org/10.1186/1750-1326-9-36 Google Scholar Crossref Search ADS PubMed 244 Barton W , Penney NC , Cronin O , et al. The microbiome of professional athletes differs from that of more sedentary subjects in composition and particularly at the functional metabolic level . Gut 2017 , doi: 10.1136/gutjnl-2016-313627. 245 Sandhu KV , Sherwin E , Schellekens H , et al. Feeding the microbiota-gut-brain axis: diet, microbiome, and neuropsychiatry . Transl Res 2017 ; 179 : 223 – 44 . http://dx.doi.org/10.1016/j.trsl.2016.10.002 Google Scholar Crossref Search ADS PubMed 246 Bokulich NA , Chung J , Battaglia T , et al. Antibiotics, birth mode, and diet shape microbiome maturation during early life . Sci Transl Med 2016 ; 8 ( 343 ): 343ra382 . Google Scholar Crossref Search ADS 247 Preidis GA , Versalovic J. Targeting the human microbiome with antibiotics, probiotics, and prebiotics: gastroenterology enters the metagenomics era . Gastroenterology 2009 ; 136 ( 6 ): 2015 – 31 . http://dx.doi.org/10.1053/j.gastro.2009.01.072 Google Scholar Crossref Search ADS PubMed 248 Petschow B , Doré J , Hibberd P , et al. Probiotics, prebiotics, and the host microbiome: the science of translation . Ann N Y Acad Sci 2013 ; 1306 : 1 – 17 . Google Scholar Crossref Search ADS PubMed 249 Damaskos D , Kolios G. Probiotics and prebiotics in inflammatory bowel disease: microflora ‘on the scope’ . Br J Clin Pharmacol 2008 ; 65 ( 4 ): 453 – 67 . Google Scholar Crossref Search ADS PubMed 250 Schrezenmeir J , de Vrese M. Probiotics, prebiotics, and synbiotics—approaching a definition . Am J Clin Nutr 2001 ; 73(2 Suppl) : 361s – 4s . Google Scholar Crossref Search ADS 251 Ghouri YA , Richards DM , Rahimi EF , et al. Systematic review of randomized controlled trials of probiotics, prebiotics, and synbiotics in inflammatory bowel disease . Clin Exp Gastroenterol 2014 ; 7 : 473 . Google Scholar PubMed 252 Frei R , Akdis M , O’Mahony L. Prebiotics, probiotics, synbiotics, and the immune system: experimental data and clinical evidence . Curr Opin Gastroenterol 2015 ; 31 ( 2 ): 153 – 8 . Google Scholar Crossref Search ADS PubMed 253 Mehta V , Bhatt K , Desai N , et al. Probiotics: an adjuvant therapy for D-galactose induced Alzheimer's disease . J Med Res Innov 2017 ; 1 : 30 – 3 . Google Scholar Crossref Search ADS 254 Dinan TG , Stanton C , Cryan JF. Psychobiotics: a novel class of psychotropic . Biol Psychiatry 2013 ; 74 ( 10 ): 720 – 6 . http://dx.doi.org/10.1016/j.biopsych.2013.05.001 Google Scholar Crossref Search ADS PubMed 255 Wall R , Cryan JF , Ross RP , et al. Bacterial neuroactive compounds produced by psychobiotics. Microbial endocrinology: The microbiota-gut-brain axis in health and disease . Springer , 2014 , 221 – 39 . 256 Borody TJ , Khoruts A. Fecal microbiota transplantation and emerging applications . Nat Rev Gastroenterol Hepatol 2012 ; 9 : 88 – 96 . Google Scholar Crossref Search ADS 257 Smits LP , Bouter KE , de Vos WM , et al. Therapeutic potential of fecal microbiota transplantation . Gastroenterology 2013 ; 145 ( 5 ): 946 – 53 . http://dx.doi.org/10.1053/j.gastro.2013.08.058 Google Scholar Crossref Search ADS PubMed 258 Khoruts A , Weingarden AR. Emergence of fecal microbiota transplantation as an approach to repair disrupted microbial gut ecology . Immunol Lett 2014 ; 162 ( 2 ): 77 – 81 . http://dx.doi.org/10.1016/j.imlet.2014.07.016 Google Scholar Crossref Search ADS PubMed 259 Paramsothy S , Borody TJ , Lin E , et al. Donor recruitment for fecal microbiota transplantation . Inflamm Bowel Dis 2015 ; 21 ( 7 ): 1600 – 6 . http://dx.doi.org/10.1097/MIB.0000000000000405 Google Scholar Crossref Search ADS PubMed 260 Wolf‐Meyer MJ. Normal, regular, and standard: scaling the body through fecal microbial transplants . Med Anthropol Q 2016 ; 31 : 297 – 314 . Google Scholar Crossref Search ADS 261 Kang D-W , Adams JB , Gregory AC , et al. Microbiota Transfer Therapy alters gut ecosystem and improves gastrointestinal and autism symptoms: an open-label study . Microbiome 2017 ; 5 ( 1 ): 10 . http://dx.doi.org/10.1186/s40168-016-0225-7 Google Scholar Crossref Search ADS PubMed 262 Modi SR , Collins JJ , Relman DA. Antibiotics and the gut microbiota . J Clin Investig 2014 ; 124 ( 10 ): 4212.http://dx.doi.org/10.1172/JCI72333 Google Scholar Crossref Search ADS PubMed 263 Andersson DI. Persistence of antibiotic resistant bacteria . Curr Opin Microbiol 2003 ; 6 ( 5 ): 452 – 6 . http://dx.doi.org/10.1016/j.mib.2003.09.001 Google Scholar Crossref Search ADS PubMed 264 Zeissig S , Blumberg RS. Life at the beginning: perturbation of the microbiota by antibiotics in early life and its role in health and disease . Nat Immunol 2014 ; 15 ( 4 ): 307 – 10 . http://dx.doi.org/10.1038/ni.2847 Google Scholar Crossref Search ADS PubMed 265 Dubreuil L , Mahieux S , Neut C. Antibiotic Susceptibility of Probiotic Strains. Is it Reasonable to Combine Probiotics with Antibiotics? Gastroenterology 2017 ; 152 ( 5 ): S821. Google Scholar Crossref Search ADS 266 Sharma J , Chauhan D , Goyal A. Enhancement of antimicrobial activity of antibiotics by probiotics against Escherichia coli-An in vitro study . Adv Appl Sci Res 2014 ; 5 : 14 – 18 . 267 Adnan B , Lutvo S , Sabina K , et al. P329 Advantages to taking antibiotics with probiotics in children with reduction of complications diarrhoea . BMJ 2017 ; 102 . 268 Garrett WS. Gut microbiota in 2016: a banner year for gut microbiota research . Nat Rev Gastroenterol Hepatol 2017 ; 14 : 78 – 80 . http://dx.doi.org/10.1038/nrgastro.2016.207 Google Scholar Crossref Search ADS PubMed 269 Kantae V , Krekels EH , Van Esdonk MJ , et al. Integration of pharmacometabolomics with pharmacokinetics and pharmacodynamics: towards personalized drug therapy . Metabolomics 2017 ; 13 ( 1 ): 9 . http://dx.doi.org/10.1007/s11306-016-1143-1 Google Scholar Crossref Search ADS PubMed 270 Enright EF , Gahan CG , Joyce SA , et al. Focus: microbiome: the impact of the gut microbiota on drug metabolism and clinical outcome . Yale J Biol Med 2016 ; 89 ( 3 ): 375 . Google Scholar PubMed 271 Koch C , Müller S. Personalized microbiome dynamics-Cytometric fingerprints for routine diagnostics . Mol Aspects Med 2017 , in press. 272 Halfvarson J , Brislawn CJ , Lamendella R , et al. Dynamics of the human gut microbiome in inflammatory bowel disease . Nat Microbiol 2017 ; 2 : 17004.http://dx.doi.org/10.1038/nmicrobiol.2017.4 Google Scholar Crossref Search ADS PubMed 273 Smith AH , Łukasik P , O'Connor MP , et al. Patterns, causes and consequences of defensive microbiome dynamics across multiple scales . Mol Ecol 2015 ; 24 ( 5 ): 1135 – 49 . Google Scholar Crossref Search ADS PubMed 274 Dorrestein PC , Mazmanian SK , Knight R. From microbiomess to metabolomes to function during host-microbial interactions . Immunity 2014 ; 40 : 824. Google Scholar Crossref Search ADS PubMed 275 von Mutius E. The shape of the microbiome in early life . Nat Med 2017 ; 23 ( 3 ): 274 – 5 . http://dx.doi.org/10.1038/nm.4299 Google Scholar Crossref Search ADS PubMed 276 Dunlop AL , Mulle JG , Ferranti EP , et al. The maternal microbiome and pregnancy outcomes that impact infant health: a review . Adv Neonat Care 2015 ; 15 ( 6 ): 377 . http://dx.doi.org/10.1097/ANC.0000000000000218 Google Scholar Crossref Search ADS 277 Zhulin IB. Databases for microbiologists . J Bacteriol 2015 ; 197 ( 15 ): 2458 – 67 . http://dx.doi.org/10.1128/JB.00330-15 Google Scholar Crossref Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Translational bioinformatics in mental health: open access data sources and computational biomarker discoveryTenenbaum, Jessica D; Bhuvaneshwar, Krithika; Gagliardi, Jane P; Fultz Hollis, Kate; Jia, Peilin; Ma, Liang; Nagarajan, Radhakrishnan; Rakesh, Gopalkumar; Subbian, Vignesh; Visweswaran, Shyam; Zhao, Zhongming; Rozenblit, Leon
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx157pmid: 29186302
Abstract Mental illness is increasingly recognized as both a significant cost to society and a significant area of opportunity for biological breakthrough. As -omics and imaging technologies enable researchers to probe molecular and physiological underpinnings of multiple diseases, opportunities arise to explore the biological basis for behavioral health and disease. From individual investigators to large international consortia, researchers have generated rich data sets in the area of mental health, including genomic, transcriptomic, metabolomic, proteomic, clinical and imaging resources. General data repositories such as the Gene Expression Omnibus (GEO) and Database of Genotypes and Phenotypes (dbGaP) and mental health (MH)-specific initiatives, such as the Psychiatric Genomics Consortium, MH Research Network and PsychENCODE represent a wealth of information yet to be gleaned. At the same time, novel approaches to integrate and analyze data sets are enabling important discoveries in the area of mental and behavioral health. This review will discuss and catalog into an organizing framework the increasingly diverse set of MH data resources available, using schizophrenia as a focus area, and will describe novel and integrative approaches to molecular biomarker discovery that make use of mental health data. translational bioinformatics, mental health, open access, biomarker discovery Introduction In 2013, mental illness was highly prevalent and estimated as incurring the highest financial burden among medical conditions in the United States, with spending estimated at $201 billion [1]. In light of the considerable cost to individuals and society, mental illness represents a compelling opportunity for discovery and improved patient care. As our ability to untangle biological mechanisms of disease grows, so too does our ability to leverage our richer understanding for better diagnoses, interventions and outcomes. In many areas of medicine, biomarker discovery is causing a shift toward biomarker-based diagnoses that promise better targeted and thus more effective therapies. Precision medicine approaches incorporating molecular and imaging biomarkers into therapeutic decision-making are emerging. Publicly available ‘big data’ resources like TCGA (The Cancer Genome Atlas) are being used and reused in numerous ways, with thousands of downstream citations [2]. Despite the magnitude of opportunity, translation has been slower in mental health (MH) than in other areas of health care [3]. In this article, we first describe the motivation as well as some challenges for the use of biomarkers in MH. We address one major challenge—findability of relevant resources—by providing a catalog of relevant data resources for biomarker discovery in MH, and a framework for their organization. Finally, we give an overview of existing approaches to biomarker discovery using publicly available data. An exploration of biomarker in MH is especially timely in light of recent announcements from the National Institute of Mental Health (NIMH) exhorting a renewed focus on causal models of disease [4, 5]. Over the past 7 years, NIMH awards have shifted away from clinical research and trials and toward mechanistic biological understanding, coinciding with the NIMH’s launch of Research Domain Criteria (RDoC) in 2011, a framework emphasizing research into mechanisms (rather than clinically observable signs and symptoms) of mental illness [6]. Despite ample motivation, data reuse in MH research remains sluggish even in the presence of available biological resources and an emphasis on data sharing [7]. One important obstacle is the surprising difficulty in identifying available resources, related at least in part to an absence of a systematic approach to cataloging available resources. We seek to propose and use a systematic approach to organizing relevant resources. We then catalog data sets pertinent to MH biomarkers to facilitate secondary use of these data for computational biomarker discovery. In addition to MH-focused resources and general resources that include MH conditions, a number of rich resources exist for specific MH disorders. However, inclusion of resources for every MH condition would far exceed space limitations for a single review. Therefore, in addition to general MH resources that span multiple disorders, we extend resource cataloging to a single MH disorder, schizophrenia (SCZ) and focus our biomarker discovery literature review on that disorder. SCZ is selected because it is one of the most studied MH disorders, puts heavy burden on the community and co-authors have conducted both large data annotations and various analyses in this area. It is also a prime example of a diagnostic concept well-recognized to be problematic and in need of updating [8, 9]. Biological exploration, biomarker discovery and elucidation of underlying mechanisms are key to addressing this issue. We have also limited our catalog to resources that are publicly available. Private or proprietary data sets or tools that are neither intended nor accessible for secondary research by independent researchers are beyond the scope of this review. The mind-biology problem: a challenge for another day The mind-body problem—What is the relationship between the mind (feelings, thoughts, beliefs) and the physical realm (matter, atoms, neurons)?—is commonly recognized in philosophy [10]. We stipulate, for the purposes of this discussion, that psychology becomes neurobiology once a biological mechanism is understood. As we gain insights into the neural basis of normal and abnormal behavior, syndromes historically described in terms of mental constructs can be described in terms of biological constructs. While we do not seek to tackle the philosophical question of whether all mental constructs can be adequately described in biological terms, we do assert that understanding biological mechanisms in MH is valuable, is likely to expand and will benefit from integrative research connecting behavior to biomarkers. RDoC: ‘Outcomes to Causes and Back’ MH disorders typically include a spectrum of symptoms that affect emotions, thoughts and behaviors [4]. Moreover, two people can be diagnosed with a single disorder such as SCZ, despite having no overlapping symptoms. The NIMH RDoC initiative is an attempt to ‘develop, for research purposes, new ways of classifying mental disorders based on dimensions of observable behavior and neurobiological measures’ [4, 11]. The goal is to generate categories stemming from basic behavioral neuroscience, rather than starting with a highly heterogeneous illness definition and then seeking its neurobiological underpinnings. To this end, RDoC makes no reference to current DSM-based (Diagnostic and Statistical Manual of Mental Disorders) classification but instead proposes an alternative organizing scheme for linking behavior to underlying mechanisms. The RDoC approach is directly relevant to MH biomarkers because it aims to identify specific elements, such as mutations, genes, molecules, cells, pathways, physiological measures or behaviors associated with specific mental constructs across different disorders [3, 11]. Data, like life, is not always FAIR (Findable, Accessible, Interoperable, Reusable) One of the expected and desirable results of the NIMH funding shift has been the generation of large biological data sets relevant to MH that are expected to be shared and reused. Recognizing the urgent need to improve infrastructure around data discoverability and reuse in the big data era, a group of stakeholders from academia, industry, funding organizations and publishers came together to design and endorse a set of measurable principles to act as guidelines for best practices in data sharing [12]. The resulting framework is known as FAIR principles—Findable, Accessible, Interoperable, Reusable. FAIR principles put particular emphasis on enhancing the ability for computers to find and use existing data. Findable refers to whether a researcher who would want to use the data set in question is able to discover that the data exist. This requires clear, persistent and searchable metadata. Accessible refers to whether the data are available to be downloaded. Are they retrievable through a standard communications protocol that enables authentication and authorization? Interoperable considers whether appropriate data and metadata standards are used for knowledge representation. Reusable addresses whether the data and provenance are represented in sufficient detail, with clear guidelines for usage. While some researchers express concerns about reuse of clinical data in particular [13], many in the scientific community see significant benefit to be gained by data sharing and reuse [14]. The National Institutes of Health (NIH) has launched a Data Commons initiative to establish a virtual environment to facilitate the use, interoperability and discoverability of shared digital objects used for research [15]. This review focuses on those resources, data sets and publications that adhere to the spirit of the FAIR principles [12]. Biomarkers in mental health: What, why and how? What is a ‘biomarker’ anyway? A biomarker traditionally is defined as ‘a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacological response’ [16]. Biomarkers can be generally classified as (1) diagnostic or trait markers that indicate the presence of a disease, (2) prognostic markers that indicate the likely course of a disease or (3) theranostic markers that predict how an individual is likely to respond to a certain treatment [17, 18]. As yet, no clinically actionable biomarkers have been approved for use in MH [11]. However, there is increasing recognition of the biological underpinnings of MH, the importance of biomarker discovery and the significant opportunity that MH poses in this regard. To this end, substantial research efforts have been devoted to biomarker discovery, and a number of publications describe promising leads [19–24]. Importantly, many of these studies have made their data publicly available to varying degrees, enabling secondary research and innovative approaches to analysis, in some cases through novel, integrative methods that could not have been done with the original data alone. Biomarker types Physiological biomarkers span a wide range of modalities and data types and may be categorized as either microscopic or macroscopic in scale (Figure 1). Figure 1 Open in new tabDownload slide Overview of micro- and macro-level biomarkers. indels, small insertions/deletions; SV, structural variants. Figure 1 Open in new tabDownload slide Overview of micro- and macro-level biomarkers. indels, small insertions/deletions; SV, structural variants. Micro-scale biomarkers: all things omics Micro-scale biomarkers refer to biomarkers at the molecular level. The various and ever-increasing number of ‘-omic’ data-based biomarkers has been documented elsewhere [25, 26]. Genomic biomarkers generally refer to DNA sequence, including single-nucleotide variations (SNVs), copy number variations (CNVs), insertions, deletions, structural variants, etc. Transcriptomics refers to RNA expression, including both coding and noncoding RNA. Epigenomics refers to features of DNA other than the sequence itself, e.g. methylation, histone modification, etc. Proteomics refers to the presence, quantity and posttranslational modification state of proteins and peptides. Metabolomics refers to identification, quantification and ratios of various metabolites generated through the organism’s metabolism. Genomic and transcriptomic biomarkers have arguably received the most attention in the past decade, in part because they have become relatively low hanging fruit: microarrays and sequencing technologies make it fairly straightforward and increasingly inexpensive to make observations across the entire genome and transcriptome. Macro-scale biomarkers: tissue and system level Macro-scale biomarkers are observed at the tissue or system-level, generally through imaging technologies. Advances in brain imaging technology over past 20–30 years have enabled application to MH and illness. Commonly used imaging modalities include magnetic resonance imaging (MRI), magnetic resonance spectroscopy, positron-emission tomography (PET), single-photon emission computed tomography and diffusion tensor imaging (DTI). Other methods of neuroimaging involve recording of electrical currents or magnetic fields, for example electroencephalography (EEG) and magnetoencephalography (MEG). The additional biomarker types listed below are all macro-scale biomarkers. Structural biomarkers of the brain Structural imaging provides qualitative and quantitative information about the brain that describes the shape, size and integrity of gray and white matter structures in the brain. Typically, morphometric techniques measure the volume or shape of gray matter structures and white matter tracts. Structural MRI is used for identifying density or volume of brain matter, and DTI provides images of anatomical pathways and circuits especially of white matter [27]. Functional biomarkers of the brain Whereas structural imaging provides static anatomical information, functional imaging provides dynamic physiological information [28]. Functional MRI (fMRI) and PET measure localized changes in cerebral blood flow related to neural activity, while EEG and MEG measure electrical currents and magnetic fields that vary with function. The connectome: a ‘wiring diagram’ for the brain The brain connectome defines the connectivity architecture and network organization of the neural components of the brain in terms of both structure and function. The connectome is represented as a large graph with nodes (brain regions) and edges (pathways) and has been enabled by advances in neuroimaging including structural MRI, fMRI and diffusion MRI [29]. Connectivity analysis based on graph theory is used to explore variations in the type and strength of connectivity between brain regions. Current evidence demonstrates alterations in both large-scale network and local network connectivity in mental health, and these alterations define distinct clinical and cognitive phenotypes [30]. Biomarker data resources A framework for resource classification A surprising challenge awaits a novice attempting integrative analyses: simply identifying what resources are available, how they relate to each other and what each one can and cannot provide is surprisingly difficult. In writing this review, we initially set out to catalog a list of publicly available data resources relevant to mental health. In the course of due diligence to identify these resources, certain categories and attributes emerged. Thus, our effort to catalog available resources also informed the creation of a candidate framework for classifying and organizing the different resource types. Data resources can be classified as one (or sometimes more than one) of four high-level categories: (1) Organizational entity; (2) Initiative; (3) Platform; or (4) Data set (Figure 2). Examples of organizational entities include federal agencies, such as the NIMH, and nonprofit organizations, such as the Allen Institute for Brain Science [31]. Initiatives are activities or groups organized around activities aimed at creating, collecting or cataloging data for research. Examples include PsychENCODE, BioCADDIE and the Psychiatric Genomics Consortium (PGC) [32–34]. Data sharing platforms are Web-based applications that enable a researcher to search for data sets using metadata and to download the data. Examples include Sage Bionetworks’ Synapse platform or the Gene Expression Omnibus (GEO) [35, 36]. Finally, specific data sets may include data resulting from experimental assays, e.g. various data sets available in GEO, or curated knowledge bases like SZGR [28]. As shown in Figure 2, the relationships between different categories do not form a simple hierarchy but are instead are many-to-many. An initiative may be associated with one or more organizational entities, whether through funding or logistical or administrative support, while an organizational entity may be associated with one or more initiatives. A given organizational entity or initiative may rely on one or more platforms. A platform may contain (or point to) one or more data sets from one or more initiatives or organizational entities. A given data set is generally stored in one platform, but may also be accessed through other platforms, whether because it is replicated there or because some platforms serve as portals to federated data sets. These categories are not strictly mutually exclusive, and some resources blur the boundaries between them. For example, it can be hard to differentiate between an organization and an initiative. As a general rule, if an organization was created primarily for the purpose of creating or collecting data, we consider it an initiative. In addition, a curated knowledge base may import and redistribute some data sets on which it is based making it both a platform and a data set. Figure 2 Open in new tabDownload slide A framework for classification of data-related resources. Nodes denote resource types (Entities, Initiatives, Platforms and Data sets), and edges show the many-to-many relationships among them. Figure 2 Open in new tabDownload slide A framework for classification of data-related resources. Nodes denote resource types (Entities, Initiatives, Platforms and Data sets), and edges show the many-to-many relationships among them. With respect to platforms, several attributes are especially salient. Some platforms such as Open fMRI focus on a single data type. Other platforms such as Synapse are meant to be general-purpose. In addition to storing different types of data, Synapse is disease-agnostic, storing data from many different diseases and medical domains. Other platforms, for example the Stanley Neuropathology Integrative Database focuses on a specific set of mental health conditions. Figure 3 shows where major data-sharing platforms relevant to mental health fall along the spectra of data-type specificity and disease focus. Some platforms, e.g. DataMed developed by the bioCADDIE project team for the NIH BD2K Data Discovery Index (DDI), are essentially portals to a federated collection of data sets that reside in still other platforms. Finally, some resources solely house data or information, while others are associated with biospecimens that may be available for generating additional data. Figure 3 Open in new tabDownload slide Visual representation of data platform attributes. See Table 1 for abbreviations. Figure 3 Open in new tabDownload slide Visual representation of data platform attributes. See Table 1 for abbreviations. Resource identification Because the topic of data for biomarker discovery in mental health is so broad, a simple PubMed query was not feasible. (For example, a query for ‘[mental health OR behavioral health OR psychiatric] AND [biomarker OR genomic OR imaging] AND data’ yields >25 000 hits.) A preliminary list of resources was established based on co-authors’ prior knowledge in the domains of open data, FAIR principles and MH. The list was then augmented through a series of searches in PubMed and Google Scholar using combinations and variations of the following terms: mental health, SCZ, open data, database, imaging, genomic, proteomic, metabolomic and biomarker. In addition to direct hits yielded by these terms, PubMed’s ‘similar articles’ provided valuable additional results. Finally, a search in BioCADDIE’s DataMed data search engine yielded additional sources. Inclusion criteria for resources were: (1) Scope includes one or more types of biomarker data (beyond clinical phenotype data); (2) Data accessibility, or at minimum some indication of how to request the data; and (3) Coverage of MH phenotypes, or in the case of disease-specific resources, SCZ. The resulting list of data resources and their metadata is provided in Table 1. Figure 4 gives a high-level landscape overview for the MH-specific organizational entities, initiatives and platforms and how they relate to each other. Table 1 Open data resources for biomarker discovery in mental health, particularly in schizophrenia Resource . Type . URL . Notes . Enhancing Neuro Imaging Genetics Through Meta Analysis (ENIGMA) O http://enigma.ini.usc.edu/ongoing/enigma-schizophrenia-working-group/ The ENIGMA Network brings together researchers in imaging genomics to understand brain structure, function and disease, based on brain imaging and genetic data. Includes Schizophrenia Working Group (ENIGMA-SCZ) NIMH O https://www.nimh.nih.gov/index.shtml The institute within the NIH that focuses on mental health and disease. The NIMH is one of 27 institutes and centers within NIH, which is part of the US Department of Health and Human Services Open Translational Science In Schizophrenia (OPTICS) O https://sites.google.com/site/opticsschizophrenia/home A time-limited proof of concept pilot project designed to provide a forum for translational science based on Janssen clinical trial data made available to qualified investigators Stanley Medical Research Institute (SMRI) O http://www.stanleyresearch.org/ A nonprofit organization supporting research on the causes of, and treatments for, SCZ and bipolar disorder Mental Health Research Network (MHRN) O http://hcsrn.org/mhrn Consortium of 13 health system research centers dedicated to improving patient mental health through research, practice and policy. Supported by a cooperative agreement from the NIMH. The MHRN conducts pragmatic research in health systems serving over 12 million patients Common Mind Consortium I http://commonmind.org Public–private partnership to generate and analyze large-scale genomic data across several brain regions from human subjects with neuropsychiatric disease and to make these data and the associated analytical results broadly available to qualified investigators Human Connectome Project (HCP) I http://www.humanconnectome.org/ Large NIH-funded project for integrating genomics, behavior and brain imaging. Currently, high-resolution imaging data are available on 1200 individuals. Primary modalities measure brain activity (resting state fMRI and task-evoked fMRI), white matter integrity (diffusion imaging and T2 FLAIR) and oscillatory brain activity (EEG and) NIMH Human Genetics Initiative I https://www.nimhgenetics.org/nimh_human_genetics_initiative/ Intended to establish a national resource of clinical and diagnostic information and immortalized cell lines from individuals with SCZ, bipolar disorder or Alzheimer's disease and their relatives, available to qualified investigators for research on the genetic basis of these disorders PsychENCODE I https://www.synapse.org//#! Synapse: syn4921369/wiki/235539 Funded by the NIMH with the goal of accelerating discovery of noncoding functional genomic elements in the human brain and elucidating their role in the molecular pathophysiology of psychiatric disorders Stanley Neuropathology Consortium (SNC) I http://www.stanleyresearch.org/brain-research/neuropathology-consortium/ A collection of 60 brains, consisting of 15 each diagnosed with SCZ, bipolar disorder or major depression, and unaffected controls. Samples may be requested for research purposed. Associated data are available in the SNC Integrative Database (SNCID)—see below Psychiatrics Genomics Consortium (PGC) I http://www.med.unc.edu/pgc Founded in 2007, the PGC includes over 800 investigators from 38 countries with the goal of conducting meta- and mega-analyses of genomic data for psychiatric disorders. The initial focus was on autism, attention-deficit hyperactivity disorder, bipolar disorder, major depressive disorder and SCZ. More recently, the scope has expanded to other conditions and other types of genetic variation beyond SNVs Neuroscience Information Framework (NIF) I/P https://neuinfo.org/ An NIH-funded framework for identifying, locating, relating, accessing, integrating and analyzing information from the neuroscience research enterprise. NIF has come to refer to both this initiative and the set of tools and platforms that make up that framework including the registry of electronic resources and the discovery portal for searching those resources. NIF includes >4500 curated resources and access to > 100 databases Allen Brain Atlas/Data Portal I/P http://human.brain-map.org/ The Allen Institute for Brain Science is dedicated to understanding how the human brain works in health and disease. The Allen Human Brain Atlas integrates anatomic and genomic information across the brain. Data modalities include MRI, DTI, histology and gene expression data derived from both microarray and in situ hybridization (ISH) approaches. Microarray data are spatially mapped to the MRI. Complete microarray and RNA-seq data are available for six human brains. ISH data are available for ∼50 SCZ brains NIMH Repository and Genomics Resource (RGR) P https://www.nimhgenetics.org/available_data/schizophrenia/ Includes 100+ studies, including CommonMind, PsychENCODE. Formerly the Center for Collaborative Genomic Studies on Mental Disorders, the RGR was established in 1998 through the NIMH Human Genetics Initiative to leverage and increase the value of human genetic samples and data produced through NIMH-funded research. It contains a collection of > 150 000 well-characterized, high-quality patient and control samples from patients with a range of mental disorders. The RGR’s Biologic Core and a Data Management Core are external to NIH Function Biomedical Informatics Research Network Data Repository (FBIRN DR) P fbirnbdr.nbirn.net: 8080 (BROKEN) FBIRN was initially focused on assessing major sources of variation of fMRI data generated across different scanners. The FBIRN Phase 1 data set consists of a traveling subject study of five healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with SCZ or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. The BIRN Data Repository (BDR) includes imaging, clinical, cognitive and physiological data OpenNeuro (previously OpenfMRI) P https://openneuro.org/(https://openfmri.org/) A neuroimaging repository to enable reproducible analysis and data sharing. Started in 2010, it initially focused only on task-based MRI, but is now open to all forms of neuroimaging data, reflected in the name transition from OpenfMRI to OpenNeuro. Data are anonymized before distribution to protect the confidentiality of participants and distributed using a Public Domain license Research Domain Criteria Database (RDoC DB) P https://data-archive.nimh.nih.gov/rdocdb/ A data repository for the harmonization and sharing of research data related to the RDoC initiative and mental health research more generally. The actual platform uses software designed to host the NIH’s National Database for Autism Research (NDAR) SchizConnect P http://schizconnect.org/ Federated access to several neuroimaging databases with images acquired on SCZ subjects. Data sources include FBIRN, NUSDAST, COINS and MCIC (maintained by the Mental Illness and Neuroscience Discovery Institute, now the Mind Research Network). More than 1100 subjects with >1000 have imaging data, including resting state fMRI, task-related fMRI, structural and diffusion imaging SNCID P http://sncid.stanleyresearch.org/ Web-based tool for exploring neuropathological traits, gene expression and associated biological processes in psychiatric disorders generated by the SNC within the SMRI Australian Schizophrenia Research Bank P http://www.schizophreniaresearch.org.au/bank/ A research database and storage facility that links clinical and neuropsychological information, blood samples and structural and fMRI brain scans from people with SCZ and healthy nonpsychiatric controls, and currently has data on ∼900 cases and 900 controls Internet Brain Volume Database (IBVD) P http://ibvd.virtualbrain.org/ Centered around publications as the central data structure, IBVD is a Web-based searchable database of brain neuroanatomic volumetric observations that enables electronic access to the results in the published literature dbGap P https://www.ncbi.nlm.nih.gov/gap Developed by the NIH’s NCBI to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype. While the focus is on genomic data, other data types are included as well, for example metabolomic data and laboratory values Metabolights P http://www.ebi.ac.uk/metabolights/ A database for Metabolomics experiments and derived information. Metabolights is the slightly more established European counterpart to the NIH’s MW and the recommended metabolomics repository for a number of top journals DataMed P http://datamed.org/ Data search engine portal to enable users to search for data across different repositories developed for the NIH BD2K DDI by the bioCADDIE project team. The initial prototype release (v2.0) features a set of data repositories selected by the bioCADDIE team, with a form to suggest additional repositories for inclusion Metabolomics Workbench (MW) P http://www.metabolomicsworkbench.org/ A repository for metabolomics data and metadata, MW provides analysis tools and access to metabolite standards, protocols, tutorials and training PRIDE P https://www.ebi.ac.uk/pride/archive/ A centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, posttranslational modifications and supporting spectral evidence. Most of the data sets related to mental health disorders in PRIDE are derived from animal models Synapse P https://www.synapse.org/ Sage Bionetworks’ software platform for data sharing and provenance tracking. Synapse enables researchers to carry out, track and communicate research in real time and enables co-location of scientific content (data, code, results) and narrative descriptions of that work. The platform is agnostic regarding biomedical domain or data type and hosts a number of different file types and projects funded by a number of different sources GEO P https://www.ncbi.nlm.nih.gov/geo/ An international public repository developed by the NIH NCBI that archives and freely distributes microarray, next-generation sequencing and other high-throughput functional genomics data submitted by the research community AE P https://www.ebi.ac.uk/arrayexpress/ The European counterpart to GEO. AE is an archive of functional genomics data from high-throughput functional genomics experiments. A subset of experiments is imported from GEO, while others are submitted directly GEMMA P http://www.chibi.ubc.ca/Gemma/ Gemma is a website, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles OmicsDI P http://www.omicsdi.org Enables data set discovery across omics data resources spanning eight international repositories, including both open and controlled access data resources. The resource provides key metadata for each data set and uses this metadata to enable search capabilities and identification of related data sets. OmicsDI helps researchers to idenitfy groups of related, multi-omics data sets across repositories Resource . Type . URL . Notes . Enhancing Neuro Imaging Genetics Through Meta Analysis (ENIGMA) O http://enigma.ini.usc.edu/ongoing/enigma-schizophrenia-working-group/ The ENIGMA Network brings together researchers in imaging genomics to understand brain structure, function and disease, based on brain imaging and genetic data. Includes Schizophrenia Working Group (ENIGMA-SCZ) NIMH O https://www.nimh.nih.gov/index.shtml The institute within the NIH that focuses on mental health and disease. The NIMH is one of 27 institutes and centers within NIH, which is part of the US Department of Health and Human Services Open Translational Science In Schizophrenia (OPTICS) O https://sites.google.com/site/opticsschizophrenia/home A time-limited proof of concept pilot project designed to provide a forum for translational science based on Janssen clinical trial data made available to qualified investigators Stanley Medical Research Institute (SMRI) O http://www.stanleyresearch.org/ A nonprofit organization supporting research on the causes of, and treatments for, SCZ and bipolar disorder Mental Health Research Network (MHRN) O http://hcsrn.org/mhrn Consortium of 13 health system research centers dedicated to improving patient mental health through research, practice and policy. Supported by a cooperative agreement from the NIMH. The MHRN conducts pragmatic research in health systems serving over 12 million patients Common Mind Consortium I http://commonmind.org Public–private partnership to generate and analyze large-scale genomic data across several brain regions from human subjects with neuropsychiatric disease and to make these data and the associated analytical results broadly available to qualified investigators Human Connectome Project (HCP) I http://www.humanconnectome.org/ Large NIH-funded project for integrating genomics, behavior and brain imaging. Currently, high-resolution imaging data are available on 1200 individuals. Primary modalities measure brain activity (resting state fMRI and task-evoked fMRI), white matter integrity (diffusion imaging and T2 FLAIR) and oscillatory brain activity (EEG and) NIMH Human Genetics Initiative I https://www.nimhgenetics.org/nimh_human_genetics_initiative/ Intended to establish a national resource of clinical and diagnostic information and immortalized cell lines from individuals with SCZ, bipolar disorder or Alzheimer's disease and their relatives, available to qualified investigators for research on the genetic basis of these disorders PsychENCODE I https://www.synapse.org//#! Synapse: syn4921369/wiki/235539 Funded by the NIMH with the goal of accelerating discovery of noncoding functional genomic elements in the human brain and elucidating their role in the molecular pathophysiology of psychiatric disorders Stanley Neuropathology Consortium (SNC) I http://www.stanleyresearch.org/brain-research/neuropathology-consortium/ A collection of 60 brains, consisting of 15 each diagnosed with SCZ, bipolar disorder or major depression, and unaffected controls. Samples may be requested for research purposed. Associated data are available in the SNC Integrative Database (SNCID)—see below Psychiatrics Genomics Consortium (PGC) I http://www.med.unc.edu/pgc Founded in 2007, the PGC includes over 800 investigators from 38 countries with the goal of conducting meta- and mega-analyses of genomic data for psychiatric disorders. The initial focus was on autism, attention-deficit hyperactivity disorder, bipolar disorder, major depressive disorder and SCZ. More recently, the scope has expanded to other conditions and other types of genetic variation beyond SNVs Neuroscience Information Framework (NIF) I/P https://neuinfo.org/ An NIH-funded framework for identifying, locating, relating, accessing, integrating and analyzing information from the neuroscience research enterprise. NIF has come to refer to both this initiative and the set of tools and platforms that make up that framework including the registry of electronic resources and the discovery portal for searching those resources. NIF includes >4500 curated resources and access to > 100 databases Allen Brain Atlas/Data Portal I/P http://human.brain-map.org/ The Allen Institute for Brain Science is dedicated to understanding how the human brain works in health and disease. The Allen Human Brain Atlas integrates anatomic and genomic information across the brain. Data modalities include MRI, DTI, histology and gene expression data derived from both microarray and in situ hybridization (ISH) approaches. Microarray data are spatially mapped to the MRI. Complete microarray and RNA-seq data are available for six human brains. ISH data are available for ∼50 SCZ brains NIMH Repository and Genomics Resource (RGR) P https://www.nimhgenetics.org/available_data/schizophrenia/ Includes 100+ studies, including CommonMind, PsychENCODE. Formerly the Center for Collaborative Genomic Studies on Mental Disorders, the RGR was established in 1998 through the NIMH Human Genetics Initiative to leverage and increase the value of human genetic samples and data produced through NIMH-funded research. It contains a collection of > 150 000 well-characterized, high-quality patient and control samples from patients with a range of mental disorders. The RGR’s Biologic Core and a Data Management Core are external to NIH Function Biomedical Informatics Research Network Data Repository (FBIRN DR) P fbirnbdr.nbirn.net: 8080 (BROKEN) FBIRN was initially focused on assessing major sources of variation of fMRI data generated across different scanners. The FBIRN Phase 1 data set consists of a traveling subject study of five healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with SCZ or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. The BIRN Data Repository (BDR) includes imaging, clinical, cognitive and physiological data OpenNeuro (previously OpenfMRI) P https://openneuro.org/(https://openfmri.org/) A neuroimaging repository to enable reproducible analysis and data sharing. Started in 2010, it initially focused only on task-based MRI, but is now open to all forms of neuroimaging data, reflected in the name transition from OpenfMRI to OpenNeuro. Data are anonymized before distribution to protect the confidentiality of participants and distributed using a Public Domain license Research Domain Criteria Database (RDoC DB) P https://data-archive.nimh.nih.gov/rdocdb/ A data repository for the harmonization and sharing of research data related to the RDoC initiative and mental health research more generally. The actual platform uses software designed to host the NIH’s National Database for Autism Research (NDAR) SchizConnect P http://schizconnect.org/ Federated access to several neuroimaging databases with images acquired on SCZ subjects. Data sources include FBIRN, NUSDAST, COINS and MCIC (maintained by the Mental Illness and Neuroscience Discovery Institute, now the Mind Research Network). More than 1100 subjects with >1000 have imaging data, including resting state fMRI, task-related fMRI, structural and diffusion imaging SNCID P http://sncid.stanleyresearch.org/ Web-based tool for exploring neuropathological traits, gene expression and associated biological processes in psychiatric disorders generated by the SNC within the SMRI Australian Schizophrenia Research Bank P http://www.schizophreniaresearch.org.au/bank/ A research database and storage facility that links clinical and neuropsychological information, blood samples and structural and fMRI brain scans from people with SCZ and healthy nonpsychiatric controls, and currently has data on ∼900 cases and 900 controls Internet Brain Volume Database (IBVD) P http://ibvd.virtualbrain.org/ Centered around publications as the central data structure, IBVD is a Web-based searchable database of brain neuroanatomic volumetric observations that enables electronic access to the results in the published literature dbGap P https://www.ncbi.nlm.nih.gov/gap Developed by the NIH’s NCBI to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype. While the focus is on genomic data, other data types are included as well, for example metabolomic data and laboratory values Metabolights P http://www.ebi.ac.uk/metabolights/ A database for Metabolomics experiments and derived information. Metabolights is the slightly more established European counterpart to the NIH’s MW and the recommended metabolomics repository for a number of top journals DataMed P http://datamed.org/ Data search engine portal to enable users to search for data across different repositories developed for the NIH BD2K DDI by the bioCADDIE project team. The initial prototype release (v2.0) features a set of data repositories selected by the bioCADDIE team, with a form to suggest additional repositories for inclusion Metabolomics Workbench (MW) P http://www.metabolomicsworkbench.org/ A repository for metabolomics data and metadata, MW provides analysis tools and access to metabolite standards, protocols, tutorials and training PRIDE P https://www.ebi.ac.uk/pride/archive/ A centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, posttranslational modifications and supporting spectral evidence. Most of the data sets related to mental health disorders in PRIDE are derived from animal models Synapse P https://www.synapse.org/ Sage Bionetworks’ software platform for data sharing and provenance tracking. Synapse enables researchers to carry out, track and communicate research in real time and enables co-location of scientific content (data, code, results) and narrative descriptions of that work. The platform is agnostic regarding biomedical domain or data type and hosts a number of different file types and projects funded by a number of different sources GEO P https://www.ncbi.nlm.nih.gov/geo/ An international public repository developed by the NIH NCBI that archives and freely distributes microarray, next-generation sequencing and other high-throughput functional genomics data submitted by the research community AE P https://www.ebi.ac.uk/arrayexpress/ The European counterpart to GEO. AE is an archive of functional genomics data from high-throughput functional genomics experiments. A subset of experiments is imported from GEO, while others are submitted directly GEMMA P http://www.chibi.ubc.ca/Gemma/ Gemma is a website, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles OmicsDI P http://www.omicsdi.org Enables data set discovery across omics data resources spanning eight international repositories, including both open and controlled access data resources. The resource provides key metadata for each data set and uses this metadata to enable search capabilities and identification of related data sets. OmicsDI helps researchers to idenitfy groups of related, multi-omics data sets across repositories Note: Type: O, organizational entity; I, initiative; P, platform. Open in new tab Table 1 Open data resources for biomarker discovery in mental health, particularly in schizophrenia Resource . Type . URL . Notes . Enhancing Neuro Imaging Genetics Through Meta Analysis (ENIGMA) O http://enigma.ini.usc.edu/ongoing/enigma-schizophrenia-working-group/ The ENIGMA Network brings together researchers in imaging genomics to understand brain structure, function and disease, based on brain imaging and genetic data. Includes Schizophrenia Working Group (ENIGMA-SCZ) NIMH O https://www.nimh.nih.gov/index.shtml The institute within the NIH that focuses on mental health and disease. The NIMH is one of 27 institutes and centers within NIH, which is part of the US Department of Health and Human Services Open Translational Science In Schizophrenia (OPTICS) O https://sites.google.com/site/opticsschizophrenia/home A time-limited proof of concept pilot project designed to provide a forum for translational science based on Janssen clinical trial data made available to qualified investigators Stanley Medical Research Institute (SMRI) O http://www.stanleyresearch.org/ A nonprofit organization supporting research on the causes of, and treatments for, SCZ and bipolar disorder Mental Health Research Network (MHRN) O http://hcsrn.org/mhrn Consortium of 13 health system research centers dedicated to improving patient mental health through research, practice and policy. Supported by a cooperative agreement from the NIMH. The MHRN conducts pragmatic research in health systems serving over 12 million patients Common Mind Consortium I http://commonmind.org Public–private partnership to generate and analyze large-scale genomic data across several brain regions from human subjects with neuropsychiatric disease and to make these data and the associated analytical results broadly available to qualified investigators Human Connectome Project (HCP) I http://www.humanconnectome.org/ Large NIH-funded project for integrating genomics, behavior and brain imaging. Currently, high-resolution imaging data are available on 1200 individuals. Primary modalities measure brain activity (resting state fMRI and task-evoked fMRI), white matter integrity (diffusion imaging and T2 FLAIR) and oscillatory brain activity (EEG and) NIMH Human Genetics Initiative I https://www.nimhgenetics.org/nimh_human_genetics_initiative/ Intended to establish a national resource of clinical and diagnostic information and immortalized cell lines from individuals with SCZ, bipolar disorder or Alzheimer's disease and their relatives, available to qualified investigators for research on the genetic basis of these disorders PsychENCODE I https://www.synapse.org//#! Synapse: syn4921369/wiki/235539 Funded by the NIMH with the goal of accelerating discovery of noncoding functional genomic elements in the human brain and elucidating their role in the molecular pathophysiology of psychiatric disorders Stanley Neuropathology Consortium (SNC) I http://www.stanleyresearch.org/brain-research/neuropathology-consortium/ A collection of 60 brains, consisting of 15 each diagnosed with SCZ, bipolar disorder or major depression, and unaffected controls. Samples may be requested for research purposed. Associated data are available in the SNC Integrative Database (SNCID)—see below Psychiatrics Genomics Consortium (PGC) I http://www.med.unc.edu/pgc Founded in 2007, the PGC includes over 800 investigators from 38 countries with the goal of conducting meta- and mega-analyses of genomic data for psychiatric disorders. The initial focus was on autism, attention-deficit hyperactivity disorder, bipolar disorder, major depressive disorder and SCZ. More recently, the scope has expanded to other conditions and other types of genetic variation beyond SNVs Neuroscience Information Framework (NIF) I/P https://neuinfo.org/ An NIH-funded framework for identifying, locating, relating, accessing, integrating and analyzing information from the neuroscience research enterprise. NIF has come to refer to both this initiative and the set of tools and platforms that make up that framework including the registry of electronic resources and the discovery portal for searching those resources. NIF includes >4500 curated resources and access to > 100 databases Allen Brain Atlas/Data Portal I/P http://human.brain-map.org/ The Allen Institute for Brain Science is dedicated to understanding how the human brain works in health and disease. The Allen Human Brain Atlas integrates anatomic and genomic information across the brain. Data modalities include MRI, DTI, histology and gene expression data derived from both microarray and in situ hybridization (ISH) approaches. Microarray data are spatially mapped to the MRI. Complete microarray and RNA-seq data are available for six human brains. ISH data are available for ∼50 SCZ brains NIMH Repository and Genomics Resource (RGR) P https://www.nimhgenetics.org/available_data/schizophrenia/ Includes 100+ studies, including CommonMind, PsychENCODE. Formerly the Center for Collaborative Genomic Studies on Mental Disorders, the RGR was established in 1998 through the NIMH Human Genetics Initiative to leverage and increase the value of human genetic samples and data produced through NIMH-funded research. It contains a collection of > 150 000 well-characterized, high-quality patient and control samples from patients with a range of mental disorders. The RGR’s Biologic Core and a Data Management Core are external to NIH Function Biomedical Informatics Research Network Data Repository (FBIRN DR) P fbirnbdr.nbirn.net: 8080 (BROKEN) FBIRN was initially focused on assessing major sources of variation of fMRI data generated across different scanners. The FBIRN Phase 1 data set consists of a traveling subject study of five healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with SCZ or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. The BIRN Data Repository (BDR) includes imaging, clinical, cognitive and physiological data OpenNeuro (previously OpenfMRI) P https://openneuro.org/(https://openfmri.org/) A neuroimaging repository to enable reproducible analysis and data sharing. Started in 2010, it initially focused only on task-based MRI, but is now open to all forms of neuroimaging data, reflected in the name transition from OpenfMRI to OpenNeuro. Data are anonymized before distribution to protect the confidentiality of participants and distributed using a Public Domain license Research Domain Criteria Database (RDoC DB) P https://data-archive.nimh.nih.gov/rdocdb/ A data repository for the harmonization and sharing of research data related to the RDoC initiative and mental health research more generally. The actual platform uses software designed to host the NIH’s National Database for Autism Research (NDAR) SchizConnect P http://schizconnect.org/ Federated access to several neuroimaging databases with images acquired on SCZ subjects. Data sources include FBIRN, NUSDAST, COINS and MCIC (maintained by the Mental Illness and Neuroscience Discovery Institute, now the Mind Research Network). More than 1100 subjects with >1000 have imaging data, including resting state fMRI, task-related fMRI, structural and diffusion imaging SNCID P http://sncid.stanleyresearch.org/ Web-based tool for exploring neuropathological traits, gene expression and associated biological processes in psychiatric disorders generated by the SNC within the SMRI Australian Schizophrenia Research Bank P http://www.schizophreniaresearch.org.au/bank/ A research database and storage facility that links clinical and neuropsychological information, blood samples and structural and fMRI brain scans from people with SCZ and healthy nonpsychiatric controls, and currently has data on ∼900 cases and 900 controls Internet Brain Volume Database (IBVD) P http://ibvd.virtualbrain.org/ Centered around publications as the central data structure, IBVD is a Web-based searchable database of brain neuroanatomic volumetric observations that enables electronic access to the results in the published literature dbGap P https://www.ncbi.nlm.nih.gov/gap Developed by the NIH’s NCBI to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype. While the focus is on genomic data, other data types are included as well, for example metabolomic data and laboratory values Metabolights P http://www.ebi.ac.uk/metabolights/ A database for Metabolomics experiments and derived information. Metabolights is the slightly more established European counterpart to the NIH’s MW and the recommended metabolomics repository for a number of top journals DataMed P http://datamed.org/ Data search engine portal to enable users to search for data across different repositories developed for the NIH BD2K DDI by the bioCADDIE project team. The initial prototype release (v2.0) features a set of data repositories selected by the bioCADDIE team, with a form to suggest additional repositories for inclusion Metabolomics Workbench (MW) P http://www.metabolomicsworkbench.org/ A repository for metabolomics data and metadata, MW provides analysis tools and access to metabolite standards, protocols, tutorials and training PRIDE P https://www.ebi.ac.uk/pride/archive/ A centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, posttranslational modifications and supporting spectral evidence. Most of the data sets related to mental health disorders in PRIDE are derived from animal models Synapse P https://www.synapse.org/ Sage Bionetworks’ software platform for data sharing and provenance tracking. Synapse enables researchers to carry out, track and communicate research in real time and enables co-location of scientific content (data, code, results) and narrative descriptions of that work. The platform is agnostic regarding biomedical domain or data type and hosts a number of different file types and projects funded by a number of different sources GEO P https://www.ncbi.nlm.nih.gov/geo/ An international public repository developed by the NIH NCBI that archives and freely distributes microarray, next-generation sequencing and other high-throughput functional genomics data submitted by the research community AE P https://www.ebi.ac.uk/arrayexpress/ The European counterpart to GEO. AE is an archive of functional genomics data from high-throughput functional genomics experiments. A subset of experiments is imported from GEO, while others are submitted directly GEMMA P http://www.chibi.ubc.ca/Gemma/ Gemma is a website, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles OmicsDI P http://www.omicsdi.org Enables data set discovery across omics data resources spanning eight international repositories, including both open and controlled access data resources. The resource provides key metadata for each data set and uses this metadata to enable search capabilities and identification of related data sets. OmicsDI helps researchers to idenitfy groups of related, multi-omics data sets across repositories Resource . Type . URL . Notes . Enhancing Neuro Imaging Genetics Through Meta Analysis (ENIGMA) O http://enigma.ini.usc.edu/ongoing/enigma-schizophrenia-working-group/ The ENIGMA Network brings together researchers in imaging genomics to understand brain structure, function and disease, based on brain imaging and genetic data. Includes Schizophrenia Working Group (ENIGMA-SCZ) NIMH O https://www.nimh.nih.gov/index.shtml The institute within the NIH that focuses on mental health and disease. The NIMH is one of 27 institutes and centers within NIH, which is part of the US Department of Health and Human Services Open Translational Science In Schizophrenia (OPTICS) O https://sites.google.com/site/opticsschizophrenia/home A time-limited proof of concept pilot project designed to provide a forum for translational science based on Janssen clinical trial data made available to qualified investigators Stanley Medical Research Institute (SMRI) O http://www.stanleyresearch.org/ A nonprofit organization supporting research on the causes of, and treatments for, SCZ and bipolar disorder Mental Health Research Network (MHRN) O http://hcsrn.org/mhrn Consortium of 13 health system research centers dedicated to improving patient mental health through research, practice and policy. Supported by a cooperative agreement from the NIMH. The MHRN conducts pragmatic research in health systems serving over 12 million patients Common Mind Consortium I http://commonmind.org Public–private partnership to generate and analyze large-scale genomic data across several brain regions from human subjects with neuropsychiatric disease and to make these data and the associated analytical results broadly available to qualified investigators Human Connectome Project (HCP) I http://www.humanconnectome.org/ Large NIH-funded project for integrating genomics, behavior and brain imaging. Currently, high-resolution imaging data are available on 1200 individuals. Primary modalities measure brain activity (resting state fMRI and task-evoked fMRI), white matter integrity (diffusion imaging and T2 FLAIR) and oscillatory brain activity (EEG and) NIMH Human Genetics Initiative I https://www.nimhgenetics.org/nimh_human_genetics_initiative/ Intended to establish a national resource of clinical and diagnostic information and immortalized cell lines from individuals with SCZ, bipolar disorder or Alzheimer's disease and their relatives, available to qualified investigators for research on the genetic basis of these disorders PsychENCODE I https://www.synapse.org//#! Synapse: syn4921369/wiki/235539 Funded by the NIMH with the goal of accelerating discovery of noncoding functional genomic elements in the human brain and elucidating their role in the molecular pathophysiology of psychiatric disorders Stanley Neuropathology Consortium (SNC) I http://www.stanleyresearch.org/brain-research/neuropathology-consortium/ A collection of 60 brains, consisting of 15 each diagnosed with SCZ, bipolar disorder or major depression, and unaffected controls. Samples may be requested for research purposed. Associated data are available in the SNC Integrative Database (SNCID)—see below Psychiatrics Genomics Consortium (PGC) I http://www.med.unc.edu/pgc Founded in 2007, the PGC includes over 800 investigators from 38 countries with the goal of conducting meta- and mega-analyses of genomic data for psychiatric disorders. The initial focus was on autism, attention-deficit hyperactivity disorder, bipolar disorder, major depressive disorder and SCZ. More recently, the scope has expanded to other conditions and other types of genetic variation beyond SNVs Neuroscience Information Framework (NIF) I/P https://neuinfo.org/ An NIH-funded framework for identifying, locating, relating, accessing, integrating and analyzing information from the neuroscience research enterprise. NIF has come to refer to both this initiative and the set of tools and platforms that make up that framework including the registry of electronic resources and the discovery portal for searching those resources. NIF includes >4500 curated resources and access to > 100 databases Allen Brain Atlas/Data Portal I/P http://human.brain-map.org/ The Allen Institute for Brain Science is dedicated to understanding how the human brain works in health and disease. The Allen Human Brain Atlas integrates anatomic and genomic information across the brain. Data modalities include MRI, DTI, histology and gene expression data derived from both microarray and in situ hybridization (ISH) approaches. Microarray data are spatially mapped to the MRI. Complete microarray and RNA-seq data are available for six human brains. ISH data are available for ∼50 SCZ brains NIMH Repository and Genomics Resource (RGR) P https://www.nimhgenetics.org/available_data/schizophrenia/ Includes 100+ studies, including CommonMind, PsychENCODE. Formerly the Center for Collaborative Genomic Studies on Mental Disorders, the RGR was established in 1998 through the NIMH Human Genetics Initiative to leverage and increase the value of human genetic samples and data produced through NIMH-funded research. It contains a collection of > 150 000 well-characterized, high-quality patient and control samples from patients with a range of mental disorders. The RGR’s Biologic Core and a Data Management Core are external to NIH Function Biomedical Informatics Research Network Data Repository (FBIRN DR) P fbirnbdr.nbirn.net: 8080 (BROKEN) FBIRN was initially focused on assessing major sources of variation of fMRI data generated across different scanners. The FBIRN Phase 1 data set consists of a traveling subject study of five healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with SCZ or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. The BIRN Data Repository (BDR) includes imaging, clinical, cognitive and physiological data OpenNeuro (previously OpenfMRI) P https://openneuro.org/(https://openfmri.org/) A neuroimaging repository to enable reproducible analysis and data sharing. Started in 2010, it initially focused only on task-based MRI, but is now open to all forms of neuroimaging data, reflected in the name transition from OpenfMRI to OpenNeuro. Data are anonymized before distribution to protect the confidentiality of participants and distributed using a Public Domain license Research Domain Criteria Database (RDoC DB) P https://data-archive.nimh.nih.gov/rdocdb/ A data repository for the harmonization and sharing of research data related to the RDoC initiative and mental health research more generally. The actual platform uses software designed to host the NIH’s National Database for Autism Research (NDAR) SchizConnect P http://schizconnect.org/ Federated access to several neuroimaging databases with images acquired on SCZ subjects. Data sources include FBIRN, NUSDAST, COINS and MCIC (maintained by the Mental Illness and Neuroscience Discovery Institute, now the Mind Research Network). More than 1100 subjects with >1000 have imaging data, including resting state fMRI, task-related fMRI, structural and diffusion imaging SNCID P http://sncid.stanleyresearch.org/ Web-based tool for exploring neuropathological traits, gene expression and associated biological processes in psychiatric disorders generated by the SNC within the SMRI Australian Schizophrenia Research Bank P http://www.schizophreniaresearch.org.au/bank/ A research database and storage facility that links clinical and neuropsychological information, blood samples and structural and fMRI brain scans from people with SCZ and healthy nonpsychiatric controls, and currently has data on ∼900 cases and 900 controls Internet Brain Volume Database (IBVD) P http://ibvd.virtualbrain.org/ Centered around publications as the central data structure, IBVD is a Web-based searchable database of brain neuroanatomic volumetric observations that enables electronic access to the results in the published literature dbGap P https://www.ncbi.nlm.nih.gov/gap Developed by the NIH’s NCBI to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype. While the focus is on genomic data, other data types are included as well, for example metabolomic data and laboratory values Metabolights P http://www.ebi.ac.uk/metabolights/ A database for Metabolomics experiments and derived information. Metabolights is the slightly more established European counterpart to the NIH’s MW and the recommended metabolomics repository for a number of top journals DataMed P http://datamed.org/ Data search engine portal to enable users to search for data across different repositories developed for the NIH BD2K DDI by the bioCADDIE project team. The initial prototype release (v2.0) features a set of data repositories selected by the bioCADDIE team, with a form to suggest additional repositories for inclusion Metabolomics Workbench (MW) P http://www.metabolomicsworkbench.org/ A repository for metabolomics data and metadata, MW provides analysis tools and access to metabolite standards, protocols, tutorials and training PRIDE P https://www.ebi.ac.uk/pride/archive/ A centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, posttranslational modifications and supporting spectral evidence. Most of the data sets related to mental health disorders in PRIDE are derived from animal models Synapse P https://www.synapse.org/ Sage Bionetworks’ software platform for data sharing and provenance tracking. Synapse enables researchers to carry out, track and communicate research in real time and enables co-location of scientific content (data, code, results) and narrative descriptions of that work. The platform is agnostic regarding biomedical domain or data type and hosts a number of different file types and projects funded by a number of different sources GEO P https://www.ncbi.nlm.nih.gov/geo/ An international public repository developed by the NIH NCBI that archives and freely distributes microarray, next-generation sequencing and other high-throughput functional genomics data submitted by the research community AE P https://www.ebi.ac.uk/arrayexpress/ The European counterpart to GEO. AE is an archive of functional genomics data from high-throughput functional genomics experiments. A subset of experiments is imported from GEO, while others are submitted directly GEMMA P http://www.chibi.ubc.ca/Gemma/ Gemma is a website, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles OmicsDI P http://www.omicsdi.org Enables data set discovery across omics data resources spanning eight international repositories, including both open and controlled access data resources. The resource provides key metadata for each data set and uses this metadata to enable search capabilities and identification of related data sets. OmicsDI helps researchers to idenitfy groups of related, multi-omics data sets across repositories Note: Type: O, organizational entity; I, initiative; P, platform. Open in new tab Figure 4 Open in new tabDownload slide Overview of landscape of organizational entities, initiatives and data sharing platforms. Figure 4 Open in new tabDownload slide Overview of landscape of organizational entities, initiatives and data sharing platforms. A number of potentially useful resources were deemed out of scope for this review because they lacked either -omics data (e.g. National Database for Clinical Trials Related to Mental Illness, NDCT, Yale Open Data Access Project YODA [37]), or psychiatric phenotype data (Exome Aggregation Consortium, ExAC [38], Genotype-Tissue Expression (GTEx) project [39]). Data sets Genomic data The two main data repositories for gene expression or transcriptomic data are GEO and ArrayExpress (AE). GEO is an international public repository developed by the US NIH’s National Center for Biotechnology Informatics (NCBI) that archives and freely distributes microarray, next-generation sequencing and other high-throughput functional genomics data submitted by the research community [35]. AE is the Europe-based repository, hosted by the European Bioinformatics Institute within the European Molecular Biology Laboratory (EMBL-EBI). Data are imported from GEO into AE on a weekly basis making GEO a subset of AE. To be uploaded to these data repositories, data sets need to be in a specific format, such as GEOarchive, SOFT or MINiM. They must also include appropriate metadata about the clinical and experimental data. Both AE and GEO enable programmatic access to data via tools like R/Bioconductor. Data sets in GEO therefore are able to satisfy the F (findable) and R (reusable) FAIR criteria. Gene expression data in GEO are generally considered to be de-identified, and are thus freely available for public use. There has been an increase in genomic profiling of data related to MH in the last few years. Taking SCZ as an example, a search in AE for published SCZ data sets shows 92 data sets in humans (Supplementary Table S1). Only two data sets were published in 2007 as compared with 11 data sets published in 2016. Until 2010, the majority of published data sets concerned transcriptome profiling. In 2012 and 2013, other genomic methods had gained popularity including methylation and next-generation sequencing technologies, exploration of noncoding regions and gene expression and splicing. Since 2014, many studies have been published using newer genomic platforms including chromatin immunoprecipitation sequencing, RNA sequencing (RNA-seq) and microRNA-seq, amounting to an approximate 15 published studies in 2014, 13 studies in 2015 and 11 studies in 2016. As of July 2017, we found that of the 92 data sets, 80 had been cited in one or more subsequent publications. Using Google Scholar queries on data set accession identifiers, it was determined that these 80 publications have been cited 6710 times (Supplementary Table S1). Note that citation does not necessarily imply analysis: many publications called attention to the existence of a data set without performing any additional analysis. NCBI’s database of Genotypes and Phenotypes (dbGaP) contains archived data and results from studies that have investigated the association between genotype and phenotype in Humans [40]. The European equivalent is the EGA (European Genome-phenome Archive) [41]. As with the gene expression repositories, data sets need to be in a specific format along with minimal metadata to be submitted into dbGaP or EGA. Note that these repositories contain sequencing data that are unique to the individuals from whom they were derived, and thus cannot be considered completely de-identified. Users must therefore submit a data request form detailing the goals of their project and how they intend to use the data and observe data use policy for approval by a data request committee. This approach has implications for meeting the accessibility aspect of FAIR criteria but represents a balance between data accessibility and data privacy for research participants. dbGaP contains a number of MH-related data sets, including SCZ. Of the 154 studies returned based on a query for the term ‘schizophrenia’, only a small subset was targeted at SCZ as determined by manual inspection. In this case, findability is hampered by the number of false positives (Table 2). The vast majority of the 24 studies returned in a search for ‘schizophrenia’ in EGA are either focused on SCZ or have some number of samples included with a SCZ diagnosis. Table 2 SCZ data sets in dbGaP Data set ID . Name . # Participants . Platform . Publication (PMIDs) . Citations . Data type . phs000979.v1.p1 (PRJNA293910) Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders 914 HumanHap650Yv3.0, Human1M-Duov3_B, Human HT-12 Expression Bead Ch 28070120 [4] SNP array, mRNA expression phs000473.v2.p2 (PRJNA157243, PRJNA94281) Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing 12 380 SureSelect Human All Exon v.1 Kit, SureSelect Human All Exon v. 22641211 [15] WES phs000738.v1.p1 Exome Sequencing in Schizophrenia Families 216 SeqCap EZ Human Exome Library v2.0 23911319, 24317315 [1] WES phs000687.v1.p1 Bulgarian Schizophrenia Trio Sequencing Study 1826 SureSelect Human All Exon v.2 Kit, SureSelect Human All Exon v3-50Mb, SeqCap EZ Human Exome Library v2.0 23040492, 22083728, 24463507 [1] WES, SNP Genotype phs000608.v1.p1 Whole-Genome Profiling to Detect Schizophrenia Methylation Markers 1459 MBD-seq 23244307 [42] Methylation phs000448.v1.p1 Genetics of Schizophrenia in an Ashkenazi Jewish Case-Control Cohort 3096 HumanOmni1-Quad_v1-0_B [4] SNP array phs000021.v3.p2 Genome-Wide Association Study of Schizophrenia 5064 AFFY_6.0 16400611 [43] SNP array phs000167.v1.p1 Molecular Genetics of Schizophrenia-nonGAIN Sample (MGS nonGAIN) 3029 AFFY_6.0 16400611 [44] SNP array Data set ID . Name . # Participants . Platform . Publication (PMIDs) . Citations . Data type . phs000979.v1.p1 (PRJNA293910) Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders 914 HumanHap650Yv3.0, Human1M-Duov3_B, Human HT-12 Expression Bead Ch 28070120 [4] SNP array, mRNA expression phs000473.v2.p2 (PRJNA157243, PRJNA94281) Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing 12 380 SureSelect Human All Exon v.1 Kit, SureSelect Human All Exon v. 22641211 [15] WES phs000738.v1.p1 Exome Sequencing in Schizophrenia Families 216 SeqCap EZ Human Exome Library v2.0 23911319, 24317315 [1] WES phs000687.v1.p1 Bulgarian Schizophrenia Trio Sequencing Study 1826 SureSelect Human All Exon v.2 Kit, SureSelect Human All Exon v3-50Mb, SeqCap EZ Human Exome Library v2.0 23040492, 22083728, 24463507 [1] WES, SNP Genotype phs000608.v1.p1 Whole-Genome Profiling to Detect Schizophrenia Methylation Markers 1459 MBD-seq 23244307 [42] Methylation phs000448.v1.p1 Genetics of Schizophrenia in an Ashkenazi Jewish Case-Control Cohort 3096 HumanOmni1-Quad_v1-0_B [4] SNP array phs000021.v3.p2 Genome-Wide Association Study of Schizophrenia 5064 AFFY_6.0 16400611 [43] SNP array phs000167.v1.p1 Molecular Genetics of Schizophrenia-nonGAIN Sample (MGS nonGAIN) 3029 AFFY_6.0 16400611 [44] SNP array Open in new tab Table 2 SCZ data sets in dbGaP Data set ID . Name . # Participants . Platform . Publication (PMIDs) . Citations . Data type . phs000979.v1.p1 (PRJNA293910) Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders 914 HumanHap650Yv3.0, Human1M-Duov3_B, Human HT-12 Expression Bead Ch 28070120 [4] SNP array, mRNA expression phs000473.v2.p2 (PRJNA157243, PRJNA94281) Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing 12 380 SureSelect Human All Exon v.1 Kit, SureSelect Human All Exon v. 22641211 [15] WES phs000738.v1.p1 Exome Sequencing in Schizophrenia Families 216 SeqCap EZ Human Exome Library v2.0 23911319, 24317315 [1] WES phs000687.v1.p1 Bulgarian Schizophrenia Trio Sequencing Study 1826 SureSelect Human All Exon v.2 Kit, SureSelect Human All Exon v3-50Mb, SeqCap EZ Human Exome Library v2.0 23040492, 22083728, 24463507 [1] WES, SNP Genotype phs000608.v1.p1 Whole-Genome Profiling to Detect Schizophrenia Methylation Markers 1459 MBD-seq 23244307 [42] Methylation phs000448.v1.p1 Genetics of Schizophrenia in an Ashkenazi Jewish Case-Control Cohort 3096 HumanOmni1-Quad_v1-0_B [4] SNP array phs000021.v3.p2 Genome-Wide Association Study of Schizophrenia 5064 AFFY_6.0 16400611 [43] SNP array phs000167.v1.p1 Molecular Genetics of Schizophrenia-nonGAIN Sample (MGS nonGAIN) 3029 AFFY_6.0 16400611 [44] SNP array Data set ID . Name . # Participants . Platform . Publication (PMIDs) . Citations . Data type . phs000979.v1.p1 (PRJNA293910) Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders 914 HumanHap650Yv3.0, Human1M-Duov3_B, Human HT-12 Expression Bead Ch 28070120 [4] SNP array, mRNA expression phs000473.v2.p2 (PRJNA157243, PRJNA94281) Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing 12 380 SureSelect Human All Exon v.1 Kit, SureSelect Human All Exon v. 22641211 [15] WES phs000738.v1.p1 Exome Sequencing in Schizophrenia Families 216 SeqCap EZ Human Exome Library v2.0 23911319, 24317315 [1] WES phs000687.v1.p1 Bulgarian Schizophrenia Trio Sequencing Study 1826 SureSelect Human All Exon v.2 Kit, SureSelect Human All Exon v3-50Mb, SeqCap EZ Human Exome Library v2.0 23040492, 22083728, 24463507 [1] WES, SNP Genotype phs000608.v1.p1 Whole-Genome Profiling to Detect Schizophrenia Methylation Markers 1459 MBD-seq 23244307 [42] Methylation phs000448.v1.p1 Genetics of Schizophrenia in an Ashkenazi Jewish Case-Control Cohort 3096 HumanOmni1-Quad_v1-0_B [4] SNP array phs000021.v3.p2 Genome-Wide Association Study of Schizophrenia 5064 AFFY_6.0 16400611 [43] SNP array phs000167.v1.p1 Molecular Genetics of Schizophrenia-nonGAIN Sample (MGS nonGAIN) 3029 AFFY_6.0 16400611 [44] SNP array Open in new tab Proteomic and metabolomic data sets EBI’s Metabolights and NCBI’s Metabolomics Workbench (MW) are two major metabolomics data repositories. Metabolights has no SCZ data sets; MW has one but data are not downloadable. A limited number of data sets appear to be available for other mental health phenotypes such as Alzheimer’s disease and autism spectrum disorder. PRIDE (PRoteomics IDEntifications), the leading proteomics data repository, has only three SCZ-related data sets: two in rat models and one in a mouse model. Data sets related to other MH disorders are similarly limited and largely generated from animal models. Imaging repositories SchizConnect is a federated portal that integrates data from three neuroimaging consortia on SCZ: FBIRN's Human Imaging Database (HID), MRN's Collaborative Imaging and Neuroinformatics System (COINS) and the Northwestern University Schizophrenia Data and Software Tool (NUSDAST) project [45]. A number of general purpose brain imaging repositories exist, such as OpenNeuro (formerly OpenfMRI) [46], and the Neuroimaging Informatics Tools and Resources Clearinghouse Image Repository (NITRC IR) [47]. However, to date, the publicly available data sets in those resources appear to be more cognition-oriented (e.g. classification learning, visual and auditory functions, attention) than psychiatric. Notable exceptions in SCZ include [42, 48]. The Functional Connectomes Project is also available through NITRC [49]. It comprises data from >1400 healthy subjects who underwent fMRI scans that assessed their brain activity when their minds were at rest. Included in the 1400 is a subset known as the COBRE (Center for Biomedical Research Excellence) data set, which includes anatomical and functional MR data from 72 patients with SCZ and 75 healthy controls. These data have been analyzed in a number of different ways by different groups [50–53]. Curated knowledge bases SZGR [28], SZGene [44] and SZDB [54] are three distinct but significantly overlapping knowledge repositories that include curated information regarding SCZ-related genes and data sets. The SZGR, available since 2009, is a ‘one-stop shop’ for genes and variants in SCZ, along with their function, regulation and drug information. It was created through systematic review and curation of multiple lines of evidence and includes ∼4200 common mutations and ∼1000 de novo mutations [28, 55]. SZGene is affiliated with the Schizophrenia Research Forum and contains data from 1700 studies. It enables the user to search by gene, protein, polymorphism, study or keyword to return the specific publications addressing those features [44]. However, the resource only contains data from studies before 2012 and is no longer supported. (Unfortunately, this is not uncommon for resources that require maintenance over time.) Finally, SZDB includes genomic, transcriptomic, molecular network data and functional annotations [54]. The DisGeNET database (http://www.disgenet.org) integrates human gene–disease associations from various expert curated databases and text-mining-derived associations including Mendelian, complex and environmental diseases. A search in July 2017 for genes and single-nucleotide polymorphisms (SNPs) associated with SCZ yielded 1871 and 1635, respectively. Genome-wide association studies and beyond: innovative, integrative approaches to biomarker discovery in mental health At the most basic level, data sharing can enable integrative analysis for biomarker discovery by allowing researchers to combine two or more comparable data sets to increase statistical power through increased sample size. Researchers have developed many creative methods to combine data, enabling the discovery of patterns not apparent when analyzing just a single data type. Genotype data, including those identified through both microarrays and next-generation sequencing, can be combined with other data types such as protein–protein interaction networks, biological pathways, gene expression and co-expression, methylation and microRNA regulation data [43, 56–60]. In some cases, researchers have been able to combine three or more data types in creative ways for biomarker discovery [61, 62]. Most studies start with the purpose of discovering novel variants and then make use of public resources to validate and support their initial discovery. Methods may be categorized at the gene level [63–66], pathway level [67] and network level [68–70]. Another way to categorize the methods is based on the multi-omics data. Some integrative studies involve only genetics and eQTL (expression quantitative trait loci) data [71], while others are more comprehensive, involving multiple genome-wide association studies (GWAS) and/or other dimensional data [72, 73]. Recently, with the dramatic increase of GWAS data, especially by the PGC, an increasing number of studies have been published for integrative analyses using multiple-disorders or multiple-omics data, aiming to identify shared or unique genetic variants among different MH disorders [74]. To generate an overview of these integrative studies, we used PubMed to systematically search for integrative studies using keywords listed in Table 3. In total, we obtained 595 publications for integrative studies of SCZ. A majority of them (497) were published after 2010, likely due in part to the curation of omics data in recent years. As shown in Figure 5, the publication of integrative studies in SCZ has been increasing sharply in recent years, mostly in the category of eQTL. Recurring themes among these integrative methods include overlap between variants from different omics modalities and randomization and permutation tests for statistical significance. Table 3 Keywords and counts for integrative biomarker studies in schizophrenia published before May 2017 Keywords . N . schizophrenia [TIAB] AND GWAS AND expression 285 schizophrenia [TIAB] AND SNP AND expression 242 schizophrenia [TIAB] AND GWAS AND network 140 schizophrenia [TIAB] AND SNP AND network 75 schizophrenia [TIAB] AND GWAS AND methylation 36 schizophrenia [TIAB] AND GWAS AND eQTL 35 schizophrenia [TIAB] AND SNP AND integrative 32 schizophrenia [TIAB] AND GWAS AND quantitative traits 26 schizophrenia [TIAB] AND GWAS AND transcriptome 26 schizophrenia [TIAB] AND SNP AND methylation 20 schizophrenia [TIAB] AND SNP AND eQTL 19 schizophrenia [TIAB] AND SNP AND quantitative traits 11 schizophrenia [TIAB] AND SNP AND transcriptome 10 schizophrenia [TIAB] AND GWAS AND integrative 5 schizophrenia [TIAB] AND SNP AND transcriptome 4 schizophrenia [TIAB] AND genotyping AND transcriptome 3 schizophrenia [TIAB] AND SNP AND ATAC-seq 1 Keywords . N . schizophrenia [TIAB] AND GWAS AND expression 285 schizophrenia [TIAB] AND SNP AND expression 242 schizophrenia [TIAB] AND GWAS AND network 140 schizophrenia [TIAB] AND SNP AND network 75 schizophrenia [TIAB] AND GWAS AND methylation 36 schizophrenia [TIAB] AND GWAS AND eQTL 35 schizophrenia [TIAB] AND SNP AND integrative 32 schizophrenia [TIAB] AND GWAS AND quantitative traits 26 schizophrenia [TIAB] AND GWAS AND transcriptome 26 schizophrenia [TIAB] AND SNP AND methylation 20 schizophrenia [TIAB] AND SNP AND eQTL 19 schizophrenia [TIAB] AND SNP AND quantitative traits 11 schizophrenia [TIAB] AND SNP AND transcriptome 10 schizophrenia [TIAB] AND GWAS AND integrative 5 schizophrenia [TIAB] AND SNP AND transcriptome 4 schizophrenia [TIAB] AND genotyping AND transcriptome 3 schizophrenia [TIAB] AND SNP AND ATAC-seq 1 Open in new tab Table 3 Keywords and counts for integrative biomarker studies in schizophrenia published before May 2017 Keywords . N . schizophrenia [TIAB] AND GWAS AND expression 285 schizophrenia [TIAB] AND SNP AND expression 242 schizophrenia [TIAB] AND GWAS AND network 140 schizophrenia [TIAB] AND SNP AND network 75 schizophrenia [TIAB] AND GWAS AND methylation 36 schizophrenia [TIAB] AND GWAS AND eQTL 35 schizophrenia [TIAB] AND SNP AND integrative 32 schizophrenia [TIAB] AND GWAS AND quantitative traits 26 schizophrenia [TIAB] AND GWAS AND transcriptome 26 schizophrenia [TIAB] AND SNP AND methylation 20 schizophrenia [TIAB] AND SNP AND eQTL 19 schizophrenia [TIAB] AND SNP AND quantitative traits 11 schizophrenia [TIAB] AND SNP AND transcriptome 10 schizophrenia [TIAB] AND GWAS AND integrative 5 schizophrenia [TIAB] AND SNP AND transcriptome 4 schizophrenia [TIAB] AND genotyping AND transcriptome 3 schizophrenia [TIAB] AND SNP AND ATAC-seq 1 Keywords . N . schizophrenia [TIAB] AND GWAS AND expression 285 schizophrenia [TIAB] AND SNP AND expression 242 schizophrenia [TIAB] AND GWAS AND network 140 schizophrenia [TIAB] AND SNP AND network 75 schizophrenia [TIAB] AND GWAS AND methylation 36 schizophrenia [TIAB] AND GWAS AND eQTL 35 schizophrenia [TIAB] AND SNP AND integrative 32 schizophrenia [TIAB] AND GWAS AND quantitative traits 26 schizophrenia [TIAB] AND GWAS AND transcriptome 26 schizophrenia [TIAB] AND SNP AND methylation 20 schizophrenia [TIAB] AND SNP AND eQTL 19 schizophrenia [TIAB] AND SNP AND quantitative traits 11 schizophrenia [TIAB] AND SNP AND transcriptome 10 schizophrenia [TIAB] AND GWAS AND integrative 5 schizophrenia [TIAB] AND SNP AND transcriptome 4 schizophrenia [TIAB] AND genotyping AND transcriptome 3 schizophrenia [TIAB] AND SNP AND ATAC-seq 1 Open in new tab Figure 5 Open in new tabDownload slide Publication summary of SCZ integrative studies. Note: Publications in 2017 were estimated based on the data between January and May in 2017. Figure 5 Open in new tabDownload slide Publication summary of SCZ integrative studies. Note: Publications in 2017 were estimated based on the data between January and May in 2017. Big big-data: large-scaleGWASs Since 2008, GWAS have reported a number of genetic variants associated with SCZ [63–66]. The Schizophrenia Working Group of the PGC conducted the largest GWAS in SCZ to date (36 989 SCZ and 113 075 controls) and identified 108 loci [75]. The largest ancestrally and phenotypically homogeneous GWAS study of SCZ (11 260 cases and 24 542 controls) reported 50 novel SCZ risk loci [76]. Despite these large-scale studies, the genes or functional DNA elements through which these variants exert their effects remain unknown. An emerging trend is to integrate multidimensional data from genetics, epigenetics and transcriptomics to prioritize biomarkers and also to understand the underlying mechanisms by which they act. Combining expression and sequence data: eQTL Several large-scale studies have markedly expanded the scope of known eQTLs [77–79]. One recent example integrates SCZ postmortem brain gene expression with GWAS signals at the pathway levels [80]. Building on these eQTL resources as well as gene expression profiles, candidate genes have been identified by integrative studies [81–85]. Using expression data from 647 postmortem human brain samples collected by Stanley Medical Research Institute and the GTEx project, the gene complement component 4 (C4) in major histocompatibility complex region was identified as contributing to SCZ risk [86]. In addition, the CommonMind Consortium (CMC) generated RNA-seq data from postmortem dorsolateral prefrontal cortices from 258 subjects with SCZ and 279 controls, and they identified a list of genes whose expression was significantly affected by SCZ risk variations [87]. Using the CMC data, a list of brain splicing quantitative trait loci was identified that are causally associated with SCZ [88]. In another study, SCZ risk genes were identified using summary data from GWAS and eQTL in which gene expression data were generated from 5311 peripheral blood samples [89]. More examples include valuable eQTLs located at NMDAR [90], CTCF and CACNB2 [91], and 17q25 locus [92], most of which used public eQTL data such as those from GTEx. These eQTLs are mostly common variants. Rare noncoding SCZ risk variants were also identified [93]. Combining genomic data with multiple phenotypes: pleiotropy Genomic data combined with multiple phenotypes can enable the discovery of pleiotropy-associated genes, i.e. alleles that impact two or more apparently unrelated effects. The PGC provides a good example of this. They initially, and intentionally, focused studies on five major psychiatric disorders: autism, attention-deficit hyperactivity disorder, bipolar disorder, major depressive disorder and SCZ [74, 94]. A number of clinical features transcend these disease classifications, and previous research had suggested overlap in familial and genetic liability for different combinations of these disorders [74]. Combining genomic and imaging data Another common integrative approach in the study of MH is the combination of genomic data with imaging data. The Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) consortium provides tools and protocols to meta-analyze genome-wide and neuroimaging data from research teams worldwide [95]. The consortium does not require participating investigators to contribute raw data nor provide access to such data for research or public use. Instead, ENIGMA provides standardized protocols with predefined covariates to allow sites to conduct GWAS and imaging studies locally and report meta-analyzed data, which are made publicly available through an interactive visualization tool, ENIGMA-Vis [96]. For example, the ENIGMA SCZ working group conducted a collaborative, prospective meta-analysis of neuroimaging data from >4500 study participants (2028 were subjects with SCZ and 2540 were healthy control) across 15 sites [97]. Combining genomic and epigenetic data DNA methylation is important for epigenetic regulation of gene expression. The 108 SCZ-GWAS risk loci identified by the PGC [75] were evaluated systematically as methylation quantitative trait loci in postmortem prefrontal cortex from 191 SCZ and 335 controls [98], 689 SCZ and 645 controls [99] and 1163 postmortem brains of European ancestry [100]. The loci were also systematically analyzed using 166 human fetal brains [101]. The evidence showed a much stronger differential DNA methylation enrichment in genes associated with SCZ, even using a medium sample size (<100) of postmortem brains [102]. Transposase Accessible Chromatin followed by sequencing (ATAC-seq) is another promising technique that can be used to map chromatin accessibility. Recently, ATAC-seq has been used to study spatiotemporal regulation of gene expression of neuronal and nonneuronal nuclei isolated from frozen postmortem human brain to map chromatin accessibility for SCZ risk loci [103]. Concluding remarks Our review of data resources and integrative biomarker discovery in MH with a focus on SCZ suggests a recent increase in the number and quality of resources and an even more recent growth in their use. Indeed, we are starting to see high-profile papers that leverage some of these existing data sets. For example, a recent paper in Nature Genetics described a GWAS study in 36 000 individuals of Chinese ancestry that was combined with data from the PGC [75] to perform trans-ancestry meta-analyses yielding 30 novel risk loci for SCZ [104]. However, there remains significant untapped potential: many resources that could be made available under FAIR principles are not, and those resources that are available remainunderused. While we leave other MH disorders for a future paper, we do not believe the results will be significantly different for anxiety or depression, or for most other disorders in mental health with the possible exception of autism spectrum disorder, where a concerted national effort by both public and private entities has created a large concentration of findable and accessible shared data [105, 106]. The challenges facing those seeking to reuse resources for integrative research are numerous and formidable, but we believe that many seeking to enter the fray are tripping over the threshold of findability. We hope that both the suggested organizing framework and the catalog of resources presented here will help new entrants into the space cross the threshold successfully and focus on more substantive challenges of data integration and analysis. We also believe the proposed framework is useful more broadly in other biomedical domains to facilitate categorization and dissemination of information about data resources to support emerging precision medicine initiatives. As noted recently in a memo from Dr Joshua Gordon, director of NIMH, understanding the underlying biology of MH is more important than ever, and increasingly within reach given recent technological developments [4]. A consensus on FAIR principles, the NIH push toward data sharing, and NLM support for best practices, mean those who continue to develop innovative approaches to the vast and ever-increasing amount of publicly available data will help the rest of us gather valuable insights about mental health diagnosis and treatment. Key Points The growing number of data sets available to researchers in mental and behavioral health enables secondary analysis and novel integrative methods for biomarker discovery. We propose a framework for organizing and classifying publicly available resources for biomarker discovery in mental health using SCZ as an example. Many potential resources are not yet compliant with FAIR data-sharing principles, and currently available resources remain underused. While no clinically actionable biomarkers have yet been identified, a confluence of policies, initiatives and technological advances puts us at a potential inflection point for accelerating discovery and advancement in the field of MH. Acknowledgements The authors wish to thank Dr Piper Ranallo for her passion and perseverance in catalyzing creation of the AMIA Mental Health Informatics Working Group out of which this article emerged and Jyotishman Pathak for valuable comments on early drafts of the manuscript. The authors also thank the anonymous reviewers for their insightful comments and helpful suggestions to strengthen this review. Funding The National Institutes of Health (grant numbers UL1TR001117 to J.D.T., R01LM012095 to S.V. and R01LM012806 to P.J. and Z.Z.). Jessica D. Tenenbaum is an assistant professor of Translational Biomedical Informatics in the Department of Biostatistics and Bioinformatics at the Duke University School of Medicine. She is co-founder and chair of the American Medical Informatics Association (AMIA)’s Mental Health Informatics Working Group. Krithika Bhuvaneshwar is a research associate and project manager at the Innovation Center for Biomedical Informatics (ICBI) at Georgetown University. She has expertise in bioinformatics analysis and systems biology research, and combines her interdisciplinary skills in bioinformatics and biostatistics for numerous projects. Jane Gagliardi is an associate professor of psychiatry and behavioral sciences and an associate professor of Medicine at Duke University School of Medicine and serves as vice chair for education and residency training director in Psychiatry. Her main areas of interest and expertise are clinical psychiatry, clinical medicine, patient safety and quality and the impact of electronic technology on patient care and education. Kate Fultz Hollis is a research associate and instructor in the Department of Biomedical Informatics and Clinical Epidemiology at Oregon Health and Science University. She specializes in clinical research informatics and medical research data access. Peilin Jia is an assistant professor in Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston. She co-directs the Bioinformatics and Systems Medicine Laboratory. Liang Ma is a bioinformatics postdoctoral fellow in the Bioinformatics and Systems Medicine Laboratory (BSML), Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. Radhakrishnan Nagarajan is an associate professor of biomedical informatics in the College of Medicine, University of Kentucky. His research involves developing novel analytics for knowledge discovery from heterogeneous molecular and health-care data. Gopalkumar Rakesh is a physician-scientist in training with the Department of Psychiatry at Duke University Medical Center. His main areas of interest and expertise are clinical psychiatry, brain stimulation and big data analytics. Vignesh Subbian is an assistant professor in the Department of Biomedical Engineering and the Department of Systems and Industrial Engineering at the University of Arizona. Shyam Visweswaran is an associate professor of biomedical informatics and the Intelligent Systems Program at the University of Pittsburgh. He is the director of Clinical Informatics for the Department of Biomedical Informatics, the director of the Center for Clinical Research Informatics and the director of the Biomedical Informatics Core of the University of Pittsburgh Clinical and Translational Science Institute. Zhongming Zhao is the chair and professor for Precision Health and director of Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. He directs the Bioinformatics and Systems Medicine Laboratory. Leon Rozenblit is the Founder and CEO of Prometheus Research, LLC, a research informatics services and technology company with a concentration in mental health informatics. He served as the informatics lead on a number of large-scale mental health research initiatives including the Simons Simplex Collection and the Autism Biomarker Consortium for Clinical Trials. References 1 Roehrig C. Mental disorders top the list of the most costly conditions in the United States: $201 billion . Health Aff 2016 ; 35 ( 6 ): 1130 – 5 . Google Scholar Crossref Search ADS WorldCat 2 Cancer Genome Atlas Research Network , Comprehensive genomic characterization defines human glioblastoma genes and core pathways . Nature 2008 ; 455 ( 7216 ): 1061 – 8 . Crossref Search ADS PubMed WorldCat 3 Kapur S , Phillips AG, Insel TR. Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Mol Psychiatry 2012 ; 17 ( 12 ): 1174. Google Scholar Crossref Search ADS PubMed WorldCat 4 Gordon J. RDoC: Outcomes to Causes and Back. Bethesda, MD : NIH , 2017 . 5 Reardon S. US mental-health agency’s push for basic research has slashed support for clinical trials . Nature 2017 ; 546 : 339 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Insel T , Cuthbert B, Garvey M, et al. Research domain criteria (RDoC): toward a new classification framework for research on mental disorders . Am J Psychiatry 2010 ; 167 ( 7 ): 748 – 51 . http://dx.doi.org/10.1176/appi.ajp.2010.09091379 Google Scholar Crossref Search ADS PubMed WorldCat 7 Johnson EC , Border R, Melroy-Greif WE, et al. No evidence that schizophrenia candidate genes are more associated with schizophrenia than noncandidate genes . Biol Psychiatry 2017 ; 82 : 702 – 708 . http://dx.doi.org/10.1016/j.biopsych.2017.06.033 Google Scholar Crossref Search ADS PubMed WorldCat 8 Guloksuz S , van Os J. The slow death of the concept of schizophrenia and the painful birth of the psychosis spectrum . Psychol Med 2017 ; 1 – 16 . Google Scholar OpenURL Placeholder Text WorldCat 9 McCarthy-Jones S. The concept of schizophrenia is coming to an end – here’s why . The Conversation 2017 ; 2017 . Google Scholar OpenURL Placeholder Text WorldCat 10 Westphal J. The Mind–Body Problem . Cambridge, MA : MIT Press ; 2016 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 11 Cuthbert BN. Research domain criteria: toward future psychiatric nosologies . Dialogues Clin Neurosci 2015 ; 17 ( 1 ): 89 – 97 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 12 Wilkinson MD , Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship . Sci Data 2016 ; 3 : 160018 . http://dx.doi.org/10.1038/sdata.2016.18 Google Scholar Crossref Search ADS PubMed WorldCat 13 Longo DL , Drazen JM. More on data sharing . N Engl J Med 2016 ; 374 ( 19 ): 1896 – 7 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 14 Greene CS , Garmire LX, Gilbert JA, et al. Celebrating parasites . Nat Genet 2017 ; 49 ( 4 ): 483 – 4 . http://dx.doi.org/10.1038/ng.3830 Google Scholar Crossref Search ADS PubMed WorldCat 15 Jagodnik KM , Koplev S, Jenkins SL, et al. Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: report from the Commons Framework Pilots workshop . J Biomed Inform 2017 ; 71 : 49 – 57 . http://dx.doi.org/10.1016/j.jbi.2017.05.006 Google Scholar Crossref Search ADS PubMed WorldCat 16 Biomarkers Definitions Working Group . Biomarkers and surrogate endpoints: preferred definitions and conceptual framework . Clin Pharmacol Ther 2001 ; 69 ( 3 ): 89 – 95 . http://dx.doi.org/10.1067/mcp.2001.113989 Crossref Search ADS PubMed WorldCat 17 Weickert CS , Weickert TW, Pillai A, et al. Biomarkers in schizophrenia: a brief conceptual consideration . Dis Markers 2013 ; 35 ( 1 ): 3 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Davis J , Maes M, Andreazza A, et al. Towards a classification of biomarkers of neuropsychiatric disease: from encompass to compass . Mol Psychiatry 2015 ; 20 ( 2 ): 152 – 3 . http://dx.doi.org/10.1038/mp.2014.139 Google Scholar Crossref Search ADS PubMed WorldCat 19 Arranz MJ , Munro J, Sham P, et al. Meta-analysis of studies on genetic variation in 5-HT2A receptors and clozapine response . Schizophr Res 1998 ; 32 ( 2 ): 93 – 9 . http://dx.doi.org/10.1016/S0920-9964(98)00032-2 Google Scholar Crossref Search ADS PubMed WorldCat 20 Kaddurah-Daouk R , McEvoy J, Baillie R, et al. Impaired plasmalogens in patients with schizophrenia . Psychiatry Res 2012 ; 198 ( 3 ): 347 – 52 . http://dx.doi.org/10.1016/j.psychres.2012.02.019 Google Scholar Crossref Search ADS PubMed WorldCat 21 Stevenson JM , Reilly JL, Harris MS, et al. Antipsychotic pharmacogenomics in first episode psychosis: a role for glutamate genes . Transl Psychiatry 2016 ; 6 ( 2 ): e739 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Yao JK , Condray R, Dougherty GG Jr., et al. Associations between purine metabolites and clinical symptoms in schizophrenia . PLoS One 2012 ; 7 ( 8 ): e42165 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Czerwensky F , Leucht S, Steimer W. MC4R rs489693: a clinical risk factor for second generation antipsychotic-related weight gain? . Int J Neuropsychopharmacol 2013 ; 16 ( 9 ): 2103. (9). Google Scholar Crossref Search ADS PubMed WorldCat 24 McEvoy J , Baillie RA, Zhu H, et al. Lipidomics reveals early metabolic changes in subjects with schizophrenia: effects of atypical antipsychotics . PLoS One 2013 ; 8 ( 7 ): e68717. Google Scholar Crossref Search ADS PubMed WorldCat 25 Ghosh D , Poisson LM. “Omics” data and levels of evidence for biomarker discovery . Genomics 2009 ; 93 ( 1 ): 13 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 26 McDermott JE , Wang J, Mitchell H, et al. Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data . Expert Opin Med Diagn 2013 ; 7 ( 1 ): 37 – 51 . http://dx.doi.org/10.1517/17530059.2012.718329 Google Scholar Crossref Search ADS PubMed WorldCat 27 MRI, U.S.D.C.f.F . Structural MRI imaging. 2017 . http://fmri.ucsd.edu/Howto/3T/structure.html (10 September 2017, date last accessed). 28 Jia P , Han G, Zhao J, et al. SZGR 2.0: a one-stop shop of schizophrenia candidate genes . Nucleic Acids Res 2017 ; 45 ( D1 ): D915 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Van Essen DC , Ugurbil K, Auerbach E, et al. The Human Connectome Project: a data acquisition perspective . Neuroimage 2012 ; 62 ( 4 ): 2222 – 31 . http://dx.doi.org/10.1016/j.neuroimage.2012.02.018 Google Scholar Crossref Search ADS PubMed WorldCat 30 Pedrosa E , Shah A, Tenore C, et al. β-catenin promoter ChIP-chip reveals potential schizophrenia and bipolar disorder gene network . J Neurogenet 2010 ; 24 ( 4 ): 182 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Allen Institute. https://www.alleninstitute.org (9September 2017, date last accessed). 32 Sullivan PF. The psychiatric GWAS consortium: big science comes to psychiatry . Neuron 2010 ; 68 ( 2 ): 182 – 6 . http://dx.doi.org/10.1016/j.neuron.2010.10.003 Google Scholar Crossref Search ADS PubMed WorldCat 33 Akbarian S , Liu C, Knowles JA, et al. The PsychENCODE project . Nat Neurosci 2015 ; 18 ( 12 ): 1707 – 12 . http://dx.doi.org/10.1038/nn.4156 Google Scholar Crossref Search ADS PubMed WorldCat 34 Ohno-Machado L , Sansone SA, Alter G, et al. Finding useful data across multiple biomedical data repositories using DataMed . Nat Genet 2017 ; 49 ( 6 ): 816 – 9 . http://dx.doi.org/10.1038/ng.3864 Google Scholar Crossref Search ADS PubMed WorldCat 35 Edgar R , Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository . Nucleic Acids Res 2002 ; 30 ( 1 ): 207 – 10 . http://dx.doi.org/10.1093/nar/30.1.207 Google Scholar Crossref Search ADS PubMed WorldCat 36 Omberg L , Ellrott K, Yuan Y, et al. Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas . Nat Genet 2013 ; 45 ( 10 ): 1121 – 6 . http://dx.doi.org/10.1038/ng.2761 Google Scholar Crossref Search ADS PubMed WorldCat 37 Krumholz HM , Waldstreicher J. The Yale open data access (YODA) project–a mechanism for data sharing . N Engl J Med 2016 ; 375 ( 5 ): 403 – 5 . http://dx.doi.org/10.1056/NEJMp1607342 Google Scholar Crossref Search ADS PubMed WorldCat 38 Karczewski KJ , Weisburd B, Thomas B, et al. The ExAC browser: displaying reference data information from over 60 000 exomes . Nucleic Acids Res 2017 ; 45 ( D1 ): D840 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Lonsdale J , Thomas J, Salvatore M, et al. T., The Genotype-Tissue Expression (GTEx) project . Nat Genet 2013 ; 45 ( 6 ): 580 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Mailman MD , Feolo M, Jin Y, et al. The NCBI dbGaP database of genotypes and phenotypes . Nat Genet 2007 ; 39 ( 10 ): 1181 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Lappalainen I , Almeida-King J, Kumanduri V, et al. The European Genome-phenome archive of human data consented for biomedical research . Nat Genet 2015 ; 47 ( 7 ): 692 – 5 . http://dx.doi.org/10.1038/ng.3312 Google Scholar Crossref Search ADS PubMed WorldCat 42 Frazier JA , Hodge SM, Breeze JL, et al. Diagnostic and sex effects on limbic volumes in early-onset bipolar disorder and schizophrenia . Schizophr Bull 2007 ; 34 ( 1 ): 37 – 46 . http://dx.doi.org/10.1093/schbul/sbm120 Google Scholar Crossref Search ADS PubMed WorldCat 43 Jia P , Zhao Z. Network.assisted analysis to prioritize GWAS results: principles, methods and perspectives . Hum Genet 2014 ; 133 ( 2 ): 125 – 38 . http://dx.doi.org/10.1007/s00439-013-1377-1 Google Scholar Crossref Search ADS PubMed WorldCat 44 Allen NC , Bagade S, McQueen MB, et al. Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database . Nat Genet 2008 ; 40 ( 7 ): 827 – 34 . http://dx.doi.org/10.1038/ng.171 Google Scholar Crossref Search ADS PubMed WorldCat 45 Ambite JL , Tallis M, Alpert K, et al. SchizConnect: virtual data integration in neuroimaging . Data Integr Life Sci 2015 ; 9162 : 37 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Poldrack RA , Gorgolewski KJ. OpenfMRI: open sharing of task fMRI data . Neuroimage 2017 ; 144 ( Pt B ): 259 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Kennedy DN , Haselgrove C, Riehl J, et al. The NITRC image repository . Neuroimage 2016 ; 124 ( Pt B ): 1069 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Repovš G , Barch DM. Working memory related brain network connectivity in individuals with schizophrenia and their siblings . Front Hum Neurosci 2012 ; 6 : 137 . 6. Google Scholar Crossref Search ADS PubMed WorldCat 49 Dolgin E. This is your brain online: the Functional Connectomes Project . Nat Med 2010 ; 16 ( 4 ): 351.http://dx.doi.org/10.1038/nm0410-351b Google Scholar Crossref Search ADS PubMed WorldCat 50 Calhoun VD , Sui J, Kiehl K, et al. Exploring the psychosis functional connectome: aberrant intrinsic networks in schizophrenia and bipolar disorder . Front Psychiatry 2011 ; 2 : 75. Google Scholar PubMed OpenURL Placeholder Text WorldCat 51 Hanlon FM , Houck JM, Pyeatt CJ, et al. Bilateral hippocampal dysfunction in schizophrenia . Neuroimage 2011 ; 58 ( 4 ): 1158 – 68 . http://dx.doi.org/10.1016/j.neuroimage.2011.06.091 Google Scholar Crossref Search ADS PubMed WorldCat 52 Mayer AR , Ruhl D, Merideth F, et al. Functional imaging of the hemodynamic sensory gating response in schizophrenia . Hum Brain Mapp 2013 ; 34 ( 9 ): 2302 – 12 . http://dx.doi.org/10.1002/hbm.22065 Google Scholar Crossref Search ADS PubMed WorldCat 53 Anderson A , Cohen MS. Decreased small-world functional network connectivity and clustering across resting state networks in schizophrenia: an fMRI classification tutorial . Front Hum Neurosci 2013 ; 7 : 520 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 54 Wu Y , Yao YG, Luo XJ. SZDB: a database for schizophrenia genetic research . Schizophr Bull 2017 ; 43 ( 2 ): 459 – 71 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 55 Jia P , Sun J, Guo AY, et al. SZGR: a comprehensive schizophrenia gene resource . Mol Psychiatry 2010 ; 15 ( 5 ): 453 – 62 . http://dx.doi.org/10.1038/mp.2009.93 Google Scholar Crossref Search ADS PubMed WorldCat 56 Torkamani A , Dean B, Schork NJ, et al. Coexpression network analysis of neural tissue reveals perturbations in developmental processes in schizophrenia . Genome Res 2010 ; 20 ( 4 ): 403 – 12 . http://dx.doi.org/10.1101/gr.101956.109 Google Scholar Crossref Search ADS PubMed WorldCat 57 Pidsley R , Viana J, Hannon E, et al. Methylomic profiling of human brain tissue supports a neurodevelopmental origin for schizophrenia . Genome Biol 2014 ; 15 ( 10 ): 483.http://dx.doi.org/10.1186/s13059-014-0483-2 Google Scholar Crossref Search ADS PubMed WorldCat 58 O'Dushlaine C , Kenny E, Heron E, et al. Molecular pathways involved in neuronal cell adhesion and membrane scaffolding contribute to schizophrenia and bipolar disorder susceptibility . Mol Psychiatry 2011 ; 16 ( 3 ): 286 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Jia P , Wang L, Fanous AH, et al. Network-assisted investigation of combined causal signals from genome-wide association studies in schizophrenia . PLoS Comput Biol 2012 ; 8 ( 7 ): e1002587. Google Scholar Crossref Search ADS PubMed WorldCat 60 Guo AY , Sun Jia JP, et al. A novel microRNA and transcription factor mediated regulatory network in schizophrenia . BMC Syst Biol 2010 ; 4 : 10 . http://dx.doi.org/10.1186/1752-0509-4-10 Google Scholar Crossref Search ADS PubMed WorldCat 61 Prabakaran S , Swatton JE, Ryan MM, et al. Mitochondrial dysfunction in schizophrenia: evidence for compromised brain metabolism and oxidative stress . Mol Psychiatry 2004 ; 9 ( 7 ): 684 – 97 . 643. Google Scholar Crossref Search ADS PubMed WorldCat 62 Ng B , White CC, Klein HU, et al. An xQTL map integrates the genetic architecture of the human brain's transcriptome and epigenome . Nat Neurosci 2017 ; 20 ( 10 ): 1418 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Sullivan PF , Lin D, Tzeng JY, et al. Genomewide association for schizophrenia in the CATIE study: results of stage 1 . Mol Psychiatry 2008 ; 13 ( 6 ): 570 – 84 . http://dx.doi.org/10.1038/mp.2008.25 Google Scholar Crossref Search ADS PubMed WorldCat 64 International Schizophrenia C , Purcell SM, Wray NR, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder . Nature 2009 ; 460 ( 7256 ): 748 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Shi J , Levinson DF, Duan J, et al. Common variants on chromosome 6p22.1 are associated with schizophrenia . Nature 2009 ; 460 ( 7256 ): 753 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Stefansson H , Ophoff RA, Steinberg S, et al. Common variants conferring risk of schizophrenia . Nature 2009 ; 460 ( 7256 ): 744 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Network and Pathway Analysis Subgroup of Psychiatric Genomics Consortium . Psychiatric genome-wide association study analyses implicate neuronal, immune and histone pathways . Nat Neurosci 2015 ; 18 ( 2 ): 199 – 209 . http://dx.doi.org/10.1038/nn.3922 Crossref Search ADS PubMed WorldCat 68 Wang Q , Yu H, Zhao Z, et al. EW_dmGWAS: edge-weighted dense module search for genome-wide association studies and gene expression profiles . Bioinformatics 2015 ; 31 ( 15 ): 2591 – 4 . http://dx.doi.org/10.1093/bioinformatics/btv150 Google Scholar Crossref Search ADS PubMed WorldCat 69 Gilman SR , Iossifov I, Levy D, et al. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses . Neuron 2011 ; 70 ( 5 ): 898 – 907 . http://dx.doi.org/10.1016/j.neuron.2011.05.021 Google Scholar Crossref Search ADS PubMed WorldCat 70 Jia P , Zheng S, Long J, et al. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks . Bioinformatics 2011 ; 27 ( 1 ): 95 – 102 . http://dx.doi.org/10.1093/bioinformatics/btq615 Google Scholar Crossref Search ADS PubMed WorldCat 71 Richards AL , Jones L, Moskvina V, et al. Schizophrenia susceptibility alleles are enriched for alleles that affect gene expression in adult human brain . Mol Psychiatry 2012 ; 17 ( 2 ): 193 – 201 . http://dx.doi.org/10.1038/mp.2011.11 Google Scholar Crossref Search ADS PubMed WorldCat 72 Duan F , Duitama J, Al Seesi S, et al. Genomic and bioinformatic profiling of mutational neoepitopes reveals new rules to predict anticancer immunogenicity . J Exp Med 2014 ; 211 ( 11 ): 2231 – 48 . http://dx.doi.org/10.1084/jem.20141308 Google Scholar Crossref Search ADS PubMed WorldCat 73 FANTOM Consortium and the RIKEN PMI and CLST (DGT) , Forrest AR, Kawaji H,, et al. A promoter-level mammalian expression atlas . Nature 2014 ; 507 ( 7493 ): 462 – 70 . http://dx.doi.org/10.1038/nature13182 Google Scholar Crossref Search ADS PubMed WorldCat 74 Cross-Disorder Group of the Psychiatric Genomics Consortium , Lee SH, Ripke S, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs . Nat Genet 2013 ; 45 ( 9 ): 984 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Schizophrenia Working Group of the Psychiatric Genomics Consortium . Biological insights from 108 schizophrenia-associated genetic loci . Nature 2014 ; 511 ( 7510 ): 421 – 7 . Crossref Search ADS PubMed WorldCat 76 Pardiñas AF , Holmans P, Pocklington AJ, et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and maintained by background selection . bioRxiv 2016 . doi: 10.1101/068593. Google Scholar OpenURL Placeholder Text WorldCat 77 Kim Y , Xia K, Tao R, et al. A meta-analysis of gene expression quantitative trait loci in brain . Transl Psychiatry 2014 ; 4 ( 10 ): e459 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Colantuoni C , Lipska BK, Ye T, et al. Temporal dynamics and genetic control of transcription in the human prefrontal cortex . Nature 2011 ; 478 ( 7370 ): 519 – 23 . http://dx.doi.org/10.1038/nature10524 Google Scholar Crossref Search ADS PubMed WorldCat 79 Aguet F , Brown AA, Castel S, et al. Local genetic effects on gene expression across 44 human tissues . bioRxiv 2016 . doi: 10.1101/074450. Google Scholar OpenURL Placeholder Text WorldCat 80 Zhao Z , Xu J, Chen J, et al. Transcriptome sequencing and genome-wide association analyses reveal lysosomal function and actin cytoskeleton remodeling in schizophrenia and bipolar disorder . Mol Psychiatry 2015 ; 20 ( 5 ): 563 – 72 . http://dx.doi.org/10.1038/mp.2014.82 Google Scholar Crossref Search ADS PubMed WorldCat 81 Tao R , Davis K, Li NC, et al. GAD1 alternative transcripts and DNA methylation in human prefrontal cortex and hippocampus in brain development, schizophrenia . Mol Psychiatry 2017 , in press. Google Scholar OpenURL Placeholder Text WorldCat 82 Tao R , Cousijn H, Jaffe AE, et al. Expression of ZNF804A in human brain and alterations in schizophrenia, bipolar disorder, and major depressive disorder. a novel transcript fetally regulated by the psychosis risk variant rs1344706 . JAMA Psychiatry 2014 ; 71 ( 10 ): 1112 – 20 . http://dx.doi.org/10.1001/jamapsychiatry.2014.1079 Google Scholar Crossref Search ADS PubMed WorldCat 83 Kunii Y , Hyde TM, Ye T, et al. Revisiting DARPP-32 in postmortem human brain: changes in schizophrenia and bipolar disorder and genetic associations with t-DARPP-32 expression . Mol Psychiatry 2014 ; 19 ( 2 ): 192 – 9 . http://dx.doi.org/10.1038/mp.2012.174 Google Scholar Crossref Search ADS PubMed WorldCat 84 Bigos KL , Mattay VS, Callicott JH, et al. Genetic variation in CACNA1C affects brain circuitries related to mental illness . Arch Gen Psychiatry 2010 ; 67 ( 9 ): 939 – 45 . http://dx.doi.org/10.1001/archgenpsychiatry.2010.96 Google Scholar Crossref Search ADS PubMed WorldCat 85 Li M , Jaffe AE, Straub RE, et al. A human-specific AS3MT isoform and BORCS7 are molecular risk factors in the 10q24.32 schizophrenia-associated locus . Nat Med 2016 ; 22 ( 6 ): 649 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Sekar A , Bialas AR, de Rivera H, et al. Schizophrenia risk from complex variation of complement component 4 . Nature 2016 ; 530 ( 7589 ): 177 – 83 . http://dx.doi.org/10.1038/nature16549 Google Scholar Crossref Search ADS PubMed WorldCat 87 Fromer M , Roussos P, Sieberts SK, et al. Gene expression elucidates functional impact of polygenic risk for schizophrenia . Nat Neurosci 2016 ; 19 ( 11 ): 1442 – 53 . http://dx.doi.org/10.1038/nn.4399 Google Scholar Crossref Search ADS PubMed WorldCat 88 Takata A , Ionita-Laza I, Gogos JA, et al. De novo synonymous mutations in regulatory elements contribute to the genetic etiology of autism and schizophrenia . Neuron 2016 ; 89 ( 5 ): 940 – 7 . http://dx.doi.org/10.1016/j.neuron.2016.02.024 Google Scholar Crossref Search ADS PubMed WorldCat 89 Zhu Z , Zhang F, Hu H, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets . Nat Genet 2016 ; 48 ( 5 ): 481 – 7 . http://dx.doi.org/10.1038/ng.3538 Google Scholar Crossref Search ADS PubMed WorldCat 90 Weickert CS , Fung SJ, Catts VS, et al. Molecular evidence of N-methyl-D-aspartate receptor hypofunction in schizophrenia . Mol Psychiatry 2013 ; 18 ( 11 ): 1185 – 92 . http://dx.doi.org/10.1038/mp.2012.137 Google Scholar Crossref Search ADS PubMed WorldCat 91 Juraeva D , Haenisch B, Zapatka M, et al. Integrated pathway-based approach identifies association between genomic regions at CTCF and CACNB2 and schizophrenia . PLoS Genet 2014 ; 10 ( 6 ): e1004345. Google Scholar Crossref Search ADS PubMed WorldCat 92 Guan L , Wang Q, Wang L, et al. Common variants on 17q25 and gene-gene interactions conferring risk of schizophrenia in Han Chinese population and regulating gene expressions in human brain . Mol Psychiatry 2016 ; 21 ( 9 ): 1244 – 50 . http://dx.doi.org/10.1038/mp.2015.204 Google Scholar Crossref Search ADS PubMed WorldCat 93 Duan J , Shi J, Fiorentino A, et al. A rare functional noncoding variant at the GWAS-implicated MIR137/MIR2682 locus might confer risk to schizophrenia and bipolar disorder . Am J Hum Genet 2014 ; 95 ( 6 ): 744 – 53 . http://dx.doi.org/10.1016/j.ajhg.2014.11.001 Google Scholar Crossref Search ADS PubMed WorldCat 94 Cross-Disorder Group of the Psychiatric Genomics, C ., Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis . Lancet 2013 ; 381 ( 9875 ): 1371 – 9 . Crossref Search ADS PubMed WorldCat 95 Thompson PM , Stein JL, Medland SE, et al. The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data . Brain Imaging Behav 2014 ; 8 ( 2 ): 153 – 82 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 96 Novak NM , Stein JL, Medland SE, et al. EnigmaVis: online interactive visualization of genome-wide association studies of the Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) consortium . Twin Res Hum Genet 2012 ; 15 ( 3 ): 414 – 8 . http://dx.doi.org/10.1017/thg.2012.17 Google Scholar Crossref Search ADS PubMed WorldCat 97 van Erp TG , Hibar DP, Rasmussen JM, et al. Subcortical brain volume abnormalities in 2028 individuals with schizophrenia and 2540 healthy controls via the ENIGMA consortium . Mol Psychiatry 2016 ; 21 ( 4 ): 585.http://dx.doi.org/10.1038/mp.2015.118 Google Scholar Crossref Search ADS PubMed WorldCat 98 Jaffe AE , Gao Y, Deep-Soboslay A, et al. Mapping DNA methylation across development, genotype and schizophrenia in the human frontal cortex . Nat Neurosci 2015 ; 19 ( 1 ): 40 – 7 . http://dx.doi.org/10.1038/nn.4181 Google Scholar Crossref Search ADS PubMed WorldCat 99 Montano C , Taub MA, Jaffe A, et al. Association of DNA Methylation Differences With Schizophrenia in an Epigenome-Wide Association Study . JAMA Psychiatry 2016 ; 73 ( 5 ): 506 – 14 . http://dx.doi.org/10.1001/jamapsychiatry.2016.0144 Google Scholar Crossref Search ADS PubMed WorldCat 100 Lu AT , Hannon E, Levine ME, et al. Genetic architecture of epigenetic and neuronal ageing rates in human brain regions . Nat Commun 2017 ; 8 : 15353 . http://dx.doi.org/10.1038/ncomms15353 Google Scholar Crossref Search ADS PubMed WorldCat 101 Hannon E , Spiers H, Viana J, et al. Methylation QTLs in the developing brain and their enrichment in schizophrenia risk loci . Nat Neurosci 2016 ; 19 ( 1 ): 48 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Gagliano SA , Ptak C, Mak DYF, et al. Allele-Skewed DNA modification in the brain: relevance to a schizophrenia GWAS . Am J Hum Genet 2016 ; 98 ( 5 ): 956 – 62 . http://dx.doi.org/10.1016/j.ajhg.2016.03.006 Google Scholar Crossref Search ADS PubMed WorldCat 103 Fullard JF , Giambartolomei C, Hauberg ME, et al. Open chromatin profiling of human postmortem brain infers functional roles for non-coding schizophrenia loci . Hum Mol Genet 2017 ; 26 ( 10 ): 1942 – 51 . http://dx.doi.org/10.1093/hmg/ddx103 Google Scholar Crossref Search ADS PubMed WorldCat 104 Li Z , Chen Yu JH, et al. Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia . Nat Genet 2017 ; 49 : 1576 – 83 . http://dx.doi.org/10.1038/ng.3973 Google Scholar Crossref Search ADS PubMed WorldCat 105 Hall D , Huerta MF, McAuliffe MJ, et al. Sharing heterogeneous data: the national database for autism research . Neuroinformatics 2012 ; 10 ( 4 ): 331 – 9 . http://dx.doi.org/10.1007/s12021-012-9151-4 Google Scholar Crossref Search ADS PubMed WorldCat 106 Fischbach GD , Lord C. The Simons simplex collection: a resource for identification of autism genetic risk factors . Neuron 2010 ; 68 ( 2 ): 192 – 5 . http://dx.doi.org/10.1016/j.neuron.2010.10.006 Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
Assessing the heterogeneity of in silico plasmid predictions based on whole-genome-sequenced clinical isolatesLaczny, Cedric, C;Galata,, Valentina;Plum,, Achim;Posch, Andreas, E;Keller,, Andreas
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx162pmid: 29220507
Abstract High-throughput next-generation shotgun sequencing of pathogenic bacteria is growing in clinical relevance, especially for chromosomal DNA-based taxonomic identification and for antibiotic resistance prediction. Genetic exchange is facilitated for extrachromosomal DNA, e.g. plasmid-borne antibiotic resistance genes. Consequently, accurate identification of plasmids from whole-genome sequencing (WGS) data remains one of the major challenges for sequencing-based precision medicine in infectious diseases. Here, we assess the heterogeneity of four state-of-the-art tools (cBar, PlasmidFinder, plasmidSPAdes and Recycler) for the in silico prediction of plasmid-derived sequences from WGS data. Heterogeneity, sensitivity and precision were evaluated by reference-independent and reference-dependent benchmarking using 846 Gram-negative clinical isolates. Interestingly, the majority of predicted sequences were tool-specific, resulting in a pronounced heterogeneity across tools for the reference-independent assessment. In the reference-dependent assessment, sensitivity and precision values were found to substantially vary between tools and across taxa, with cBar exhibiting the highest median sensitivity (87.45%) but a low median precision (27.05%). Furthermore, integrating the individual tools into an ensemble approach showed increased sensitivity (95.55%) while reducing the precision (25.62%). CBar and plasmidSPAdes exhibited the strongest concordance with respect to identified antibiotic resistance factors. Moreover, false-positive plasmid predictions typically contained only few antibiotic resistance factors. In conclusion, while high degrees of heterogeneity and variation in sensitivity and precision were observed across the different tools and taxa, existing tools are valuable for investigating the plasmid-borne resistome. Nevertheless, additional studies on representative clinical data sets will be necessary to translate in silico plasmid prediction approaches from research to clinical application. bacteria, plasmids, prediction, next-generation sequencing Introduction Bacterial plasmids play important roles in the emergence and spread of antibiotic resistance [1]. These genetic elements vary in size, are mostly circular, can replicate independently and often encode resistance- and/or virulence-related genes [1–4]. Moreover, the dissemination of pathogens is facilitated by inter-species plasmid exchange [5]. A prominent example is the plasmid-encoded mcr-1 gene inducing colistin resistance originally reported by Y. Liu and Y. Wang for Enterobacteriaceae samples collected in China [6]. The mcr-1 gene was subsequently found in bacteria collected in Europe, Laos, Thailand and Nigeria [7]. Therefore, plasmid detection and classification are crucial steps for the identification and characterization of plasmid-mediated phenotypes. Polymerase chain reaction-based replicon typing (based on elements of the replication machinery) [8, 9] and MOB typing (based on conserved motifs of the relaxase gene) are frequently used to detect and classify plasmids [10, 11]. Limitations of these approaches are, among others, that the available typing schemes do not cover all plasmids and that the complete genetic repertoire of the plasmid(s) remains unknown, as the focus of these approaches is on a specific set of genes [12]. In contrast, whole-genome sequencing (WGS) indiscriminately resolves the chromosomal and extrachromosomal genetic complements. Subsequent annotation of de novo assembled sequences enables the characterization of chromosome- and plasmid-derived functional potential in addition to taxonomic identification of the studied organism. In a detailed review of plasmid classification within the context of antibiotic resistance epidemiology, Orlek et al. [12] describe the potential of the in silico analysis of WGS data to address the limitations of replicon and MOB typing. Furthermore, Arredondo-Alonso et al. [13] reviewed computational solutions for the automated plasmid prediction on a set of 42 reference genomes. The existing in silico approaches can be divided into three main categories: marker-gene search-based approaches, e.g. searching for replicons in the sequences (PlasmidFinder [14]); approaches based on genomic signatures, e.g. k-mer frequencies, of plasmid-derived and chromosomal DNA (cBar [15]); and approaches identifying plasmids based on k-mer coverage differences and/or circular paths in the assembly graph (PlasmidSPAdes [16], Recycler [17]). However, repetitive regions and/or genes found on multiple genomic units (chromosomes and plasmids) challenge the de novo assembly of short-read sequencing data, resulting in fragmented assemblies and mis-assemblies [17]. In accordance with studies reporting on the improved contiguity of genome assemblies based on or augmented by long reads [18–21], Arredondo-Alonso et al. conclude that long-read sequencing data are expected to greatly assist in the resolution of chromosomal and extrachromosomal sequences. Unarguably, full-length genomic resolution is ultimately desirable, but despite advances in long-read sequencing, short-read-based approaches currently dominate the WGS space and can provide crucial diagnostic information. Therefore, the analysis of a cohort of clinical samples will allow improved assessment of the variance in the predictions across different taxa but also between the individual tools. Here, we analyzed the short-read WGS data of 846 Gram-negative, clinical bacterial isolates using four existing in silico plasmid prediction tools (cBar, PlasmidFinder, plasmidSPAdes and Recycler) and an ensemble approach that integrates the individual tools’ predictions. The heterogeneity between the individual tools was first assessed using reference-independent approaches. Subsequently, an ad hoc ground truth was defined. This was necessary as the herein included isolates were patient-derived and the closest reference genome needed to be identified first. De novo assembled contigs were then aligned against the respective reference chromosome(s) and plasmid(s) to identify plasmid-positive samples. This information was used to evaluate the sensitivity and precision of the individual tools and the ensemble approach. Furthermore, the differences in k-mer coverage of chromosome and plasmid sequences in plasmid-positive samples were compared. Finally, we analyzed the concordance between the predictions and the ground truth with respect to plasmid-borne antibiotic resistance genes. Materials and methods WGS and preprocessing Batches of 96 samples were sequenced per lane for paired-end sequencing (2 × 100 bp) on Illumina Hiseq2000 or Hiseq2500 sequencers using TruSeq PE Cluster v3 and TruSeq SBS v3 sequencing chemistry (Illumina) as previously described in detail [22]. A total of 2 705 458 738 raw reads and a median of 2 987 123 reads per sample were generated. Trimmomatic version 0.35 was used with the command line parameters: ‘PE ILLUMINACLIP:NexteraPE-PE.fa:1:50:30 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’ [23]. Only the trimmed, paired-end reads were used herein, if not stated otherwise. De novo assembly SPAdes version 3.10.1 [24] was used to assemble the trimmed, paired-end reads with the following parameters: ‘--careful -t 6 -k 21,33,55’. Predicting plasmid sequences plasmidSPAdes: ‘plasmidspades.py’ from SPAdes version 3.10.1 [16, 24] was used to assemble the trimmed, paired-end reads to identify candidate plasmid sequences, with the following parameters: ‘--careful -t 6 -k 21,33,55’. PlasmidFinder: Sequences for the Enterobacteriaceae were downloaded from https://bitbucket.org/genomicepidemiology/plasmidfinder_db/src (commit ID: d5a49e9b01b0) [14]. A BLASTN database was built using ‘makeblastdb’ of ncbi-blast-2.6.0+. cBar: Version 1.2 was used [15]. Recycler: Version 0.62 was used [17]. The required BAM file was generated using bwa-0.7.15. Recycler’s ‘make_fasta_from_fastg.py’ was used to generate the FASTA file (from the ‘assembly_graph.fastg’ file generated by SPAdes) required to build the bwa index [25, 26]. The trimmed paired-end reads were aligned against the resulting index with ‘bwa mem’, and the SAM output was directly converted to BAM format using ‘samtools view -buS -| samtools view -bF 0x0800 -| samtools sort –’ (samtools version 0.1.19-96b5f2294a) [27]. The resulting BAM file was indexed using ‘samtools index’. Finally, Recycler was run with the following options: ‘-g assembly_graph.fastg -k 55 -b assembly_graph.bam -i True’. Ensemble approach: To increase the sensitivity, we implemented a straightforward ensemble approach. The candidate plasmid sequences, as predicted by cBar, plasmidSPAdes, PlasmidFinder and Recycler, were pooled and clustered using ‘cd-hit-est’ from CD-HIT version 4.6.6 and the default parameters [28, 29]. Accompanying code can be found at https://github.com/claczny/2017_plasmid_prediction_review. Pairwise correlation of the individual tools’ predictions Sourmash [30] version 2.0.0a1 was used to compute signatures for each tool’s predicted sequences (‘compute -k 31 –scaled 50 –track-abundance’). As plasmid-derived sequences are typically much shorter than chromosomal sequences, a small scaling factor was chosen accordingly. Subsequently, for all tools, all predictions within each tool were compared against each other using sourmash’s ‘compare’ function. The resulting similarity matrices per tool were converted to distance matrices (1—similarity value), and all pairwise tool combinations were correlated using the ‘mantel’ function (‘method=“pearson”, permutations = 999, parallel = 30’) in the vegan R package [31] for samples occurring in both of the predictions of the respective tool pair. The superheat-function in the superheat R package was used for plotting. Complete reference genomes Nucleotide FASTA files of complete bacterial genomes were downloaded from the NCBI RefSeq database (ncbi-genome-download, https://github.com/kblin/ncbi-genome-download, version 0.2.2, parameters: --section refseq --format fasta --assembly-level complete --human-readable --parallel 5 --retries 3 --verbose bacteria, on 24 May 2017). In total, 6901 genomes were retrieved; sequences containing the word ‘plasmid’ in their FASTA header were considered as plasmids resulting in 5611 plasmid and 7415 non-plasmid sequences in total. Defining the ad hoc ground truth data Lacking dedicated, finished genomes for the present clinical, i.e. patient-derived in contrast to reference material, isolates, sourmash version 2.0.0a1 was used to identify the most similar, complete reference genome. Specifically, signatures were first computed for each complete reference genome and the contigs of each successful de novo assembly (‘compute -k 31 --scaled 2000 --track-abundance -o SEQ.sig SEQ.fa’). The reference genomes’ signatures were indexed (‘index REFIDXPREFIX -k 31 --traverse-directory PATH_TO_REF_SIGNATURES’). Subsequently, for each de novo assembly, the index was searched (‘search -k 31 ASSEMBLY.sig REFIDXPREFIX.sbt.json -o ASSEMBLY.best_only_hits.txt --best-only’), and the top hit returned by sourmash was used as the respective reference. For each isolate-reference pair, the isolate’s de novo contigs were aligned against the reference to identify plasmid or chromosome sequences using BLASTN (ncbi-blast-2.6.0+, format: ‘6 std qcovs qcovhsp qlen slen’) [32]. For each query sequence, the subject (reference sequence) with the longest alignment length and highest query-coverage-by-subject was selected. If multiple hits existed, the hit with the highest bitscore was chosen. If multiple hits remained, the first subject representing a plasmid was chosen. Thus, each de novo contig was assigned a label whether it represents a plasmid, and contigs not matching a sequence of the closest reference genome were considered ‘unclassified’. Evaluating the predictions Reference-independent analysis of heterogeneity: Sequences were clustered as described for the ‘Ensemble approach’. The ‘clstr2txt.pl’ script from CD-HIT version 4.6.6 was used to reformat the cluster output. The reformatted output was used to compute the fraction of the cumulative length of the cluster centroids that was represented by one, two, three or all four tools. It should be noted that the length of the cluster centroid was used here as a proxy. However, the actual shared fraction could be lower if cluster members are of shorter length than the cluster centroid. Reference sequence coverage: PlasmidSPAdes and Recycler generate their own set of contigs, whereas cBar and PlasmidFinder directly identify candidate plasmid sequences on the de novo assembled contigs. Thus, to stay consistent between all tools, the predicted sequences were linked with the ad hoc ground truth sequences by using the former as queries and the latter as the subjects in BLASTN (ncbi-blast-2.6.0+, format: ‘6 std qcovs qcovhsp qlen slen’). Similar to the approach used for defining the ground truth data, for each query sequence, the subject (de novo contig) with the longest alignment length and highest query-coverage-by-subject was selected. If multiple hits existed, the hit with the highest bitscore was chosen. If multiple hits remained, the longest subject was chosen. Should still multiple hits remain, the first subject representing a plasmid was chosen. Unclassified sequences were ignored. Based on the resulting prediction-to-ground-truth mapping, the sensitivity and precision were computed using the following definitions: P= cumulative length of ground truth plasmid sequences TP=∑length(subject) ; if query was predicted plasmid and subject was ground truth plasmid FN= P – TP N= cumulative length of ground truth chromosome sequences FP=∑length(subject) ; if query was predicted plasmid and subject was ground truth chromosome TN= N – FP Sensitivity= TPP Precision= TPTP+FP The following edge cases were considered and handled accordingly: The sample contained no plasmids and the tool predicted no plasmids: P=0, TP=0, FN=0, FP=0, TN=N The sample contained plasmids and the tool predicted no plasmids: TP=0, FN=P, FP=0, TN=N The sample contained no plasmids and the tool predicted plasmids: P=0, TP=0, FN=0 Antibiotic resistance genes: Prokka version 1.11 was used to annotate the genes of the predicted and ground truth plasmid sequences [33]. Translated coding DNA sequences were searched against the ResFams core database using hmmsearch version 3.1b2 (‘--cut_tc --tblout’) [34, 35]. The counts of ResFams hits per-sample-and-tool were compared with the respective counts of the ground truth for samples common to the respective tool and the ground truth. The Spearman correlation was computed using the ‘cor’ function in R version 3.3.2 [36]. For each comparison, a linear model including confidence intervals was fitted using the ‘geom_smooth’ function from the ggplot package version 2.2.1 in R version 3.3.2 [37]. Results and discussion Cultured isolates of Gram-negative bacteria from 846 clinical samples were sequenced as described in Galata et al. [22], and de novo assemblies were successfully created for 844 samples. We evaluated the performance of four plasmid-prediction tools: cBar, PlasmidFinder, plasmidSPAdes and Recycler. Moreover, we integrated the individual tools into an ensemble approach by merging and clustering the predictions according to their nucleotide sequence identity to remove redundant sequences. In addition to evaluating the predictions using reference-independent as well as reference-dependent approaches, the concordance between the predictions and the ground truth with respect to plasmid-borne antibiotic resistance genes was analyzed. Reference-independent assessment of plasmid predictions In a first analysis, we were mainly interested in the heterogeneity of the predictions between the individual tools. We thus performed the predictions and compared them with each other. Interestingly, all tested tools substantially varied in their number of predicted plasmid-positive samples, i.e. samples predicted to contain at least one plasmid-derived sequence. CBar predicted all 844 samples to be plasmid-positive, while plasmidSPAdes, PlasmidFinder and Recycler predicted 766, 446 and 375 plasmid-positive samples, respectively. Moreover, the cumulative lengths of the predicted sequences per sample were found to vary markedly: cBar was found to have the largest cumulative lengths, while Recycler had the lowest (Figure 1). Figure 1 View largeDownload slide Cumulative lengths of the predicted plasmid sequences per tool. The y-axis uses a log10 scale. The median values are shown and the boxplots represent the median, two hinges, two whiskers and all outlier points individually. Figure 1 View largeDownload slide Cumulative lengths of the predicted plasmid sequences per tool. The y-axis uses a log10 scale. The median values are shown and the boxplots represent the median, two hinges, two whiskers and all outlier points individually. Moreover, the tools were tested for their pairwise correlations across the predictions. CBar and plasmidSPAdes were found to exhibit the highest correlation value (0.82), suggesting that these two approaches resulted in somewhat related predictions (Figure 2). In contrast, plasmidSPAdes and Recycler were found to have the lowest pairwise correlation (0.3). Based on the clustering of the individual tools’ predictions using the ensemble approach, the tools’ heterogeneity was furthermore evaluated with respect to the shared fraction of the cumulative plasmid lengths. The largest fraction was represented by sequences predicted by a single tool (Figure 3). Conversely, all four tools were infrequently found to show pronounced overlap in their prediction. Furthermore, strong variations were observed in the fraction of cumulative length predicted by one or by two tools. This indicates a distinct heterogeneity between the individual tools’ predictions. Figure 2 View largeDownload slide Correlation of the tools’ predictions. For each tool, a distance matrix with respect to the tool’s predictions was computed. Pairwise distance matrix correlation was computed and is shown in the heatmap. The color indicates the correlation degree and correlation values are shown in each cell. Figure 2 View largeDownload slide Correlation of the tools’ predictions. For each tool, a distance matrix with respect to the tool’s predictions was computed. Pairwise distance matrix correlation was computed and is shown in the heatmap. The color indicates the correlation degree and correlation values are shown in each cell. Figure 3 View largeDownload slide Fraction of cumulative lengths shared by the tested tools. The fraction of the cumulative length is shown on the x-axis, and the number of tools exhibiting overlap of the respective sequence(s) is shown on the y-axis. The lengths of the cluster centroids were taken as proxies. Points are jittered randomly vertically per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 3 View largeDownload slide Fraction of cumulative lengths shared by the tested tools. The fraction of the cumulative length is shown on the x-axis, and the number of tools exhibiting overlap of the respective sequence(s) is shown on the y-axis. The lengths of the cluster centroids were taken as proxies. Points are jittered randomly vertically per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Reference-dependent assessment of sensitivity and precision A complete reference genome could be identified for 818 of the 846 samples. The ad hoc definition of the ground truth was required because of the lack of dedicated, finished genomes, e.g. using complimentary long-read sequencing data, for the present set of clinical isolates. The median cumulative lengths of chromosome contigs, plasmid contigs and unclassified contigs were 4 907 449, 114 954 and 27 733 bp, respectively (Supplementary Figure S1). Seven samples had >1 Mbp of unclassified contigs and were thus excluded from further analyses, resulting in median lengths of 514.0, 437.5 and 118.0 bp, for chromosome contigs, plasmid contigs and unclassified contigs, respectively (Supplementary Figure S2). A total of 347 samples were considered to be plasmid-positive according to the ad hoc ground truth and were subsequently used to compute the sensitivity and precision of the individual tools and of the ensemble approach (Supplementary Figure S3). CBar was found to be the most sensitive (median sensitivity: 87.45%) among the individual tools, followed by plasmidSPAdes (81.49%) and PlasmidFinder (36.47%) (Figure 4). Recycler’s predictions generally had overall low cumulative lengths (Figure 1), consequently resulting in extremely low sensitivity values (median sensitivity: 0.00%). Importantly, Recycler was designed to recover circular sequences, and the present results suggest that their number was minimal in our de novo assemblies. The ensemble approach resulted in a median sensitivity value of 95.55%. Figure 4 View largeDownload slide Prediction performances of the tested tools and the ensemble approach for plasmid-positive samples based on the ad hoc ground truth. Sensitivity (‘Sens’) and precision (‘Prec’) are shown. The median values are shown and the boxplots represent the median, two hinges and two whiskers. Figure 4 View largeDownload slide Prediction performances of the tested tools and the ensemble approach for plasmid-positive samples based on the ad hoc ground truth. Sensitivity (‘Sens’) and precision (‘Prec’) are shown. The median values are shown and the boxplots represent the median, two hinges and two whiskers. Resolving the prediction performances by genus revealed strong variations, both within and between the individual tools (Supplementary Figure S4). For example, while cBar was found to exhibit overall high sensitivity values, plasmid sequences of Acinetobacter spp. were less well detected. Moreover, the sensitivity of plasmidSPAdes varied strongly for the Citrobacter spp., Enterobacter spp. and Salmonella spp. samples. PlasmidFinder exhibited particularly low sensitivity for Acinetobacter spp., which is to be expected, as this genus is a member of the Moraxellaceae family and, thus, not covered by PlasmidFinder’s Enterobactericeae-specific database. The sensitivity of the ensemble approach was found to be on par or better compared with the individual tools. While cBar had the highest median sensitivity, its median precision (27.05%) was below the median precision of PlasmidFinder (100%) and plasmidSPAdes (52.70%), indicating that cBar frequently misclassifies chromosomal contigs as being plasmid-derived (false positives). The median precision of the ensemble approach was 25.62%. Importantly, the ensemble approach included all the false-positive predictions of the individual tools, which explains the low precision. Similar to the sensitivity results resolved by genus, the precision of the individual tools varied substantially (Supplementary Figure S4). Notably, the highest median precisions were observed for Klebsiella spp. In contrast, the precision was extremely low for Acinetobacter spp., regardless of the approach being reference-dependent (cBar, PlasmidFinder) or reference-independent (plasmidSPAdes, Recycler). Hierarchical clustering of the individual tools and the ensemble approach with respect to their true-positive values revealed cBar and the ensemble approach to be the most similar, followed by plasmidSPAdes (Figure 5). Figure 5 View largeDownload slide Hierarchical clustering of individual tools according to their true-positive values. True-positive values represent the cumulative base pair length correctly covered by the individual tools and were scaled and centered before computing the hierarchical clustering. Figure 5 View largeDownload slide Hierarchical clustering of individual tools according to their true-positive values. True-positive values represent the cumulative base pair length correctly covered by the individual tools and were scaled and centered before computing the hierarchical clustering. Differential coverage between chromosome and plasmid sequences Plasmids can independently replicate [3, 4] and thus can occur in different copy numbers than the bacterial chromosome(s). PlasmidSPAdes uses this information to identify assembly graph components with (substantially) differing coverage, considering these components as candidate plasmids. However, this approach is, by design, challenged by plasmids occurring in similar copy numbers as the chromosome(s) (false negatives), or by components within the graph that exhibit coverage differences despite representing chromosomal sequences (false positives), e.g. because of bacterial cells at different stages in the replication cycle [38]. To study how frequently plasmid sequences significantly differed in their copy numbers from the chromosome sequences, we analyzed the k-mer coverage of the de novo assembled contigs. Of the 811 isolates (818−7 samples with >1 Mbp of unclassified sequences), 28.11% (228 of 811) showed statistically significant results (alpha = 0.05; false discovery rate-adjusted: 185 of 811) when tested for unimodality of the k-mer coverage distributions (Supplementary Figure S5), suggesting that these distributions could be considered mutimodal. However, only 31.70% (110 of 347) of the plasmid-positive samples were likely multimodal in their k-mer coverage distributions. Moreover, 61.10% (212 of 347) of the plasmid-positive samples significantly differed in their k-mer coverages of the plasmid sequences and chromosome sequences (Wilcoxon-Mann-Whitney-test, P < 0.05). It should be noted that plasmidSPAdes median sensitivity was higher (81.29%; Figure 4); yet, this was computed using the sequence coverage of plasmid sequences rather than number of samples. Correctly predicted, long-assembled sequences will increase the true-positive value, thereby leading to higher sensitivity values. Antibiotic resistance factors encoded in plasmid sequences The in silico separation of genomic sequences into ‘chromosome-derived’ or ‘extrachromosome-derived’ has proven to be a challenging task as demonstrated herein as well as in [12, 13]. Nevertheless, the identification of candidate plasmid-derived sequences in fragmented assemblies is relevant. Specifically, the functional potential can thus be assessed for the candidates. To this end, antibiotic resistance genes included in ResFams were identified on the predicted and ground truth plasmid sequences of plasmid-positive samples. The number of ResFams hits was found to vary within and between the individual tools but also for the ground truth (Figure 6). PlasmidFinder and Recycler recovered few of the expected ResFams hits, which is in accordance with the reduced sensitivity observed herein (Figure 4). CBar and plasmidSPAdes were found to more closely represent the ground truth distribution of the ResFams hits. Only plasmidSPAdes exhibited a higher number of hits than found in the ground truth. These extra hits might represent chromosome-borne antibiotic resistance genes. As plasmidSPAdes uses coverage information for its predictions, it could be speculated that the respective chromosomal regions exhibited differential coverage to the remainder of the chromosome. While there are various potential reasons as to why this could occur, e.g. competitive advantage under antibiotic pressure and thus increased replication, the exact reason is currently unknown. Moreover, the ResFams hit counts were compared pairwise between the ground truth and the individual tools, and the respective Spearman correlations were computed (Figure 7). CBar and plasmidSPAdes were found to be the closest to represent the ground truth, with cBar exhibiting a higher correlation (0.68 versus 0.56), likely because of the increased variation toward low or high counts for plasmidSPAdes. Figure 6 View largeDownload slide ResFams hits counts of plasmid-positive samples. The number of samples per tool is shown in parentheses. Only plasmid-positive samples with at least one ResFams hit are shown. Points are jittered randomly horizontally per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 6 View largeDownload slide ResFams hits counts of plasmid-positive samples. The number of samples per tool is shown in parentheses. Only plasmid-positive samples with at least one ResFams hit are shown. Points are jittered randomly horizontally per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 7 View largeDownload slide Comparison of ResFams hits against the ad hoc ground truth. ResFams hits counts of the four herein tested tools are plotted against the respective counts in the ground truth for paired samples. A linear model was fitted (black line) and confidence intervals are shown (in orange). Moreover, a two-dimensional density estimate is plotted with transparency increasing with decreasing point density. Figure 7 View largeDownload slide Comparison of ResFams hits against the ad hoc ground truth. ResFams hits counts of the four herein tested tools are plotted against the respective counts in the ground truth for paired samples. A linear model was fitted (black line) and confidence intervals are shown (in orange). Moreover, a two-dimensional density estimate is plotted with transparency increasing with decreasing point density. Conclusion The importance of WGS has been repeatedly demonstrated for taxonomic identification of microorganisms, with its application in infectious disease diagnostics and in epidemiological studies providing direct benefits to individuals and the general public [39–41]. Furthermore, the indiscriminate extraction of the entire microbial genomic complement enables concurrent sequencing of chromosomal and extrachromosomal sequences, e.g. plasmids in bacteria. This is especially relevant as plasmid-encoded functions can strongly affect the bacterial phenotype, thus providing crucial information beyond chromosomes and taxonomy [42–45]. To this end, the present study analyzed the performances of four plasmid prediction tools on de novo assemblies of 846 Gram-negative WGS clinical isolates using reference-independent and reference-dependent evaluation approaches. With respect to the latter, the use of patient-derived isolates, in contrast to reference material, required the definition of an ad hoc ground truth. This approach was found to be robust for the plasmid-positive samples, as the cumulative length of unclassified sequences was limited. However, plasmid sequences, in particular if they were recently acquired, might have been missed; yet, this would not negatively affect the present evaluation, as unclassified sequences were ignored. Moreover, plasmid sequences recently introduced in the chromosome(s) or plasmid sequences homologous to chromosome sequences might represent confounding factors in the definition of the ad hoc ground truth. This further highlights the importance of full-length assemblies/reference genomes, which were, however, unavailable for the herein included isolates, and the generation of this complementary data was beyond the scope of the current study. Overall, no single-best approach was identified and pronounced variations in heterogeneity between the tools were observed, with cBar and plasmidSPAdes showing the strongest correlation. Moreover, the diversity of the present samples comprised 11 genera of at least 20 samples and allowed to reveal taxon-dependent variation, both, within tools and between tools. Interestingly, Acinetobacter-borne plasmids were less well detected by cBar, resulting in a low sensitivity, which may be because of a limited representation of this genus in the reference database that was originally used for cBar’s training. Furthermore, the generally low precision for this specific genus suggests that Acinetobacter spp. infections may require dedicated analyses and attention, e.g. in the case of plasmid-carrying, multidrug-resistant Acinetobacter baumannii organisms [46–48]. The taxon-dependent variation in the tools’ performances highlights the importance of concurrent identification of taxonomy and functional potential and the need for reference databases with an increased diversity, e.g. improved coverage of Acinetobacter spp. by cBar and PlasmidFinder. Moreover, we showed that copy numbers of plasmid sequences need not necessarily vary significantly to the copy number of the chromosome(s), thereby limiting coverage-based approaches. Accordingly, the use of complementary approaches that could lend mutual support, e.g. using cBar and plasmidSPAdes, appears sensible. In addition to the individual tools, an ensemble approach integrating the four independent predictions was evaluated. Overall, the sensitivity was found to be increased and less variable. However, the combination of the individual tools also led to reduced precision. Accordingly, the ensemble approach represents an interesting solution if the objective is to maximize the sensitivity, and false positives are acceptable and/or can be removed downstream, e.g. by identifying sequences with exceptionally high or low fold-coverage or by identifying sequences encoding relevant factors, such as antibiotic resistance genes. This approach is, however, not intended to replace the development of improved databases and prediction algorithms in the future. An example of the fast developments in this field is PlasmidTron, which was published, while the present manuscript was in revision [49]. Moreover, PLACNET represents a recently published approach for the plasmid reconstruction from WGS data [50]. It was excluded from the present evaluation of fully automated tools because of a manual pruning step in PLACNET’s workflow. The reconstructed genomic sequences, including the plasmid sequences, remained fragmented in the present study, which is in accordance with the results reported by Arredondo-Alonso et al. [13]. While long-read-based sequencing greatly improves the contiguity of genome assemblies [18, 19, 51], plasmid prediction tools can strongly reduce the search space for short-read-based data. Importantly, despite the frequent prediction of false positives, the accordance in the number of antibiotic resistance genes with respect to the ground truth was found to be high for cBar and plasmidSPAdes. Overall, this is expected to support precision medicine by reducing the time and work burden required for data examination. Furthermore, the present study illustrates that specific objectives are met by specific approaches and, thus, systematic benchmarking on extensive and curated data sets is important for the translation of bioinformatics tools from research to clinical application. Key Points Extrachromosomal DNA in the form of plasmids can carry phenotype-relevant information, e.g. antibiotic resistance factors. Next-generation sequencing of isolates allows linkage of taxonomy and extrachromosomal functional potential via concurrent resolution of chromosomal and extrachromosomal DNA. Existing in silico plasmid-prediction approaches showed limited agreement as well as strong inter- and intra-taxon variability on a set of 846 WGS clinical bacterial isolates. Combining the individual predictions resulted in increased sensitivity while reducing precision. Antibiotic resistance gene counts on predicted plasmid sequences were not strongly affected by false-positive predictions. Acknowledgement The authors would like to thank Curetis GmbH and Ares Genetics GmbH for the support as well as the provided data set. Availability of WGS data The raw WGS data are available on a reasonable request for academic research use only after signing a data transfer agreement. Funding In parts by the Best Ageing (grant number 306031) from the European Union. Cedric C. Laczny is a Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Valentina Galata is a PhD student at the Chair of Clinical Bioinformatics at Saarland University. Achim Plum is a Managing Director of Ares Genetics GmbH and Curetis GmbH. Andreas E. Posch is a Managing Director of Ares Genetics GmbH. Andreas Keller is a professor and head of the Chair for Clinical Bioinformatics at Saarland University. References 1 Carattoli A. Resistance plasmid families in Enterobacteriaceae . Antimicrob Agents Chemother 2009 ; 53 ( 6 ): 2227 – 38 . doi: 10.1128/AAC.01707-08 2 Frost LS , Leplae R , Summers AO , et al. Mobile genetic elements: the agents of open source evolution . Nat Rev Microbiol 2005 ; 3 ( 9 ): 722 – 32 . doi: 10.1038/nrmicro1235 3 Scott JR. Regulation of plasmid replication . Microbiol Rev 1984 ; 48 ( 1 ): 1 – 23 . 4 del Solar G , Giraldo R , Ruiz-Echevarría MJ , et al. Replication and control of circular bacterial plasmids . Microbiol Mol Biol Rev 1998 ; 62 ( 2 ): 434 – 64 . doi: 1092-2172/98/$04.0010 5 Conlan S , Park M , Deming C , et al. Plasmid dynamics in KPC-positive Klebsiella pneumoniae during long-term patient colonization . MBio 2016 : 2 : e000085 . doi: 10.1128/mBio.00742-16 6 Liu Y-Y , Wang Y , Walsh TR , et al. Emergence of plasmid-mediated colistin resistance mechanism MCR-1 in animals and human beings in China: a microbiological and molecular biological study . Lancet Infect Dis 2016 ; 16 : 161 – 8 . doi: 10.1016/S1473-3099(15)00424-7 7 Zhi C , Lv L , Yu L-F , et al. Dissemination of the mcr-1 colistin resistance gene . Lancet Infect Dis 2016 ; 16 : 292 – 3 . doi: 10.1016/S1473-3099(16)00063-3 8 Couturier M , Bex F , Bergquist PL , et al. Identification and classification of bacterial plasmids . Microbiol Rev 1988 ; 52 ( 3 ): 375 – 95 . 9 Carattoli A , Bertini A , Villa L , et al. Identification of plasmids by PCR-based replicon typing . J Microbiol Methods 2005 ; 63 ( 3 ): 219 – 28 . doi: 10.1016/j.mimet.2005.03.018 10 Francia MV , Varsaki A , Garcillán-Barcia MP , et al. A classification scheme for mobilization regions of bacterial plasmids . FEMS Microbiol Rev 2004 ; 28 ( 1 ): 79 – 100 . doi: 10.1016/j.femsre.2003.09.001 11 Alvarado A , Garcillán-Barcia MP , de la Cruz F. A degenerate primer MOB typing (DPMT) method to classify gamma-proteobacterial plasmids in clinical and environmental settings . PLoS One 2012 ; 7 ( 7 ): e40438. doi: 10.1371/journal.pone.0040438 12 Orlek A , Stoesser N , Anjum MF , et al. Plasmid classification in an era of whole-genome sequencing: application in studies of antibiotic resistance epidemiology . Front Microbiol 2017 ; 8 : 182 . doi: 10.3389/fmicb.2017.00182 13 Arredondo-Alonso S , Willems RJ , van Schaik W , et al. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data . Microb Genomics 2017 ; 1 – 18 . doi: 10.1099/mgen.0.000128 14 Carattoli A , Zankari E , García-Fernández A , et al. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing . Antimicrob Agents Chemother 2014 ; 58 ( 7 ): 3895 – 903 . doi: 10.1128/AAC.02412-14 15 Zhou F , Xu Y. cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data . Bioinformatics 2010 ; 26 ( 16 ): 2051 – 2 . doi: 10.1093/bioinformatics/btq299 16 Antipov D , Hartwick N , Shen M , et al. plasmidSPAdes: assembling plasmids from whole genome sequencing data . Bioinformatics 2016 ; 32 : 3380 – 7 . doi: 10.1093/bioinformatics/btw493 17 Rozov R , Brown Kav A , Bogumil D , et al. Recycler: an algorithm for detecting plasmids from de novo assembly graphs . Bioinformatics 2016 ; 33 : 475 – 82 . doi: 10.1093/bioinformatics/btw651 18 English AC , Richards S , Han Y , et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology . PLoS One 2012 ; 7 ( 11 ): e47768 . doi: 10.1371/journal.pone.0047768 19 Reuter S , Hunt M , Peacock SJ , et al. Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology . Microb Genomics 2016 ; 2 ( 9 ): e000085 . doi: 10.1099/mgen.0.000085 20 George S , Pankhurst L , Hubbard A , et al. Resolving plasmid structures in Enterobacteriaceae using the MinION nanopore sequencer: assessment of MinION and MinION/Illumina hybrid data assembly approaches . Microb Genomics 2017 ; 3 ( 8 ): 10 . http://dx.doi.org/10.1099/mgen.0.000118 21 Wick RR , Judd LM , Gorrie CL , et al. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads . PLoS Comput Biol 2017 ; 13 ( 6 ): e1005595 . doi: 10.1371/journal.pcbi.1005595 22 Galata V , Backes C , Laczny CC , et al. Comparing genome versus proteome-based identification of clinical bacterial isolates . Brief Bioinform 2016 ; doi:10.1093/bib/bbw122.doi: 10.1093/bib/bbw122 23 Bolger AM , Lohse M , Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data . Bioinformatics 2014 ; 30 ( 15 ): 2114 – 20 . doi: 10.1093/bioinformatics/btu170 24 Bankevich A , Nurk S , Antipov D , et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing . J Comput Biol 2012 ; 19 ( 5 ): 455 – 77 . doi: 10.1089/cmb.2012.0021 25 Li H , Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics 2009 ; 25 ( 14 ): 1754 – 60 . doi: 10.1093/bioinformatics/btp324 26 Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM . arXiv [q-bio.GN] 2013;1303.3997v1. 27 Li H , Handsaker B , Wysoker A , et al. The sequence alignment/map format and SAMtools . Bioinformatics 2009 ; 25 ( 16 ): 2078 – 9 . doi: 10.1093/bioinformatics/btp352 28 Fu L , Niu B , Zhu Z , et al. CD-HIT: accelerated for clustering the next-generation sequencing data . Bioinformatics 2012 ; 28 ( 23 ): 3150 – 2 . doi: 10.1093/bioinformatics/bts565 29 Li W , Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences . Bioinformatics 2006 ; 22 ( 13 ): 1658 – 9 . doi: 10.1093/bioinformatics/btl158 30 Titus Brown C , Irber L. sourmash: a library for MinHash sketching of DNA . J Open Source Softw 2016 ; 1 ( 5 ): 27 . doi: 10.21105/joss.00027 31 Oksanen J , Blanchet FG , Friendly M , et al. vegan: Community Ecology Package, 2017 . 32 Altschul SF , Gish W , Miller W , et al. Basic local alignment search tool . J Mol Biol 1990 ; 215 ( 3 ): 403 – 10 . doi: 10.1016/S0022-2836(05)80360-2 33 Seemann T. Prokka: rapid prokaryotic genome annotation . Bioinformatics 2014 ; 30 ( 14 ): 2068 – 9 . doi: 10.1093/bioinformatics/btu153 34 Eddy SR. Profile hidden Markov models . Bioinformatics 1998 ; 14 ( 9 ): 755 – 63 . http://dx.doi.org/10.1093/bioinformatics/14.9.755 35 Gibson MK , Forsberg KJ , Dantas G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology . Isme J 2015 ; 9 ( 1 ): 207 – 16 . doi: 10.1038/ismej.2014.106 36 R Core Team . R: A Language and Environment for Statistical Computing , 2016 . 37 Wickham H. ggplot2: Elegant Graphics for Data Analysis . New York, NY : Springer-Verlag , 2009 . 38 Korem T , Zeevi D , Suez J , et al. Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples . Science 2015 ; 349 ( 6252 ): 1101 – 6 . doi: 10.1126/science.aac4812 39 Grumaz S , Stevens P , Grumaz C , et al. Next-generation sequencing diagnostics of bacteremia in septic patients . Genome Med 2016 ; 8 ( 1 ): 73 . doi: 10.1186/s13073-016-0326-8 40 Loman NJ , Constantinidou C , Christner M , et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104: H4 . Jama 2013 ; 309 ( 14 ): 1502 – 10 . doi: 10.1001/jama.2013.3231 41 Zhou K , Lokate M , Deurenberg RH , et al. Characterization of a CTX-M-15 producing Klebsiella pneumoniae outbreak strain assigned to a novel sequence type (1427) . Front Microbiol 2015 ; 6 : 1250 . doi: 10.3389/fmicb.2015.01250 42 von Wright A , Tynkkynen S. Construction of Streptococcus lactis subsp. lactis strains with a single plasmid associated with mucoid phenotype . Appl Environ Microbiol 1987 ; 53 ( 6 ): 1385 – 6 . 43 Matsui H , Bacot CM , Garlington WA , et al. Virulence plasmid-borne spvB and spvC genes can replace the 90-kilobase plasmid in conferring virulence to Salmonella enterica serovar typhimurium in subcutaneously inoculated mice . J Bacteriol 2001 ; 183 ( 15 ): 4652 – 8 . doi: 10.1128/JB.183.15.4652-4658.2001 44 Hammerl JA , Freytag B , Lanka E , et al. The pYV virulence plasmids of Yersinia pseudotuberculosis and Y. pestis contain a conserved DNA region responsible for the mobilization by the self-transmissible plasmid pYE854 . Environ Microbiol Rep 2012 ; 4 ( 4 ): 433 – 8 . doi: 10.1111/j.1758-2229.2012.00353.x 45 Guiney DG , Fang FC , Krause M , et al. Plasmid-mediated virulence genes in non-typhoid Salmonella serovars . FEMS Microbiol Lett 1994 ; 124 ( 1 ): 1 – 9 . http://dx.doi.org/10.1111/j.1574-6968.1994.tb07253.x 46 Huang H , Dong Y , Yang Z-L , et al. Complete sequence of pABTJ2, a plasmid from Acinetobacter baumannii MDR-TJ, carrying many phage-like elements . Genomics Proteomics Bioinformatics 2014 ; 12 ( 4 ): 172 – 7 . doi: 10.1016/j.gpb.2014.05.001 47 Weber BS , Ly PM , Irwin JN , et al. A multidrug resistance plasmid contains the molecular switch for type VI secretion in Acinetobacter baumannii . Proc Natl Acad Sci USA 2015 ; 112 ( 30 ): 9442 – 7 . doi: 10.1073/pnas.1502966112 48 Hamidian M , Holt KE , Pickard D , et al. A small Acinetobacter plasmid carrying the tet39 tetracycline resistance determinant . J Antimicrob Chemother 2016 ; 71 ( 1 ): 269 – 71 . doi: 10.1093/jac/dkv293 49 Page AJ , Wailan A , Shao Y , et al. PlasmidTron: assembling the cause of phenotypes from NGS data . bioRxiv 2017 . https://doi.org/10.1101/188920. 50 Lanza VF , de Toro M , Garcillán-Barcia MP , et al. Plasmid flux in Escherichia coli ST131 sublineages, analyzed by plasmid constellation network (PLACNET), a new method for plasmid reconstruction from whole genome sequences . PLoS Genet 2014 ; 10 : e1004766 . doi: 10.1371/journal.pgen.1004766 51 Quick J , Quinlan AR , Loman NJ. A reference bacterial genome dataset generated on the MinIONTM portable single-molecule nanopore sequencer . Gigascience 2014 ; 3 ( 1 ): 22. doi: 10.1186/2047-217X-3-22 © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence dataJayakumar, Vasanthan; Sakakibara, Yasubumi
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx147pmid: 29112696
Abstract Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms. de novo assembly, third-generation sequencing, single-molecule sequencing, PacBio SMRT, assembly evaluation Introduction Pacific Biosciences (PacBio) single-molecule real-time (SMRT) and Oxford Nanopore sequencing technologies are the two widely used third-generation, single-molecule sequencing (SMS) technologies, which can generate average read lengths of several thousand base pairs. SMRT sequencing technology suffers from high error rates reaching up to 15% [1]; however, as these errors are random, high-quality error-corrected consensus sequences can be generated with sufficient coverage. Application of SMRT sequencing to eukaryotic genomes [2–18] has already demonstrated the obvious advantages provided by long reads in de novo assembly, such as higher contiguity, lesser gaps and fewer errors. The assembled contigs of recently assembled plant and animal genomes can be routinely seen to achieve an N50 of 1 Mb using SMS data. Hence, a significant rise in the number of genomes sequenced using SMS technologies is imminent, raising the need for evaluation of the available long-read assemblers. Large-scale evaluation studies such as GAGE [19], GAGE-B [20], Assemblathon1 [21] and Assemblathon2 [22] have been attempted with short-read assemblers, providing conclusions that serve as a useful guide for the de novo assembly of a given target organism. Although such evaluations have also been attempted for SMS data, these studies were either focused on bacterial and smaller eukaryotic genomes [23, 24] or were not sufficiently comprehensive to cover all of the available non-hybrid long-read assemblers [25–27], while others are already outdated because of continuous improvements in the technology [28, 29]. Also genome size was found to correlate with contiguity in long-read assemblies [17]; hence, diverse genome sizes can help differentiate the effect of the assemblers on each data set. In this study, we attempted to comprehensively evaluate three important features—contiguity, completeness and correctness [1]—of long-read assemblers (Table 1), using SMRT data of a bacterium (Escherichia coli, ∼5 Mb), protist (Plasmodium falciparum, ∼23 Mb), nematode (Caenorhabditis elegans, ∼105 Mb) and plant (Ipomoea nil, ∼750 Mb). We also designed a pipeline (Figure 1) for assembling the data and evaluating the results of different assemblers, which can be applied to both model organisms as well as to non-model organisms with limited genomic resources. Table 1 Summarized statistics of the assemblies Organism . #Contigs . Assembly size (Mb) . Longest contig (Mb) . N50 (Mb) . L50 . CPU time (hours) . Maximum RSS (GB) . Escherichia coli (4.6 Mb) Maximum 1 4.7 4.7 4.7 1 83.9 44.5 Minimum 1 4.6 4.6 4.6 1 2.2 3.6 Mean 1 4.7 4.7 4.7 1 19.4 15.7 Plasmodium falciparum (23 Mb) Maximum 43 23.8 3.3 1.7 7 2012.6 43.9 Minimum 15 23.1 2.1 1.3 5 20.1 4.5 Mean 26.3 23.4 2.9 1.5 6.1 441.7 22.7 Caenorhabditis elegans (105 Mb) Maximum 452 106.9 7.1 3.7 38 6733.8 251.7 Minimum 68 101.9 2.7 0.8 11 13.4 10.1 Mean 166.7 104.2 5.1 2.2 19.4 1221.4 56.9 Ipomoea nil (750 Mb) Maximum 8751 752.7 11.5 1.8 1194 28 504.7 331.2 Minimum 1697 642 2.5 0.1 104 129.7 16.2 Mean 4288 702.7 6.2 0.7 439.4 10 065.8 78.2 Organism . #Contigs . Assembly size (Mb) . Longest contig (Mb) . N50 (Mb) . L50 . CPU time (hours) . Maximum RSS (GB) . Escherichia coli (4.6 Mb) Maximum 1 4.7 4.7 4.7 1 83.9 44.5 Minimum 1 4.6 4.6 4.6 1 2.2 3.6 Mean 1 4.7 4.7 4.7 1 19.4 15.7 Plasmodium falciparum (23 Mb) Maximum 43 23.8 3.3 1.7 7 2012.6 43.9 Minimum 15 23.1 2.1 1.3 5 20.1 4.5 Mean 26.3 23.4 2.9 1.5 6.1 441.7 22.7 Caenorhabditis elegans (105 Mb) Maximum 452 106.9 7.1 3.7 38 6733.8 251.7 Minimum 68 101.9 2.7 0.8 11 13.4 10.1 Mean 166.7 104.2 5.1 2.2 19.4 1221.4 56.9 Ipomoea nil (750 Mb) Maximum 8751 752.7 11.5 1.8 1194 28 504.7 331.2 Minimum 1697 642 2.5 0.1 104 129.7 16.2 Mean 4288 702.7 6.2 0.7 439.4 10 065.8 78.2 Note: L50 and N50 represent the number of contigs and the length of the contig, respectively, crossing 50% mark of the assembly. Higher N50 and lower L50 values indicate highly contiguous assemblies. Max RSS represents the peak memory usage of the computational node. Open in new tab Table 1 Summarized statistics of the assemblies Organism . #Contigs . Assembly size (Mb) . Longest contig (Mb) . N50 (Mb) . L50 . CPU time (hours) . Maximum RSS (GB) . Escherichia coli (4.6 Mb) Maximum 1 4.7 4.7 4.7 1 83.9 44.5 Minimum 1 4.6 4.6 4.6 1 2.2 3.6 Mean 1 4.7 4.7 4.7 1 19.4 15.7 Plasmodium falciparum (23 Mb) Maximum 43 23.8 3.3 1.7 7 2012.6 43.9 Minimum 15 23.1 2.1 1.3 5 20.1 4.5 Mean 26.3 23.4 2.9 1.5 6.1 441.7 22.7 Caenorhabditis elegans (105 Mb) Maximum 452 106.9 7.1 3.7 38 6733.8 251.7 Minimum 68 101.9 2.7 0.8 11 13.4 10.1 Mean 166.7 104.2 5.1 2.2 19.4 1221.4 56.9 Ipomoea nil (750 Mb) Maximum 8751 752.7 11.5 1.8 1194 28 504.7 331.2 Minimum 1697 642 2.5 0.1 104 129.7 16.2 Mean 4288 702.7 6.2 0.7 439.4 10 065.8 78.2 Organism . #Contigs . Assembly size (Mb) . Longest contig (Mb) . N50 (Mb) . L50 . CPU time (hours) . Maximum RSS (GB) . Escherichia coli (4.6 Mb) Maximum 1 4.7 4.7 4.7 1 83.9 44.5 Minimum 1 4.6 4.6 4.6 1 2.2 3.6 Mean 1 4.7 4.7 4.7 1 19.4 15.7 Plasmodium falciparum (23 Mb) Maximum 43 23.8 3.3 1.7 7 2012.6 43.9 Minimum 15 23.1 2.1 1.3 5 20.1 4.5 Mean 26.3 23.4 2.9 1.5 6.1 441.7 22.7 Caenorhabditis elegans (105 Mb) Maximum 452 106.9 7.1 3.7 38 6733.8 251.7 Minimum 68 101.9 2.7 0.8 11 13.4 10.1 Mean 166.7 104.2 5.1 2.2 19.4 1221.4 56.9 Ipomoea nil (750 Mb) Maximum 8751 752.7 11.5 1.8 1194 28 504.7 331.2 Minimum 1697 642 2.5 0.1 104 129.7 16.2 Mean 4288 702.7 6.2 0.7 439.4 10 065.8 78.2 Note: L50 and N50 represent the number of contigs and the length of the contig, respectively, crossing 50% mark of the assembly. Higher N50 and lower L50 values indicate highly contiguous assemblies. Max RSS represents the peak memory usage of the computational node. Open in new tab Figure 1 Open in new tabDownload slide Evaluation pipeline. Figure 1 Open in new tabDownload slide Evaluation pipeline. Materials and methods Long-read assembly pipelines Overlap layout consensus (OLC) approach, de Bruijn graphs and string graphs are the commonly used algorithms for de novo assembly [30–33]. The advent of SMS data introduced a new challenge in de novo assembly because of the high error rates. Hence, application of de Bruijn graphs was rendered unfeasible [34], bringing back the OLC approach along with the string graphs to higher prominence. The longer the reads, the more efficient the assembly using the OLC approach, resulting in a linear increase in contiguity [35]. Although second-generation sequencing (SGS) reads were initially used for correcting long reads [36], most of the current long-read OLC pipelines follow a hierarchical approach (Figure 2), exclusively using SMS data as follows: (a) select a subset of longer reads as seed data; (b) use shorter reads to align against the longer seed data as reference, and correct sequencing errors by consensus of the aligned reads; (c) use the error-corrected reads for a draft assembly; and (d) obtain a polished consensus of the draft assembly [36, 37]. The procedure to identify overlaps has been the key difference in most long-read assemblers, and some of the overlap detection methods have been evaluated previously [38]. The long-read assemblers assessed in the present work are briefly summarized below. Figure 2 Open in new tabDownload slide Hierarchical pipeline for OLC assembly approaches. Errors are displayed in Step C, which become reduced in number in the corrected reads. After assembly, a consensus polishing step, which is not shown in the figure, will also be performed as part of the hierarchical pipeline. Figure 2 Open in new tabDownload slide Hierarchical pipeline for OLC assembly approaches. Errors are displayed in Step C, which become reduced in number in the corrected reads. After assembly, a consensus polishing step, which is not shown in the figure, will also be performed as part of the hierarchical pipeline. Hierarchical Genome Assembly Process Hierarchical Genome Assembly Process (HGAP) [36] was one of the first hierarchical pipelines to exclusively use SMS reads for assembling a genome. Higher-quality preassembled reads with around 25–30× coverage are generated by aligning shorter reads against longer seed reads. The preassembled reads are then fed to the celera assembler [39] to obtain a draft assembly, followed by applying a consensus polishing procedure called quiver. BLASR [40] is used for aligning candidate overlaps, which are identified using an Ferragina–Manzini (FM)-index search and clustering of k-mer hits. The slower BLASR-based pipeline was replaced by FALCON in the latest version (v4). To distinguish between HGAP v3 and v4, the version used in the present evaluation is referred to as HGAP3. PBcR PBcR [41] also follows the hierarchical approach using MinHash Alignment Process (MHAP) for overlap detection. To identify k-mers shared between overlapping reads, without performing any alignments, k-mers of query reads are converted to integer fingerprints using multiple hash functions. The minimum values from the multiple hash functions are used to create a set called as MinHash sketch, for each read. MHAP then calculates the Jaccard similarity index by comparing the sketches of query reads to identify overlap candidates. Like HGAP3, the assembly of the corrected reads is performed using the celera assembler. Canu Canu [25] is a fork of the celera assembler and improves on the earlier PBcR pipeline into a single, comprehensive assembler. Highly repetitive k-mers, which are abundant in all the reads, can be non-informative. Hence, term frequency, inverse document frequency (tf-idf), a weighting statistic was added to MinHashing, giving weightage to non-repetitive k-mers as minimum values in the MinHash sketches, and sensitivity has been demonstrated to reach up to 89% without any parameter adjustment. By retrospectively inspecting the assembly graphs and also statistically filtering out repeat-induced overlaps, the chances of mis-assemblies are reduced. FALCON FALCON [42] is a hierarchical, haplotype-aware genome assembly tool. The sequence data are split into blocks for comparison using daligner [43]. Daligner first compiles a list of k-mers, along with their read identifiers and read coordinates, and then sorts them lexicographically. Identical k-mers from each block are merged into a new list containing both the query identifiers and their coordinates. A second sorting procedure, accounting for the query coordinates, places neighboring matches adjacent to each other, resulting in the identification of overlap candidates. A directed string graph is created from the alignment of the overlaps, with a collapsed diploid-aware layout, while maintaining the heterozygosity information. HINGE HINGE [34] is one of the few assemblers not requiring an error-correction step. Dalinger is used for overlap detection. The key innovation of this assembler is the placement of hinges to mark repeat regions that are not spanned by longer reads. Repeats are identified using the coverage gradients of the alignments, and an in-hinge and an out-hinge are marked on the reads, which are on the boundaries of unbridged repeats. Only two reads per repeat region, which have the longest overlap within the repeat, are chosen for placing the hinges. When a repeat is spanned by a completely bridged read, the other overlapping reads are marked as poisoned and not considered for hinge placing, thereby separating bridged repeats. Hinge-aided greedy graphs are used to resolve repeat junctions before obtaining a consensus. Miniasm Miniasm [37] was the first long-read assembler to not use error correction and hence is fast. Minimap is used for overlap detection, which indexes subsampled k-mers (minimizers [44]) from all the reads in a hash table, against which the query minimizers are then compared. The matches are sorted and clustered to find the longest collinear matching chains to identify overlap candidates. An assembly graph layout is subsequently constructed from the collinear matches and output as the assembled contigs, without building any consensus. Because error-correction and consensus procedures are not executed, the error rate of the final assembly is equivalent to that of the raw reads. To circumvent this, Racon [26], a consensus module, was shown to generate high-quality contigs within reasonable run times and is included in the present study as part of the miniasm pipeline. SMARTdenovo SMARTdenovo (https://github.com/ruanjue/smartdenovo) is another fast assembler, which can also work without error correction of the raw reads. Similar to minimap, SMARTdenovo searches subsampled query k-mers in indexed hash tables, which are then sorted and merged into collinear matches. Alignment using a dot-matrix alignment method is performed for adjacent matches, and the overlap candidates are subsequently input to a string graph layout. The consensus module can reach an accuracy of up to 99.7%, albeit taking up much of the entire computational time. ABruijn A de Bruijn graph [45] is a directed graph that is generally constructed from k−1 overlaps of adjacent k-mers. Rather, a set of solid strings (frequent k-mers), instead of all k-mers, is used to construct the ABruijn graphs because of the high error rates in SMS reads. A fast dynamic programming approach is used to find the longest common subpaths to obtain a rough estimate of the overlaps between two reads. Overlapping read vertices are added onto the graph, and the draft assembly is subsequently constructed. After aligning reads against the draft assembly, ABruijn graphs are constructed again to obtain a polished consensus assembly. Wtdbg Wtdbg (https://github.com/ruanjue/wtdbg) is another assembler that uses the framework of de Bruijn graphs. Unlike ABruijn graphs, overlapping k-mer hits are identified among the reads using a sorting approach similar to that adopted in minimap and SMARTdenovo, and the hits are used to add on and construct the fuzzy de Bruijn graphs. The resulting graphs, in comparison with ABruijn graphs, have reduced complexity and thereby consume lesser memory. Mapping, Error Correction and de novo Assembly Tool Mapping, Error Correction and de novo Assembly Tool (MECAT) [27] scans for identical k-mers, in blocks of sequences among query reads, to calculate distance difference factor (DDF) between neighboring k-mer hits. When the DDF is within a specified threshold, scores are assigned to the blocks of k-mers and extended to neighboring blocks. With the scoring mechanism, a large number of irrelevant read overlap candidates are filtered out, significantly reducing the computational time before alignment. After error correction, the corrected reads are pairwise-aligned and fed into a modified canu pipeline to construct contigs. Data sets for evaluation The evaluation data sets were broadly chosen in such a way that (i) data are available for public use, and (ii) genomes are of diverse sizes. Initially, the standard bacterial model organism E.coli was chosen, and the sequence data (1 SMRT cell: ∼140× coverage) of P6-C4 chemistry (Supplementary Figure S1A) were downloaded from the PacBio DevNet website (https://github.com/PacificBiosciences/DevNet/wiki/Datasets). Plasmodium falciparum (protist) is one of the few smaller eukaryotic genomes with long-read data available. Although the genome is only ∼23 Mb in length, it contains 14 chromosomes with a relatively high repeat content of 51.8% and a high AT% of 80.6% [46]. Plasmodiumfalciparum sequence data (9 SMRT cells: ∼180× coverage) of P6-C4 chemistry (Supplementary Figure S1B) were downloaded from the National Center for Biotechnology Information’s Sequence Read Archive (SRA360189) [47]. In contrast to P. falciparum, C.elegans (nematode) has a genome size of ∼105 Mb, but with only six, although much longer, chromosomes. The genome is also estimated to contain ∼20 000 genes making it more complex when compared with those of E. coli and P. falciparum, which have only ∼5000 genes each. There are also relatively fewer transposons (∼12%), although they are sufficiently long (1–3 kb) to confound the genome assembly [48]. Caenorhabditiselegans sequence data (11 SMRT cells: ∼45× coverage) of P6-C4 chemistry (Supplementary Figure S1C) were also downloaded from the PacBio DevNet website. Next, we tackled the main challenge of focus for this evaluation using the genome of a non-model plant with a high repetitive content and longer repeats. For this purpose, I.nil (plant) data [2] of P5-C3 chemistry (Supplementary Figure S1A) were obtained based on our previous work (90 SMRT cells: ∼50× coverage; DRA002710). Ipomoeanil has a highly repetitive (64%) genome of an estimated size of 750 Mb, with limited available genomic resources, providing a good measure for similar repetitive plant genomes. To evaluate the correctness of the I. nil genome assemblies, restriction site-associated DNA (RAD)-seq (DRA002758), expressed sequence tags (ESTs; HY917605–HY949060) and bacterial artificial chromosome (BAC)-end data (GA933005–GA974698) were used. PacBio RSII was the sequencer used in all cases. The P6-C4 chemistry, in comparison with P5-C3, has shown an increase in average read lengths and therefore the average read lengths of the I. nil data set are slightly shorter than those of the other data sets (Supplementary Figure S1). The reason for choosing only SMRT data for the present study is that one of the aims was to evaluate long-read assemblies without depending on SGS data, whereas the nonrandom errors of nanopore data may still have to rely on more accurate Illumina data [49, 50]. All four data sets were preprocessed using HGAP3 to obtain filtered subreads for assembly. Two rounds of consensus polishing were applied to all assemblies using quiver. Criteria for evaluation For assessing the assembly results, we considered various metrics (Figure 1). Apart from N50 and L50 measures, the average contigs-to-chromosomes (ctg/chr) ratio was calculated for assessing contiguity. For gene-level completeness, BUSCO [51] and CEGMA [52] were used. In eukaryotic contigs, the terminal regions were scanned using tandem repeats finder [53] for the presence of telomeres. Peak computational memory in the form of maximum resident set size (RSS) and CPU time were determined to compare computational requirements. When complete reference sequences were available, single-nucleotide variations (SNVs), indels and structural variations (SVs) were analyzed from QUAST [54] and Assemblytics [55] to evaluate correctness; unique SVs provided a relative measure of assembly errors. In addition, dot plots were visualized for rearrangements. The percentage of reference sequences covered by the assemblies was calculated using MUMmer [56] alignments. For the non-model organism I. nil, linkage maps were constructed from RAD-seq [57] data using STACKS [58], to identify mis-assembled contigs. Because the marker density of the linkage maps was low, this also provided a good measure for contiguity, as larger contigs have a better chance of being incorporated in the linkage maps. ESTs and BAC-end reads were used for assessing completeness. Longer contigs had a better chance of concordantly mapping the 100 kb insert-sized BAC-end read pairs, whereas discordant mapping rates provided an indirect measure of mis-assemblies. Whole BAC sequences, of ∼100 kb in length, were used to assess contiguity and completeness, and also to identify SNVs and indels. Tpn transposons, a unique feature of I. nil flowers [2], were also considered to assess completeness. Ranks were assigned for all the criteria, as listed in Supplementary Methods. The ranks for all criteria were summed up for each assembler. The summed score, in the decreasing order, was used for assigning an overall rank. Also, z-scores were calculated for all observed metrics, so that significant observations received rewards or penalties [22]. The average of the z-scores, from all metrics, for each assembler was plotted to observe z-score-based rankings, which displayed high and low scores for better and worse performances, respectively. For assemblies that failed during execution, either they were left out from the rankings or assigned arbitrary low rankings. Versions of the tools and the commands used are detailed in the Supplementary Methods. Results Contiguity All of the assemblers reported good contiguity (Table 1). Escherichia coli A single contig representing the complete bacterial genome was reconstructed by all the assemblers (Supplementary Table S1). Plasmodium falciparum Fewer number of contigs (15–43 contigs), high N50 values (1.2–1.7 Mb), low L50 values (5–7) and low ctg/chr ratios (1–2.27 ratios) were generally observed in all the assemblies, representing high level of contiguity, despite the repetitive nature of the genome. MECAT, in particular, reconstructed every chromosome in one piece, whereas miniasm, SMARTdenovo and wtdbg produced comparatively fragmented or redundant contigs (Supplementary Tables S2 and S3). Caenorhabditis elegans The N50 exceeded 1 Mb in all, but the PBcR assembly. Canu had the best N50 (3.6 Mb) and L50 (11) values, while PBcR had low N50 (847 kb) and high L50 (38) values. In general, six contigs, on an average, were found to be sufficient to represent a chromosome (Supplementary Tables S4 and S5). Ipomoea nil HGAP3 obtained the best contiguity (N50=1.53 Mb; L50=120) and was the only assembler to have contigs >10 Mb in length. Canu and FALCON shared the next best N50 (934 and 904 kb, respectively) and L50 values (191), while both wtdbg and miniasm had fragmented assemblies (Supplementary Table S6). The shorter the genome, the lesser the differences observed in contiguity among the assemblers. However, with longer genomes, the contiguity profiles progressively started to differ among the assemblers (Supplementary Figures S2–S4). Completeness Escherichia coli In all the cases, the assembly size was slightly larger than that of the reference genome, with 99.9% BUSCO completeness (Supplementary Table S1). Plasmodium falciparum On average, the contigs covered the 14 chromosomes in the range of 95.67–99.90% (Supplementary Table S7). Excluding ABruijn, the apicoplast genome was assembled by all the assemblers, while the mitochondrial genome was only present in the HGAP3 assembly. Canu was able to reconstruct 23 of the 28 telomeres, whereas the PBcR and wtdbg assemblies resolved <10 telomeres (Supplementary Table S8). Intriguingly, Miniasm was unable to resolve even a single telomere. BUSCO analysis showed 67.4–68.9% completeness for all the assemblies, while it should be noted that the original reference sequence also yielded only 68.8% completeness. Caenorhabditis elegans At least 99% of all the chromosomes were covered by the assembled contigs on average, excluding the wtdbg assembly (Supplementary Table S9). Canu and HGAP3 produced 10 of 12 telomeres, whereas wtdbg produced only a single telomere (Supplementary Table S10). All the assemblies also showed high BUSCO (97.2–99.2%) completeness ratios. Ipomoea nil Most of the assemblies fell short of the expected genome size of 750 Mb; however, BUSCO reported completeness ratios in the range of 92.9–94%. Most of the assemblies mapped around 99% of the ESTs and BAC-end reads (Supplementary Table S11). PBcR (314) and HGAP3 (311) resolved the largest number of Tpn transposons (Supplementary Table S11), followed by canu (307) and MECAT (307). MECAT (18), FALCON (16) and SMARTdenovo (16) were better at resolving telomeres (Supplementary Table S11). Some smaller PBcR contigs were present redundantly and were covered within larger contigs with short overhangs. The high BUSCO and CEGMA ratios indicated that the gene regions were captured effectively, despite differences in the assembly sizes. The shorter, circular and high-copy nature of the mitochondrial genomes could have possibly confounded the assemblers and were largely unassembled. Correctness After two rounds of consensus polishing of the draft assemblies, the indel rates were drastically reduced. Escherichia coli Analysis using QUAST showed that all contigs had mis-assemblies. However, on closer inspection using Assemblytics, the source of the mis-assemblies reported by QUAST was revealed to be because of three SVs, which are likely strain-specific differences rather than mis-assemblies (Supplementary Table S1). For instance, in the ABruijn assembly, the contig length was equal to the reference length when the SVs were tallied. However, most other assemblies still had a large number of SVs (an average of 68.8 SVs compared with 9 SVs of ABruijn), even after two rounds of polishing. Plasmodium falciparum More than 5000 SVs were shared among all the assemblies. Wtdbg (6448) produced the largest number of unique SVs, whereas ABruijn (389), canu (384), MECAT (311) and PBcR (332) performed better by producing a relatively smaller share of the unique SVs (Supplementary Table S12). Dot plots were used for observing rearrangements, which displayed small rearrangements only in ABruijn and wtdbg assemblies. In other cases, an approximate straight diagonal line was observed with strong congruity. Caenorhabditis elegans A total of 17 893 SVs were shared among all the assemblies. Wtdbg (30 622) produced the largest number of unique SVs, whereas canu (2374), FALCON (3337), MECAT (2358) and PBcR (4179) produced a relatively smaller share of unique SVs (Supplementary Table S13). A single or a couple of mis-assembled contigs were visible in the dot plots of all assemblies, barring MECAT and SMARTdenovo. Ipomoea nil Miniasm (1.2 Mb) and wtdbg (5.8 Mb) assemblies had the shortest of the mis-assembled contigs, while HGAP3 (128 Mb) showed the largest share of mis-assembled data. HGAP3, FALCON and MECAT had >100 Mb of mis-assembled contigs, whereas canu offered the best balance in incorporating longer contigs (593.3 Mb) into the linkage maps, with shorter (20.9 Mb) mis-assemblies (Supplementary Table S14). Wtdbg (1.04%) and miniasm (2.53%) had the least discordantly mapping BAC-end read pairs. Surprisingly, FALCON (6.36%) had the highest discordant mapping rate (Supplementary Table S11). When BAC sequences were completely covered by contigs, the per-base accuracy was 99.9% in four of the five BAC sequences (Supplementary Table S15), while mismatched bases were almost nonexistent. Fragmented contigs were not considered for assessing per-base accuracy, as they had unresolved errors in overlapping terminal regions. A lot of SVs were shared among all the assemblers, which may be actual variations rather than assembly errors. Unlike the SMRT data, the Illumina-based assembly was found to have large indels, and plenty of mismatches covering the five BAC sequences in I. nil [2]. The evaluated assemblers, which are based on the overlap information of the longer reads, had benefited not just in terms of contiguity but also in per-base accuracy for a repetitive genome like I. nil. Circularity and overlapping fragmented contigs With the application of Circlator [59], it was evident that the circularity of some of the E. coli assemblies was clearly not resolved, and hence the presence of additional base pairs, which were subsequently trimmed out. The increased indel rates were originally concentrated on the overlapping terminal ends of the circularly unresolved contigs. As a result, the indel rates became almost identical in all the circularly resolved assemblies (Supplementary Table S16). However, Circlator was unable to resolve the circularity for HGAP3, MECAT and wtdbg assemblies. Similarly, when the contigs were fragmented in repetitive regions, sometimes, the breakpoints happened in such a way that two nearby contigs shared considerable overlapping terminal ends. Consensus polishing did not have an impact in such overlapping regions leading to unresolved and high amount of indel errors. Resource usage Escherichia coli HINGE and wtdbg assemblies were quickly obtained, while HGAP3 was the slowest, as expected (Figure 3A). Miniasm was the fastest of all assemblers, and finished in about 16 min of CPU time; however, two rounds of RACON execution required 25.81 CPU h, making this pipeline the second slowest. SMARTdenovo consumed the maximum peak memory usage, while HGAP3 consumed the least amount of memory (Figure 3B). Figure 3 Open in new tabDownload slide Computational resource requirements. Computational requirements are represented as (A) log CPU time and (B) maximum RSS, a measure of peak memory usage, for all assemblers. Refer to Supplementary Table S17 for actual CPU times. Failed assemblies (HINGE for the eukaryotic genomes, and ABruijn for the I. nil genome) are not plotted. Figure 3 Open in new tabDownload slide Computational resource requirements. Computational requirements are represented as (A) log CPU time and (B) maximum RSS, a measure of peak memory usage, for all assemblers. Refer to Supplementary Table S17 for actual CPU times. Failed assemblies (HINGE for the eukaryotic genomes, and ABruijn for the I. nil genome) are not plotted. Plasmodium falciparum Wtdbg was the quickest assembler, closely followed by MECAT. Other assemblers generally consumed hundreds of CPU hours, with HGAP3 being almost 100-fold slower compared with the speed of wtdbg (Figure 3A). ABruijn, SMARTdenovo and wtdbg were memory-intensive, whereas canu, FALCON and MECAT were memory-efficient (Figure 3B). Caenorhabditis elegans Wtdbg followed by MECAT were the fastest in producing assemblies, while PBcR was the slowest (Figure 3A). ABruijn consumed a huge amount of memory, while canu was the most memory-efficient, followed by MECAT and HGAP3 (Figure 3B). Ipomoea nil Wtdbg was again the fastest assembler (129.7 CPU h). It should be noted that HGAP3 took 83.9 CPU h even for a bacterial genome. MECAT was also fairly quick, while the celera-dependent pipelines were the slowest (Figure 3A). Wtdbg consumed 331.15 Gb of peak memory. MECAT was the best with respect to both CPU time and peak memory usage, while canu also showed a reasonable balance in resource usage (Figure 3B). Ranking Escherichia coli The assemblers ABruijn, canu and FALCON in the order were top-ranked in both the rankings (Figures 4 and 5A). The rankings were heavily influenced by whether the assemblies were circularly resolved, and hence, MECAT, HGAP3 and wtdbg were pushed to the bottom of the table. Figure 4 Open in new tabDownload slide Rankings for all assemblies. The lower the rank, the better is the assembly. Figure 4 Open in new tabDownload slide Rankings for all assemblies. The lower the rank, the better is the assembly. Figure 5 Open in new tabDownload slide Z-score-based rankings. Average z-scores of all ranking metrics are plotted for (A) E. coli, (B) P. falciparum, (C) C. elegans and (D) I. nil. Higher the average z-value, the better is the assembly performance. The failed ABruijn assembly is left blank for I. nil data set. Figure 5 Open in new tabDownload slide Z-score-based rankings. Average z-scores of all ranking metrics are plotted for (A) E. coli, (B) P. falciparum, (C) C. elegans and (D) I. nil. Higher the average z-value, the better is the assembly performance. The failed ABruijn assembly is left blank for I. nil data set. Plasmodium falciparum Although HGAP3 had the highest N50 value, it was not the top-ranked assembler (Figures 4 and 5B). Four assemblers in the order of MECAT, FALCON, ABruijn and canu were top-ranked according to their z-scores (Figure 5B), corroborating that N50 should not be the sole factor in choosing an assembly. HINGE assembly was excluded from the rankings, as it resulted in a segmentation fault and therefore was not tested for the other eukaryotic data sets too. Caenorhabditis elegans Canu ranked at the top, followed by FALCON and MECAT (Figure 5C). Although miniasm was eighth in the ranking (Figure 4), it surprisingly ranked fourth according to the z-scores, as a result of obtaining considerably high z-scores for contiguity metrics (Figure 5C). Without error correction, it would be difficult to distinguish duplications and repeats [37]; however, the repeat-sparse nature of the C. elegans genome likely contributed to the better contiguity achieved by miniasm. Ipomoea nil ABruijn assembly resulted in a segmentation fault and was not considered for evaluation. The highly repetitive nature and the shorter insert size of the I. nil data set prevented all of the assemblers from reaching a 1 Mb contig N50, excluding HGAP3. Nevertheless, canu ranked first, ahead of HGAP3, in either of the rankings (Figure 4 and 5D). If mis-assemblies were given additional penalties, the ranking of HGAP3 might come down further. For the first time, SMARTdenovo was ranked among the top five assemblers. Mean ranking of the three eukaryotic assemblies When the rankings of the eukaryotic assemblies were averaged (Figure 4), canu, MECAT, FALCON and HGAP3, in that order, were on the top of the rankings. Similarly, in the z-score-based mean rankings, canu, MECAT, FALCON and HGAP3, in that order, displayed better performances with positive mean z-scores (Figure 6). Figure 6 Open in new tabDownload slide Mean z-score-based rankings. The mean scores of the individual average z-scores obtained from E. coli, P. falciparum, C. elegans and I. nil are plotted. Higher the average z-value, the better is the assembly performance. Figure 6 Open in new tabDownload slide Mean z-score-based rankings. The mean scores of the individual average z-scores obtained from E. coli, P. falciparum, C. elegans and I. nil are plotted. Higher the average z-value, the better is the assembly performance. Discussion De novo genome assemblies using SMRT data, when compared with earlier versions, have been shown to increase contiguity by several hundred-folds [3, 6, 10], and resolve fragmented regions into contiguous, gapless sequences [6, 41]. The average and median contig N50 values of recently assembled plant and animal genomes using long reads are 6.24 and 3.60 Mb (Table 2), respectively. In the current study, the three important features—contiguity, completeness and correctness [1]—of long-read assemblers were evaluated. Table 2 A list of recently assembled genomes using PacBio’s SMRT data Organism . Technology . Assembly tool . Contig N50/NG50 (Mb) . Scaffold N50/NG50 (Mb) . Study . Taeniopygia guttata PB FALCON 5.8 NA [3] Calypte anna PB FALCON 5.4 NA [3] Drosophila serrata PB PBcR 0.94 NA [4] Utricularia gibba PB HGAP3 3.42 NA [5] Arabidopsis thaliana PB PBcR 11.16 NA [41] Drosophila melanogaster PB Canu 21.31 NA [25] Homo sapiens CHM1 PB Canu 21.95 NA [25] Vitis vinifera PB FALCON 2.39 NA [42] Ipomoea nil PB+Illumina+LM HGAP3 1.87 2.88 [2] Vigna angularis PB+Illumina+454 Sprai, Celera 0.8 2.95 [7] Oreochromis niloticus PB+RH map+RAD map Canu 3.1 NA [8] Gorilla gorilla PB+BAC-end+Fosmid-end FALCON 9.56 23.14 [6] Lates calcalifer PB+OM+LM HGAP3 1.72 25.85 [9] Capra hircus PB+OM+HiC PBcR 18.7 87.28 [11] Arabis alpina PB+OM+HiC PBcR, FALCON 0.9 3.8 [17] Euclidium syriacum PB+OM PBcR, FALCON 3.3 17.5 [17] Conringia planisiliqua PB+OM PBcR, FALCON 3.6 8.9 [17] Corvus corone PB+OM FALCON 8.91 18.36 [10] Zea mays PB+OM PBcR, FALCON 1.19 9.56 [13] Homo sapiens NA12878 PB+OM PBcR, FALCON 1.4 31.1 [14] Homo sapiens HX1 PB+OM FALCON 8.3 22 [12] Oropetium thomaeum PB+OM HGAP3 2.4 7.1 [16] Oryza sativa indica PB+Fosmids+OM+LM PBcR 4.43 1.22 [15] Homo sapiens NA19240 PB+OM FALCON 7.25 78.6 [18] Organism . Technology . Assembly tool . Contig N50/NG50 (Mb) . Scaffold N50/NG50 (Mb) . Study . Taeniopygia guttata PB FALCON 5.8 NA [3] Calypte anna PB FALCON 5.4 NA [3] Drosophila serrata PB PBcR 0.94 NA [4] Utricularia gibba PB HGAP3 3.42 NA [5] Arabidopsis thaliana PB PBcR 11.16 NA [41] Drosophila melanogaster PB Canu 21.31 NA [25] Homo sapiens CHM1 PB Canu 21.95 NA [25] Vitis vinifera PB FALCON 2.39 NA [42] Ipomoea nil PB+Illumina+LM HGAP3 1.87 2.88 [2] Vigna angularis PB+Illumina+454 Sprai, Celera 0.8 2.95 [7] Oreochromis niloticus PB+RH map+RAD map Canu 3.1 NA [8] Gorilla gorilla PB+BAC-end+Fosmid-end FALCON 9.56 23.14 [6] Lates calcalifer PB+OM+LM HGAP3 1.72 25.85 [9] Capra hircus PB+OM+HiC PBcR 18.7 87.28 [11] Arabis alpina PB+OM+HiC PBcR, FALCON 0.9 3.8 [17] Euclidium syriacum PB+OM PBcR, FALCON 3.3 17.5 [17] Conringia planisiliqua PB+OM PBcR, FALCON 3.6 8.9 [17] Corvus corone PB+OM FALCON 8.91 18.36 [10] Zea mays PB+OM PBcR, FALCON 1.19 9.56 [13] Homo sapiens NA12878 PB+OM PBcR, FALCON 1.4 31.1 [14] Homo sapiens HX1 PB+OM FALCON 8.3 22 [12] Oropetium thomaeum PB+OM HGAP3 2.4 7.1 [16] Oryza sativa indica PB+Fosmids+OM+LM PBcR 4.43 1.22 [15] Homo sapiens NA19240 PB+OM FALCON 7.25 78.6 [18] PB, PacBio SMRT data; OM, Optical mapping data; LR, Linked reads; LM, Linkage maps. Open in new tab Table 2 A list of recently assembled genomes using PacBio’s SMRT data Organism . Technology . Assembly tool . Contig N50/NG50 (Mb) . Scaffold N50/NG50 (Mb) . Study . Taeniopygia guttata PB FALCON 5.8 NA [3] Calypte anna PB FALCON 5.4 NA [3] Drosophila serrata PB PBcR 0.94 NA [4] Utricularia gibba PB HGAP3 3.42 NA [5] Arabidopsis thaliana PB PBcR 11.16 NA [41] Drosophila melanogaster PB Canu 21.31 NA [25] Homo sapiens CHM1 PB Canu 21.95 NA [25] Vitis vinifera PB FALCON 2.39 NA [42] Ipomoea nil PB+Illumina+LM HGAP3 1.87 2.88 [2] Vigna angularis PB+Illumina+454 Sprai, Celera 0.8 2.95 [7] Oreochromis niloticus PB+RH map+RAD map Canu 3.1 NA [8] Gorilla gorilla PB+BAC-end+Fosmid-end FALCON 9.56 23.14 [6] Lates calcalifer PB+OM+LM HGAP3 1.72 25.85 [9] Capra hircus PB+OM+HiC PBcR 18.7 87.28 [11] Arabis alpina PB+OM+HiC PBcR, FALCON 0.9 3.8 [17] Euclidium syriacum PB+OM PBcR, FALCON 3.3 17.5 [17] Conringia planisiliqua PB+OM PBcR, FALCON 3.6 8.9 [17] Corvus corone PB+OM FALCON 8.91 18.36 [10] Zea mays PB+OM PBcR, FALCON 1.19 9.56 [13] Homo sapiens NA12878 PB+OM PBcR, FALCON 1.4 31.1 [14] Homo sapiens HX1 PB+OM FALCON 8.3 22 [12] Oropetium thomaeum PB+OM HGAP3 2.4 7.1 [16] Oryza sativa indica PB+Fosmids+OM+LM PBcR 4.43 1.22 [15] Homo sapiens NA19240 PB+OM FALCON 7.25 78.6 [18] Organism . Technology . Assembly tool . Contig N50/NG50 (Mb) . Scaffold N50/NG50 (Mb) . Study . Taeniopygia guttata PB FALCON 5.8 NA [3] Calypte anna PB FALCON 5.4 NA [3] Drosophila serrata PB PBcR 0.94 NA [4] Utricularia gibba PB HGAP3 3.42 NA [5] Arabidopsis thaliana PB PBcR 11.16 NA [41] Drosophila melanogaster PB Canu 21.31 NA [25] Homo sapiens CHM1 PB Canu 21.95 NA [25] Vitis vinifera PB FALCON 2.39 NA [42] Ipomoea nil PB+Illumina+LM HGAP3 1.87 2.88 [2] Vigna angularis PB+Illumina+454 Sprai, Celera 0.8 2.95 [7] Oreochromis niloticus PB+RH map+RAD map Canu 3.1 NA [8] Gorilla gorilla PB+BAC-end+Fosmid-end FALCON 9.56 23.14 [6] Lates calcalifer PB+OM+LM HGAP3 1.72 25.85 [9] Capra hircus PB+OM+HiC PBcR 18.7 87.28 [11] Arabis alpina PB+OM+HiC PBcR, FALCON 0.9 3.8 [17] Euclidium syriacum PB+OM PBcR, FALCON 3.3 17.5 [17] Conringia planisiliqua PB+OM PBcR, FALCON 3.6 8.9 [17] Corvus corone PB+OM FALCON 8.91 18.36 [10] Zea mays PB+OM PBcR, FALCON 1.19 9.56 [13] Homo sapiens NA12878 PB+OM PBcR, FALCON 1.4 31.1 [14] Homo sapiens HX1 PB+OM FALCON 8.3 22 [12] Oropetium thomaeum PB+OM HGAP3 2.4 7.1 [16] Oryza sativa indica PB+Fosmids+OM+LM PBcR 4.43 1.22 [15] Homo sapiens NA19240 PB+OM FALCON 7.25 78.6 [18] PB, PacBio SMRT data; OM, Optical mapping data; LR, Linked reads; LM, Linkage maps. Open in new tab Canu ranked the best in the average rankings of all the assemblies from all the data sets. Canu, because of its efficiency to handle repeats [25], had fewer assembly errors, sometimes trading contiguity for correctness. Indeed, it is essential to prioritize correctness rather than contiguity, which would otherwise defeat the purpose of building a reference genome for future studies. Canu and MECAT showed the best balance in computational requirements. MECAT requires longer reads to effectively distinguish non-repetitive overlaps, and was found to underperform in the case of I. nil, whose transposons can be longer than the 7 kb average insert size of I. nil data. FALCON, the only diploid-aware assembler, showed reasonable performance for genomes up to 100 Mb in length, similar to MECAT. The FALCON assembly was surprisingly filled with mis-assemblies for the I. nil data, probably because of the repeat filtering steps, leading to further loss of coverage in input data. An increase in insert sizes and coverage could yield better performance from both FALCON and MECAT. HGAP3 was found to be the most contiguous assembler, but with the disadvantage of extremely slow computation times. Mis-assemblies were also most abundant in the HGAP3 assemblies, possibly because of the greedier nature of celera’s algorithm at the layout stage [36]. In addition, as previously observed for PBcR in the rice genome assembly [15], the celera-based assemblers, PBcR and HGAP3, were found to have redundant contigs. PBcR is the second most widely used long-read assembler (Table 2); however, it is no longer maintained, as the focus has shifted to its successor canu, which seemed to outperform PBcR in almost every analysis. SMARTdenovo, although not the best, produced moderately good results in all metrics and would be a suitable choice for obtaining larger genome assemblies quickly. Leaving out the consensus module, miniasm was the fastest available assembler for all genomes evaluated, excluding I. nil. Miniasm requires as much as 13% divergence for repeat resolution, whereas canu and FALCON require only 3 and 5% divergence, respectively [25]. Hence, miniasm produced fragmented contigs for repeat-rich genomes, but obtained reasonable rankings otherwise. HINGE may not be ideal for assembling large genomes, but would be a good choice for assembling highly repetitive bacterial genomes. As observed in the assemblies of the slightly smaller yeast genome [24], ABruijn, despite its good contiguity, was chimeric. ABruijn failed to assemble the I. nil data set; however, when the error-corrected reads of canu were used, the assembly was possible but only after consuming almost 500 Gb of maximum RSS. Similarly, wtdbg was also memory-intensive, and both the assemblers will need high-end servers for handling larger genomes. In the case of repetitive genomes, both assemblers could collapse repeats, leading to loss of information. In particular, the wtdbg assembly was found to be >100 Mb short of the expected genome size in I. nil. Wtdbg assemblies, which always ranked last, mostly because no consensus procedure was executed, would need additional rounds of consensus polishing to effectively compete with other assemblers. Wtdbg assemblies also had fragmented contigs. Mitochondrial genomes were generally left unassembled. Hence, it might be necessary to either extract (i) reads that do not align to the assembled contigs, (ii) or reads that align to an available or a closely related mitochondrial genome. The extracted reads could be used to perform an additional round of assembly, for reconstructing extra-chromosomal genomes [47]. In addition, redundancy at the ends of contigs can be a major obstacle for polishing the genome, as it might become difficult for the reads to be aligned at such regions, leaving out errors stranded in the terminal portions of the contigs. Indeed, when whole BAC sequences of I. nil were covered by completely spanning contigs, the error rate was approximately homogenous across all the assemblers, whereas when contigs were in overlapping fragmented pieces, the terminal overlapping regions were found to have increased error rates. The same phenomenon was observed in redundant regions from circularly unresolved bacterial assemblies. Identifying such regions and trimming the redundant base pairs may lead to an improved overall per-base correctness. Dot plots showed that many of the breakpoints in contig mis-assemblies originated from different locations for different assemblers. Contiguity profiles were also found to be different for FALCON and PBcR in plant genome assemblies, and a hybrid assembly using the different contiguity profiles was found to be highly successful [17]. Hence, an alternative solution to increasing the contiguity would be to combine different assemblies by using reconciliation tools such as quickmerge [60]. For example, miniasm had fewer contigs and breakpoints compared with MECAT for the C. elegans assemblies. Using miniasm assembly as a backbone for extending the MECAT assembly may result in longer and more accurate contigs in this case. Similar to the evaluation of short read assemblers [19–22], the current study did not reveal a clear winner; a similar result was observed with evaluations of Nanopore sequencing data [24]. That is, an optimal assembler for one data set may not be optimal for a different data set. Hence, it would be ideal to try out a variety of assemblers, as performed in the Solanum pennelii genome project [49], and choose the best assembly based on various evaluation strategies. Any available resources such as BAC-end data, whole BAC sequences, previously annotated gene sets and similar resources could be effectively used for the purpose of evaluation, as demonstrated in this study. Based on the results, we suggest that the best approach in handling larger genomes would be to generate assemblies from at least canu, FALCON, MECAT and SMARTdenovo, and basing the final decision on the assembler according to different evaluation metrics rather than on N50 alone. When time is not a limiting factor, HGAP3 could also be used, but care should be taken in recognizing mis-assembled and redundant contigs. Recently, scaffolding techniques, such as optical mapping, CHICAGO, Hi-C and linked reads, have been applied to correct mis-assemblies [10–18], which can also be used for achieving chromosome-scale assemblies. Key Points All non-hybrid long-read assemblers are good at producing excellent contiguity. Considering correctness, computational time and memory requirements, canu, MECAT, FALCON and SMARTdenovo are recommended as minimum necessity for assembling third-generation, single-molecule sequencing (SMRT) data. As observed in the previous evaluations for short-read assemblers [19–22], the assemblies should be carefully evaluated using different metrics before finalizing the assembly, without relying on just N50 metric. Redundant base pairs in the overlapping terminal regions of fragmented contigs lead to unresolved errors, even after several rounds of consensus polishing. High copy extrachromosomal genomes have a significant chance of being filtered out. To reconstruct mitochondrial genomes, it may be necessary to identify reads, which do not align to the assembled contigs, so that they can be separately assembled. Funding This work was supported by JSPS KAKENHI Grant Number 16H06279. Vasanthan Jayakumar is a graduate student at the Department of Biosciences and Informatics, Keio University, Japan. His research interest is in Genomics and Bioinformatics. Yasubumi Sakakibara is a Professor at the Department of Biosciences and Informatics, Keio University, Japan. His research interests are Bioinformatics including genome assembly, metagenome analysis and artificial intelligence. References 1 Lee H , Gurtowski J, Yoo S, et al. Third-generation sequencing and the future of genomics . bioRxiv 2016 : 048603 . Google Scholar OpenURL Placeholder Text WorldCat 2 Hoshino A , Jayakumar V, Nitasaka E, et al. Genome sequence and analysis of the Japanese morning glory Ipomoea nil . Nat Commun 2016 ; 7 : 13295 . http://dx.doi.org/10.1038/ncomms13295 Google Scholar Crossref Search ADS PubMed WorldCat 3 Korlach J , Gedman G, Kingan S, et al. De novo PacBio long-read and phased avian genome assemblies correct and add to genes important in neuroscience research . Gigascience 2017 ; 6 : 1 – 16 . doi: 10.1093/gigascience/gix085 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Allen SL , Delaney EK, Kopp A, Chenoweth SF. Single-molecule sequencing of the Drosophila serrata genome . G3 2017 ; 7 ( 3 ): 781 – 8 . http://dx.doi.org/10.1534/g3.116.037598 Google Scholar Crossref Search ADS PubMed WorldCat 5 Lan T , Renner T, Ibarra-Laclette E, et al. Long-read sequencing uncovers the adaptive topography of a carnivorous plant genome . Proc Natl Acad Sci USA 2017 ; 114 ( 22 ): E4435 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Gordon D , Huddleston J, Chaisson MJ, et al. Long-read sequence assembly of the Gorilla genome . Science 2016 ; 352 ( 6281 ): aae0344 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Sakai H , Naito K, Ogiso-Tanaka E, et al. The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome . Sci Rep 2015 ; 5 ( 1 ): 16780 . http://dx.doi.org/10.1038/srep16780 Google Scholar Crossref Search ADS PubMed WorldCat 8 Conte MA , Gammerdinger WJ, Bartie KL, et al. A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions . BMC Genomics 2017 ; 18 ( 1 ): 341 . http://dx.doi.org/10.1186/s12864-017-3723-5 Google Scholar Crossref Search ADS PubMed WorldCat 9 Vij S , Kuhl H, Kuznetsova IS, et al. Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding . PLoS Genet 2016 ; 12 : e1005954 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Weissensteiner MH , Pang AW, Bunikis I, et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications . Genome Res 2017 ; 27 ( 5 ): 697 – 708 . http://dx.doi.org/10.1101/gr.215095.116 Google Scholar Crossref Search ADS PubMed WorldCat 11 Bickhart DM , Rosen BD, Koren S, et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome . Nat Genet 2017 ; 49 ( 4 ): 643 – 50 . http://dx.doi.org/10.1038/ng.3802 Google Scholar Crossref Search ADS PubMed WorldCat 12 Shi L , Guo Y, Dong C, et al. Long-read sequencing and de novo assembly of a Chinese genome . Nat Commun 2016 ; 7 : 12065 . http://dx.doi.org/10.1038/ncomms12065 Google Scholar Crossref Search ADS PubMed WorldCat 13 Jiao Y , Peluso P, Shi J, et al. Improved maize reference genome with single-molecule technologies . Nature 2017 ; 546 ( 7659 ): 524 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Pendleton M , Sebra R, Pang AW, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies . Nat Methods 2015 ; 12 ( 8 ): 780 – 6 . http://dx.doi.org/10.1038/nmeth.3454 Google Scholar Crossref Search ADS PubMed WorldCat 15 Du H , Yu Y, Ma Y, et al. Sequencing and de novo assembly of a near complete indica rice genome . Nat Commun 2017 ; 8 : 15324 . http://dx.doi.org/10.1038/ncomms15324 Google Scholar Crossref Search ADS PubMed WorldCat 16 VanBuren R , Bryant D, Edger PP, et al. Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum . Nature 2015 ; 527 ( 7579 ): 508 – 11 . http://dx.doi.org/10.1038/nature15714 Google Scholar Crossref Search ADS PubMed WorldCat 17 Jiao WB , Accinelli GG, Hartwig B, et al. Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data . Genome Res 2017 ; 27 ( 5 ): 778 – 86 . http://dx.doi.org/10.1101/gr.213652.116 Google Scholar Crossref Search ADS PubMed WorldCat 18 Steinberg KM , Graves-Lindsay T, Schneider VA, et al. High-quality assembly of an individual of Yoruban descent . bioRxiv 2016 : 067447 . doi: 10.1101/067447 . Google Scholar OpenURL Placeholder Text WorldCat 19 Salzberg SL , Phillippy AM, Zimin A, et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms . Genome Res 2012 ; 22 ( 3 ): 557 – 67 . http://dx.doi.org/10.1101/gr.131383.111 Google Scholar Crossref Search ADS PubMed WorldCat 20 Magoc T , Pabinger S, Canzar S, et al. GAGE-B: an evaluation of genome assemblers for bacterial organisms . Bioinformatics 2013 ; 29 ( 14 ): 1718 – 25 . http://dx.doi.org/10.1093/bioinformatics/btt273 Google Scholar Crossref Search ADS PubMed WorldCat 21 Earl D , Bradnam K, St John J, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods . Genome Res 2011 ; 21 ( 12 ): 2224 – 41 . http://dx.doi.org/10.1101/gr.126599.111 Google Scholar Crossref Search ADS PubMed WorldCat 22 Bradnam KR , Fass JN, Alexandrov A, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species . Gigascience 2013 ; 2 ( 1 ): 10 . http://dx.doi.org/10.1186/2047-217X-2-10 Google Scholar Crossref Search ADS PubMed WorldCat 23 Sović I , Križanović K, Skala K, Šikić M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads . Bioinformatics 2016 ; 32 ( 17 ): 2582 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Istace B , Friedrich A, d'Agata L, et al. De novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer . Gigascience 2017 ; 6 ( 2 ): 1 – 13 . http://dx.doi.org/10.1093/gigascience/giw018 Google Scholar Crossref Search ADS PubMed WorldCat 25 Koren S , Walenz BP, Berlin K, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation . Genome Res 2017 ; 27 ( 5 ): 722 – 36 . http://dx.doi.org/10.1101/gr.215087.116 Google Scholar Crossref Search ADS PubMed WorldCat 26 Vaser R , Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads . Genome Res 2017 ; 27 ( 5 ): 737 – 46 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Xiao CL , Chen Y, Xie SQ, et al. MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads . Nat Methods 2017 . doi: 10.1038/nmeth.4432 . Google Scholar OpenURL Placeholder Text WorldCat 28 Cherukuri Y , Janga SC. Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches . BMC Genomics 2016 ; 17 ( S7 ): 507 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Liao YC , Lin SH, Lin HH. Completing bacterial genome assemblies: strategy and performance comparisons . Sci Rep 2015 ; 5 ( 1 ): 8747 . http://dx.doi.org/10.1038/srep08747 Google Scholar Crossref Search ADS PubMed WorldCat 30 Myers EW. A history of DNA sequence assembly . Inf Technol 58 : 126 – 32 . OpenURL Placeholder Text WorldCat 31 Simpson JT , Pop M. The theory and practice of genome sequence assembly . Annu Rev Genomics Hum Genet 2015 ; 16 : 153 – 72 . http://dx.doi.org/10.1146/annurev-genom-090314-050032 Google Scholar Crossref Search ADS PubMed WorldCat 32 Chen Q , Lan C, Zhao L, et al. Recent advances in sequence assembly: principles and applications . Brief Funct Genomics 2017 , in press. doi: 10.1093/bfgp/elx006. Google Scholar OpenURL Placeholder Text WorldCat 33 Chaisson MJ , Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes . Nat Rev Genet 2015 ; 16 ( 11 ): 627 – 40 . http://dx.doi.org/10.1038/nrg3933 Google Scholar Crossref Search ADS PubMed WorldCat 34 Kamath GM , Shomorony I, Xia F, et al. HINGE: long-read assembly achieves optimal repeat resolution . Genome Res 2017 ; 27 ( 5 ): 747 – 56 . http://dx.doi.org/10.1101/gr.216465.116 Google Scholar Crossref Search ADS PubMed WorldCat 35 Koren S , Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads . Nat Biotechnol 2012 ; 30 : 693 – 700 . http://dx.doi.org/10.1038/nbt.2280 Google Scholar Crossref Search ADS PubMed WorldCat 36 Chin CS , Alexander DH, Marks P, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nat Methods 2013 ; 10 ( 6 ): 563 – 9 . http://dx.doi.org/10.1038/nmeth.2474 Google Scholar Crossref Search ADS PubMed WorldCat 37 Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences . Bioinformatics 2016 ; 32 ( 14 ): 2103 – 10 . http://dx.doi.org/10.1093/bioinformatics/btw152 Google Scholar Crossref Search ADS PubMed WorldCat 38 Chu J , Mohamadi H, Warren RL, et al. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art . Bioinformatics 2017 ; 33 : 1261 – 70 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 39 Myers EW , Sutton GG, Delcher AL, et al. A whole-genome assembly of Drosophila . Science 2000 ; 287 ( 5461 ): 2196 – 204 . http://dx.doi.org/10.1126/science.287.5461.2196 Google Scholar Crossref Search ADS PubMed WorldCat 40 Chaisson MJ , Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory . BMC Bioinformatics 2012 ; 13 ( 1 ): 238 . http://dx.doi.org/10.1186/1471-2105-13-238 Google Scholar Crossref Search ADS PubMed WorldCat 41 Berlin K , Koren S, Chin CS, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing . Nat Biotechnol 2015 ; 33 ( 6 ): 623 – 30 . http://dx.doi.org/10.1038/nbt.3238 Google Scholar Crossref Search ADS PubMed WorldCat 42 Chin CS , Peluso P, Sedlazeck FJ, et al. Phased diploid genome assembly with single-molecule real-time sequencing . Nat Methods 2016 ; 13 ( 12 ): 1050 – 4 . doi: 10.1038/nmeth.4035. Google Scholar Crossref Search ADS PubMed WorldCat 43 Myers G. Efficient local alignment discovery amongst noisy long reads. In: Brown D, Morgenstern B (eds). Algorithms in Bioinformatics . Lecture Notes in Bioinformatics, 8701. Berlin : Springer , 2014 , 52 – 67 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 44 Roberts M , Hayes W, Hunt BR, et al. Reducing storage requirements for biological sequence comparison . Bioinformatics 2004 ; 20 ( 18 ): 3363 – 9 . http://dx.doi.org/10.1093/bioinformatics/bth408 Google Scholar Crossref Search ADS PubMed WorldCat 45 Lin Y , Yuan J, Kolmogorov M, et al. Assembly of long error-prone reads using de Bruijn graphs . Proc Natl Acad Sci USA 113 : E8396 – 405 . Crossref Search ADS PubMed WorldCat 46 Girgis HZ. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale . BMC Bioinformatics 2015 ; 16 ( 1 ): 227 . http://dx.doi.org/10.1186/s12859-015-0654-5 Google Scholar Crossref Search ADS PubMed WorldCat 47 Vembar SS , Seetin M, Lambert C, et al. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing . DNA Res 2016 ; 23 ( 4 ): 339 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Tyson JR , O’Neil NJ, Jain M, et al. Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device . bioRxiv 2017 : 099143 . Google Scholar OpenURL Placeholder Text WorldCat 49 Schmidt MH-W , Vogel A, Denton A, et al. Reconstructing the gigabase plant genome of Solanum pennellii using Nanopore sequencing . Plant Cell 2017 . pii: tpc.00521.2017. doi: 10.1105/tpc.17.00521 . Google Scholar OpenURL Placeholder Text WorldCat 50 Jain M , Koren S, Quick J, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads . bioRxiv 2017 : 128835 . Google Scholar OpenURL Placeholder Text WorldCat 51 Simão FA , Waterhouse RM, Ioannidis P, et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs . Bioinformatics 2015 ; 31 ( 19 ): 3210 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Parra G , Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes . Bioinformatics 2007 ; 23 ( 9 ): 1061 – 7 . http://dx.doi.org/10.1093/bioinformatics/btm071 Google Scholar Crossref Search ADS PubMed WorldCat 53 Benson G. Tandem repeats finder: a program to analyze DNA sequences . Nucleic Acids Res 1999 ; 27 ( 2 ): 573 – 80 . http://dx.doi.org/10.1093/nar/27.2.573 Google Scholar Crossref Search ADS PubMed WorldCat 54 Gurevich A , Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies . Bioinformatics 2013 ; 29 ( 8 ): 1072 – 5 . http://dx.doi.org/10.1093/bioinformatics/btt086 Google Scholar Crossref Search ADS PubMed WorldCat 55 Nattestad M , Schatz MC. Assemblytics: a web analytics tool for the detection of variants from an assembly . Bioinformatics 2016 ; 32 ( 19 ): 3021 – 3 . http://dx.doi.org/10.1093/bioinformatics/btw369 Google Scholar Crossref Search ADS PubMed WorldCat 56 Kurtz S , Phillippy A, Delcher AL, et al. Versatile and open software for comparing large genomes . Genome Biol 2004 ; 5 ( 2 ): R12 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Baird NA , Etter PD, Atwood TS, et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers . PLoS One 2008 ; 3 ( 10 ): e3376 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Catchen JM , Amores A, Hohenlohe P, et al. Stacks: building and genotyping Loci de novo from short-read sequences . G3 2011 ; 1 : 171 – 82 . http://dx.doi.org/10.1534/g3.111.000240 Google Scholar Crossref Search ADS PubMed WorldCat 59 Hunt M , Silva ND, Otto TD, et al. Circlator: automated circularization of genome assemblies using long sequencing reads . Genome Biol 2015 ; 16 ( 1 ): 294 . http://dx.doi.org/10.1186/s13059-015-0849-0 Google Scholar Crossref Search ADS PubMed WorldCat 60 Chakraborty M , Baldwin-Brown JG, Long AD, Emerson JJ. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage . Nucleic Acids Res 2016 ; 44 : e147 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
Coloured Petri nets for multilevel, multiscale and multidimensional modelling of biological systemsLiu, Fei; Heiner, Monika; Gilbert, David
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx150pmid: 29112705
Abstract Owing to the availability of data of one biological phenomenon at different levels/scales, modelling of biological systems is moving from single level/scale to multiple levels/scales, which introduces a number of challenges. Coloured Petri nets (ColPNs) have been successfully applied to multilevel, multiscale and multidimensional modelling of some biological systems, addressing many of these challenges. In this article, we first review the basics of ColPNs and some popular extensions, and then their applications for multilevel, multiscale and multidimensional modelling of biological systems. This understanding of how to use ColPNs for modelling biological systems will assist readers in selecting appropriate ColPN classes for specific modelling circumstances. multilevel, multiscale and multidimensional modelling, coloured Petri nets, biological systems Introduction Systems biology [1, 2] studies the interactions between the components of a biological system and how the interactions produce the behaviour of that system. Mathematical and computational modelling plays a crucial role in achieving this goal. So far, a variety of modelling approaches, including Petri nets, Boolean networks and (ordinary or partial) differential equations, have been applied to a wide field of biological systems (see [3, 4] for reviews). Among them, Petri nets are particularly appropriate for describing and analysing the concurrent, asynchronous and dynamic behaviour of complex biological systems. Since Reddy et al. [5] introduced qualitative Petri nets to model metabolic pathways, different types of Petri nets [e.g. stochastic Petri nets (SPNs), timed Petri nets, continuous Petri nets (CPNs) and hybrid Petri nets] have been proposed for modelling biological systems [4, 6, 7]. However, as an unparameterized method, these standard Petri nets do not easily scale, and so they are usually applicable for representing smaller (biological) systems only. In the past few years, because of the availability of data of one biological phenomenon at different levels/scales, modelling of biological systems has moved from single level/scale to multiple levels/scales [8]. Multilevel/multiscale modelling integrates information at different levels/scales into one model, which can more accurately describe a system and thus provide more insights into the system. Although ‘multi-level’ and ‘multiscale’ are often synonymously used, they are in fact distinct [9, 10]. In this article, we wish to distinguish them, but do not intend to provide a rigorous definition for them. Multilevel modelling considers dynamic processes at multiple levels (e.g. subcellular, cellular, tissue level) of biological systems, while multiscale modelling incorporates multiple different temporal and spatial scales in one model, regardless of whether the model has multiple levels. A multilevel model is not necessarily a multiscale model, and vice versa. However, multiple levels usually coincide with multiple spatial and temporal scales. Besides, apart from multilevel and multiscale aspects, a biological model could also be constructed as multidimensional [11]. For example, when studying reaction–diffusion processes, we can model this phenomenon in one-, two- or three-dimensional (shortly 1 D, 2 D or 3 D) space. The model involving more dimensions usually represents the system to be studied more accurately. Modelling beyond one level/scale introduces plenty of challenges, e.g. repetition of components (e.g. cells, tissues), (hierarchical) organization, communication or movement of components, differentiation, division or deletion of components or pattern formation of a biological system. To address these challenges, coloured Petri nets (ColPNs) have been used to construct multilevel, multiscale and multidimensional models, and gained increased popularity for a wide spectrum of applications [12, 13]. ColPNs [14, 15] are an extension of standard Petri nets, which were proposed to represent large complex systems. Using ColPNs, a group of similar components of a system can be represented as one component, each of which is encoded as a colour and thus distinguished by this colour. ColPNs offer parameterized and compact representations of complex systems, without losing the analysis capabilities of standard Petri nets thanks to automatic unfolding. Moreover, ColPNs provide the possibility to easily increase the size of a model consisting of many similar components just by adding new colours. ColPNs have been widely applied to modelling protocols and technical networks, software, workflows and business processes, hardware and manufacturing systems [16]. Recently, ColPNs have been used for modelling biological systems, e.g. in an early attempt, ColPNs were used for discriminating metabolites, which follow different T-invariants [17]. Later, a ColPN-based approach to multilevel/multiscale modelling of biological systems has been presented in [12], and some successful applications appeared, e.g. modelling multicellular systems [18] and spatial diffusion [11]. In summary, ColPNs have been proven to be appropriate to construct multilevel, multiscale and multidimensional models. Multilevel modelling. The levels to be considered can be represented by the use of tuples within tuples. That is, each tuple encodes a level. For example, in the fly wing, we use a colour tuple (x, y) to represent the cell level, and another tuple (a, b) to represent each compartment of a cell. Thus, a nested tuple (x,y,(a,b)) describes two levels of the fly wing model [19]. Multiscale modelling. Multiscale modelling is often accompanied with multilevelness. Thus, the encoding of multiscale models with colours is similar to that for multilevel modelling. The mapping functions between spatial scales can be implemented via media (auxiliary) nodes (places or transitions), which are then used by rate functions at different scales [20]. The mapping functions between temporal scales can be explicitly represented via hybrid Petri nets [21]. Multidimensional modelling. A multidimensional grid can be represented by the use of colour tuples, whose arity respects the number of dimensions: 1, 2 or 3. That is, a colour encodes a spatial locality of the grid in 1D, 2D or 3D space. For example, in a 2D grid, each grid cell can be defined as a colour tuple, e.g. (x, y), and the connectivity between cells can be defined as a neighbourhood function of colours [11]. Furthermore, tessellation of different shapes, e.g. hexagonal cells instead of rectangular cells, can also be easily defined [22]. In this article, we will review the basics and some extensions of ColPNs and also their applications for the modelling of biological systems in terms of the aforementioned three categories. We hope this review will open the door for a wide use of ColPNs in the systems biology area. Coloured Petri nets ColPNs offer a parameterized method for modelling a large system, where a group of similar components of the system is defined as and distinguished by a set of colours, thus presenting a compact representation of that system. For example, Figure 1B gives a ColPN by defining the left and right components (both components have the same structure) in Figure 1A as two colours. Figure 1 Open in new tabDownload slide A ColPN example. (A) A prey–predator Petri net model with migration. (B) A ColPN model by folding the left and right components in (A). The declarations are as follows: CS=enumeration with a, b; variable x: CS. The successor operator ‘+’ in the arc expression +x returns the successor of x in an ordered finite colour set; if x is the last colour, then it returns the first colour. See [23] for the syntax of all declarations. As standard Petri nets, ColPNs [12, 24] are directed bipartite multi-graphs and consist of places, transitions and arcs connecting places and transitions. In the biological scenario, places may represent any species or chemical compounds, such as genes, mRNAs, proteins, protein conformations or protein complexes, while transitions may represent chemical reactions (such as transcription and translation), molecular interactions or intramolecular changes. Additionally, a group of colour sets is defined for a ColPN. Each colour set is based on a data type, which is a set of values (colours) that obey some properties of a programming language [25]; common data types include integer, Boolean, string, enumeration and structure. Each place gets assigned a colour set and may contain distinguishable tokens, i.e. each token is associated with a specific colour. As there can be several tokens of the same colour on a given place, the tokens on the place are best described by a multiset over its colour set. A specific distribution of tokens on all places constitutes a marking of a ColPN. Each transition is associated with a guard, which is a Boolean expression over defined variables, constants and functions. The guard of a transition has to be evaluated to true for enabling the transition. The trivial guard ‘true’ is usually not explicitly given. Each arc gets assigned an expression; the result type of the expression is a multiset over the colour set of the connected place. In Table 1, we briefly compare properties of the elements in ColPNs and uncoloured Petri nets, taking the models in Figure 1 as an example. Table 1 A Comparison of properties of elements in ColPNs and uncoloured Petri nets Elements . ColPNs . Uncoloured Petri nets . Declaration Colour sets, e.g. CS=enumeration with a, b N/A Variables, e.g. x: CS N/A Place A colour set, e.g. CS for p1 N/A Coloured tokens, e.g. 10`a++10`b on p1 Black tokens, e.g. 10 on place p1_a Transition A guard, e.g. ‘true’ for r1 N/A Arc A multiset expression, e.g. 2`x on the arc (r1, p1) A positive integer multiplicity, e.g. 2 on the arc (r1_a,p1_a) Marking A vector of multiset expressions A vector of non-negative integers Elements . ColPNs . Uncoloured Petri nets . Declaration Colour sets, e.g. CS=enumeration with a, b N/A Variables, e.g. x: CS N/A Place A colour set, e.g. CS for p1 N/A Coloured tokens, e.g. 10`a++10`b on p1 Black tokens, e.g. 10 on place p1_a Transition A guard, e.g. ‘true’ for r1 N/A Arc A multiset expression, e.g. 2`x on the arc (r1, p1) A positive integer multiplicity, e.g. 2 on the arc (r1_a,p1_a) Marking A vector of multiset expressions A vector of non-negative integers Note: N/A: Not applicable. Open in new tab Table 1 A Comparison of properties of elements in ColPNs and uncoloured Petri nets Elements . ColPNs . Uncoloured Petri nets . Declaration Colour sets, e.g. CS=enumeration with a, b N/A Variables, e.g. x: CS N/A Place A colour set, e.g. CS for p1 N/A Coloured tokens, e.g. 10`a++10`b on p1 Black tokens, e.g. 10 on place p1_a Transition A guard, e.g. ‘true’ for r1 N/A Arc A multiset expression, e.g. 2`x on the arc (r1, p1) A positive integer multiplicity, e.g. 2 on the arc (r1_a,p1_a) Marking A vector of multiset expressions A vector of non-negative integers Elements . ColPNs . Uncoloured Petri nets . Declaration Colour sets, e.g. CS=enumeration with a, b N/A Variables, e.g. x: CS N/A Place A colour set, e.g. CS for p1 N/A Coloured tokens, e.g. 10`a++10`b on p1 Black tokens, e.g. 10 on place p1_a Transition A guard, e.g. ‘true’ for r1 N/A Arc A multiset expression, e.g. 2`x on the arc (r1, p1) A positive integer multiplicity, e.g. 2 on the arc (r1_a,p1_a) Marking A vector of multiset expressions A vector of non-negative integers Note: N/A: Not applicable. Open in new tab Each colour of a place corresponds to a place instance when unfolded. Each transition is surrounded by a set of expressions, including its guard and the expressions on its adjacent arcs, which may involve a set of variables. Before the expressions are evaluated, the variables must be assigned values of suitable data types, which is called binding [24]. Each binding of a transition corresponds to a transition instance when unfolded. Enabling and firing of a transition instance are based on the evaluation of both its guard and related arc expressions. If the guard is evaluated to true and the preplaces have sufficient appropriately coloured tokens after the arc expressions were evaluated for a given binding, the transition instance that corresponds to the binding is enabled and may fire. When a transition instance fires, it removes appropriately coloured tokens from its preplaces and adds appropriately coloured tokens to its postplaces, i.e. it changes the current marking to a new reachable one. The colours of the tokens that are removed from preplaces and added to postplaces are decided by arc expressions. The set of markings reachable from the initial marking constitutes the state space of a given net. These reachable markings and transitions instances between them constitute the reachability graph of the net. An uncoloured Petri net (Figure 1A) can be folded to a ColPN (Figure 1B), either manually or in a semi-automatic way [26]. Vice versa, a ColPN (Figure 1B) can be automatically unfolded to an uncoloured Petri nets (Figure 1A); afterwards all the simulation algorithms or analysis techniques for uncoloured Petri nets can be used for ColPNs [27]. Based on basic ColPNs, many extensions have been proposed for different purposes, e.g. arc extensions [coloured Petri nets with extended arcs (ColXPNs)], time extensions [coloured time and coloured stochastic Petri nets (ColSPNs)] and state space extensions [coloured continuous and coloured hybrid Petri nets (ColHPNs)] [28]. In the following, we briefly review the most important extensions, which have already been used or potentially could be used for the modelling of biological systems. Coloured Petri nets with extended arcs ColPNs have been extended to incorporate different special arc types such as read arcs (often also called test arcs), inhibitor arcs and reset arcs [12, 28]. These special arcs either make the model representation more compact while keeping the modelling power, or strictly extend the modelling power of the Petri net formalism. All these special arcs are only allowed to go from places to transitions. Read and inhibitor arcs add constraints on the firing of a transition, but the connected places are not affected on firing. A read arc allows to model that some resource (e.g. enzyme in a chemical reaction) is required, but not exclusively and it is not consumed on firing; hence, the same token could be used at the same time by more than one transition. An inhibitor arc reverses the logic of the enabling condition of a place, i.e. it imposes a constraint that a transition may only fire if the place contains less tokens than the weight that the arc indicates. A reset arc empties the place connected by this arc once the transition fires; the number of tokens on the place does not matter for enabling. Besides, ColPNs can be further enriched to include marking-dependent arcs, i.e. the arc multiplicities are allowed to be marking-dependent expressions of various types in terms of a transition’s preplaces [29], which facilitates the modelling of some special biological scenarios such as cell division [30, 31]. ColPNs and ColXPNs can be analysed using a variety of techniques, such as structural analysis (confined to models without special arcs extending the modelling power) [12] or state space analysis based on computational tree logic (CTL), which is a branching time temporal logic [32] matching the needs for analysing reachability graphs (model checking). See [12] for details on the use of these techniques for the analysis of ColPNs. ColPNs and ColXPNs have been widely used for modelling biological systems when kinetic data are not available. Coloured timed Petri nets There are many different types of coloured timed (or time) Petri nets (ColTPNs), but here we confine ourselves to the ColTPNs implemented in CPN tools [24], which have gained wider use in different fields. In a ColTPN, each token carries a second value called a time stamp (a non-negative integer) in addition to the token’s colour. The time stamp of a token tells us the time at which the token can be moved from its associated place. ColTPNs work in a similar way as event queues in many simulation engines of discrete event simulation. Using ColTPNs, performance measures of a system can be computed. In the biological area, early applications were usually done with ColPNs or ColTPNs supported by CPN tools or its predecessor Design/CPN [33]; see [34, 35]. Coloured stochastic Petri nets ColSPNs are a coloured version of stochastic Petri nets (SPNs) [12]. A firing delay is introduced and associated with each transition, which is a random variable defined by an exponential probability distribution. Therefore, the semantics of a ColSPN is equivalent to a continuous time Markov chain (CTMC), which is constructed from the reachability graph of the underlying qualitative Petri net by labelling the arcs between states with the state transition rates. Thus, in addition to the analysis techniques given above, we can further use such quantitative analysis techniques as model checking continuous stochastic logic (CSL) [36], a probabilistic counterpart of CTL, or probabilistic linear-time temporal logic with numerical constraints (PLTLc) [37]