Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Developing a ‘personalome’ for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes

Developing a ‘personalome’ for precision medicine: emerging methods that compute interpretable... The development of computational methods capable of analyzing -omics data at the individual level is critical for the suc- cess of precision medicine. Although unprecedented opportunities now exist to gather data on an individual’s -omics profile (‘personalome’), interpreting and extracting meaningful information from single-subject -omics remain underdeveloped, particularly for quantitative non-sequence measurements, including complete transcriptome or proteome expression and metabolite abundance. Conventional bioinformatics approaches have largely been designed for making population-level inferences about ‘average’ disease processes; thus, they may not adequately capture and describe individual variability. Novel approaches intended to exploit a variety of -omics data are required for identifying individualized signals for mean- ingful interpretation. In this review—intended for biomedical researchers, computational biologists and bioinformati- cians—we survey emerging computational and translational informatics methods capable of constructing a single subject’s ‘personalome’ for predicting clinical outcomes or therapeutic responses, with an emphasis on methods that provide inter- pretable readouts. Key points: (i) the single-subject analytics of the transcriptome shows the greatest development to date and, (ii) the methods were all validated in simulations, cross-validations or independent retrospective data sets. This survey uncovers a growing field that offers numerous opportunities for the development of novel validation methods and opens the door for future studies focusing on the interpretation of comprehensive ‘personalomes’ through the integration of mul- tiple -omics, providing valuable insights into individual patient outcomes and treatments. Key words: single-subject studies; personalome; precision medicine; n-of-1 Francesca Vitali, PhD, is a Research Assistant Professor at the University of Arizona. Her main research interests are in pharmacogenomics, drug repurpos- ing, precision medicine, bioinformatics and big data techniques. Qike Li is a PhD candidate at the University of Arizona. His research interests are in the area of single-subject analytics with applications in precision medicine. Grant Schissler is an Assistant Professor at University of Nevada, Reno. Recently, he has helped to build statistic informatics tools that allow clinical researchers to interpret genomic data of individual patients. Joanne Berghout, PhD, is a Research Assistant Professor at the University of Arizona. She uses genetics and ontologies to uncover patterns and candidate genes associated with Mendelian and complex diseases. Colleen Kenost, EdD, Director of Operations, Center for Biomedical Informatics and Biostatistic, the University of Arizona. Her role is to translate research prerogatives into action and operationalize strategic plans. Yves Lussier, MD, Professor of Medicine and Director, Center for Biomedical Informatics and Biostatistics, University of Arizona. His research group solves problems related to computational precision medicine and translational bioinformatics. Submitted: 15 July 2017; Received (in revised form): 6 October 2017 V The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 789 Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 790 | Vitali et al. through analyses against a reference genome, or in the case of Introduction cancer, by also comparing paired cancer and unaffected tissue The arrival of precision medicine has led to a more individual- to determine somatic versus germline mutations. Here, we based view of diseases, with characteristics of single subjects show how differentially expressed molecules of life and path- being central to the prediction of clinical outcomes and pre- ways can be unveiled in a single subject through the analysis of scription of tailored treatments. This concept is not new; in fact, transcriptome data. evidence-based clinical practice guidelines [1] stratify treat- We surveyed emerging novel computational biology, bio- ments according to some patient characteristics (e.g. gender, statistical and translational informatics methods that construct ancestry, age, family history, some laboratory test results). a single subject’s personalome by analyzing transcriptome data However, precision medicine differs from the traditional medi- to predict outcomes or therapeutic responses without requiring cal approach, as it seeks to leverage not only clinical variables the large cohort needed for conventional approaches. and clinician-selected genetic tests but also broad and data- Our review methodology is detailed in the Supplementary intensive molecular and general -omics profiles of a patient [2]. Material S1. Particular emphasis is placed on those methods These large and heterogeneous data cannot be interpreted that provide clinically interpretable readouts rather than simple directly by medical practitioners and require an automatic pro- categorical classification, as the latter are known to be difficult cedure for extracting relevant knowledge before incorporation to reproduce across data sets and contain noisy, incidental and into clinical practice. Therefore, it is fundamental to develop passenger variation [6–9]. The papers and methods selected for computational methods aimed at analyzing these data at the review reflect the authors’ views and are not intended to pro- individual level. vide an exhaustive search. Figure 2A depicts all considered pub- Current approaches aimed at analyzing disease or other bio- lications by year of publication and number of citations, and the logical processes, therapeutic efficacy and -omic data still lever- studies are shown with different colors and shapes according to age well-established cohort-based population analyses such as the type of required data input and output, respectively. Figure case-control studies [e.g. gene expression classifiers (GExpCs)], 2B shows the number of citations over time. observational trials or controlled intervention trials. These large The review is divided according to the type of data inputs in cohort/group approaches place emphasis on the group average the methods (i.e. transcriptome and integrated -omics). A rather than individual participants; though this group average review of the validations of all methods follows, and finally, we may not represent any actual individual’s personal profile, let discuss and conclude with the broad challenges, the applica- alone be meaningful to understanding the profile of a given spe- tions and the opportunities in developing a personalome for cific patient. On the other hand, the framework of N-of-1 trials precision medicine, i.e. how the single-subject analyses (SSAs) has been applied to repeated measures of a single analyte for of -omics data can bring novel insights in disease mechanisms over two decades [3]. This approach is based on the collection of specific of a patient and unveil potential patient-specific treat- various relevant data for one person as frequently as possible ments. A table of content for the review is provided in Table 1. [4]. In this way, novel strategies can be explored to compare dif- ferent treatments of the same person. Moreover, by looking at Transcriptome commonalities across multiple N-of-1 studies collecting the same type of data, it is possible to estimate the efficacy of an Transcriptome analysis aims to interpret the quantification of intervention in a specific subset population (i.e. people sharing transcribed genetic material, including both coding and noncod- a particular genetic profile). N-of-1 trials demonstrated their ing RNA. Different from DNA, which is relatively static, analyses power to evaluate treatment effectiveness in a single subject for of the transcriptome capture the collective impact of tissue one variable [5], but proposed approaches for one analyte do not type, sequence variation, regulation, environment, external scale for -omics legion-size data sets. stimulation (e.g. drug treatments) and interactions between Although we now have an unprecedented technical opportu- them. High-throughput technologies, such as microarray and nity to gather data relating to an individual’s -omic profile, bio- RNA sequencing (RNA-Seq), are capable of assessing transcript informatics tools to understand these data comprehensively, expression at genome-scale for an individual sample, with and at the individual level, remain underdeveloped. Novel RNA-Seq providing unbiased detection, broader dynamic range, approaches for identifying individualized (single-subject)—and increased specificity and sensitivity and easier detection of rare not cohort—signals are required for gathering insights into the and low-abundance transcripts. biology of diseases and healthy states of individuals. This The transcriptome provides a snapshot of transcriptional review focuses on computational methods aimed at analyzing activity under the condition where the RNA was collected, quantitative transcriptomic measurements of an individual and allowing researchers to study the biological impact of certain the combination of transcriptome with other -omic data. diseases or effect of treatments [10]. This allows us to better In this review, we define the personalome as an interpret- understand general disease mechanisms, discover biomarkers able personal molecular mechanism profile of an individual or identify drug targets at the cohort scale when sufficient derived from one or more scales of -omic data, especially when samples are collected, but also has the power to reveal designed to enable precision medicine. ‘Personal -omics’ means individual-specific signals, whose detection and analysis the -omics measures of a single subject. Molecular mechanisms through computational methods can lead to far more precise are any molecular functions or biological processes such as a medical understanding and decision-making. Analysis of missense mutation in DNA, or a differentially expressed path- more than one transcriptome of an individual enables the way (DEP) at the transcriptome or proteome. To be considered assessment of personal dynamic changes over time or in interpretable at the molecular mechanism, the raw -omics pro- response to therapy or other environmental changes. Yet, file must have been subjected to analyses performing (i) dimen- identifying important individual signals is not a trivial task, as sion reduction and (ii) biomolecular interpretation of the transcript expression variations in a given tissue and time point mechanisms involved in molecules of life (Figure 1). For exam- are further modulated by stochastic variability, cyclic patterns ple, full genomes are reduced to variant and mutation calls (ex circadian) and platform biases or measurement errors in Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 791 Figure 1. Flow chart of methods designed for clinical interpretation of single-subject -omics. This review addresses the gap of knowledge to compare and contrast sin- gle-subject methods designed to reduce the dimension of raw -omics data (left) and to provide a biomolecular interpretation of signals (gray rectangle). For DNA sequencing, variant and mutations calls as well as all functional annotations in single subjects (e.g. missense mutation) already bridge this gap. However, this inter- mediate step is often omitted for other molecules of life, such as mRNAs, miRNAs, proteins, methylated DNA regions and metabolites (carbohydrates and lipids). This review focuses on single-subject methods that analyze transcriptome data. ‘Clinical applications’ section provides emerging evidence that the newly available, unbiased SSA of the transcriptome enable innovative types of studies to investigate their clinical utility by addressing the gap of biomolecular interpretation of raw - omics signals. Among possible studies, we demonstrate that -omics clinical prediction classifiers that operate directly at the -omics scale may be redesigned for the parsimonious transformed signal of single-subject studies for improved clinical utility. Figure 2. SSA studies included in this review. (A) Each numbered point represents a publication plotted by year of publication and the relative number of citations (in log2 scale). Numbers correspond to the publication in this article’s reference list, colors indicate the type of input required, i.e. one single-subject sample (1 ss SAMPLE—green), two paired single-subject sample (2 ss SAMPLES—purple) or if the method requires the collection of multiple samples from the same subject (multiple ss SAMPLES—orange). The shapes represent the type of output provided by the selected studies, i.e. DEGs—circle, DEPs—X. Finally, blue squares indicate methods based on the integration of transcriptome data with other - omics. (B) Number of citations over time starting from the publication year for the single-subject studies analyzing transcriptome data. Color and shape codification is the same as for the (A). Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 792 | Vitali et al. Table 1. Table of content of the review The PC strategy is a distinct approach to derive statistics directly on the pathways without using DEGs. This approach is Section Pages more sensitive to a concordant change of expression in the same direction, even if the transcripts would not be otherwise Transcriptome p. 2 identified as DEGs. While more sensitive to directionally dysre- Cross-subject transcriptome analyses p. 4 gulated pathways than GCs, current implementations of PCs are Single-subject transcriptome analyses p. 4 not designed to identify dysregulated pathways with both upre- DEGs identification in single-subjects p. 7 DEPs identification in single-subjects p. 8 gulated and downregulated transcripts. Longitudinal time series analyses of transcriptome p. 10 However, a limitation when focusing on the identification of Single-subject transcriptome integrated with other -omics p. 11 DEPs relies on the selection of the considered prior knowledge Validation of single-subject methods p. 12 on pathways. Currently, several knowledge sources, such as Clinical applications p. 12 KEGG [18], Reactome, [26] and Pathway Common [27] can be used. Perspective and conclusion p. 13 This may cause redundancy and different results; moreover, such data sources may contain incomplete, incorrect or inconsistent data. Such dependencies between pathways could result in corre- addition to signals, which are truly relevant to the disease state. lated P-values and over dispersion of the number of significant The power of the methods reported in this section is that pathways, leading to biased results [28]. Therefore, future studies starting from thousands of genes they are able to provide are required to compare the robustness of DEP methods in pres- information on the key genes and mechanisms (i.e. pathways) ence of noise and missing gene set annotations. of a disease. This can allow to speed up the planning of future While these approaches for transcriptome analysis are and effective studies. strong in the right context and if properly powered, only few are designed to scale down to individuals. For many DEG detection methods, this failure to scale down to a single subject is an Cross-subject transcriptome analyses inherent limitation of the underlying mathematical constructs, Conventional transcriptome analytics require well-powered as they rely on a minimum of three replicates to assess cohorts of both cases and controls and describe variation in gene-level variance, overdispersion and/or other parameters transcriptome when comparing two or more classes with a vari- requiring multiple subjects. Under most experimental designs, ety of methods (e.g. t-test [11, 12], analysis of variance [13], lin- cross-sample replicates are used, though triplicate samples ear mixed models [14], modeling via the negative binomial from the same individual could potentially be used as a proxy distribution [15–17]). These strategies are designed to identify when these are not resource limiting. Although the cost of DEGs of ‘average responses across patients’ under particular high-throughput sequencing has been declining, it is still experimental conditions (e.g. disease versus normal; or predrug resource-prohibitive to sequence multiple samples, especially and postdrug treatment). To extract more interpretable results, when sample procurement is naturally invasive. genes detected as differentially expressed are often further Other conventional approaches for analyzing transcrip- categorized according to enrichment or membership in knowl- tomes exploit curated knowledge of a particular disease to edge bases such as curated biological pathways or functional specifically examine validated or hypothesized markers gene sets (e.g. Kyoto Encyclopedia of Genes and Genomes whose gene expression differs from the reference ‘normal’ or (KEGG) [18]), Gene Ontology (GO) [19]. In this way, DEPs of aver- is expressed above a predetermined threshold. This is the case TM age responses of patients can be identified, providing a more for Oncotype DX [29], PAM50 [30] and other clinically avail- able tests that classify samples into tumor subtypes. Reliance comprehensible view of the transcriptomic processes under study versus a simple gene list that requires significant gene on a predefined panel of genes dodges the problems of dimen- sionality and signal-to-noise detection in raw transcriptome recognition and subject matter expertise for interpretation. data, but limits scalability across multiple characteristics of a A wide array of studies and tools belong to this category includ- disease and prevents the investigation of novel transcripts ing popular ones as gene set enrichment analysis (GSEA) [20] and disease mechanisms. To address these issues, other and DAVID [21]. In general, there are two main strategies to clustering-based techniques can be applied to gather pat- identify DEPs: (i) gene set-centric (GC) and (ii) pathway-centric terned genes across/within samples or data sets (for a review, analyses (PC). The GC approach is generally performed in two see [31]) and to obtain classifiers which can then be explored steps: first, DEGs are selected and the DEPs are computed by for within-group commonalities and cross-group differences statistically testing the genes against the background. A critical [32]. However, they require a large number of samples, as well limitation of GC strategy is that the results strongly depend on as careful external validation in large data sets that have adequate the DEGs identified in the first step. In fact, small changes in the protection from bias and have been reviewed elsewhere [33]. DEG analysis may lead to the detection of a slightly different DEG list that can result in high changes of the identified DEPs. Single-subject transcriptome analyses In addition, the final result is significantly affected by the arbi- trary cutoff chosen in the enrichment step, as the majority of In the context of SSA, several studies have been proposed for statistical test require a P-value threshold [22]. Therefore, we extracting relevant biological knowledge from transcriptome are providing the minimum number of genes in each gene set data without the large cohort requirement. These approaches (Figure 5, column ‘Minimum # of transcript per scored gene can be divided into different categories based on either (i) the set’), as methods providing a higher minimal threshold will be number of samples from the same subject they require or (ii) less susceptible to this bias. However, another limitation com- the type of output they provide. mon to all reviewed DEPs is that similar gene sets are not identi- As illustrated in Figure 3, single-subject studies can be cate- fied as biomolecularly related in the resulting set, though gorized into GC (DEGs; Figure 4 and Table 2) or PC (DEPs; Figure postprocessing methods are available to address it [23–25]. 5). Based on this classification, we reported the related studies Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 793 Figure 3. Current strategies to analyze single-subject transcriptomes. Analysis of single-subject transcriptome can be usually divided into two categories based on the required number of samples: (i) single sample analyses, (ii) paired sample analyses, or (iii) more samples (not shown). They can also be categorized according to their outputs: (i) Differentially Expressed Genes (DEGs), (ii) Differentially Expressed Pathways (DEPs), or Disease Scores (DSs). Note: DEP*¼ not true DEP, rather a relative expression level of the pathways because there are no references or baseline to compare the pathway expression of a single sample. Figure 4. Summary of single-subject methods that analyze transcriptome data to identify DEGs. Note: Additional details are available in Table 2. Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 794 | Vitali et al. Table 2. Additional details on single-subject transcriptome analyses of DEGs Publication Name Description Wang et al. [34] RankComp RankComp requires two inputs: (i) a disease sample and (ii) a set of accumulated normal samples, which can be can be accrued during the same experiment or a priori from various external resources. RankComp begins by ranking genes within the samples (both the case and the normal) according to increasing expression values. Next, pairwise rank comparison are performed to identify (a) stable gene pairs, and (b) reversal gene pairs. Stable gene pairs are defined as those with the same ordering in 99% of the accumulated normal samples [expressiongeneA > expressiongeneB] while reversal gene pairs are identified by disruption of that ordering in the disease sample [expressiongeneA < expressiongeneB]. Fisher’s exact test is conducted to test the null hypothesis that the numbers of reversal gene pairs supporting its upregulation or downregulation are equal. This procedure enables extraction of a list of DEGs for a single subject, and interpretable results can be obtained through manual examination or by performing gene set enrichment analyses Liu et al. [35] DNB Computational approach based on DNB theory to detect pre-disease states Wang. et al. [36] DEGseq DEGseq identifies DEGs using RNA-Seq data collected from a single subject. When replicates are not avail- able, the authors suggest a MA-plot-based method with a random sampling model, which assumes the expression counts follow a binomial distribution. Given the average of log2-transformed expression levels, it approximates the log2 expression fold change by a normal distribution, and then calculates a Z-score based on this distribution. P-values are computed based on Z-scores Tarazona NOISeq NOISeq is a data-adaptive and nonparametric approach, which has a variant, NOISeq-sim, that works et al. [37] without replicates. NOISeq-sim uses simulated replicates when real replicates do not exist. It simulates replicates under the assumption that gene expression counts follow multinomial distribution in which the probability of each gene corresponds to the probability of a read mapping to that gene. The proba- bility of each gene is estimated by the proportion of its read counts relative to the total number of mapped reads from the only sample under the corresponding experimental condition. With the simu- lated replicates, NOISeq-sim generates a joint null distribution of fold-changes (M) and absolute differ- ences (D) of the expression counts from the replicates within the same condition. This joint null distribution is then used to assess differential expression by gene‘s (M, D) pair computed between conditions Feng et al. [38] GFOLD This method assumes a Poisson distribution (k) for the gene expression counts and a uniform prior distri- bution for k. After computing a posterior distribution of k for each gene, GFOLD ranks gene expression changes of all genes based on the cth percentile of these posterior distributions, where c is determined by users. In this way, it penalizes genes with low expression levels for their larger variances Anders et al. [39] DESeq When neither condition (i.e. affected and control sample) has replicate transcriptomes, DESeq assumes the majority of the genes as non-DEGs and estimates a mean–variance relationship from treating the two samples as if they were replicates [33] Robinson et al. [17] edgeR edgeR assumes that RNA-Seq data follow negative binomial distribution for which, given the mean, the variance is determined by a dispersion parameter. When working without replicates, edgeR assigns the same value of the dispersion parameter to all genes and conducts a negative binomial exact test to compute P-values. Note that the value of dispersion is predetermined based on investigators’ under- standing of the biological nature of the samples rather than estimated from data [18] in ‘DEGs identification in single subjects’ and ‘DEP identification transcriptome, i.e. PC approaches. To the best of our knowledge, in single subjects’ sections, respectively. these methods have not been compared with enriching DEGs The selected studies can be further divided according to the into DEPs using methods from Figure 4. However, these DEP number of samples involved: (i) analysis of single samples (top methods have been evaluated in vitro and in vivo as shown in of Figure 3), (ii) analyses of single individuals using paired sam- the last section of the article, thus remain the validated strategy ples from the same subject (bottom of Figure 3) and (iii) multiple for imputing DEPs until properly compared with single-subject measurements in single subjects (not shown in Figure 3). This DEGs followed by enrichment. last class of methods is reported in ‘Longitudinal time series Another key difference in the analysis of single-subject tran- analyses of transcriptome’ section. scriptome is the number of samples required from the subject. The utility of single-subject discovery of differentially In addition, we have found that successful single sample-meth- expressed patterns is central to precision medicine. For exam- ods require not only the individual sample but also a cohort ple, implicating DEGs in a patient may identify an unconven- reference (‘reference-based’) to perform comparisons and to tional treatment (i.e. personal drug repositioning) for this detect DEGs or DEPs (Figures 4 and 5, column ‘Cohort reference disease, assuming that these DEGs are well-established targets size’). This type of strategy is particularly useful when matched of the drug [52] in another related disease state (e.g. cancer tar- normal and disease samples are unavailable or limited (e.g. gets). On the other hand, if the aim of the study is to investigate brain or heart tissue samples). a disease or a particular condition from a broader point of view Analysis of individuals using paired samples naturally and to promote greater interpretation of the gene expression requires that both samples be drawn from the same subject. As results, DEP analyses should be preferred. For example, Figure 5 samples are isogenic aside from potential somatic variation refers to methods directly imputing DEPs from the and/or taken from the same tissue and environmental context Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 795 Figure 5. Summary of single-subject methods that analyze transcriptome data to identify DEPs. aside from any experimentally induced stimuli, this design DEGs identification in single subjects increases the signal-to-noise ratio and improves the detection In this section, we outline and describe emerging studies aimed of relevant DEGs or pathways. For example, studying both at extracting DEGs starting from: (i) one sample of the individual tumor and non-tumor tissues from a cancer patient focuses and (ii) paired samples drawn from the same subject. A detailed attention on pathogenic and compensatory mechanisms that description of the methods is provided in Figure 4 and Table 2. differentiate the two tissues because of the disease state. While we review the methods that strive to mine the most Approaches based on a single sample of an individual information from limited data (e.g. a pair of transcriptomes of a We identified two single-sample methods ([34, 35]) designed to single subject), investigators need to be cautious that methods inform on an individual’s transcriptome aberrations. Both do not replace data [53]. methods require the application of a cohort reference, but differ An additional aspect we underlined is the requirement of in their predicted outputs. The first one, called RankComp [34], user-defined parameters heuristics of the considered publica- identifies DEGs by comparing the gene expression of the tions heuristics (Column 8 in Figures 4 and 5). Automated meth- affected sample with a baseline, akin to a reference genome or ods, not requiring user-defined parameters, are considered normal range for clinical testing. RankComp has been applied superior, as they are less biased and more convenient. separately to both total mRNA and microRNA (miRNA) investi- The following subsections will focus on methods for imput- gations [54]. In the second study, they demonstrated the power ing single-subject DEGs (‘DEGs identification in single subjects’ of their method to identify deregulation of miRNAs and miRNA– section) and then single-subject DEPs (‘DEP identification in sin- target pairs with mutually exclusive alterations. This approach gle subjects’ section). has the limitation of not being sensitive enough for detecting Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 796 | Vitali et al. genes whose differentially expression causes minor changes in GSEA, ssGSEA [45], Functional Analysis of Individual Microarray the ranking. The second method [35], DNB (Figure 4), different Expression, FAIME [44]) (Figure 5). A detailed description of the from RankComp, predicts critical disease transition from one methodologies used by these studies is provided in Table 3. sample of an individual, by comparing it with multiple control Each of the reference-based methods begins by aggregating samples (from other data sets). This type of approach is particu- gene-level information into pathway-level information, provid- ing meaningful dimension reduction, and then apply statistical larly interesting for investigating individual profiles and classi- fying them as healthy, pre-disease or disease state. analyses directly at the pathway level. The first method, individPath, uses relative expressions orderings to directly Approaches based on paired samples of an individual stratify patients based on individual deregulated pathway sta- tus. The authors showed that individPath could predict individ- Although DEG identification often requires a large cohort of samples, a few attempts have been made to identify DEGs from ually identified, but in-common pathway biomarkers from lung adenocarcinoma and breast cancer data sets that were corre- only a pair of transcriptomes. These methods provide an oppor- lated with survival analysis. tunity to identify a set of personalized DEGs of a single subject The second method is Pathifier, which computes pathway without requiring costly transcriptome replicates. Among these deregulation scores (PDSs—Table 3) for SSA using principal methods, DESeq [39] and edgeR [17] were originally designed as cohort-based methods (Figure 4, column ‘Designed for ss’), but component analysis (PCA) and curve fitting. Drier et al. [43] showed how PDSs successfully reflect deregulation of pathways have wide applications. When replicates are not available, these two methods can still be applied. Without replicates, DESeq is in glioblastoma and colorectal cancer data sets and could pro- vide clinically relevant stratification of patients. Pathifier has conservative, as it assumes the majority of the genes as non- also been successfully applied to provide a classification of DEGs and estimates a mean–variance relationship from treating breast cancer subtype [59] and to perform a personalized analy- the two samples as if they were replicates. edgeR’s performance sis for understanding the status of homologous recombination relies on investigators’ understanding of study, as a parameter pathway dysregulation in breast cancer [60]. in the model is predetermined by the biological nature of the Finally, an additional method proposed by Ahn et al. [42] samples. DEGseq [36] is designed for discovering DEGs from quantifies the aberrance of an individual sample’s pathways by only a pair of transcriptomes; yet, its assumption of binomial comparison with accumulated normal data. The authors pro- distribution of RNA-Seq data is insufficient when overdisper- vide gene-level statistics (i.e. Z-score) by standardizing the gene sion in gene expression is present. NOISeq-sim [37] simulates expression level of the disease sample with the mean and the replicates when real replicates do not exist. With the simulated SD of the normal samples. replicates, NOISeq-sim generates a joint null distribution of DEP approaches requiring a cohort reference, such as iPAS, fold-changes (M) and absolute differences (D) of the expression Pathifier and individPath, are constrained by (i) the number of counts from the replicates within the same condition. This joint available normal samples (power), (ii) platform-dependencies null distribution is then used to assess differential expression and (iii) limited sensitivity to detect pathways that contain only by a gene’s (M, D) pair computed between conditions. Finally, few genes. The large number of normal cohort sample required GFOLD [38] is another method designed for transcriptome anal- limits the applicability of these methods in infrequent diseases, ysis without replicates, as it provides biologically meaningful or when obtaining appropriate samples and/or defining an gene ranks of differential expression, but no significance appropriate ‘normal’ state is complex. Moreover, the reference assessment. cohort may be heterogeneous, and pooling together normal samples means that transcriptome of different individuals is DEPs identification in single subjects merged, which can obscure stratification and correlation pat- In this subsection, we report other methods that create biologi- terns in the normal data. cally interpretable results from a single subject’s transcriptome Because of these limitations, other methods have been pro- bypassing detection of significant differences in gene-level posed to circumvent the normal reference requirement using expression to go directly to pathway-level signals (DEPs). Such solely a sample from the subject under study. These strategies analyses aim to promote a higher-level interpretation of the aim to reduce dimension by injecting domain knowledge while underlying gene expression data, providing a holistic view of reducing gene-level noise inherent to a single case sample. Two pathway perturbation, instead of focusing attention on any par- such methods are the FAIME [44] and ssGSEA [45]. Both methods ticular gene. All the approaches belonging to this category seek to quantify the effect size and statistical significance of incorporate a large body of prior biological knowledge (e.g. path- consistent overexpression or underexpression of aggregated way knowledge sources such as KEGG [18, 55]). This allows gene expression within externally defined gene sets, compared researchers to reduce the dimension of a transcriptome-wide with the genes not annotated to the gene set (background). In gene list (22k in human) to a much smaller set (e.g. 5000 GO- the terminology of Goeman and Buhlmann [61], this framework BP terms) which is then analyzed according to term or pathway is ‘competitive’ in the sense that scores reflect relative gene set overrepresentation or other involvement. This dimension expression when compared with the background. In this man- reduction has been showed to improve the prediction of prog- ner, both FAIME and ssGSEA detect aberrant pathway expres- nosis and therapies [56, 57]. A detailed description of the meth- sion for an individual’s sample. The methods differ in ods is provided in Figure 5 and Table 3. implementation, however; FAIME operates on the normalized gene expression, while ssGSEA performs calculations on the Approaches based on a single sample of an individual ranks. We identified three methods that require a single sample of an A limitation of these two methods is that they provide a individual and a cohort reference (individPath [41], Pathifier ranking of pathways in terms of their deregulated with respect [43], individualized pathway aberrance score, iPAS [42]) and two to other pathways using the gene expression data of the indi- approaches capable of extracting DEPs from within an individu- vidual (e.g. more or less expressed than an average expression). al’s transcriptome without external comparison (single-subject Therefore, these methods do not identify functionally altered Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 797 Table 3. Additional details on single-subject transcriptome analyses of DEPs Publication Name Description Wang et al. [41] IndividPath IndividPath computes REOs from a pathway point of view reducing the dimension of the sam- ple representation. Patient-specific DEPs of a sample are obtained by applying a similar pro- cedure to RankComp [35], in which REOs in an individual sample are compared with the highly stable REOs identified from a large cohort of normal samples. The authors identify the biological pathways with significantly disrupted ordering of gene expression via P-values. In this case, P-values are determined by testing whether the frequency of reversal gene pairs observed in a sample within each pathway is significantly greater than that expected by chance using the hypergeometric distribution model (i.e. a Fisher’s exact test) Drier et al. [43] Pathifier Pathifier has been developed to compute PDSs for cancer tumor samples by aggregating gene- level information into pathway-level information, providing meaningful dimension reduc- tion. Pathifier analyzes one pathway at a time and assigns a PDS to each sample by using the expression levels of the genes belonging to the pathway. To calculate PDSs, a PCA is per- formed to reduce the dimensions and capture the variation of the data. Next, the method identifies the best principal curve using both cohort samples (normal and disease). Then, the PDS of a sample is obtained by computing the distance of a single sample from the median of the normal samples on the principal curve. The output of this approach is therefore a list of DEPs for each sample representing the level of deregulation of each pathway Ahn et al. [42] iPAS iPAS provide gene-level statistics (i.e. Z-score) by standardizing the gene expression level of the disease sample with the mean and the standard deviation of the normal samples. Z-scores are used as inputs to calculate iPAS for the disease sample, for example, using the average of the Z-scores in a pathway. iPAS is then computed for every normal sample to construct a null distribution, which assesses the significance of disease iPAS’s deviation from the nor- mal reference. Yang et al. [44] FAIME The FAIME transforms a vector of mRNA quantification into pathway-level metrics derived from a single biological sample. Each mRNA is annotated to a gene, and genes are annotated to gene sets via knowledge base integration. Every pathway receives a score that quantifies the ‘average’ over-expression of genes within the pathway, when compared with genes in background (not in the pathway). This process provides mechanism-level interpretation to a single transcriptome. Barbie et al. [45] ssGSEA ssGSEA uses the difference in empirical cumulative distribution functions of gene expression ranks inside and outside a gene set (i.e. pathway) to calculate an enrichment statistic per sample, akin to the FAIME methodology described above. The procedure adopted is similar to GSEA [21] except that ssGSEA uses gene expression intensity at the single sample level to compute enrichment scores Gardeux et al. [46] N-of-1 pathways This method aggregates gene expression values from two paried samples into gene sets pro- Wilcoxon vided by external knowledge sources (e.g. GO, KEGG). Each externally defined gene set is assessed for differential expression using the nonparametric analog of a paired t-test, the Wilcoxon signed-rank test. The result is a metric of pathway-level dysregulation in the form of either a P-value or corresponding signed z-score (sign indicates whether the case sample is upregulated or downregulated compared with baseline sample). Computing such a metric across all pathways in an ontology provides a mechanistically anchored profile of personal transcriptome dysregulation for each patient Schissler et al. [47] N-of-1 pathways MD N-of-1-pathways MD seeks to improve the differential expression testing component of the framework introduced by Gardeux et al. [58]. The rationale behind using the statistical gener- alization of distance is to incorporate the observed covariance structure between the two paired samples (as they are derived from the same patient). Briefly, the average log2 fold- change of expression within the pathway is adjusted using components of the variance– covariance matrix. Then, a nonparametric bootstrap is performed to estimate the standard error of the pathway average expression. This provides pathway metrics that are more clini- cally relevant than a Wilcoxon test statistic and simulation studies showed increased power under the MD framework Schissler et al. [48] ClusterT The Cluster-T is yet another improvement to the differential test procedure of N-of-1-path- ways. It was shown that under nontrivial inter-genetic correlation, the bootstrapping proce- dure of the MD failed to produce adequate estimates of the standard error of the average log2 fold-change of expression. This problem proved to be challenging without bringing in external knowledge of context-specific gene–gene correlation. With this external knowledge, genes are clustered within pathways and, under certain assumptions, the test statistic was shown to follow a t-distribution with degrees of freedom dependent on the number of clus- ters. In novel multivariate gene expression simulations, the Clustered-T showed far superior performance in false-positive rates Li et al. [50] N-of-1-pathways N-of-1 pathways MixEnrich improves both N-of-1 pathways Wilcoxon and MD by detecting MixEnrich DEPs when they are bidirectionally dysregulated and/or background noise is present. Both Continued Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 798 | Vitali et al. Table 3. (continued) Publication Name Description Wilcoxon and MD are not designed to detect dysregulated pathways with upregulated and downregulated genes (bidirectional dysregulation), which are ubiquitous in biological sys- tems. MixEnrich identifies bidirectional dysregulation by first clustering genes into upregu- lated, downregulated and unaltered genes. Subsequently, MixEnrich identifies pathways enriched with upregulated and/or downregulated transcripts. The enrichment test per- formed by MixEnrich detects only pathways with a significantly higher proportion of dysre- gulated genes with respect to the background. It is therefore more robust in presence of background noise (i.e. a large number of dysregulated genes unrelated to the phenotype) Li et al. [49] N-of-1-pathways N-of-1 pathways kMEn further improves the N-of-1 pathways MixEnrich method by using a kMEn nonparametric model (i.e. k-means clustering) to cluster genes into upregulated, downregu- lated and unaltered clusters. The distribution of log2 fold-change of gene expression is com- plex and may vary from experiment to experiment. Hence, a nonparametric model might be more flexible to model that distribution REOs¼ Relative expression orderings. pathways against a reference as in the previous methods ClusterT to estimate co-expression of genes within the relevant because a pathway more or less expressed than average may biological context. For example, TCGA breast cancer RNA-Seq the normal expected level of expression of that pathway. samples could be used to characterize clusters of genes within The output of both ssGSEA and FAIME report DEPs, which pathways, with positively correlated genes within the same cluster. This approach bears similarities to the ‘accumulated allow enhanced functional interpretation of disease-associated biological processes relative to less readily interpretable lengthy normal sample’ strategy described above, but differs in the way that an ontology is characterized by clusters within the context lists of DEGs. This approach could be useful when little patho- of analyzing a single subject’s pair. The authors envision logical knowledge is available for the disease or when substan- co-expression cluster-augmented knowledge bases to enable tial pathway heterogeneity may underlie the clinical clinical translation without the additional burden of accumulat- phenotype. In the case of single-subject studies, DEP lists can be ing ad hoc normal samples. N-of-1 pathway Wilcoxon, MD and used not only to investigate biological mechanisms specific of the Clustered-T approaches perform gene set testing, one path- certain patients but also to suggest potential treatments or way at a time, in a ‘self-contained’ fashion [61]. This offers an combination of treatments based on gene products annotated opportunity for small-scale gene expression testing, as whole- to the pathology-associated DEP, or other known interactions. transcriptome measurement is not required. Seeking gene set test procedures in the paired-sample set- Approaches based on paired samples of an individual ting that explore pathway-level expression relative to the rest of In the following, we will focus on single-subject-based methods the transcriptome (i.e. a ‘competitive’ test), Li et al. developed that analyze paired samples without the requirement of repli- two procedures, N-of-1-pathways k-Means Enrichment (kMEn) cates. In this category, we identified the methods known as N- [49] and Mixture-Enrichment (MixEnrich) [50]. The benefits of of-1-pathways (Figure 5 and Table 3). These methods provide a the techniques lie in the detection of bidirectional pathway dys- statistical informatic approach by aggregating gene-level meas- regulation (mRNAs within the same pathway that are both urements from two samples into gene sets (pathways) provided overexpressed and underexpressed) and in noisy samples with by curated knowledge bases (e.g. GO, KEGG). This consolidation a high frequency of DEGs. seeks to reduce noise from gene-level measurements and pro- All the N-of-1 pathways methods showed their power in the vide meaningful dimension reduction. These profiles are identification of DEPs that result from diverse health disorder designed to have clinical translational value by providing a sys- [46, 47–50]. tems biology perspective instead of focusing on single bio- markers. A drug targeting a non-DEG product at first glance Longitudinal time series analyses of transcriptome could seem useless. However, the pathway could yet to be dys- regulated and the drug may still have therapeutic value. For Biological processes are highly dynamic, and understanding example, an epithelial growth factor receptor inhibitor (erloti- how diseases evolve over time can reveal factors involved in nib) was successfully used in dual therapy to abate pathway- determining the disease status, progression and compensation. wide overexpression in oral carcinomas [58]. However, -omic technologies are typically gathered at infre- The first transcriptome analytic framework for quantifying quent or even single static points. Comprehensive longitudinal within-patient differential expression from a pair of samples -omics data (i.e. one or more type of omics measured over time) was introduced by Gardeux et al. [46] that developed the N-of-1- can provide key information for understanding the whole evo- pathways Wilcoxon. Schissler et al. [47] extended the analysis of lution of biological processes and underlying biological mecha- within-patient paired samples with the N-of-1-pathways nisms. Comprehensive longitudinal analyses are typically Mahalanobis distance (MD) to improve on the Wilcoxon-based limited by the substantial associated costs of sample collection approach. MD provides an effect size of pathway-differential and patient follow-up. As a result, with perhaps rare exceptions, expression that incorporates the variance–covariance structure long time series experiments have few or no isogenic replicates between the two samples. However, MD’s testing procedure in single subjects. Traditional analysis of time series gene failed to account for inter-gene correlation within pathways, expression data aims at identifying gene sets that exhibit com- which could result in the inflation of false-positive rates [48]. In mon or distinct patterns of expression between two or more response to this shortcoming, Schissler et al. [48] developed conditions (i.e. gene modification, treatment). The Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 799 computational complexity to analyze such data is higher, as statistical power with short time course data. Therefore, it can- time course data involves the three dimensions of gene, time not be applied to investigate biological processes involving and condition. When considering time series data from an indi- small time series (generally short-term) responses. One of the major limitations of all transcriptome analyses vidual, several strategies can be applied depending on the experimental setup (i.e. number of time points and condition is their inability to fully capture the dynamics of the considered). For example, baseline comparisons can be per- represented system because of, for instance, posttranscriptional modifications. To this end, analysis of the product of transcrip- formed by considering samples gathered during ostensibly healthy physiological states of the patient as the reference pop- tome can provide significant insights and source of ulation if multiple time points are sampled. Samples gathered information. during ostensibly healthy physiological states of the patient approaches to extract meaningful knowledge from time series Single-subject transcriptome integrated with transcriptome data are based on clustering algorithms [62], hid- other -omics den Markov models [63], Gaussian processes [64] or Bayesian approaches [65] (for a review, see [66, 67]). These techniques can In this section, we will report the state of art of current analy- be applied for single-subject transcriptome analysis to extract ses aimed at analyzing the transcriptome combined with other DEGs or gene expression trajectory patterns from multiple -omics for SSA. The retrieved studies are summarized in experimental conditions where multiple time points are Figure 6. studied. The integration and analysis of different high-throughput However, when replicates are not available, few models molecular assays and data is one of the major topics in preci- have been proposed for the identification of DEGs or DEPs from sion medicine for understanding patient-specific variations. longitudinal data. In Figure 4, we reported a method [40] aimed This approach enables the possibility of obtaining a comprehen- at extracting DEGs from time series data, i.e. gene whose sive view of the genetic, biochemical, metabolic, proteomic and expression changed significantly with respect to time. Wu et al. epigenetic processes underlying a disease that, otherwise, could [40] propose a nonparametric method that integrates a func- not be fully investigated by using single -omics approaches. The tional principal component analysis (FPCA) into a hypothesis increased power of multi-omics studies have been already testing framework to extract gene-specific expression trajecto- assessed in the understanding of diseases, biomarkers and drug ries. As this approach is based on FPCA, the user has to select discovery. These methods are based on supervised or unsuper- the number of the first principal components that can explain vised machine learning techniques and typically aim at classify- the data. Therefore, the selection of such parameters can affect ing patients into cancer subtypes [70–74] or are designed for the overall results. On the other hand, Martini et al. [51] devel- drug repurposing [73, 75, 76]. Even if these strategies are useful oped an approach to extract time-dependent pathways (DEPs) for precision medicine, they are not able to extract meaningful without the requirement of replicates (Figure 5). This method knowledge on individual-specific biological mechanisms. They combines dimension reduction and graph decomposition still rely on the integration of -omics profiles from populations theory. It first extracts time-dependent pathways and decom- of subjects. poses them into cliques to isolate the time-dependent portion. In our review of the literature, only few computational Although this approach is tailored to time course gene expres- single-subject algorithms aimed at analyzing transcriptome sion data without replicates, it does not provide information data combined with other -omics have been proposed in the about the directionality of the identified DEPs. Its output is the past years. Chen et al. [77] pioneered an ambitious project to activation versus non-activation of a pathway. Moreover, it has integrate, analyze and provide clinically interpretable results been designed specifically for long time series data, not showing from multi-omics profiles of an individual. The authors Figure 6. Summary of single-subject methods that analyze transcriptome data combined with other -omics. Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 800 | Vitali et al. proposed the integrative personal -omic profile (iPOP), using example, missense and nonsense mutations are known to Dr Snyder’s -omics as a test case colloquially referred to as the affect the host genes, many of which are known to lead to ‘Snyderome’. iPOP combines genomic, transcriptomic, proteo- Mendelian diseases annotated in OMIM [78]. Reproducible mic, metabolomics and autoantibody profiles collected from a genome-wide association studies led to the creation of the single subject over a 14-month period. A key aspect of this NHGRI Catalog that annotates the disease risk associated to cer- study, other than the focus on collecting data from a single tain single-nucleotide polymorphisms. In other words, for person, was its comprehensive longitudinal nature and sam- DNAseq, an SSA (intermediate step of mutation and variant pling during a variety of incidental environmental exposures calls) precedes clinical utility studies. However, this has not including two viral infections and physician-recommended been the case for the majority of the studies at other omics diet changes. This resulted in 3 billion measurements taken scales. over 20 time points and >30 TB of data [77]. The article con- This review focuses on comparing and contrasting SSA that firmed that some disease risks could be assessed from the incorporates this previously unavailable intermediate step for genome sequence of the patient, but actual onset and assess- other molecules of life, such as mRNAs, miRNAs, proteins, ment of certain other diseases, such as hypertriglyceridemia, methylated DNA regions and metabolites (carbohydrates and could not be diagnosed based only on the genomic profile. lipids). For example, oncologists already use assays for deter- mining expression fold change and protein function of onco- Interestingly, proteome and metabolome were also required to understand the biological mechanisms underlying response to genes and tumor suppressors through the comparison of tumor the viral infections. Association between expression and dis- tissue with external references or unaffected paired tissue. As ease status was also revealed through the analysis of tran- these curated approaches may not scale to the full omics data scriptome data. for other diseases, we provide emerging evidence that the PARADIGM [78] integrates transcriptome and DNA CNV data newly available unbiased SSA enables new types of studies to compute pathway scores that represent the alternation of a investigating their clinical utility by addressing the gap of bio- person’s pathways. Pathway scores are calculated as a joint molecular interpretation of raw omics signal. Among possible probability of a directed factor graph, a form a probabilistic studies, we demonstrate that omics clinical prediction classi- graphical model. Variables in the graphical model correspond to fiers that operate directly at the omics scale may be redesigned different molecular entities; edges in the graph represent for the parsimonious transformed signal of single-subject stud- within- or between-scale interactions. The interactions are ies for improved clinical utility. For example, Gardeux et al. [79] determined by central dogma and knowledge of annotated quantified the personal pathway-level transcriptomic response pathways, such as pathway interaction database [77]. of peripheral blood mononucleocytes to rhinovirus ex vivo and trained a classifier predictive of children prone to asthma exac- erbations. The dimension of the signal was reduced from the Validation of single-subject omics methods entire transcriptomes of paired samples in 20 subjects (10 We further classified each publication in this review according data points) to the effect size of statistically significant respon- to the method(s) used for result validation (Table 4). The major- sive pathways in at least one subject (10 data points). ity of approaches have been validated with in silico simulation of While many unbiased fully specified GExpCs designed over data or by cross-validation in the same data set. A few methods the entire transcriptome have been published in peer-reviewed have validated their results across replicate samples, or have journals, few have been FDA approved because of their lack of a had pathology-associated DEG or DEP results successfully repro- mechanistic relationship between the features (gene tran- duced in independent data set. To our knowledge, only the N- scripts) and the disease progression [6, 80, 81]. Two additional of-1pathways W [46] SSA method has been validated in vitro and important limitations of the clinical utility of conventional as a prognostic outcome classifier in a prospective study. In that GExpCs include (i) their platform dependence that limits their study, patient-specific DEPs were identified in response to an ex face-value validity (e.g. specific to AgilentTM) [8], and (ii) dis- vivo stimulation of their PBMCs with rhinovirus and used to tinct GExpCs are paradoxically obtained from distinct cohorts of accurately predict risk of asthmatic exacerbation in those same the same phenotypes [6, 8]. Interestingly, the transformation of patients over a 2-year follow-up period. This strongly supports a signal from conventional raw gene expression to effect size the conclusion that the field of single-subject studies of person- obtained after DEP-type SSA enables us to address these three alomes is an emerging field that is in need of more rigorous vali- limitations. First, SSA generates an effect size and P-value for dations for translation to clinical practice. Additionally, new each subject, analogous to mechanisms-level features ascribed validation strategies need to be developed for in vivo and clinical to a patient. In addition, Zhang et al. [44] have shown that the trial validations of personalome imputations. FAIME DEP transformation leads to the rediscovery of at least 50% of the same gene set-level features (KEGG, GO) in seven dis- tinct data sets of head and neck cancers when learning fully Clinical applications specified gene set-level classifiers (GenesetCs). The discovered To better understand the requirement for single-subject studies, features were consistently predictive of disease progression in we revisit the types of approaches and transformations independent validation data sets. Furthermore, three studies required for clinicians to interpret the more clinically used demonstrated that the discovered GenesetCs overlap by >50% method of DNA sequencing. As shown in Figure 1, we high- of gene set features across expression platforms (Affymetrix, lighted the critical steps for clinical interpretation of DNA Agilent, RNAseq) [44, 82, 83], thus addressing another limitation sequencing. The full genome of 3.5 billion base pairs is eval- of GExpC. Finally, a recent report from Gardeux et al. [79] shows uated against reference genomes to identify the single-subject that DEP single-subject studies in paired samples could gener- variants and mutations, yielding a substantial dimension ate features of higher quality than those obtained directly from reduction as well as a transformation from molecular data to a gene expression in small cohorts. Specifically, a GenesetC pre- biomolecular interpretation of the sequence. Additional studies dictive of exacerbation of pediatric asthmatic patients was con- provided external knowledge for clinical interpretation. For firmed in an independent cohort (learning set 40 subjects, Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 801 Table 4. Summary of the method validation in single subjects Publication Method In silico Real dataset Independent In vitro In vivo Clinical trial validation validation dataset validation validation validation validation Transcriptome Gardeux et al. [46] N-of-1 pathways W Wang et al. [36] DEGseq Anders et al. [39] DESeq Feng et al. [38] GFOLD Wang et al. [34] RankComp Yang et al. [44] FAIME Drier et al. [43] Pathifier Li et al. [50] N-of-1 pathways MixEnrich Li et al. [49] N-of-1-pathways kMEn Schissler et al. [47] N-of-1 pathways MD Liu et al. [35] DNB Wang et al. [41] IndividPath Ahn et al. [42] iPAS Schissler et al. [48] ClusterT Tarazona et al. [37] NOISeq Robinson et al. [17] edgeR Wu et al. [40] FPCA Barbie et al. [45] ssGSEA Martini et al. [51] timeClip Multi-omics Vaske et al. [69] PARADIGM Chen et al. [68] iPOP validation set 22 subjects). This study suggests that SSA could requires to be applied effectively (e.g. single sample, paired reduce the cohort size for classifier development, as conven- samples, longitudinal samples), as well as whether access to an tional GExpCs generally require hundreds of subjects in their appropriate external reference database is necessary or what learning sets. type of output is provided (e.g. DEGs or DEPs). Each broad cate- gory of currently available methods has both advantages and limitations. Perspective and conclusion Approximately half of the bioinformatics methods we sur- The development and analysis of personal transcriptome inter- veyed perform a comparison between the single subject’s pro- pretation are essential for precision medicine, as therapeutic file and a reference, most often a cohort of accumulated normal decision-making pertains not exclusively to genomic sequences samples or samples of a well-defined disease subtype [34, 35, but to Genome x Environment interactions (GxE) as well. For 41–43]. These methods are generally able to capture patient var- example, isogenic twins may experience different diseases iability and provide clinically interpretable results. However, because of their distinct environment exposures, despite shar- accumulating the reference may be challenging and not factor ing identical genomes. Even in the presence of the same dis- in the heterogeneity of the reference sample, and subtle effects eases, their therapeutic responses may vary as a result of other may be difficult to detect. This may result in missing crucial GxE conditions [79, 84]. The analysis of single-subject transcrip- alterations present in the patient profiles. Nonetheless, these tomes is valuable for extracting useful knowledge to better methods are appropriate when a robust reference is obtainable, understand individual variability and patient-specific mecha- and/or cases where a paired sample design does not make nisms underlying a disease and for suggesting tailored thera- sense. pies. Selecting the best method for evaluation of a given subject’s personalome is first dependent on the biological ques- Recommended DEG and DEP approaches to SSAs tion and experimental design approach that is best suited for As transcriptomes vary by cell type and with environmental determining an answer. exposures, clinically or biologically interpretable altered mecha- This review revealed that ongoing advances in high- nisms are more convincing when developed in isogenic (same throughput technologies, emerging research and clinical ques- subject) conditions than in heterogenic ones. We thus recom- tions urge continued investigation and development toward mend clinical or experimental designs that generate a baseline experimentally validated methods for unveiling tailored treat- in the same individual, i.e. paired samples (Figures 1, 2 and 5), ments from patient-specific transcriptomes (Table 4). In recent years, this nascent field of single-subject -omics has demon- which are well evaluated (Table 4) and have been validated in many publications (Figure 2). At this point in time, multi-omics strated considerable growth as reflected by the number of methods have not been evaluated sufficiently to recommend approaches being published and underlined by the high number of citations for the earlier works (Figure 2). Figures 4, 5 and 6 one over another, even though they have the potential for being detail the computational analysis options that are available for the best methods. Among analytical techniques exploiting a transcriptome data and the sampling regimens that each baseline, the more measures the better; thus FPCA and timeClip Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 802 | Vitali et al. are favored for DEGs and DEPs, respectfully, when three or more copy number variants, gain- or loss-of-function mutation, samples are available over time. For discovery of DEGs among expression fold changes (tumor versus normal) or gene expres- paired samples analyses, we recommend the use of DEGseq, as sion against reference tissue and occasionally protein activity. it is designed for single subjects, provides effect sizes and P-val- However, these are limited results for a handful of known genes ues, considers a limited variance estimate and is validated in that have been highly curated to apply to a narrow set of dis- independent samples. On the other hand, edgeR and GFOLD are eases, while novel SSA approaches discussed in this manuscript suboptimal as they require user-defined parameters (Figure 4, unbiasedly assess the entire transcriptome for DEPs and DEGs column ‘User-defined parameters heuristics’). The unbiased in diseases that may be far less well studied, analyses which are and parameter-free DEseq approach, which is not designed for a not currently available commercially. Clinical utility of these single subject, is likely performing better in these conditions assessments requires additional studies or a knowledge base, than either edgeR or GFOLD that require subjective, and possi- similarly to the interpretation of novel mutations for DNA bly biased, user-defined parameters. However, currently no (Figure 1; ‘Studies informing clinical interpretation’). study has yet been conducted to compare the accuracy of differ- ent single-subject DEG methods against one another. For dis- Opportunities for future work covery of DEPs in paired samples, N-of-1-pathways kMen has Analysis of multi-omics dynamic profiles including transcrip- been shown in simulation and in real data sets to outperform tome, proteome, methylome and metabolome can additionally other paired DEPs methods; however, the N-o1-pathways provide indicators of real-time phenotypes and physiology in Wilcoxon remains the most validated, which includes a clinical individuals that cannot be obtained through examination of the trial. Of note, ClusterT is the only approach controlling for inter- static genome alone. In doing so, GxE interactions are revealed genic correlation (Figure 5, column ‘Intergenic correlation’) that [79]. -Omics integration has been used successfully for the iden- can create enrichment biases; however, additional validations tification of novel associations between biological entities (e.g. are required. genes, proteins) and disease [74], patient stratification [73] and In absence of multiple samples from a single subject with its biomarker discovery or drug repositioning [76]. However, these own isogenic reference, we recommend analyses providing bio- strategies have not yet been applied for the integration of multi- logically and clinically interpretable results of altered expres- omics data of an individual and biological knowledge. sion against heterogenic references (a population). Among When taking into account the integration of multiple -omics, single-sample SSA, we recommend RankComp and individPath an important aspect to consider is the variability of data for DEG and DEP determination, respectively. RankComp is cur- between each -omics, not only with respect to the represented rently the only method that provides DEGs based on the com- biological process but also with the associated noise levels, parison of a single sample against a reference cohort. While for identification accuracy, coverage and temporal resolution of DEP determination, we suggest individPath because of its rigor- data. These differences complicate the integration and joint ous formal model and the small number of transcripts required modeling of multi-omics data. While this is intuitively clear, it to detect DEPs. remains computationally and experimentally challenging to Imputing altered or dysregulated expression of a transcript effectively integrate longitudinal multi-omics data. For one, of a pathway is not feasible for inadequately designed clinical each biological entity (e.g. gene, metabolites) has different time- assays or experiments aimed at interpreting a single transcrip- dependent modulation and responds to signals on a different tome in the absence of any transcriptome reference (e.g. iso- specific time scale, even if contributing to the same biological genic, heterogenic). To address this, transcript and pathway process. Second, biological processes that take place in inacces- expression can be compared within a sample using FAIME or sible tissues (e.g. brain, internal organs) cannot be feasibly ssGSEA. However, the output of higher or lower expression of a monitored in a longitudinal approach, even if a single sample is mechanism as compared with the sample expression may sim- possible. Additional challenges are related to the same variables ply be the normal state of such a mechanism with the interpre- of autocorrelation across repeated measurements, random tation being ambiguous. effects and missing data. Moreover, the design of longitudinal While transcriptome analyses can provide DEGs and DEPs studies of a single subject must account for repeated measures for single subjects and are the most mature, we anticipate preferably being equally spaced in time, allowing the increase that as the field advances, it will be possible to reveal novel in statistical power of the approach [85]. An obvious opportunity physiological state correlations through the construction and that has not been reported is to learn convergent patterns at analysis of multi-scale personalomes. The analysis of a single one -omics scale (e.g. transcriptome) and correlate it with those scale (i.e. -omics data) alone cannot reveal the complex picture of another scale (e.g. proteomics), thus providing internal vali- underlying a disease that may be fully captured only by fusing dation and increasing the noise-to-signal ratio. together multi-omics data (from genome to metabolome, to Futures studies for SSA of transcriptomes will need to focus exposome) of an individual, i.e. via comprehensive personalome on four underreported approaches: (1) variance estimation in profiles. In fact, the combination of multiple -omics data can lead to the detection of a comprehensive individual variability, isogenic conditions from single-subject measures (without requirement of reference transcriptomes), (2) activity level of essential for providing new insights into disease pathophysiol- ogy and mechanisms that may explain the differences in drug pathways (functional, e.g. upregulated versus downregulated) rather than expression direction (overexpressed versus under- responses in the human population. As shown in Figure 1 and discussed in ‘Clinical applications’ section, by delivering dimen- expressed), (3) the analysis and integration of comprehensive sion reduction and biomolecular interpretations, SSAs enable personal -omics data to infer dysregulated molecules and new types of transcriptomic analyses for clinical interpretation mechanisms and (4) rigorous validation of DEGs, DEPs, or other that compare with the methods applied to DNAseq for clinical results with appropriate in vitro, in vivo and clinical trial interpretation. For example, current DNA sequence-based, and investigations. classifier-based SSA commercial offerings provide oncologists As clinical research continues to explore the importance of with annotations of oncogenic or tumor suppressor genes with patient heterogeneity, we encourage more investigators to Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 803 adopt single-subject study designs and -omics analyses when 1R01CA190696-01), NIH/National Institute of Allergy and appropriate to maximize the information made available by Infectious Diseases (grant number U01AI122275-01). high-throughput technologies. Access to these analysis tools may also allow researchers to more thoroughly explore certain References rare case studies, outliers and patients-of-special-interest in a 1. Stone NJ, Robinson JG, Lichtenstein AH, et al. 2013 ACC/AHA way that they could not have done if relying only on traditional guideline on the treatment of blood cholesterol to reduce large cohort-based statistics. This is particularly true if an atherosclerotic cardiovascular risk in adults: a report of the isogenic paired-sample study design can be used to answer a American College of Cardiology/American Heart Association meaningful biological or clinical question. The use of persona- Task Force on Practice Guidelines. J Am Coll Cardiol 2014;63(25 lome integrated with the available external knowledge (e.g. Pt B):2889–934. repositories on disease–disease association, target–gene inter- 2. Collins FS, Varmus H. A new initiative on precision medicine. actions, gene–gene interaction) can provide new opportunities N Engl J Med 2015;372(9):793–5. for developing more robust and comprehensive results that 3. Guyatt GH, Keller JL, Jaeschke R, et al. The n-of-1 randomized account for all the interacting -omics and temporal behaviors of controlled trial: clinical usefulness. Our three-year experi- the biological system of an individual. ence. Ann Intern Med 1990;112(4):293–9. Finally, personalome researchers should consider a creative 4. Schork NJ. Personalized medicine: time for one-person trials. application of powerful engineering and mathematical tools Nature 2015;520(7549):609–11. that have not yet been applied to study the mechanisms under- 5. Scuffham PA, Nikles J, Mitchell GK, et al. Using N-of-1 trials to pinning the personalome of an individual. For example, compu- improve patient management and save costs. J Gen Intern Med tational methods used to analyze time series data include 2010;25(9):906–13. generalized linear mixed models, generalized estimating equa- 6. Massague J. Sorting out breast-cancer gene signatures. N Engl tions, Markov models, nonparametric or semi-parametric mod- J Med 2007;356:294–7. els or Bayesian models and dynamic pathway analysis [85]. 7. Stec J, Wang J, Coombes K, et al. Comparison of the predictive However, these methods have not yet been applied to study the accuracy of DNA array-based multigene classifiers across machinery underpinning the personalome of an individual. cDNA arrays and Affymetrix GeneChips. J Mol Diagn 2005;7(3): 357–67. Clearly, these methods are not directly applicable as they are 8. Simon R, Radmacher MD, Dobbin K, et al. Pitfalls in the use of cohort-centric; however, innovations may altogether extend the DNA microarray data for diagnostic and prognostic classifica- paradigm of their current implementation. tion. J Natl Cancer Inst 2003;95(1):14–18. 9. Dupuy A, Simon RM. Critical review of published microarray Key Points studies for cancer outcome and guidelines on statistical anal- ysis and reporting. J Natl Cancer Inst 2007;99(2):147–57. For the ‘personalome’ to enable precision medicine 10. Conesa A, Madrigal P, Tarazona S, et al. A survey of best prac- from -omics data, we need to move from cohort- tices for RNA-seq data analysis. Genome Biol 2016;17(1):13. focused assays and analytics to individualized (single- 11. Ritchie ME, Phipson B, Wu D, et al. limma powers differential subject) studies. expression analyses for RNA-sequencing and microarray We survey and categorize methodology by biological studies. Nucleic Acids Res 2015;43(7):e47. and informatics input, mathematical formalism and 12. Li J, Tibshirani R. Finding consistent patterns: a nonparamet- procedure output. ric approach for identifying differential expression in RNA- Seq data. Stat Methods Med Res 2013;22(5):519–36. Our review focuses on the transcriptome dimension of the personalome showing a great development to date, 13. Tusher VG, Tibshirani R, Chu G. Significance analysis of while proteomics, multi-scale and other scales of biol- microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001;98(9):5116–21. ogy present open challenges. The personalome methods need more rigorous valida- 14. Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol 2000;7(6): tions, as few have been validated in vitro, in vivo or in 819–37. clinical trials. • 15. Trapnell C, Hendrickson DG, Sauvageau M, et al. Differential The emerging personalome represents a largely unex- analysis of gene regulation at transcript resolution with RNA- plored application of -omics data and potentially has seq. Nat Biotechnol 2013;31(1):46–53. important consequences for improving patient 16. Love MI, Huber W, Anders S. Moderated estimation of fold outcomes. change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15(12):550. Funding 17. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene Publication of this article has been funded in part by the expression data. Bioinformatics 2010;26(1):139–40. following grants and organizations: National Institute of 18. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and Health (NIH)/Office of the Director Precision Medicine genomes. Nucleic Acids Res 2000;28(1):27–30. Initiative (grant number 1UG3OD023171-01), the Precision 19. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for Medicine Initiative of the Center for Biomedical Informatics the unification of biology. Nat Genet 2000;25(1):25–9. and Biostatistics of the University of Arizona Health 20. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrich- Sciences, NIH/National Heart, Lung, and Blood Institute ment analysis: a knowledge-based approach for interpreting (grant numbers HL126609-01, HL132523, U01 HL125208), genome-wide expression profiles. Proc Natl Acad Sci 2005; NIH/National Cancer Institute (grant numbers P30CA023074, 102(43):15545–50. Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 804 | Vitali et al. 43. Drier Y, Sheffer M, Domany E. Pathway-based personalized 21. Huang da W, Sherman BT, Lempicki RA. Systematic and inte- grative analysis of large gene lists using DAVID bioinfor- analysis of cancer. Proc Natl Acad Sci USA 2013;110(16):6388–93. matics resources. Nat Protoc 2009;4:44–57. 44. Yang X, Regan K, Huang Y, et al. Single sample expression- 22. Huang da W, Sherman BT, Lempicki RA. Bioinformatics anchored mechanisms predict survival in head and neck can- enrichment tools: paths toward the comprehensive func- cer. PLoS Comput Biol 2012;8(1):e1002350. tional analysis of large gene lists. Nucleic Acids Res 2009;37(1): 45. Barbie DA, Tamayo P, Boehm JS, et al. Systematic RNA inter- 1–13. ference reveals that oncogenic KRAS-driven cancers require 23. Falcon S, Gentleman R. Using GOstats to test gene lists for GO TBK1. Nature 2009;462(7269):108–12. term association. Bioinformatics 2007;23(2):257–8. 46. Gardeux V, Achour I, Li J, et al. ‘N-of-1-pathways’ unveils per- 24. Grossmann S, Bauer S, Robinson PN, et al. Improved detection sonal deregulated mechanisms from a single pair of RNA-Seq of overrepresentation of Gene-Ontology annotations with samples: towards precision medicine. J Am Med Inform Assoc parent child analysis. Bioinformatics 2007;23(22):3024–31. 2014;21(6):1015–25. 25. Yang X, Li J, Lee Y, et al. GO-Module: functional synthesis and 47. Schissler AG, Gardeux V, Li Q, et al. Dynamic changes of RNA- improved interpretation of gene ontology patterns. sequencing expression for precision medicine: N-of-1-path- Bioinformatics 2011;27(10):1444–6. ways Mahalanobis distance within pathways of single sub- 26. Fabregat A, Sidiropoulos K, Viteri G, et al. Reactome pathway jects predicts breast cancer survival. Bioinformatics 2015; analysis: a high-performance in-memory approach. BMC 31(12):i293–302. Bioinformatics 2017;18(1):142. 48. Schissler AG, Piegorsch WW, Lussier YA. Testing for differen- 27. Cerami EG, Gross BE, Demir E, et al. Pathway commons, a web tially expressed genetic pathways with single-subject N-of-1 resource for biological pathway data. Nucleic Acids Res 2011; data in the presence of inter-gene correlation. Stat Methods 39:D685–90. Med Res 2017. doi: 10.1177/0962280217712271. 28. Vivar JC, Pemu P, McPherson R, et al. Redundancy control in 49. Li Q, Schissler AG, Gardeux V, et al. kMEn: analyzing noisy pathway databases (ReCiPa): an application for improving and bidirectional transcriptional pathway responses in single gene-set enrichment analysis in omics studies and “Big Data” subjects. J Biomed Inform 2017;66:32–41. Biology. OMICS 2013;17(8):414–22. 50. Li Q, Schissler AG, Gardeux V, et al. N-of-1-pathways 29. Sparano JA, Paik S. Development of the 21-gene assay and its MixEnrich: advancing precision medicine via single-subject application in clinical practice and clinical trials. J Clin Oncol analysis in discovering dynamic changes of transcriptomes. 2008;26(5):721–8. BMC Med Genomics 2017;10(S1):27. 30. Parker JS, Mullins M, Cheang MC, et al. Supervised risk predic- 51. Martini P, Sales G, Calura E, et al. timeClip: pathway analysis tor of breast cancer based on intrinsic subtypes. J Clin Oncol for time course data without replicates. BMC Bioinformatics 2009;27(8):1160–7. 2014;15(Suppl 5):S3. 31. Daxin J, Chun T, Aidong Z. Cluster analysis for gene expression 52. Vitali F, Cohen LD, Demartini A, et al. A network-based data data: a survey. IEEE Trans Knowl Data Eng 2004;16:1370–86. integration approach to support drug repurposing and multi- 32. Cancer Genome Atlas Research Network. Integrated genomic target therapies in triple negative breast cancer. PLoS One analyses of ovarian carcinoma. Nature 2011;474:609–15. 2016;11:e0162407. 33. Nair VS, Maeda LS, Ioannidis JP. Clinical outcome prediction 53. Hansen KD, Wu Z, Irizarry RA, et al. Sequencing technology by microRNAs in human cancer: a systematic review. J Natl does not eliminate biological variability. Nat Biotech 2011; Cancer Inst 2012;104(7):528–40. 29(7):572–3. 34. Wang H, Sun Q, Zhao W, et al. Individual-level analysis of dif- 54. Peng F, Zhang Y, Wang R, et al. Identification of differentially ferential expression of genes and pathways for personalized expressed miRNAs in individual breast cancer patient and appli- medicine. Bioinformatics 2015;31(1):62–8. cation in personalized medicine. Oncogenesis 2016;5(2):e194. 35. Liu R, Yu X, Liu X, et al. Identifying critical transitions of com- 55. Kanehisa M, Furumichi M, Tanabe M, et al. KEGG: new per- plex diseases based on a single sample. Bioinformatics 2014; spectives on genomes, pathways, diseases and drugs. Nucleic 30(11):1579–86. Acids Res 2017;45(D1):D353–61. 36. Wang L, Feng Z, Wang X, et al. DEGseq: an R package for iden- 56. Simon R. Lost in translation: problems and pitfalls in translat- tifying differentially expressed genes from RNA-seq data. ing laboratory observations to clinical utility. Eur J Cancer Bioinformatics 2010;26(1):136–8. 2008;44(18):2707–13. 37. Tarazona S, Furio-Tari P, Turra D, et al. Data quality aware 57. Narayanan M, Huynh JL, Wang K, et al. Common dysregula- analysis of differential expression in RNA-seq with NOISeq R/ tion network in the human prefrontal cortex underlies two Bioc package. Nucleic Acids Res 2015;43(21):e140. neurodegenerative diseases. Mol Syst Biol 2014;10(7):743. 38. Feng J, Meyer CA, Wang Q, et al. GFOLD: a generalized fold 58. Chawla A, Adkins D, Worden FP, et al. Effect of the addition of change for ranking differentially expressed genes from RNA- temsirolimus to cetuximab in cetuximab-resistant head and seq data. Bioinformatics 2012;28(21):2782–8. neck cancers: Results of the randomized PII MAESTRO study. 39. Anders S, Huber W. Differential expression analysis for J Clin Oncol 2014;32:6089. sequence count data. Genome Biol 2010;11(10):R106. 59. Livshits A, Git A, Fuks G, et al. Pathway-based personalized 40. Wu S, Wu H. More powerful significant testing for time course analysis of breast cancer expression data. Mol Oncol 2015;9(7): gene expression data using functional principal component 1471–83. analysis approaches. BMC Bioinformatics 2013;14(1):6. 60. Liu C, Srihari S, Lal S, et al. Personalised pathway analysis 41. Wang H, Cai H, Ao L, et al. Individualized identification of reveals association between DNA repair pathway dysregula- disease-associated pathways with disrupted coordination of tion and chromosomal instability in sporadic breast cancer. gene expression. Brief Bioinform 2016;17(1):78–87. Mol Oncol 2016;10(1):179–93. 42. Ahn T, Lee E, Huh N, et al. Personalized identification of 61. Goeman JJ, Buhlmann P. Analyzing gene expression data in altered pathways in cancer using accumulated normal tissue terms of gene sets: methodological issues. Bioinformatics 2007; data. Bioinformatics 2014;30(17):I422–9. 23(8):980–7. Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 805 74. Lock EF, Hoadley KA, Marron JS, et al. Joint and Individual 62. Jung I, Jo K, Kang H, et al. TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data Variation Explained (Jive) for integrated analysis of multiple from multiple phenotypes. Bioinformatics 2017. pii: btw780. data types. Ann Appl Stat 2013;7(1):523–42. 75. Gottlieb A, Stein GY, Ruppin E, et al. PREDICT: a method for doi: 10.1093/bioinformatics/btw780. inferring novel drug indications with application to personal- 63. Schliep A, Schonhuth A, Steinhoff C. Using hidden Markov ized medicine. Mol Syst Biol 2011;7:496. models to analyze gene expression time course data. 76. Napolitano F, Zhao Y, Moreira VM, et al. Drug repositioning: a Bioinformatics 2003;19(Suppl 1):i255–263. machine-learning approach through data integration. J 64. Heinonen M, Guipaud O, Milliat F, et al. Detecting time periods Cheminform 2013;5(1):30. of differential gene expression using Gaussian processes: an 77. Schaefer CF, Anthony K, Krupa S, et al. PID: the Pathway application to endothelial cells exposed to radiotherapy dose Interaction Database. Nucleic Acids Res 2009;37:D674–9. fraction. Bioinformatics 2015;31(5):728–35. 78. Amberger JS, Bocchini CA, Schiettecatte F, et al. OMIM.org: 65. Tai YC, Speed TP. On gene ranking using replicated microar- Online Mendelian Inheritance in Man (OMIM(R)), an online ray time course data. Biometrics 2009;65(1):40–51. catalog of human genes and genetic disorders. Nucleic Acids 66. Spies D, Ciaudo C. Dynamics in transcriptomics: advance- Res 2015;43:D789–98. ments in RNA-seq time course and downstream analysis. 79. Gardeux V, Berghout J, Achour I, et al. A genome-by- Comput Struct Biotechnol J 2015;13:469–77. environment interaction classifier for precision medicine: 67. Bar-Joseph Z, Gitter A, Simon I. Studying and modelling personal transcriptome response to rhinovirus identifies dynamic biological processes using time-series gene expres- children prone to asthma exacerbations. J Am Med Inform sion data. Nat Rev Genet 2012;13(8):552–64. Assoc 2017; 24:1116–26. 68. Chen R, Mias GI, Li-Pook-Than J, et al. Personal omics profiling 80. Chen J, Sam L, Huang Y, et al. Protein interaction reveals dynamic molecular and medical phenotypes. Cell network underpins concordant prognosis among heterogene- 2012;148(6):1293–307. ous breast cancer signatures. J Biomed Inform 2010;43(3):385–96. 69. Vaske CJ, Benz SC, Sanborn JZ, et al. Inference of patient-specific 81. Chen JL, Li J, Stadler WM, et al. Protein-network modeling of pathway activities from multi-dimensional cancer genomics prostate cancer gene signatures reveals essential pathways in data using PARADIGM. Bioinformatics 2010;26(12):i237–45. disease recurrence. J Am Med Inform Assoc 2011;18(4):392–402. 70. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multi- 82. Perez-Rathke A, Li H, Lussier YA. Interpreting personal tran- ple genomic data types using a joint latent variable model scriptomes: personalized mechanism-scale profiling of RNA- with application to breast and lung cancer subtype analysis. seq data. Pac Symp Biocomput 2013;159–70. Bioinformatics 2009;25(22):2906–12. 83. Chen JL, Hsu A, Yang X, et al. Curation-free biomodules mech- 71. List M, Hauschild AC, Tan Q, et al. Classification of breast can- anisms in prostate cancer predict recurrent disease. BMC Med cer subtypes by combining gene expression and DNA methyl- Genomics 2013;6(Suppl 2):S4. ation data. J Integr Bioinform 2014;11(2):236. 84. Carrasco-Ramiro F, Peiro-Pastor R, Aguado B. Human genomics 72. Ray P, Zheng L, Lucas J, et al. Bayesian joint analysis of projects and precision medicine. Gene Ther 2017;24(9):551–61. heterogeneous genomics data. Bioinformatics 2014;30(10):1370–6. 85. Sperisen P, Cominetti O, Martin FP. Longitudinal omics mod- 73. Gligorijevic V, Malod-Dognin N, Przulj N. Patient-specific data eling and integration in clinical metabonomics research: fusion for cancer stratification and personalised treatment. challenges in childhood metabolic health research. Front Mol Pac Symp Biocomput 2016;21:321–32. Biosci 2015;2:44. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Developing a ‘personalome’ for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes

Loading next page...
 
/lp/ou_press/developing-a-personalome-for-precision-medicine-emerging-methods-that-o9GTHiLi0Z

References (90)

Publisher
Oxford University Press
Copyright
Copyright © 2022 Oxford University Press
ISSN
1467-5463
eISSN
1477-4054
DOI
10.1093/bib/bbx149
Publisher site
See Article on Publisher Site

Abstract

The development of computational methods capable of analyzing -omics data at the individual level is critical for the suc- cess of precision medicine. Although unprecedented opportunities now exist to gather data on an individual’s -omics profile (‘personalome’), interpreting and extracting meaningful information from single-subject -omics remain underdeveloped, particularly for quantitative non-sequence measurements, including complete transcriptome or proteome expression and metabolite abundance. Conventional bioinformatics approaches have largely been designed for making population-level inferences about ‘average’ disease processes; thus, they may not adequately capture and describe individual variability. Novel approaches intended to exploit a variety of -omics data are required for identifying individualized signals for mean- ingful interpretation. In this review—intended for biomedical researchers, computational biologists and bioinformati- cians—we survey emerging computational and translational informatics methods capable of constructing a single subject’s ‘personalome’ for predicting clinical outcomes or therapeutic responses, with an emphasis on methods that provide inter- pretable readouts. Key points: (i) the single-subject analytics of the transcriptome shows the greatest development to date and, (ii) the methods were all validated in simulations, cross-validations or independent retrospective data sets. This survey uncovers a growing field that offers numerous opportunities for the development of novel validation methods and opens the door for future studies focusing on the interpretation of comprehensive ‘personalomes’ through the integration of mul- tiple -omics, providing valuable insights into individual patient outcomes and treatments. Key words: single-subject studies; personalome; precision medicine; n-of-1 Francesca Vitali, PhD, is a Research Assistant Professor at the University of Arizona. Her main research interests are in pharmacogenomics, drug repurpos- ing, precision medicine, bioinformatics and big data techniques. Qike Li is a PhD candidate at the University of Arizona. His research interests are in the area of single-subject analytics with applications in precision medicine. Grant Schissler is an Assistant Professor at University of Nevada, Reno. Recently, he has helped to build statistic informatics tools that allow clinical researchers to interpret genomic data of individual patients. Joanne Berghout, PhD, is a Research Assistant Professor at the University of Arizona. She uses genetics and ontologies to uncover patterns and candidate genes associated with Mendelian and complex diseases. Colleen Kenost, EdD, Director of Operations, Center for Biomedical Informatics and Biostatistic, the University of Arizona. Her role is to translate research prerogatives into action and operationalize strategic plans. Yves Lussier, MD, Professor of Medicine and Director, Center for Biomedical Informatics and Biostatistics, University of Arizona. His research group solves problems related to computational precision medicine and translational bioinformatics. Submitted: 15 July 2017; Received (in revised form): 6 October 2017 V The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 789 Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 790 | Vitali et al. through analyses against a reference genome, or in the case of Introduction cancer, by also comparing paired cancer and unaffected tissue The arrival of precision medicine has led to a more individual- to determine somatic versus germline mutations. Here, we based view of diseases, with characteristics of single subjects show how differentially expressed molecules of life and path- being central to the prediction of clinical outcomes and pre- ways can be unveiled in a single subject through the analysis of scription of tailored treatments. This concept is not new; in fact, transcriptome data. evidence-based clinical practice guidelines [1] stratify treat- We surveyed emerging novel computational biology, bio- ments according to some patient characteristics (e.g. gender, statistical and translational informatics methods that construct ancestry, age, family history, some laboratory test results). a single subject’s personalome by analyzing transcriptome data However, precision medicine differs from the traditional medi- to predict outcomes or therapeutic responses without requiring cal approach, as it seeks to leverage not only clinical variables the large cohort needed for conventional approaches. and clinician-selected genetic tests but also broad and data- Our review methodology is detailed in the Supplementary intensive molecular and general -omics profiles of a patient [2]. Material S1. Particular emphasis is placed on those methods These large and heterogeneous data cannot be interpreted that provide clinically interpretable readouts rather than simple directly by medical practitioners and require an automatic pro- categorical classification, as the latter are known to be difficult cedure for extracting relevant knowledge before incorporation to reproduce across data sets and contain noisy, incidental and into clinical practice. Therefore, it is fundamental to develop passenger variation [6–9]. The papers and methods selected for computational methods aimed at analyzing these data at the review reflect the authors’ views and are not intended to pro- individual level. vide an exhaustive search. Figure 2A depicts all considered pub- Current approaches aimed at analyzing disease or other bio- lications by year of publication and number of citations, and the logical processes, therapeutic efficacy and -omic data still lever- studies are shown with different colors and shapes according to age well-established cohort-based population analyses such as the type of required data input and output, respectively. Figure case-control studies [e.g. gene expression classifiers (GExpCs)], 2B shows the number of citations over time. observational trials or controlled intervention trials. These large The review is divided according to the type of data inputs in cohort/group approaches place emphasis on the group average the methods (i.e. transcriptome and integrated -omics). A rather than individual participants; though this group average review of the validations of all methods follows, and finally, we may not represent any actual individual’s personal profile, let discuss and conclude with the broad challenges, the applica- alone be meaningful to understanding the profile of a given spe- tions and the opportunities in developing a personalome for cific patient. On the other hand, the framework of N-of-1 trials precision medicine, i.e. how the single-subject analyses (SSAs) has been applied to repeated measures of a single analyte for of -omics data can bring novel insights in disease mechanisms over two decades [3]. This approach is based on the collection of specific of a patient and unveil potential patient-specific treat- various relevant data for one person as frequently as possible ments. A table of content for the review is provided in Table 1. [4]. In this way, novel strategies can be explored to compare dif- ferent treatments of the same person. Moreover, by looking at Transcriptome commonalities across multiple N-of-1 studies collecting the same type of data, it is possible to estimate the efficacy of an Transcriptome analysis aims to interpret the quantification of intervention in a specific subset population (i.e. people sharing transcribed genetic material, including both coding and noncod- a particular genetic profile). N-of-1 trials demonstrated their ing RNA. Different from DNA, which is relatively static, analyses power to evaluate treatment effectiveness in a single subject for of the transcriptome capture the collective impact of tissue one variable [5], but proposed approaches for one analyte do not type, sequence variation, regulation, environment, external scale for -omics legion-size data sets. stimulation (e.g. drug treatments) and interactions between Although we now have an unprecedented technical opportu- them. High-throughput technologies, such as microarray and nity to gather data relating to an individual’s -omic profile, bio- RNA sequencing (RNA-Seq), are capable of assessing transcript informatics tools to understand these data comprehensively, expression at genome-scale for an individual sample, with and at the individual level, remain underdeveloped. Novel RNA-Seq providing unbiased detection, broader dynamic range, approaches for identifying individualized (single-subject)—and increased specificity and sensitivity and easier detection of rare not cohort—signals are required for gathering insights into the and low-abundance transcripts. biology of diseases and healthy states of individuals. This The transcriptome provides a snapshot of transcriptional review focuses on computational methods aimed at analyzing activity under the condition where the RNA was collected, quantitative transcriptomic measurements of an individual and allowing researchers to study the biological impact of certain the combination of transcriptome with other -omic data. diseases or effect of treatments [10]. This allows us to better In this review, we define the personalome as an interpret- understand general disease mechanisms, discover biomarkers able personal molecular mechanism profile of an individual or identify drug targets at the cohort scale when sufficient derived from one or more scales of -omic data, especially when samples are collected, but also has the power to reveal designed to enable precision medicine. ‘Personal -omics’ means individual-specific signals, whose detection and analysis the -omics measures of a single subject. Molecular mechanisms through computational methods can lead to far more precise are any molecular functions or biological processes such as a medical understanding and decision-making. Analysis of missense mutation in DNA, or a differentially expressed path- more than one transcriptome of an individual enables the way (DEP) at the transcriptome or proteome. To be considered assessment of personal dynamic changes over time or in interpretable at the molecular mechanism, the raw -omics pro- response to therapy or other environmental changes. Yet, file must have been subjected to analyses performing (i) dimen- identifying important individual signals is not a trivial task, as sion reduction and (ii) biomolecular interpretation of the transcript expression variations in a given tissue and time point mechanisms involved in molecules of life (Figure 1). For exam- are further modulated by stochastic variability, cyclic patterns ple, full genomes are reduced to variant and mutation calls (ex circadian) and platform biases or measurement errors in Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 Personalomeforprecisionmedicine | 791 Figure 1. Flow chart of methods designed for clinical interpretation of single-subject -omics. This review addresses the gap of knowledge to compare and contrast sin- gle-subject methods designed to reduce the dimension of raw -omics data (left) and to provide a biomolecular interpretation of signals (gray rectangle). For DNA sequencing, variant and mutations calls as well as all functional annotations in single subjects (e.g. missense mutation) already bridge this gap. However, this inter- mediate step is often omitted for other molecules of life, such as mRNAs, miRNAs, proteins, methylated DNA regions and metabolites (carbohydrates and lipids). This review focuses on single-subject methods that analyze transcriptome data. ‘Clinical applications’ section provides emerging evidence that the newly available, unbiased SSA of the transcriptome enable innovative types of studies to investigate their clinical utility by addressing the gap of biomolecular interpretation of raw - omics signals. Among possible studies, we demonstrate that -omics clinical prediction classifiers that operate directly at the -omics scale may be redesigned for the parsimonious transformed signal of single-subject studies for improved clinical utility. Figure 2. SSA studies included in this review. (A) Each numbered point represents a publication plotted by year of publication and the relative number of citations (in log2 scale). Numbers correspond to the publication in this article’s reference list, colors indicate the type of input required, i.e. one single-subject sample (1 ss SAMPLE—green), two paired single-subject sample (2 ss SAMPLES—purple) or if the method requires the collection of multiple samples from the same subject (multiple ss SAMPLES—orange). The shapes represent the type of output provided by the selected studies, i.e. DEGs—circle, DEPs—X. Finally, blue squares indicate methods based on the integration of transcriptome data with other - omics. (B) Number of citations over time starting from the publication year for the single-subject studies analyzing transcriptome data. Color and shape codification is the same as for the (A). Downloaded from https://academic.oup.com/bib/article/20/3/789/4758622 by DeepDyve user on 16 July 2022 792 | Vitali et al. Table 1. Table of content of the review The PC strategy is a distinct approach to de