Access the full text.
Sign up today, get DeepDyve free for 14 days.
Bartel (2004)
281Cell, 116
D. Bartel (2004)
MicroRNAs Genomics, Biogenesis, Mechanism, and FunctionCell, 116
S. Eddy (2005)
A Model of the Statistical Power of Comparative Genome Sequence AnalysisPLoS Biology, 3
Drake (2005)
223Nat. Genet., 38
International Consortium (2004)
Finishing the euchromatic sequence of the human genomeNature, 431
G. Lunter, A. Rocco, N. Mimouni, A. Heger, Alexandre Caldeira, J. Hein (2008)
Uncertainty in homology inferences: assessing and improving genomic sequence alignment.Genome research, 18 2
T. Speed (2003)
Biological sequence analysisarXiv: Probability
Casillas (2007)
2222Mol. Biol. Evol., 24
J. Drake, C. Bird, J. Nemesh, D. Thomas, C. Newton-Cheh, A. Reymond, L. Excoffier, Homa Attar, S. Antonarakis, E. Dermitzakis, J. Hirschhorn (2006)
Conserved noncoding sequences are selectively constrained and not mutation cold spotsNature Genetics, 38
D. GuhaThakurta (2006)
Computational identification of transcriptional regulatory elements in DNA sequenceNucleic Acids Research, 34
A. Stark, Michael Lin, P. Kheradpour, J. Pedersen, L. Parts, J. Carlson, M. Crosby, Matthew Rasmussen, Sushmita Roy, Ameya Deoras, J. Ruby, J. Brennecke, E. Hodges, A. Hinrichs, A. Caspi, B. Paten, Seung‐Won Park, Mira Han, Morgan Maeder, Benjamin Polansky, Bryanne Robson, S. Aerts, J. Helden, Bassem Hassan, D. Gilbert, Deborah Eastman, Michael Rice, M. Weir, Matthew Hahn, Yongkyu Park, Colin Dewey, L. Pachter, W. Kent, D. Haussler, E. Lai, D. Bartel, G. Hannon, T. Kaufman, M. Eisen, A. Clark, Douglas Smith, S. Celniker, W. Gelbart, Manolis Kellis (2007)
Discovery of functional elements in 12 Drosophila genomes using evolutionary signaturesNature, 450
W. Miller, K. Rosenbloom, R. Hardison, Minmei Hou, James Taylor, B. Raney, Richard Burhans, D. King, R. Baertsch, Daniel Blankenberg, Sergei Pond, A. Nekrutenko, B. Giardine, Robert Harris, S. Tyekucheva, M. Diekhans, Tom Pringle, W. Murphy, A. Lesk, G. Weinstock, K. Lindblad-Toh, R. Gibbs, E. Lander, A. Siepel, D. Haussler, W. Kent (2007)
28-way vertebrate alignment and conservation track in the UCSC Genome Browser.Genome research, 17 12
G. Lunter (2007)
Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomesBioinformatics, 23 13
HMMoC a compiler for hidden Markov models
R. Satija, Adam Novak, I. Miklós, Rune Lyngsø, J. Hein (2009)
BigFoot: Bayesian alignment and phylogenetic footprinting with MCMCBMC Evolutionary Biology, 9
Chiaromonte (2003)
245
G. Lunter, C. Ponting, J. Hein (2005)
Genome-Wide Identification of Human Functional DNA Using a Neutral Indel ModelPLoS Computational Biology, 2
Rahul Satija, L. Pachter, J. Hein
Combining Statistical Alignment and Phylogenetic Footprinting to Detect Regulatory Elements
E. Margulies (2008)
Confidence in comparative genomics.Genome research, 18 2
D. Tagle, B. Koop, M. Goodman, J. Slightom, D. Hess, R. Jones (1988)
Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints.Journal of molecular biology, 203 2
E. Margulies, J. Vinson, W. Miller, D. Jaffe, K. Lindblad-Toh, Jean Chang, E. Green, E. Lander, J. Mullikin, M. Clamp (2005)
An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing.Proceedings of the National Academy of Sciences of the United States of America, 102 13
F. Chiaromonte, R. Weber, K. Roskin, M. Diekhans, W. Kent, D. Haussler (2003)
The share of human genomic DNA under selection estimated from human-mouse genomic alignments.Cold Spring Harbor symposia on quantitative biology, 68
A. Siepel, G. Bejerano, J. Pedersen, A. Hinrichs, Minmei Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. Hillier, S. Richards, G. Weinstock, R. Wilson, R. Gibbs, W. Kent, W. Miller, D. Haussler (2005)
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.Genome research, 15 8
Sònia Casillas, A. Barbadilla, C. Bergman (2007)
Purifying selection maintains highly conserved noncoding sequences in Drosophila.Molecular biology and evolution, 24 10
J. Hein, C. Wiuf, Bjarne Knudsen, Marianne Moller, G. Wibling (2000)
Statistical alignment: computational properties, homology testing and goodness-of-fit.Journal of molecular biology, 302 1
D. Tagle, B. Koop, M. Goodman, J. Slightom, D. Hess, Richard Jones (1988)
Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatus): Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprintsJournal of Molecular Biology, 203
Vol. 26 no. 17 2010, pages 2116–2120 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq360 Sequence analysis Advance Access publication July 7, 2010 Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniques 1,∗ 1 2 R. Satija , J. Hein and G. A. Lunter 1 2 Department of Statistics and Wellcome Trust Centre for Human Genetics, Oxford University, Oxford, UK Associate Editor: Dmitrij Frishman ABSTRACT conserved elements in both Drosophila and human genomes (Casillas et al., 2007; Drake et al., 2005). Motivation: Comparative genomic sequence analysis is a powerful There exists a clear need for general methods that identify approach for identifying putative functional elements in silico. The functional DNA in large mammalian and vertebrate genomes. availability of full-genome sequences from many vertebrate species Previous genome-wide conservation studies have estimated that has resulted in the development of popular tools, for example, the 3–8% of the human genome is under purifying selection phastCons software package that search large numbers of genomes (Chiaromonte et al., 2003; Lunter et al., 2006; Siepel et al., to identify conserved elements. While phastCons can analyze many 2005). However, protein-coding elements have been estimated to genomes simultaneously, it ignores potentially informative insertion make up only 1–2% of the genome (International Human Genome and deletion events and relies on a fixed, precomputed multiple Sequencing Consortium, 2004). Non-protein-coding conserved sequence alignment. material, thus, represents a large fraction of all functional DNA, Results: We have developed a new method, GRAPeFoot, which and specialized techniques such as exon-finders cannot identify the simultaneously aligns two full genomes and annotates a set of majority of the dark matter present in the human genome. conserved regions exhibiting reduced rates of insertion, deletion Two general techniques have been used for identifying putative and substitution mutations. We tested GRAPeFoot using the human functional regions in the human genome by searching for regions and mouse genomes and compared its performance to a set of with reduced rates of mutation. An example of an approach focusing phastCons predictions hosted on the UCSC genome browser. Our solely on substitution data is the phastCons software package (Siepel results demonstrate that despite the use of only two genomes, et al., 2005). phastCons utilizes a phylogenetic hidden Markov GRAPeFoot identified constrained elements at rates comparable model (phylo-HMM) which annotates regions in a given multiple with phastCons, which analyzed data from 28 vertebrate genomes. sequence alignment as conserved and non-conserved by modelling This study demonstrates how integrated modelling of substitutions, conserved regions with reduced rate of point mutations. The use indels and purifying selection allows a pairwise analysis to exhibit a of an HMM framework allows phastCons to estimate all model sensitivity similar to a heuristic analysis of many genomes. parameters, including the length and expected level of conservation, Availability: The GRAPeFoot software and set of genome-wide directly from the sequence data. Theoretical studies have concluded functional element predictions are freely available to download online that comparative studies may exhibit increased statistical power at http://www.stats.ox.ac.uk/∼satija/GRAPeFoot/ when analyzing datasets with larger numbers of genomes, and Contact: satija@stats.ox.ac.uk phastCons has been applied to both a set of 17 and a set of 28 Supplementary information: Supplementary data are available at vertebrate genomes (Eddy, 2005; Margulies et al., 2005). Bioinformatics online. A complementary approach, implemented in the consIndel Received on November 20, 2009; revised on June 7, 2010; accepted software package, searches a precomputed multiple sequence on June 30, 2010 alignment for regions with a reduced rate of insertion and deletion (indel) mutations (Lunter et al., 2006). In this approach, the annotated set of ancestral repeats (ARs) is analyzed to determine the 1 INTRODUCTION distribution of ungapped regions (sequences between indel events) in neutral regions. When the entire genome is examined, long Despite the wealth of currently available DNA sequence, the ungapped regions that deviate from this null distribution were found important task of identifying functionally important biological to be highly enriched with previously annotated functional elements. regions remains difficult. One popular approach is to search for Both phastCons and consIndel have dedicated tracks on the UCSC regions that are putative targets of purifying selection by virtue genome browser (Miller et al., 2007). of being conserved at a variety of evolutionary distances (Tagle While these techniques successfully identify many previously et al., 1988). While mutational cold spots could also explain the annotated functional elements, both make their predictions based on existence of conserved elements, population genetics studies refute a paucity of either substitution or indel mutations. Additionally, both this hypothesis and conclude that purifying selection maintains techniques make their predictions based on a fixed multiple sequence alignment. While this reduces the computational complexity of To whom correspondence should be addressed. both methods, uncertainty or errors in the alignment will impact 2116 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org [15:14 30/7/2010 Bioinformatics-btq360.tex] Page: 2116 2116–2120 Pairwise genome-wide functional element detection the downstream functional element predictions. Especially as the number of genome sequences increases, thereby exponentially Ins Ins Del Slow Fast Fast increasing the number of possible multiple sequence alignments and thus the difficulty of the multiple-alignment problem, the reliance on M M Slow Fast a fixed alignment can result in flawed annotations (Margulies, 2008). Additionally, previous studies have demonstrated how standard Del Del2 Ins2 Fast Slow Fast score-based alignment methods produce biased alignments that may drastically underestimate the level of indel mutations (Lunter, 2007b; Lunter et al., 2008). Statistical alignment offers a framework for combining sequence Fig. 1. Graphical representation of the GRAPeFoot HMM. Slow states are used to model conserved functional elements and fast states model alignment with comparative sequence annotation, thereby removing background sequence. An additional pair of insertion/deletion states (insert2, the traditional dependence on a fixed alignment (Hein et al., delete2) model a mixture geometric indel length distribution in background 2000). We have previously shown how computing a probability- sequence (Lunter et al., 2006). weighted alignment distribution can increase predictive accuracy by correctly accounting for alignment error or uncertainty during annotation (Satija et al., 2008, 2009). Our method, the length of short indel events and the other modeling long indels. This combining statistical alignment and phylogenetic footprinting was found to result in a small but significant improvement in alignment (SAPF), annotates functional elements in short genomic regions quality (Lunter et al., 2008). To complete the GRAPeFoot HMM, we add three states (slow_match, slow_insert and slow_delete) modeling conserved using a HMM which modeled both a reduced rate of substitution elements, and enable transitions between the match and slow_match states, and indel events in conserved regions. However, since the number thus allowing the model to alternate between annotating neutral sequence of hidden states in the model and the number of potential alignments and conserved elements. We assume that there are only short indel events in that must be analyzed both increase exponentially with the number conserved regions and thus there is no need for extra indel states. of sequences, expanding this full probabilistic treatment to the entire A path of state transitions and emissions through the GRAPeFoot HMM genome for multiple sequences is computationally infeasible. represents both a pairwise sequence alignment and an annotation of each It is possible to extend SAPF to perform full probabilistic alignment column as either neutral sequence or a conserved element. The annotation of two genomes by building upon the probabilistic aligner posterior probabilities for each state path can be calculated after running GRAPe (Lunter, 2007a; Lunter et al., 2008). GRAPe has been found the standard Forward and Backward algorithms (Durbin et al., 1998). We to produce superior alignments of the human and mouse genomes integrate over all possible alignments to calculate, for each base in the human genome, the predicted posterior probability that it is part of a functional using a technique called Marginalized Posterior Decoding. While element. the reduced number of genomes may potentially reduce the signal available for phylogenetic footprinting, previous studies have shown how this loss of signal could be fully offset by using probabilistic 2.3 The GRAPe whole-genome aligner alignment techniques (Lunter et al., 2008). Our new software tool, The HMM described in Figure 1 is the main building block of the GRAPeFoot GRAPeFoot, combines statistical alignment and annotation of the algorithm. It is used to assess homology and compute alignments of short human and mouse genomes in order to identify regions with reduced segments, up to 1000 bp, of DNA. In brief, the algorithm works by first rates of both substitution and indel mutations. aligning short segments around guides, or putatively homologous genome loci. We used BlastZ alignments as input, from which GRAPeFoot generates one guide for every 20 BlastZ alignment columns on-the-fly. The large 2 METHODS density ensures that all BlastZ alignments are considered, but does not preclude alternative alignments from being considered. It is also possible 2.1 Model overview to provide an explicit list of alignment guides, which may be generated GRAPeFoot uses a pairwise HMM (Durbin et al., 1998) to simultaneously from, for example, BLAST or BLAT searches. align and annotate two DNA sequences. The HMM contains two sets of Around each guide, local homology is assessed, and a local alignment is hidden states: one set models neutrally evolving sequences (fast evolution) computed if the homology test succeeds. Homology testing is implemented and a second set models conserved elements (slow evolution). Transition by using a modified Fisher–Yates algorithm to permute both sequences while and emission parameters in slow states are set to model a reduced rate keeping tri-nucleotide counts fixed. The alignment log-likelihood for the of evolutionary events (substitutions, insertions and deletions). The model permuted sequences is calculated by the Forward algorithm, and their mean can switch between fast and slow states in order to annotate both neutral and variance is estimated using a minimum of six permuted sequence pairs. and functional subsequences. Since we cannot model a continuum of When their log-likelihood exceeds that of the permutations by 2 SDs plus a constraint strengths, our method uses a single category of states to represent threshold of log(10.0) which was determined by trial and error. conserved sequence as an approximation to the range of conservation levels When alignments pass the homology test and extend sufficiently far, present in functional elements. We address one of the issues related to this further guides are generated on either side of the aligned sequence, in such approximation when training our model parameters. a way that when the alignment in fact extends beyond the 1000 bp window, the resulting alignments will overlap. When two alignments generated from 2.2 The GRAPeFoot HMM such guides in fact overlap, and exactly agree in at least one ‘match’ GRAPeFoot models neutral sequences using the standard pairwise alignment alignment column with posterior support over 0.25 , they are combined into a HMM implemented in the GRAPe software package. The HMM consists of single larger alignment. This process continues until no further homologous the basic three state (match, insert and delete) HMM with the addition of two sequence is detected. extra indel states (insert2 and delete2), which specifically model long indel In this way, the algorithm is able to iteratively grow large aligned events. These extra states allow the length of insertion and deletion events sequences, potentially far beyond the initial alignment guides. When the to be modeled as a mixture of two geometric distributions, one modeling non-overlapping sequences are far apart, the alignments are output separately. [15:14 30/7/2010 Bioinformatics-btq360.tex] Page: 2117 2116–2120 R.Satija et al. 2.4 Estimating model parameters posterior probability that each individual base in the human genome has been subjected to the effects of purifying selection. While these base pair The model transition and emission parameters control the expected lengths of probabilities are useful, many researchers are most interested in a set of fast and slow regions as well as the expected rates of substitutions, insertions contiguous elements which are predicted to be functional. One method of and deletions in each. The transition probabilities between match and indel producing such a list is to use the Viterbi algorithm, which calculates the states set the expected rate of indel events, and the emission probabilities single most likely state path through the HMM, and thus does not consider from the match states, represented by a substitution probability matrix, set the influence of multiple alternative alignments. Instead, we converted the expected rate of substitution. individual base probabilities into a list of contiguous elements. We defined From previous work (Lunter, 2007b), we took a set of substitution an element as a continuous stretch of DNA with a minimum length of 5 nt probability matrices, one from each of 20 G+C bands. The data were split into where the annotated posterior probability of each base was greater than a different G+C bands in order to account for observed variation in substitution probability threshold. For example, using a probability threshold of P =0.6, and indel mutation probabilities for sequences with different G+C content we examined the set of all bases with annotated posterior probabilities (Lunter, 2007b). These substitution probability matrices, denoted by p ˆ >60%, and all contiguous sequences with a length >5 nt defined the set of correspond to emission probabilities from the match state, and represent called elements. Using different probability thresholds alters the stringency proportions of each homologous nucleotide pairing observed in neutral of element calling and allows us to control for false discovery rate. We used sequence. We used this information to estimate a neutral substitution rate the approach of Lunter et al. (2006) and created 10 different sets of called matrix, denoted by Q which contains the rate of change for each base to every ˆ elements, each using a different probability threshold. Four of these sets Qt other base. These matrices are related by the equation p ˆ = e , where t is the can be downloaded from the GRAPeFoot web site and easily imported into divergence time between the two species. To calculate the slow HMM state the UCSC genome browser, and the remaining datasets are available from emission probabilities for the slow_match state, corresponding to the matrix the authors upon request. The Supplementary Material also provides basic Qαt p ˆ , we use the formula p ˆ = e . This allows us to model the reduced slow slow statistics, such as the percentage of the genome predicted to be functional, rate of substitution mutations in conserved regions by a single parameter for each of these predicted element sets. α, constrained to be <1, while retaining different emission probabilities for each G+C band. The rates of indel mutations are set by the transition probabilities from the 3 RESULTS match to the indel states. As with the emission probabilities, we obtained a set In order to assess the performance of GRAPeFoot, we benchmarked of indel probabilities for different G+C bands from previous work (Lunter, its performance against the phastCons and consIndel software 2007b) and used these to model non-conserved regions. To calculate the indel probabilities for conserved regions, we reduced the corresponding transition packages. We tested the three software package on five different probabilities in the non-conserved regions by a factor β. The scaling factors types of previously annotated elements in the human genome. α and β are two GRAPeFoot model parameters used to generate the complete • Exons: we used the set of exons from the UCSC known genes set of transition and emission probabilities. Two additional model parameters, track, which accumulates genes from a variety of sources δ and , were used to define the geometric distributions modeling the length based on protein data (SWISS-PROT and TrEMBL) and of conserved and non-conserved regions, and correspond to the probabilities of a fast state transitioning to a fast state, or a slow state transitioning to mRNA data (RefSeq, Genbank). a slow state. The final model parameter, θ, is the self-transition probability • Putative transcription regulatory regions (pTRRs): these applying to both the slow_insert and slow_delete state and thus sets the are a set of putative transcriptional regulatory sequences expected length of insertions in slow regions. This extra parameter enables that have been identified from ENCODE experimental GRAPeFoot to model indel events in functional and background sequence data. The 100 bp regions were identified via data from as differing both in frequency and in length. chromatin immunoprecipitated samples hybridized to high- While phastCons estimates some HMM parameters via maximum density microarray chips (ChIP-chip) for sequence-specific likelihood estimation (MLE), we chose a different approach for parameter transcription factors. Set specificity was increased by requiring estimation. We expected that a significant proportion of the functional sequence in the genome would be highly conserved, and the MLE additional experimental evidence (chromatic activation, parameters would be set to optimally detect these elements at the expense of DNAase hypersensitivity and nucleosome depletion). detecting weakly conserved elements. Thus, parameters which maximized • Predicted regulatory modules (PReMods): a genome-wide the likelihood of our model could cause our method to miss functional set of computationally predicted human transcriptional elements with lower levels of conservation. In contrast, model parameters modules. Predictions are made by searching for clusters of that are trained to efficiently identify weakly conserved regions allow the phylogenetically conserved and repeating transcription factor model to identify highly conserved elements with good power as well. Thus, motifs using a motif database consisting of 229 TFs. we utilized a supervised learning approach on a segment of chromosome 21 corresponding to 1% of the human genome in order to obtain our • microRNAs: we used the miRBase database, a database of parameter estimates. We trained our parameter set to maximally identify published microRNA sequences and annotations. weakly conserved exons, while requiring that the false positive rate remained • ARs: we used the set of repeats predicted by the RepeatMasker low. For comparison, we also computed a set of MLE parameters and found that our supervised learning parameters, while decreasing the overall program, which screens DNA sequences for interspersed likelihood, were far more tolerant to functional regions that were not perfectly repeats and low complexity DNA sequences. Since ARs are conserved and allowed our model to better discriminate between annotated presumed to evolve neutrally, we benchmark the performance exons and ARs in this region. This analysis, along with the exact parameter in finding these sequences in order to estimate a false positive values used by GRAPeFoot, is presented in the Supplementary Material. rate. We wanted to compare the proportion of previously annotated 2.5 Calling individual elements elements detected by each of the three methods for a variety of different thresholds. To compare the three methods, we constructed The output of the Forward and Backward algorithms can be used for posterior receiver operating characteristics (ROC) curves for each method decoding (Durbin et al., 1998), a process which returns the predicted [15:14 30/7/2010 Bioinformatics-btq360.tex] Page: 2118 2116–2120 Pairwise genome-wide functional element detection Exons pTRRs to compare the ROC curves for the different methods, and thus ● ● ● we chose to omit the point from both the graphs and the AUC ● ● calculations. The ROC analysis in Figure 2 shows that GRAPeFoot performs ● well when compared with both phastCons and consIndel for all four types of functional elements. Additionally, the AUC values demonstrate that the quality of predictions varies for the different GRAPeFoot : 0.009 GRAPeFoot : 0.002 phastCons28 : 0.005 phastCons28 : 0.001 methods. For all methods, accuracy was highest when detecting consIndel : 0.008 consIndel : 0.002 ● ● exons, followed (in decreasing order) by miRNAs, PReMods and 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 pTRRs. This performance ordering was not unexpected given False Positive Rate False Positive Rate the average length and expected conservation of these functional elements. Both theoretical and genome-based studies have shown miRNAs PreMODs ● ● that the ability to detect functional elements via comparative ● ● genomics improves with longer elements and higher levels of ● ● ● ● ● conservation (Eddy, 2005; Stark et al., 2007). Exons have an average length of 170 bp and are in general well conserved among vertebrates ● ● while microRNAs are significantly smaller, with an average length of 24 bp, and exhibit a reduced but similar level of conservation as GRAPeFoot : 0.007 GRAPeFoot : 0.009 phastCons28 : 0.004 exons (Bartel, 2004). In contrast, regulatory modules are expected phastCons28 : 0.006 consIndel : 0.005 consIndel : 0.006 ● ● to contain large percentages of neutral DNA interspersed with 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 regulatory signals such as transcription factor binding sites, which False Positive Rate False Positive Rate have an average length of 6–15 bp and can exhibit high degrees of variability (GuhaThakurta, 2006). Thus, while all pTRRs were Fig. 2. ROC curves comparing the performance of GRAPeFoot, phastCons constrained to be 100 bp, only a small proportion of these bases (e.g. and consIndel for four classes of functional elements. The legend for each a pTRR could contain only one transcription factor binding site) may plot displays the area under each of the ROC curves. be functional and therefore subject to purifying selection. While all methods performed significantly better when detecting PReMODs, for each of the different functional element types. The y-axis of another set of putative regulatory modules, these modules were the ROC curve represents the true positive rate: the percentage identified by searching for clusters of phylogenetically conserved of nucleotides within a functional element class that are correctly motifs and therefore were assured to exhibit at least partial sequence annotated as conserved (true positives), and the x-axis represents the conservation. false positive rate: the percentage of nucleotides within ARs detected by the method (false positives) for the same threshold. The curve is 4 CONCLUSIONS constructed by placing and connecting points on the plot for a range of different thresholds. For the GRAPeFoot results, each point on the Our results demonstrate the potential value of comparative analyses ROC curve corresponds to 1 of the 10 sets of contiguous elements performed on just two sequences by using integrated modeling of described in Section 2.5. Both phastCons and consIndel predictions substitutions, indels and selection. Though GRAPeFoot had the assign each predicted contiguous element a score ranging from 0 to benefit of analyzing only the human and mouse genomes, our 1000. When constructing ROC curves for these predictions, we used software identified constrained elements at rates comparable with these scores for thresholding. Each point on the ROC curves can be phastCons, which utilized data from 28 full vertebrate genomes. easily reproduced using the base-pair-wise intersection feature on These results imply that sequencing large numbers of genomes and the UCSC genome browser after loading in our publicly available running heuristic methods in order to process all the data may not be list of conserved element predictions. We demonstrate an example the only effective way to decode the locations of functional elements of this in the Supplementary Material. in the genome. Improved modeling methods, even if they can only Since the threshold value does not appear on either axis, the run on a reduced number of genomes, have the potential to squeeze ROC curve allows us to compare method performance for a variety more information out of the data, and can also be run on datasets of different sensitivity/specificity regimes. Figure 2 displays ROC where large numbers of extant species have not been sequenced. curves for four classes of functional elements: exons, pTRRs, GRAPeFoot exhibits how statistical alignment techniques can miRNAs and preMODs. be used to extend traditional comparative analyses. While both The legend for each plot displays the area under the curve (AUC), phastCons and consIndel require a precomputed fixed alignment, a summary statistic that evaluates the method’s performance while GRAPeFoot simultaneously performs probabilistic alignment and taking into account both sensitivity and specificity. Higher AUC conservation analysis, a process which significantly improves upon values correspond to increased predictive accuracy. Each ROC the accuracy of traditional pairwise genomic alignment. GRAPeFoot curve has been modified to exclude the point (1,1), which is present avoids relying on score-based alignment practices which often by definition in all ROC curves. This point corresponds to using a use integer penalties to represent substitution and indel mutations score or probability cutoff of zero in which case each method would in order to find the single alignment with the highest score, a annotate each nucleotide in the genome as functional, corresponding process that results in known alignment biases (Lunter et al., to both a true positive rate and a false positive rate of 100%. The 2008). In contrast, GRAPeFoot’s probabilistic alignment framework inclusion of this point on the graph, however, hinders the ability utilizes parameters which characterize evolutionary models, and [15:14 30/7/2010 Bioinformatics-btq360.tex] Page: 2119 2116–2120 True Positive Rate True Positive Rate 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 True Positive Rate True Positive Rate 0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 R.Satija et al. GRAPeFoot sums over all possible indel configurations in order REFERENCES to remove these biases. These improvements enable GRAPeFoot Bartel,D. (2004) MicroRNAs genomics, biogenesis, mechanism, and function. Cell, to increase the accuracy of both the sequence alignment and the 116, 281–297. Casillas,S. et al. (2007) Purifying selection maintains highly conserved noncoding footprinting predictions. sequences in Drosophila. Mol. Biol. Evol., 24, 2222. The benchmarking results also demonstrate how both indel Chiaromonte,F. et al. (2003) The share of human genomic DNA under selection mutations and substitutions are crucially informative and should be estimated from human-mouse genomic alignments. In Cold Spring Harbor considered in conservation studies. While most footprinting methods Symposia on Quantitative Biology, Vol. 68, CSHL Press, NY, pp. 245–254. focus on finding regions with low numbers of substitutions, we found Drake,J. et al. (2005) Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat. Genet., 38, 223–227. that consIndel performed similarly to phastCons despite ignoring Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press, substitution information and focusing only on the frequency of indel New York. events. In contrast, GRAPeFoot considers both substitution and indel Eddy,S. (2005) A model of the statistical power of comparative genome sequence mutations to be informative. Additionally, the transition parameters analysis. PLoS Biol., 3, e10. GuhaThakurta,D. (2006) Computational identification of transcriptional regulatory in GRAPeFoot allowed the model to discriminate between neutral elements in DNA sequence. Nucleic Acids Res., 34, 3585. and functional sequence not only based on the frequency of indel Hein,J. et al. (2000) Statistical alignment: computational properties, homology testing events, but also on their length as well. This ensures that short and goodness-of-fit. J. Mol. Biol., 302, 265–279. and adjacent indel mutations (which may be offsetting) will not International Human Genome Sequencing Consortium (2004) Finishing the incorrectly cause GRAPeFoot to annotate a region as neutral. euchromatic sequence of the human genome. Nature, 431, 931–945. Lunter,G. (2007a) HMMoC a compiler for hidden Markov models. Bioinformatics, 23, While we would like to extend GRAPeFoot to analyze multiple sequences, it is currently infeasible to do full probabilistic alignment Lunter,G. (2007b) Probabilistic whole-genome alignments reveal high indel rates in the and footprinting of even three sequences on a genome-wide scale. human and mouse genomes. Bioinformatics, 23, i289. We are currently attempting to develop methods to approximate the Lunter,G. et al. (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput. Biol., 2, e5. multiple alignment distribution instead of utilizing a full dynamic Lunter,G. et al. (2008) Uncertainty in homology inferences: assessing and improving programming calculation. One possible future route involves genomic sequence alignment. Genome Res., 18, 298. approximating the multiple-alignment distribution by analyzing Margulies,E. (2008) Confidence in comparative genomics. Genome Res., 18, 199. the full set of pairwise alignments. We are also modifying the Margulies,E. et al. (2005) An initial strategy for the systematic identification of GRAPeFoot HMM to allow for the insertion and deletion of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci., 102, 4795–4800. functional elements. When analyzing data from multiple sequence Miller,W. et al. (2007) 28-Way vertebrate alignment and conservation track in the UCSC data, this modification would allow GRAPeFoot to annotate the Genome Browser. Genome Res., 17, 1797. evolution of functional elements along a tree and to trace the Satija,R. et al. (2008) Combining statistical alignment and phylogenetic footprinting to evolutionary history of different element classes. detect regulatory elements. Bioinformatics, 24, 1236. Satija,R. et al. (2009) BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol. Biol., 9, 217. Siepel,A. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, ACKNOWLEDGEMENTS and yeast genomes. Genome Res., 15, 1034. Stark,A. et al. (2007) Discovery of functional elements in 12 Drosophila genomes using We thank István Miklós and Rune Lyngsø for helpful discussions. evolutionary signatures. Nature, 450, 219–232. We also thank the Oxford Supercomputing Centre for providing Tagle,D. et al. (1988) Embryonic epsilon and gamma globin genes of a computational resources. prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol., 203, 439–455. Funding: Rhodes Trust, UK (R.S.). Conflict of Interest: none declared. [15:14 30/7/2010 Bioinformatics-btq360.tex] Page: 2120 2116–2120
Bioinformatics – Oxford University Press
Published: Jul 7, 2010
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.