Access the full text.
Sign up today, get DeepDyve free for 14 days.
M. Tompa, Nan Li, T. Bailey, G. Church, B. Moor, E. Eskin, A. Favorov, M. Frith, Yutao Fu, W. Kent, V. Makeev, A. Mironov, William Noble, G. Pavesi, G. Pesole, M. Régnier, Nicolas Simonis, S. Sinha, G. Thijs, J. Helden, Mathias Vandenbogaert, Z. Weng, C. Workman, Chun Ye, Zhou Zhu (2005)
Assessing computational tools for the discovery of transcription factor binding sitesNature Biotechnology, 23
R. Nieuwpoort (2003)
The Grid 2: Blueprint for a New Computing Infrastructure
Wilfred Li, S. Krishnan, K. Mueller, Koheix Ichikawa, S. Date, S. Dallakyan, M. Sanner, C. Misleh, Zhaohui Ding, Xiaohui Wei, O. Tatebe, P. Arzberger (2006)
Building Cyberinfrastructure for Bioinformatics Using Service Oriented ArchitectureSixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2
T. Bailey, C. Elkan (1995)
Unsupervised learning of multiple motifs in biopolymers using expectation maximizationMachine Learning, 21
the MEME and MAST web site. TLB acknowledges support from NIH RR021692-01 for support of continuing development of MEME and related sequence TLB also acknowledges the ARC
T. Bailey, C. Elkan (1994)
Fitting a Mixture Model By Expectation Maximization To Discover Motifs In BiopolymerProceedings. International Conference on Intelligent Systems for Molecular Biology, 2
A. Sandelin, W. Alkema, P. Engström, W. Wasserman, B. Lenhard (2004)
JASPAR: an open-access database for eukaryotic transcription factor binding profilesNucleic acids research, 32 Database issue
T. Bailey, M. Gribskov (1998)
Combining evidence using p-values: application to sequence homology searchesBioinformatics, 14 1
M. Atkinson, A. Chervenak, P. Kunszt, I. Narang, N. Paton, Dave Pearson, A. Shoshani, P. Watson (2003)
The Grid 2: Blueprint for a New Computing Infrastructure (2nd edition),
T. Bailey, C. Elkan (1995)
The Value of Prior Knowledge in Discovering Motifs with MEMEProceedings. International Conference on Intelligent Systems for Molecular Biology, 3
J. Wootton, S. Federhen (1996)
Analysis of compositionally biased regions in sequence databases.Methods in enzymology, 266
Jianwen Fang, Ryan Haasl, Yinghua Dong, G. Lushington (2005)
Discover protein sequence signatures from protein-protein interaction dataBMC Bioinformatics, 6
Jianjun Hu, Bin Li, D. Kihara (2005)
Limitations and potentials of current motif discovery algorithmsNucleic Acids Research, 33
T. Bailey, William Noble (2003)
Searching for statistically significant regulatory modulesBioinformatics, 19 Suppl 2
Thomas Lyons, Audrey Gasch, L. Gaither, David Botstein, Patrick Brown, David Eide, R. Palmiter (2000)
Genome-wide characterization of the Zap1p zinc-responsive regulon in yeast.Proceedings of the National Academy of Sciences of the United States of America, 97 14
T. Schneider, R. Stephens (1990)
Sequence logos: a new way to display consensus sequences.Nucleic acids research, 18 20
J. Henikoff, S. Pietrokovski, S. Henikoff (1997)
Recent enhancements to the Blocks Database serversNucleic acids research, 25 1
P. Pevzner, S. Sze (2000)
Combinatorial Approaches to Finding Subtle Signals in DNA SequencesProceedings. International Conference on Intelligent Systems for Molecular Biology, 8
Nucleic Acids Research, 2006, Vol. 34, Web Server issue W369–W373 doi:10.1093/nar/gkl198 MEME: discovering and analyzing DNA and protein sequence motifs 1 1 1 Timothy L. Bailey*, Nadya Williams , Chris Misleh and Wilfred W. Li Institute of Molecular Bioscience, The University of Queensland, St Lucia, QLD 4072, Australia and SDSC, UCSD, La Jolla, CA, USA Received February 14, 2006; Revised and Accepted March 21, 2006 (4). Both types of sequence signals can often be represented ABSTRACT as motifs-ungapped, approximate sequence patterns. Using a MEME (Multiple EM for Motif Elicitation) is one of process akin to gapless, local, multiple sequence alignment, the most widely used tools for searching for novel MEME searches for statistically significant motifs in the input ‘signals’ in sets of biological sequences. Applica- sequence set. In this way, MEME can discover the binding tions include the discovery of new transcription sites for the shared transcription factor in the set of promoters factor binding sites and protein domains. MEME or the common protein–protein binding domains in the set of works by searching for repeated, ungapped sequence proteins. MEME can also be used to discover motifs describ- ing many other types of DNA or protein signals besides tran- patterns that occur in the DNA or protein sequences scription factor binding sites and protein–protein interaction provided by the user. Users can perform MEME domains. searches via the web server hosted by the National To use MEME via the website, the user provides a set of Biomedical Computation Resource (http://meme. sequences in the FASTA format by either uploading a file or nbcr.net) and several mirror sites. Through the by cut-and-paste. The only other required input is an email same web server, users can also access the Motif address where the results will be sent. (A planned future ver- Alignment and Search Tool to search sequence sion will remove this requirement by providing temporary databases for matches to motifs encoded in several storage of the results on the web server for a preset period popular formats. By clicking on buttons in the MEME of time.) By default, MEME looks for up to three motifs, each output, users can compare the motifs discovered in of which may be present in some or all of the input sequences. MEME chooses the width and number of occurrences of each their input sequences with databases of known motif automatically in order to minimize the ‘E-value’ of the motifs, search sequence databases for matches to motif—the probability of finding an equally well-conserved the motifs and display the motifs in various formats. pattern in random sequences. By default, only motif widths This article describes the freely accessible web between 6 and 50 are considered, but the user may change this server and its architecture, and discusses ways as well as several other aspects of the search for motifs. to use MEME effectively to find new sequence The MEME output is HTML and shows the motifs as local patterns in biological sequences and analyze their multiple alignments of (subsets of) the input sequences, as significance. well as in several other formats (Figure 1). ‘Block diagrams’ show the relative positions of the motifs in each of the input sequences. Buttons on the MEME HTML output allow one or INTRODUCTION all of the motifs to be forwarded for analysis by other web- The purpose of MEME (Multiple EM For Motif Elicitation) based programs. Clicking on a button allows all of the motifs (rhymes with ‘team’) (1,2) is to allow users to discover signals to be sent to the MAST web server where various sequence (called ‘motifs’) in DNA or protein sequences. The user of databases (or uploaded sequences) can be searched for MEME inputs a set of sequences believed to share some sequences matching the motifs. This is useful in cases, for (unknown) sequence signal(s). For example, some or all of example, where the user would like to find whether the a set of promoters from co-expressed and/or orthologous genes motif of interest is also present in other genes or genomes. may contain binding sites (the ‘signal’) for the same transcrip- MAST is a web-based tool that can be used to search for tion factor (3). Similarly, a set of proteins that interact with a sequences that match one or more motifs. It can be used to look for sequences that contain motifs found by MEME, by other single host protein may do so via similar domains (the ‘signal’) *To whom correspondence should be addressed. Tel: +61 7 3346 2614; Fax: +61 7 3346 2101; Email: [email protected] The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] W370 Nucleic Acids Research, 2006, Vol. 34, Web Server issue Figure 1. Sample MEME output.This portion of an MEME HTML output form shows a protein motif that MEME has discovered in the input sequences. The sites identified as belonging to the motif are indicated, and above them is the ‘consensus’ of the motif and a color-coded bar graph showing the conservation of each position in the motif. Some of the hyperlinked buttons that allow the motif to be viewed and analyzed in other ways can be seen at the bottom of the screen shot. motif discovery tools or that are taken from a motif database. sequences (the haystack). The problem is easier when the The MAST website, reached via the same URL as the MEME motif instances are long and very similar to each other. It website, provides numerous nucleotide and protein databases gets much harder when the motif instances are short and/or for searching. MAST queries may contain any number of degenerate, or the input sequences are very long. motifs, and it scores each sequence in the selected database Discovering TFBS motifs in a set of DNA sequences (e.g. using all of the motifs. In the first example above, MAST can genomic regions upstream of genes) is a difficult task owing to search DNA sequences for matches to the putative transcrip- the tendency of binding sites to be short and degenerate, and tion factor binding site (TFBS) motifs found by MEME in a owing to the fact that promoter regions are often difficult to set of promoter sequences. MAST can search for matches in identify precisely. The problem tends to be worse in eukary- protein sequences to the putative protein–protein interaction otes than in prokaryotes and yeast because eukaryotic TFBS motifs found in the second MEME example. tend to be shorter and more variable (7). Users of MEME via the website or locally installed versions To successfully discover TFBS motifs with MEME, it is are asked to cite this article as well as the primary reference for necessary to choose and prepare the input sequences carefully. MEME (5). Users of MAST are asked to cite this article Candidate sequences can be the promoters of genes believed and Ref. (6). to be co-regulated based on the evidence from expression microarray experiments, or sequences appearing to bind to a transcription factor based on chromatin immunoprecipitation MOTIF DISCOVERY STRATEGIES experiments. The sequences should be as short as possible and Motif discovery can be viewed as a ‘needle in a haystack’ contain as few ‘noise’ sequences (sequences not containing problem. The motif discovery algorithm is looking for a set any motif) as possible. Ideally, the sequences should be of similar short sequences (the needle) in a set of much longer <1000 bp long (8). Including more than 40 motif-containing Nucleic Acids Research, 2006, Vol. 34, Web Server issue W371 sequences generally does not improve TFBS motif discovery with MEME and similar algorithms (9). If the sequences contain low-information segments that do not contain motifs of interest, it can be helpful to remove them using the DUST program (R. L. Tatusov and D. J. Lipman, unpublished NCBI/ Toolkit), which is available for downloading at http://blast. wustl.edu/pub/dust/. Repetitive DNA elements should also be removed from the sequences input to MEME using the RepeatMasker program (A. Smit, R. Hubley and P. Green, unpublished data), which can be accessed via the Web (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker). It should be noted that MEME is not suited to whole- genome TFBS motif discovery. Owing to their shortness and degeneracy, TFBS motifs become statistically ‘invisible’ V Y C W K D T DT I in the context of a whole genome. The sensitivity of the search F E NL L N I L P I TDNEKFL A I H C Y E for TFBS motifs can be improved by using a ‘higher-order background sequence model’, but this option is only available PSSM of x26060xblA (x26060xbl;) 5 sequences. currently when users download the MEME source code and install it locally. Instructions for the installation are available Figure 2. LOGO of protein motif. LOGOS are a visualization tool for motifs. at the MEME website (http://meme.nbcr.net/meme/website/ The height of a letter indicates its relative frequency at the given position meme-download.html) by clicking on ‘View MEME man (x-axis) in the motif. page’; see the documentation for the ‘-bfile’ switch there. Protein motifs are generally easier to discover owing to the length of the protein alphabet and the chemical similarity MAST database pull-down menu as ‘Upstream Sequence among groups of amino acids. This allows shorter motifs to Databases’. Currently, only a few organisms are supported. be more statistically significant and makes it easier to distin- However, users can upload their own database of promoter guish functional motifs from statistical artifacts. To use sequences for searching using MAST. Protein motifs can be MEME to discover protein motifs, the same basic guidelines used to search any of the sequence databases provided by the apply as with DNA motifs—keep the sequences as short as MAST website since MAST can search either protein or nuc- possible and include as few sequences that are not likely to leotide databases with protein motifs. The MAST database are contain the motif as possible in the input to MEME. Low- updated weekly. complexity regions can be removed from the protein input sequences using the SEG program (10). WEB SERVER AND USER SUPPORT ANALYZING MOTIFS USING THE MEME As of MEME version 3.5, the configuration and installation of OUTPUT HYPERLINKS MEME (including the web server) is significantly simplified by using Autoconf (http://www.gnu.org/software/autoconf/ The MEME HTML output contains buttons making it easy autoconf.html) and Automake (http://www.gnu.org/software to analyze the motifs it discovers. By clicking on the button /automake/automake.html) from the GNU Build System. An labeled ‘Compare PSPM to known motifs in JASPAR data- installation session for MEME and MAST web server may be base’ following each motif, the DNA motif can be compared to as simple as follows: each of the motifs in the JASPAR database (11) of known TFBS motifs. Similarly, protein motifs may be compared with cd meme_3.5.2 protein motifs in the BLOCKS database of protein motifs (12) ./configure --prefix¼$HOME/meme --with- by clicking on the ‘submit BLOCK’ button following each url¼http://www.nbcr.net/ motif on the MEME form. This takes the user to the ‘BLOCKS meme --enable-web server’ where clicking on ‘LAMA’ will compare the motif make with those in the BLOCKS database. The BLOCKS server also make test allows users to display protein motifs in many different ways, make install including LOGOS (13) or phylogenetic trees, by clicking on the corresponding buttons on the BLOCKS server form. By Supported platforms now include Linux, Solaris, MacOS X, clicking on one of the file output formats under Logos, the user Cygwin and Irix. is able to obtain a LOGOS diagram similar to that shown in The MEME web server hosted by NBCR is queried by about Figure 2. 800 different users (based on unique email addresses) each To search sequences for matches to the motifs found by month. Usage has been growing steadily since the service was MEME, users can click on the ‘MAST’ button at the top of first introduced in 1996. Figure 3 shows usage growth at the the MEME output form. This will take the user to the MAST NBCR server since 2000. website where they can select the database to search. Since To meet the growing user demand and take advantage of MAST is sequence-oriented, TFBS motifs should only be the emerging grid-computing resources (14), we have made used to search promoter regions. These are listed in the MEME available for the installation on Linux clusters using bits 20 W372 Nucleic Acids Research, 2006, Vol. 34, Web Server issue D J F MAM J J A S O N D J F M AM J J A S ON D J F MAM J J A S ON D J F MAM J J A S ON D J F MA M J J A S ON D J F M Figure 3. Usage of MEME at the NBCR web server. The plot shows the number of different users submitting jobs to the NBCR MEME web server each month since December 2000. Usage figures for March 2006 include up to March 20 only. either the RPM package manager or Rocks. The RPM package upstream sequences for many additional organisms to the manager is a tool for managing software installation on com- MAST/MCAST websites to facilitate the analysis of TFBS puters running many versions of the Linux operating system. motifs discovered by using MEME. Rocks (http://www.rocksclusters.org) is a highly customized NBCR has developed a set of tools built on top of the open toolkit for computational biologists and engineers to build and source software that allows bioinformatics applications to be maintain Linux clusters. The current NBCR MEME web deployed as Web Services easily (S. Krishnan, B. Stearn, server cluster is built using the MEME roll for Rocks and K. Bhatia, W. W. Li and P. Arzberger, manuscript submitted) requires minimal maintenance effort. and leverage the Cyberinfrastructure components transpar- MEME and MAST can be downloaded and installed free ently (14). A prototype has been deployed using MEME as of charge by academic users via the website: (http://meme. a scientific driver (16) that offers a user with a dynamic pool of nbcr.net/meme/website/meme-download.html). Approxi- distributed compute resource, workflow management console mately 300 users download the MEME/MAST software and a friendly user interface. This portal will be deployed to each month. The MEME support team offers assistance to the production web server in the future. the MEME and MAST user community through the forum (http://nbcr.net/forum/viewforum.php?f¼5) or the mailing list ([email protected]). Institutes interested in setting up MEME ACKNOWLEDGEMENTS mirror sites are encouraged to contact us for any assistance. The authors acknowledge NBCR award from NCRR, NIH P41 RR08605, for support of the MEME and MAST website. TLB FUTURE DIRECTIONS acknowledges grant from NIH, R01 RR021692-01, for support of continuing development of the MEME and related sequence To increase the sensitivity of MEME searches, we will add an analysis tools. T.L.B. also acknowledges the ARC Centre for option in the web server to let the user upload a background Bioinformatics (ACB) (ARC CE0348221) for infrastructure sequence model to MEME. We hope to add algorithms for support for the MEME mirror site at the ACB. Funding to removing low-complexity regions (SEG and DUST) and pay the Open Access publication charges for this article was repeated elements (RepeatMasker) in the MEME website as provided by the NIH. a convenience to users. These services will also be exposed as web services and are integrated using workflow tools Conflict of interest statement. None declared. developed by using NBCR. We have also planned to add buttons to the MEME output to allow TFBS motifs to be used in searching for cis-regulatory REFERENCES modules via algorithms such as MCAST (15). MCAST will be configured to be able to search the same DNA databases 1. Bailey,T.L. and Elkan,C. (1995) Unsupervised Learning of Multiple as MAST. In conjunction with this, we will add databases of Motifs In Biopolymers Using EM. Mach. Learn, 21, 51–80. Nucleic Acids Research, 2006, Vol. 34, Web Server issue W373 2. Bailey,T.L. and Elkan,C. (1995) The value of prior knowledge in of the Eighth International Conference on Intelligent Systems for discovering motifs with MEME. In Rawlings,C., Clark,D., Altman,R., Molecular Biology, August. AAAI Press, Menlo Park, CA, pp. 269–278. Hunter,L., Lengauer,T. and Wodak,S. (eds), In Proceedings of the Third 9. Hu,J., Li,B. and Kihara,D. (2005) Limitations and potentials of current International Conference on Intelligent Systems for Molecular biology, motif discovery algorithms. Nucleic Acids Res., 33, 4899–4913. July. AAAI Press, Menlo Park, CA, pp. 21–29. 10. Wootton,J.C. and Federhen,S. (1966) Analysis of compositionally 3. Lyons,T.J., Gasch,A.P., Alex Gaither,L., Botstein,D., Brown,P.O. and biased regions in sequence databases. Methods Enzymol, 266, Eide,D.J. (2000) Genome-wide characterization of the Zap1p 554–571. zinc-responsive regulon in yeast. Proc. Natl Acad. Sci. USA, 97, 7957–7962. 11. Sandelin,A., Alkema,W., Engstro ¨ m,P., Wasserman,W.W. and 4. Fang,J., Haasl,R.J., Dong,Y. and Lushington,G.H. (2005) Discover Lenhard,B. (2004) JASPAR: an open-access database for eukaryotic protein sequence signatures from protein-protein interaction data. transcription factor binding profiles. Nucleic Acids Res, 32, BMC Bioinformatics, 6, 1–8. D91–D94. 5. Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation 12. Henikoff,J.G., Pietrokovski,S. and Henikoff,S. (1997) Recent maximization to discover motifs in biopolymers. In Altman,R.B., enhancements to the blocks database servers. Nucleic Acids Res., Brutlag,D.L., Karp,P.D., Lathrop,R.H. and Searls,D.B. (eds), 25, 222–225. Proceedings of the Second International Conference on Intelligent 13. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to Systems for Molecular Biology, August. AAAI Press, Menlo Park, CA, display consensus sequences. Nucleic Acids Res., 18, 6097–6100. pp. 28–36. 14. Foster,I. and Kesselman,C. (2004) The Grid 2: Blueprint for a New 6. Bailey,T.L. and Gribskov,M. (1998) ‘Combining evidence using P-values: Computing Infrastructure. 2nd edn. Morgan Kaufmann Publishers, Inc., application to sequence homology searches. Bioinformatics, 14, 48–54. San Francisco, CA. 7. Tompa,M., Li,N., Bailey,T.L., Church,G.M., De Moor,B., Eskin,E., 15. Bailey,T.L. and Noble,W.S. (2003) Searching for statistically significant Favorov,A.V., Frith,M.C., Fu,Y., Kent,W.J. et al. (2005) Assessing regulatory modules. Bioinformatics, 19 (Suppl 2), II16–II25. Computational Tools for the Discovery of Transcription Factor Binding 16. Li,W.W., Krishnan,S., Mueller,K., Misleh,C. and Arzberger,P. (2006) Sites. Nat. Biotechnol., 23, 137–147. Building cyberinfrastructure for bioinformatics using service oriented 8. Pevzner,P.A. and Sze,S.H. (2000) Combinatorial approaches to finding architecture. In Bu Sung,F.L., Abramson,D., Cai,W., Graupner,S., subtle signals in DNA sequences. In Bourne,P.E., Gribskov,M., Jin,H. and Sloot,P. (eds), Proceedings of the IEEE International Altman,R.B., Jensen,N., Hope,D., Lengauer,T., Mitchell,J.C., Symposium on Cluster Computing and the Grid, May. IEEE Press, USA, Scheeff,E.D., Smith,C., Strande,S. and Weissig,H. (eds), In Proceedings (in press).
Nucleic Acids Research – Oxford University Press
Published: Jul 1, 2006
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.