Multiple sequence alignment
Robert C Edgar
1
and Serafim Batzoglou
2
Multiple sequence alignments are an essential tool for
protein structure and function prediction, phylogeny inference
and other common tasks in sequence analysis. Recently
developed systems have advanced the state of the art with
respect to accuracy, ability to scale to thousands of proteins
and flexibility in comparing proteins that do not share the same
domain architecture. New multiple alignment benchmark
databases include PREFAB, SABMARK, OXBENCH and
IRMBASE. Although CLUSTALW is still the most popular
alignment tool to date, recent methods offer significantly
better alignment quality and, in some cases, reduced
computational cost.
Addresses
1
45 Monterey Drive, Tiburon, CA, USA
2
Department of Computer Science, Stanford University, Stanford,
CA 94305-9025, USA
Corresponding author: Edgar, Robert C (bob@drive5.com)
Current Opinion in Structural Biology 2006, 16:368–373
This review comes from a themed issue on
Sequences and topology
Edited by Nick V Grishin and Sarah A Teichmann
Available online 5th May 2006
0959-440X/$ – see front matter
# 2006 Elsevier Ltd. All rights reserved.
DOI 10.1016/j.sbi.2006.04.004
Introduction
A multiple sequence alignment (MSA) arranges protein
sequences into a rectangular array with the goal that
residues in a given column are homologous (derived from
a single position in an ancestral sequence), superposable
(in a rigid local structural alignment) or play a common
functional role. Although these three criteria are essen-
tially equivalent for closely related proteins, sequence,
structure and function diverge over evolutionary time
and different criteria may result in different alignments.
Manually refined alignments continue to be superior to
purely automated methods; there is therefore a contin-
uous effort to improve the biological accuracy of MSA
tools. Additionally, the high computational cost of most
naive algorithms motivates improvements in speed and
memory usage to accommodate the rapid increase in
available sequence data. In this review, we describe
the state of the art in MSA software and benchmarking,
and offer our recommended procedures for creating
multiple alignments from typical types of input data.
Computational approaches to multiple
sequence alignment
MSA algorithm development is an active area of research
two decades after the first programs were written. The
standard computational formulation of the pairwise pro-
blem is to identify the alignment that maximizes protein
sequence similarity, which is typically defined as the sum
of substitution matrix scores for each aligned pair of
residues, minus some penalties for gaps. The mathema-
tically — though not necessarily biologically — exact
solution can be found in a fraction of a second for a pair
of proteins. This approach is generalized to the multiple
sequence case by seeking an alignment that maximizes
the sum of similarities for all pairs of sequences (the sum-
of-pairs, or SP, score).
The SP score is the foundation of many MSA algorithms,
but has a number of drawbacks. The minimum possible
computational time and memory required to maximize
the SP score has been shown to scale exponentially with
the number of sequences [1] and is not practical for more
than a handful of sequences on current computers. Heur-
istic or approximate alternatives are therefore required for
typical input data. The most widely used approach to
construct a multiple alignment is ‘progressive alignment’
[2], whereby a set of N proteins are aligned by performing
N–1 pairwise alignments of pairs of proteins or pairs of
intermediate alignments, guided by a phylogenetic tree
connecting the sequences.
In contrast to the pairwise case, the SP score has no rigorous
theoretical foundation and, in particular, fails to exploit
phylogeny or incorporate an evolutionary model. SP, like
most other scores in common use, assumes that the input
sequences are globally alignable, that is to say, substitu-
tions and small insertions and deletions are the only
mutational events separating the sequences. If full-length
sequences are used, this implies that all proteins must have
the same domain organization (the same domains in the
same order); otherwise, the user is required to identify
globally alignable subsequences, such as a common
domain, before creating an MSA. For known domains,
tools such as PFAM [3] can be used; progress towards an
automated solution is demonstrated by the recently
released ProDA program (http://proda.stanford.edu).
A methodology that has been successfully used as an
improvement of progressive alignment based on the SP
formulation is ‘consistency-based’ scoring [4–6]. Given
three sequences, A, B and C, the pairwise alignments A-B
and B-C imply an alignment of A and C that may be
different from the directly computed A-C alignment.
Current Opinion in Structural Biology 2006, 16:368–373 www.sciencedirect.com