Plant Molecular Biology 48: 39–48, 2002.
© 2002 Kluwer Academic Publishers. Printed in the Netherlands.
Computational gene ﬁnding in plants
and Steven L. Salzberg
Institute for Genome Research, 9712 Medical Center Drive, Rockville, MD 20850, USA (∗author for correspon-
dence; e-mail email@example.com)
Key words: computational gene ﬁnding, genome sequencing
Automated methods for identifying protein coding regions in genomic DNA have progressed signiﬁcantly in recent
years, but there is still a strong need for more accurate computational solutions to the gene ﬁnding problem.
Large-scale genome sequencing projects depend greatly on gene ﬁnding to generate accurate and complete gene
annotation. Improvements in gene ﬁnding software are being driven by the development of better computational
algorithms, a better understanding of the cell’s mechanisms for transcription and translation, and the enormous
increases in genomic sequence data. This paper reviews some of the most widely used algorithms for gene ﬁnding
in plants, including technical descriptions of how they work and recent measurements of their success on the
genomes of Arabidopsis thaliana and rice.
Computational methods for ﬁnding genes have be-
come an increasingly important tool in recent years.
As the pace of genome sequencing has increased, the
need for rapid methods of gene discovery has become
ever greater. The genome sequence is just the begin-
ning of a larger effort to understand the functions of an
organism, and one of the ﬁrst and most critical steps in
that process is the accurate identiﬁcation of all genes
and their associated proteins.
After a genome has been sequenced and assem-
bled, the ﬁrst step in the annotation process is to ﬁnd
the locations of the genes. For prokaryotes (bacteria
and Archaea), this means identifying the position of
the start and stop codons of each gene, and possibly
identifying regulatory sequences around them. For eu-
karyotes, this step requires identiﬁcation not only of
the start and stop codons, but also the positions of
all the introns, which vary tremendously in size and
number even within a single species. Between species
the variation is even greater: for example, the para-
site Plasmodium falciparum (the causative agent of
malaria) has on average just one intron per gene, and
these introns tend to be small, around 200 bp or less.
In contrast, human genes have 4–5 introns, with an
average size of 350 bp, and a size range from about
10 bp up to 1 Mb. Genes in Arabidopsis also have 4–5
introns on average, but the intron size is smaller than
in man. Finding these more complicated gene struc-
tures by computer is a demanding task, and no existing
program solves it perfectly.
Sequence patterns around genes
A key component of the most successful gene-ﬁnding
algorithms is the ability to recognize the DNA and
RNA sequence patterns that are critical for transcrip-
tion, splicing, and translation. These signals are usu-
ally characterized as short sequence patterns in the
genomic DNA that correspond directly to regions on
mRNA or pre-mRNA that have a key role in splic-
ing or translation. The signals most commonly used
by computational methods are translational starts and
stops and the splice junctions surrounding introns. If
all of these signals could be detected perfectly, then
the protein coding region could be identiﬁed simply
by removing the introns, concatenating all the exons,
and reading off the protein sequence from start to stop.
Unfortunately, there is no completely accurate method
to identify any of these signals, although increas-
ingly complex computational techniques have been