Plant Molecular Biology 48: 59–74, 2002.
© 2002 Kluwer Academic Publishers. Printed in the Netherlands.
Surviving in a sea of data: a survey of plant genome data resources and
issues in building data management systems
, Lukas A. Mueller and Seung Yon Rhee
Carnegie Institution, Department of Plant Biology, 260 Panama Street, Stanford, CA 94305, USA (
correspondence; e-mail firstname.lastname@example.org)
Key words: controlled vocabulary, databases, data management, genomics, information systems, nomenclature
Exponential growth of data, largely from whole-genome analyses, has changed the way biologists think about
and handle data. Optimal use of these data requires effective methods to analyze and manage these data sets.
Computers, software and the World Wide Web are now integral components of biological discovery. Understanding
how information is obtained, processed and annotated in public databases allows researchers to effectively organize,
analyze and export their own data into these databases. In this review we focus largely on two areas related to
management of genomic data. We cite examples of resources available in the public domain and describe some
of the software for data management systems currently available for plant research. In addition, we discuss a
few concepts of data management from the perspective of an individual or group that wishes to provide data to
the public databases, to use the information in the public databases more efﬁciently, or to develop a database to
manage large data sets internally or for public access. These concepts include data descriptions, exchange format,
curation, attribution, and database implementation.
Biological research during the past decade has gen-
erated an exponential increase of data. For exam-
ple, the number of sequences in GenBank increased
from 4864 490 in 1999 to 10 106 023 in 2000, total-
ing 11 101 066 288 bp (http://www.ncbi.nlm.nih.gov/
Genbank/genbankstats.html). In addition, exploration-
driven methods (e.g. genome sequencing, gene expres-
sion proﬁling) create large data sets that often exist
with little biological context, and much of them are
published electronically without peer review.
In order to derive meaning from these large data
sets, tools are required to analyze and identify patterns
in the data, and allow data to be put into a biologi-
cal context. For the tools to be developed and reﬁned,
data must be easily accessible and amenable to analy-
sis. The analyzed data must be fed back into the loop
to allow the data to be re-analyzed, reﬁned, veriﬁed,
unexplored areas to be identiﬁed, and new hypothe-
ses to be built. The development and maintenance of
systems and procedures that allow the manipulation
of data in the above processes can be deﬁned as data
management. Good data management practices are
fundamental to generators and users of genomic data,
as well as those who are concerned with the develop-
ment of resources for public access (Kaminski, 2000;
Stevens et al., 2001).
This paper is divided into two parts. First, we
describe different types of data management systems
and tools. In the second part, we present issues rele-
vant to the development of data management systems,
such as nomenclature, controlled vocabulary, data ex-
change formats, curation, attribution, conceptual data
modeling, and physical database implementation.
Resources and tools for data management
In a recent survey, biologists were asked to assess the
required tasks needed to support the utilization and
analysis of data (Stevens et al., 2001). Of primary