Transforming and Integrating Biomedical Data using Kleisli: A Perspective S.B. Davidson*, P. Buneman, S. Harker, C. Overton and V. Tannen Dept. of Computerand InformationScience & Center for Bioinformatics University of Pennsylvania Philadelphia, PA 19104-6389 *Phone: (215) 898-3490, Fax: (215) 898-0587, {susan,peter,sharker, coverton,val}@cis.upenn.edu I. Introduction The process of building a new database relevant to some field of study in biology involves extracting, transforming, integrating and cleansing data from multiple external data sources, as well as adding new material and annotations. For example, EpoDB [1] is a database created at the University of Pennsylvania Center for Bioinformatics www.cbil.upenn.edu, which was designed to study gene regulation during differentiation and development of vertebrate red blood cells. In building EpoDB, data relevant to red blood cells were extracted from Genbank, Swissprot, TRRD (transcriptional regulation data) and GERD (expression levels data) and transformed into a relational form. The data was then cleansed of errors using a semi-automated approach. Cleansed data was then integrated, and additional information or annotations were entered manually to the integrated data. Due to the cleansing and value added to the extracted data, databases such as EpoDB are commonly referred to as curated data warehouses, or in database jargon, as materialized
/lp/association-for-computing-machinery/transforming-and-integrating-biomedical-data-using-kleisli-a-BVYRedjzIl