TY - JOUR AU - Sinha, Animesh A. AB - Abstract DNAmicroarrays represent a technological intersection between biology and computers that enables gene expression analysis in human tissues on a genome-wide scale. This application can be expected to prove extremely valuable for the study of the genetic basis of complex diseases. Despite the enormous promise of this revolutionary technology, there are several issues and possible pitfalls that may undermine the authority of the microarray platform. We discuss some of the conceptual, practical, statistical, and logistical issues surrounding the use of microarrays for gene expression profiling. These issues include the imprecise definition of normal in expression comparisons; the cellular and subcellular heterogeneity of the tissues being studied; the difficulty in establishing the statistically valid comparability of arrays; the logistical logjam in analysis, presentation, and archiving of the vast quantities of data generated; and the need for confirmational studies that address the functional relevance of findings. Although several complicated issues must be resolved, the potential payoff remains large. Increasing numbers of human diseases, both acquired and genetic, are being considered to be based at least in part on alterations in DNA sequence. For most diseases, inheritance and acquisition are likely to be complex and polygenic. The efforts of the Human Genome Project to elucidate the structural genetic background by identifying the chromosomal positions and genomic organization of between approximately 30 000 and 35 000 human genes are nearly complete.1 Based on this structural knowledge, a byproduct should be a better "scaffolding" to help link specific genes to susceptibility to various human diseases. However, to understand how the products of these genetic linkages work together to orchestrate the initiation and progression of particular complex diseases, there will be a need to apply a functional genetic rather than a structural genetic approach.2,3 Until recently, functional genetic studies have generally been of limited scope, only able to elucidate the role of 1 or a few genes at a time in 1 system. Information on the specificity and relative abundance of expression products has traditionally been obtained by techniques such as RNANorthern blothybridization and ribonuclease protection assays. Somewhat more sophisticated methods, such as differential display4 and Serial Analysis of Gene Expression,5 have been used to screen larger numbers of complementary DNA (cDNA) clones. However, technical limitations render these techniques nonconducive to large-scale genetic survey. To this end, a powerful new technology is emerging, using hybridization to nucleotide arrays, the so-called gene chips.6,7 This technological intersection of biology and computers enables the reliable screening of a vast number of genes simultaneously and is amenable to automation. On a nylon membrane or glass surface, gene-specific cDNAs can be spotted, or oligonucleotides can be synthesized in situ by a combination of photolithography and oligonucleotide chemistry. This permits simultaneous monitoring of the expression of thousands of genes in a single step. Individual chips can be customized to include any chosen set of fully or partially characterized genomic or expressed sequences. Chips can monitor over 50 000 unique sequences. The power of these chips lies in the potential for comparative expression studies in diseased vs normal samples, and in documenting changes at different stages during the natural course of the disease or in response to treatment. It provides the researcher with a new arsenal to analyze underlying pathomechanisms on a grand scale and also to review the rationale of therapeutic concepts. However, despite the enormous potential of this revolutionary technology, there are several issues and possible pitfalls that attenuate the power of microarrays. First, the definition of normal in expression comparisons is neither precise nor unambiguous. Second, the heterogeneity of the tissues being studied complicates the meaning of the expression profiles. Third, the statistically valid comparability of arrays is an unresolved problem. Fourth, the vast quantities of data create a logistical logjam for analysis, presentation, and archiving. Finally, confirmational studies are needed to corroborate the biological significance of microarray data (Figure 1). Trouble with normal The standard normal vs diseased tissue type of comparison, which is the basic design foundation of profiling studies, may be more quicksand than bedrock. Normal is not so easy to define—neither is diseased. Gene expression in normal tissue is likely to be dependent on several factors involving patient and sample variation. These factors will also have an impact on expression profiles of diseased tissue. Patient Variation: Ethnicity, Sex, Age, Genetic Background, Disease States The ethnicity, sex, age, and genetic background of a patient are likely to affect the gene expression profiles of many tissues to varying extents.8,9 A simple example is provided by the expression profiles of genes involved in scalp and body hair follicle activity, which can be expected to vary over a normal range under the influence of all of these sources of patient variation. The effects of these parameters on gene expression are likely to be subtle but pervasive, not fully understood at this time, and quite problematic for defining normal. The presence of disease in a subject who is the source of tissue for control purposes, presents further potential variabilities. For example, there may be a significant difference in the conclusions reached by 2 similar microarray expression profiling studies. One may compare genes expressed in a patient's diseased lung tissue with those expressed in normal, nondiseased lung tissue from the same patient, and another may compare genes expressed in the same patient's diseased lung tissue with those expressed in normal, nondiseased lung tissue from a healthy control or normal individual. Moreover, it is also possible that seemingly unrelated disease states may influence gene expression at distant sites. For instance, the presence of diabetes in 1 of 2 renal cancer patients may complicate the direct comparison of renal tissues. Sample Variation: Proximity to Disease, Anatomic Location, and Developmental Range Yet another complication derives from the proximity of the normal tissue used as a control for the diseased tissue. Tissue adjacent to an area of disease may not be normal despite absence of evidence of disease clinically or under the light microscope. Normal-appearing tissue near a tumor could, for example, be genotypically altered or exhibit an altered gene expression profile.10-13 Moreover, factors such as the degree of disease-associated inflammation may have a significant impact on gene expression profiles. Other bystander effects, epiphenomena, or secondary disease processes could all play important roles in determining expression profiles within these adjacent, so-called normal tissues. These factors must be considered in the choice of normal. The precise location within a particular organ may be another important factor that affects gene expression.9 For example, just as location relative to the urethra may influence expression profiles in the prostate,14 skin from the nose, back, and palm are certain to have different expression profiles as well, despite all being from the same organ. Thus, site and specific anatomic location must also be taken into account in a description of normal. It must also be kept in mind that the definition of normal actually represents a dynamic state.14 All tissues, which are composed of early and late-stage cells, have a normal developmental range. For example, normal epithelium in prostatic ducts ranges from atrophic to resting to hyperplastic, and each has a unique pattern of gene expression.14 A 3-dimensional analytic approach is a strategy that has been used to address some of these concerns about defining normal. Cole et al14 used a 3-dimensional model to characterize the entire prostate gland in their study of gene expression profiles in prostate cancer. In this study, whole-mount prostactectomy specimens were divided into transverse cross sections such that the entire prostate gland, including the complete spectrum of normal epithelium and tumor progression, was available for viewing, microdissection, and microarray analysis. This method was used to determine the exact physical relationship of the normal ducts, premalignant lesions, and tumors—thus obtaining an anatomic framework on which to overlay gene expression data. This technique offers several advantages over the normal vs tumor comparison. Previous studies had used normal epithelium in prostatic ducts as a baseline control against which to compare and contrast tumor gene expression profiles.15,16 However, the expression profile of this normal epithelium is affected by proximity to tumor, location within the gland, and developmental state.14 These factors can be better appreciated using a 3-dimensional approach. Disease-Related Variation Of course, many of the parameters that affect normal expression profiles (patient ethnicity, age, sex, and genetic background, location within an organ, and developmental stage) will also affect disease expression profiles.17-19 Disease heterogeneity, including subtype, activity, severity, stage, and previous as well as current treatments, also may have a significant impact on gene expression.20-25 Categorizing and subgrouping patients on entry into a study may be useful to control for as many of these factors as possible. However, there may be problems surrounding attempts to define microarray-based categorization on the basis of another imperfect categorization system, such as histology, as these groups are sometimes arbitrary or inconsistently designated. Nevertheless, determining whether gene expression profiles correlate with existing clinical or histological categories can provide new insights into the meaning of these categories as can new methods of classifying cancers or other diseases into specific diagnostic categories based on their gene expression signatures. Several studies have been able to establish expression-based criteria (class predictors) for preexisting categories and then use these new criteria to categorize new cases (class prediction).26-28 Global profiling may also allow the development of new classification systems based on gene expression alone (class discovery).29,30 Thus, when possible, it will be of value to profile a range of normal and diseased cell populations from a number of patients to distinguish between differences in expression that are relevant to the disease process and those reflecting the biological spectrum of the normal tissue or that have occurred for reasons unrelated to the disease. The significance of this distinction is further appreciated when taking into account the vast quantity of data generated from microarrays and the potential for confounding interpretation from the inclusion of differential expression unrelated to disease processes. It is worth noting, however, that the issues of patient and sample variability are not unique to microarray experiments. In fact, microarray experiments, in contrast to classic single-gene experiments, may actually provide the tools for identifying this heterogeneity. For example, DNA microarrays have been used to explore physiological variation in gene expression on a genomic scale in 60 cell lines derived from diverse tumor tissues.31 Cluster analysis allows the identification of prominent features in gene expression patterns that appear to reflect molecular signatures of the tissue from which the cells originated.31 Heterogeneous cell populations A further complication encountered with expression profiles is that any given tissue is composed of several cell types, members of which are likely to be within a spectrum of dynamic functional states. For example, a simple punch biopsy of the skin may contain keratinocytes, melanocytes, Langerhans cells, Merkel cells, adipocytes, smooth muscle cells of arrector pili, striated muscle cells of the panniculus carnosus, blood cells including immune system cells, and cellular elements of blood vessels, nerves, hair follicles, sebaceous glands, and sweat glands. Moreover, cells from each of these populations will be at various stages of development and levels of activation, performing different functions and responding to disease processes or treatments in different ways and to varying extents. The result is a highly heterogeneous sampling of cells, each expressing a unique set of genes. An expression profile generated from a microarray study of the RNA in such a sample will thus represent merely a snapshot of the genes expressed by a plethora of cells at a moment in time. Such extensive cellular heterogeneity complicates the ability to draw conclusions about specific processes occurring within a tissue specimen. An illustrative example is provided by Stanton et al,32 who used microarrays to identify genes differentially expressed during myocardial infarction. The expression profiles they studied represented transcripts from cell populations as diverse as immune system cells, which migrated to the infarct region and are responsible for the inflammatory response, cardiac myocytes within the ischemic area undergoing apoptosis and necrosis, fibroblasts undergoing proliferation and participating in the formation of scar tissue to replace the infarct, and cardiac myocytes undergoing hypertrophy to compensate for the loss of cells in the infarct area.33 The issue of such cellular heterogeneity was avoided by categorizing the differentially expressed genes into functional categories to look for patterns indicative of cardiac remodeling without attempting to attribute specific transcripts to specific cell types. For gene expression studies involving samples with mixed cellular populations, further investigation, such as with in situ messenger RNA (mRNA) hybridization, may be necessary to localize the transcripts before conclusions can be drawn about the roles of specific genes in specific cell types during the disease process. Laser Capture Microdissection An ingenious but technically delicate approach to the study of complex biological samples has become possible with the development of laser capture microdissection (Figure 2).34 This technique allows for the rapid and accurate procurement of cells from specific areas of tissue under direct microscopic visualization, and thus makes the molecular genetic analysis of defined populations in their native tissue environment possible.35 Sgroi et al36 demonstrated the feasibility of combining laser capture microdissection with high-throughput cDNA arrays. They showed that in vivo subpopulations of malignant cells from multiple stages of breast cancer progression could be separated from nonmalignant populations, and their expression profiles could subsequently be analyzed using microarrays. The potential is powerful. Specimens could be separated into tissue layers; for example, separating a skin biopsy into epidermis, dermis, and hypodermis. Tissues could be further differentiated into specific structural components, such as dermis into blood vessels, adipose, arrector pili, and sebaceous glands. Structures could be separated into defined cell types, such as blood vessels into endothelial cells, erythrocytes, and lymphocytes. Cell types could even be separated into marker-defined subtypes, such as lymphocytes into CD4 and CD8 cells. Expression profiles from refined and defined structures and cell types likely would be extremely valuable in the study of disease. Potential aside, there are significant limitations to this technology at the present time. The standard protocols for fixing and embedding tissue samples from surgical resections were not designed to be compatible with microarray experiments, with or without laser capture microdissection. Typically, tissue suspected of being important for diagnosis and staging is processed through aldehyde-based fixatives, such as formalin, which damage mRNA integrity.37 If frozen tissue is available, mRNA can be recovered and studied from dissected cell populations. However, frozen tissue sections are technically difficult to prepare, the histology is often severely compromised, and the tissue available may contain only a limited portion of the lesion.14 Moreover, the sample amounts generated from laser capture microdissection can be small, even as miniscule as a single cell.38 Consequently, the yields of RNA are low. Arrays have a threshold for the quantity of molecular starting material: at least 5 to 15 µg for oligonucleotide arrays and between 2 and 100 µg for cDNA arrays, depending on the manufacturer, the source of the RNA, and the use of signal amplification.39,40 Studies that have successfully integrated laser capture microdissection with microarray technology have used samples of approximately 1 × 104 to 1 × 105 cells with 95% to 98% homogeneity as determined by microscopic visualization.36,41 If needed, amplification techniques may be used to generate sufficient genetic material for microarray hybridizations.41 Laser capture microdissection is an intriguing technology, but time will tell whether its potential is realized. Although some biological issues related to gene expression may be complicated by the presence of heterogeneous cell populations in studied samples, it is also true that some biological conditions can be understood only in the context of these heterogeneous cell populations. The nature of global gene expression experiments is to uncover differences between 2 biological samples, including those differences based on diverse cell populations. For example, to appreciate a disease that is characterized by an inflammatory infiltrate, it must be understood that the inflammatory infiltrate is part of the disease and is part of the difference between diseased and nondiseased tissue. Thus, the isolation of specific cell populations for study is not necessarily required or even desirable in all instances. Making microarrays comparable Ideally, microarray experiments should be comparable both within and between laboratory or manufacturing systems, but obtaining consistent and comparable data is a critical challenge for microarray-based expression analysis. Major sources for the observed variability of microarray data include the normal physiological gene expression variations in different samples and the noise introduced in the microarray assay process.42 Physiological Gene Expression Variation Inextricably linked to the issues of patient and sample variability and tissue heterogeneity discussed above, is the problem of normal gene expression variations and how to distinguish these variations from significant disease-associated changes. Few studies have systematically investigated physiological expression changes, but data from in situ hybridizations suggest that normal variance for many tightly regulated tissue-specific genes can be within 20% to 30%.42 However, there can be as much as 2- to 4-fold random fluctuations for many genes in yeast.43,44 Affymetrix (Santa Clara, Calif) guidelines have suggested that for most of the "housekeeping" genes in human tissues, which are likely to be less tightly regulated, differences of less than 4-fold are probably not biologically significant.45 Consequently, a significant portion of microarray data variability for high- or medium-abundance mRNAs may simply be due to normal expression variations. Several previous studies have used arbitrary 2-fold change criteria to define significant expression change.46 However, the 2-fold threshold has been shown to be statistically invalid even for duplicate experiments.46 In a recent study that used cDNA microarrays to profile gene expression in samples of normal skin from breast-reduction surgeries, 71 of 4400 genes were found to demonstrate variability in expression greater than 2 SDs from the mean of each gene.47 These included genes coding for transport proteins, gene transcription, cell-signaling proteins, and cell-surface proteins. Thus, physiological variation should be considered in the analysis and interpretation of microarray data. More stringent criteria for defining significant expression change may be useful. Noise in the Microarray Assay Process For the tightly regulated (mostly low abundance) mRNA species, inconsistencies introduced at any stage of the microarray-based assay process may play a major role in data variability.42 Due to the miniaturization and the large number of genes involved, it is difficult to maintain consistent processing conditions for each sequence across multiple assays, and obtaining accurate absolute signals is unlikely.42 Noise may be introduced by slide heterogeneities, printing irregularities (eg, pin-to-pin variations), and spotting volume fluctuations.48 Some of the systematic variations may be reduced by the inclusion of controls, but random fluctuation at various manufacturing stages cannot be completely controlled and can accumulate quickly in a complicated assay.48 In certain microarray systems, 2 samples are competitively hybridized to 1 array using different fluors for labeling. In other systems, there is only single-sample hybridization. A 2-color system might be expected to be more reliable since variations in spot size or amount of cDNA probe on the chip should not affect the signal ratio (both signals are derived from the same spot). However, this only holds true if signals are well above the background in both detection channels.42 In fact, the signal level for most of the tightly regulated genes will likely be close to the background level.42 In addition, background level on a slide can also vary significantly from spot to spot due to factors such as unevenness in slide surface properties, dust contamination, and incomplete washing, leading to high levels of signal variability for low-abundance mRNA species even in 2-color systems.42 The high levels of variability of microarray data also mean that subtle changes in experimental conditions may significantly alter the results, making it difficult for separate laboratories to compare experimental data. In addition, the lack of standard controls, the predominant use of relative signals (ratios), and the adoption of incompatible data formats contribute to poor comparability between studies.42 Despite the hard-wired variability introduced by chip manufacturing conditions, most of the published studies to date using microarray-based expression analysis include only limited numbers of replicates.49 In fact, many studies conduct the experiment only once. Considering the potential sources of assay variation, the need for sufficiently replicated studies is underscored.49 Microarray Data Normalization Because of variability of microarray data for single sample arrays and for further analysis of 2-color system arrays, each must be brought into the same scale to compare 2 or more arrays. How to perform this normalization of gene expression levels across multiple arrays, thus removing systematic variation between the arrays and rendering different experiments comparable, remains an issue that is not yet fully resolved.50 Many of the early microarray studies in the literature simply ignored this issue. A more statistically rigorous approach is needed. One difficulty has been that leading microarray manufacturers have not published statistical error models for their products. Thus, users are unclear how much to adjust data for variations in spot intensity, hybridization patterns, and intensity measurement sensitivity. Software does exist to allow for array-to-array comparisons by using a scaling factor to normalize gene expression patterns across arrays. However, in general, these algorithms assume that intensity differences between arrays are linearly related with a zero y-intercept.51 This assumption allows software to trim the tails off distributions of expression from different arrays at statistical cutoffs and then simply move the distributions along an axis to a common level to provide comparisons. However, this linear relationship often does not hold true.51 When the average expression level of 1 array is higher than that of a second array, a longer tail will be trimmed from the second array. Thus, a greater number of genes from the first array will be counted as being expressed because their expression level is above the statistical cutoff point. In this case, the 2 arrays cannot be considered comparable. Although bioinformatic software has recently been developed that offers more statistically robust normalization, the cost of these commercially available programs (combined with the already expensive microarrays) has been prohibitive for many researchers.50 Standardization of these processes awaits the development of improved methods of normalization leading to valid statistical models widely available to all researchers. To this end, Schadt et al51 have developed a standard nonlinear curve technique for normalizing the data in arrays that do not demonstrate a linear relationship between data sets. This model performs well when the 2 samples being compared demonstrate a low number of differentially expressed genes. However, when expression profiles of 2 samples vary to a greater extent, Schadt et al51 recommend a rank-selection method. Using this method, genes expressed on an array are ranked from highest to lowest level of expression. Then, for the array expressing a greater number of genes, the genes with the lowest expression levels are removed from the list until the 2 arrays list a comparable number of expressed genes. This type of rank-selection method has gained support from other groups, but it too has limitations.50 Removing low-expression level data points restricts the study to the more extreme and easily detected entities, a technique that blunts the genomic-scale potential of microarray technology. Efforts continue to improve comparability between arrays. Jones50 recently applied a statistical model to normalize spotted cDNA array data that takes into account not only the differences in numbers of genes expressed between arrays, but also the interarray variations in fluorescent dye intensity and mechanical error occurring in the printing process. Nevertheless, the issue of how to properly normalize array data has not been settled. Researchers must continue to demand statistical rigor in their comparisons before they can believe the mathematical results of their data. Logistical logjam Microarrays deliver massive amounts of data on tens of thousands of genes. The result is an immense quantity of biological information that must be analyzed, presented, and archived in a meaningful way. Data Analysis In human studies, the number of hybridizations that can be performed for any set of experimental conditions is often restricted by the limited number of obtainable tissue samples and by the expense of arrays. Restricted numbers of hybridizations for each experiment hamper the ability to assess the biological significance of variation within or between given sets of conditions. Thus, for the assessment of thousands of genes in a setting of limited hybridizations, the importance of reliable and sophisticated algorithms for data analysis becomes amplified.51 A logical beginning is to examine the extremes, that is, genes with significant differential expression in individual samples. For example, a comparison of 2 samples can be visualized in the form of a simple bivariate scatterplot in which the expression profile of 1 sample (x-axis) is plotted against that of the second sample (y-axis). The distribution pattern generally demonstrates that the expression ratios cluster around the line in which x is equal to y (indicating comparable levels), with individual genes falling varying distances from this line. Additional lines can be placed on the scatterplot to represent various fold changes of expression. Data points that fall above or below these lines represent genes exhibiting expression ratios greater or less than the specified fold change. Thus, one can begin by examining those genes that demonstrate a 10-fold or greater change in expression level. To expand the number of genes under investigation, one can examine genes that demonstrate a 5-fold or greater change, or a 3-fold or greater change, and so forth. Many studies define a 2-fold or greater change in expression level to represent significant differential expression. The 2-fold threshold, however, as noted above, has been shown to be statistically invalid.46 Although this simple technique can be efficient and effective for focusing on expanding sets of differentially expressed sequences, again, such an analysis does not take advantage of the full potential of genome-scale experiments to enhance our comprehension of cellular biology that would be provided by an inclusive analysis of the entire repertoire of transcripts in a cell as it goes through a biological process.52 A more holistic approach, which allows the deciphering of patterns from the entire data set, is needed. Data Organization and Presentation Statistical algorithms can be applied to detect and extract patterns within profiling data. It is a basic assumption of many gene expression studies that knowledge of where and when a gene is expressed provides information about the function of the gene. Therefore, an important beginning is to organize genes on the basis of similarities in their expression profiles.53 However, even this basic tenet deserves critical consideration. Similarity of gene expression profile does not mandate similarity of function or mechanistic pathway, and it may be purely coincidental. Nevertheless, the idea of clustering genes on the basis of their expression patterns is well established and cluster analysis has become the most widely used statistical technique applied to large-scale gene expression data.52 Although various cluster methods can usefully organize tables of gene expression measurements, the resulting ordered but still massive collection of numbers remains difficult to assimilate. Thus, another important component of genome-wide expression data exploration is the development of powerful data visualization methods and tools. Approaches have been developed that present clustering results in simple graphical displays such as dendrograms, which represent relationships among genes by a tree whose branch lengths reflect the degree of similarity in expression between the genes. Similarity is mathematically defined.54 The computed trees can be used to order genes in the original data table such that genes or groups of genes with similar expression level patterns are placed adjacently. Clustering methods can also be combined with representation of each data point with a color that quantitatively and qualitatively reflects the original experimental observations.52 Visual assimilation is then more intuitive. Data Archiving and Mining Ultimately, successful interpretation of gene profiling studies is likely to be dependent on the integration of experimental data with external information resources. As multiple experiments involving multiple cell types and tissues from multiple laboratory groups accumulate, data archiving may well become the watershed issue. Ideally, all data, in a suitably standardized form, would be freely accessible in the public domain. Even assuming a willingness to share the data, such utopian goals would require a user-friendly and powerful database system and standardization of correction and normalization procedures such that data points from various projects become comparable.55 The National Center for Biotechnology Information Entrez system (http://www.ncbi.nlm.nih.gov/Entrez/) does provide useful data in this regard, but current databases may be limited in scope or computability.53 A major focus of infrastructure development to support genomic-scale expression studies will need to be in the area of electronic biological pathway databases and resources. Confirmational studies The development of more sophisticated analytical algorithms and databases will help lend credence to the biological significance of differential gene expression determined by microarray analysis. In the meantime, several studies have begun to examine the sensitivity and specificity of microarray-based experiments. Sensitivity, defined as the minimum reproducible signal detected by a given array scanning system, has been reported for microarrays to be approximately 10 mRNA copies per cell, which is slightly inferior to the sensitivity of Northern blot analyses.56,57 Specificity studies showed that for a given probe any nontarget transcripts with more than 75% sequence similarity may show cross-hybridization.56 The problem of clone misidentification and the need for clone confirmation have also been addressed.58 One study found that of 1189 bacterial stock cultures, only 62.2% were uncontaminated and contained cDNA inserts that had significant sequence identity with published data for the ordered clones.59 Thus, the use of sequence-verified clones for cDNA microarray construction is warranted. Additionally, potential gene candidates can be assessed for relevance to disease using parallel technologies. Several such alternative platforms have been used to bolster the importance of specific sequences first suggested in gene chip comparisons including (1) methods at the RNA level, (2) methods at the protein level, and (3) functional studies. Methods at the RNA Level Reverse transcriptase polymerase chain reaction (RT-PCR) is a method often used to verify microarray data. Although RT-PCR is not well suited to quantitation, the relative technological ease of this assay and the ability to rapidly monitor multiple samples make it a useful technology.60,61 Hybridization data can be verified and multiple putative markers can be screened in a short period. Several other studies have used real-time quantitative RT-PCR (TaqMan, PE Applied Biosystems, Foster City, Calif).15,62 Real-time PCR is a technique that increases the quantitative ability of RT-PCR by providing accurate and reproducible information on RNA copy number (Figure 3). In this method, a fluorogenic probe (labeled at the 5′ end with a reporter fluorochrome and at the 3′ end with a quencher fluorochrome) is annealed to 1 strand of the target cDNA sequence between the forward and reverse PCR primers. As Taq polymerase extends the forward primer, its intrinsic 5′ to 3′ nuclease activity displaces and degrades the dual-labeled probe, releasing the reporter fluorochrome from the quencher label and allowing the detection of a fluorescent signal that is proportional to the amount of PCR product generated in each cycle.63 Northern blot analysis is also commonly used as a confirmational technique, as it is a standard specific and semiquantitative method.15,57,61 For mRNA expressed at moderate-to-high levels, and for which cDNA probes are available, Northern blot analysis works well, but it is not well suited for low-copy mRNA.64,65 Furthermore, only a small number of genes can be analyzed with this conventional method. Methods at the Protein Level DNA microarray technology is limited to the study of gene expression at the mRNA level. However, it has been established that mRNA levels do not necessarily correlate with protein levels. Moreover, the level of expression or even presence of a protein is not tightly linked to physiological consequences. An investigation conducted by Winzeler et al,66 for example, provides a cautionary tale. Their study demonstrated that genes upregulated in yeast growing in minimal medium did not prove to be more important for growth than genes that were not upregulated.33 They found only 2 of 8 genes required for yeast growth in minimal medium to be induced. The lesson to be learned is that genes that are not differentially expressed may be of equal functional importance in disease states compared with those that are. Furthermore, the regulation of some genes may be at the translational rather than the transcriptional level, which would preclude detection by DNA microarrays. Posttranslational modification of proteins is also an important mode of regulation that cannot be detected by DNA microarrays. Protein activity, particularly receptor activity, is heavily dependent on phosphorylation, for example. DNA and mRNA reveal nothing about whether a given protein is active, and can be deceptive when used to speculate about quantities of proteins. It has been demonstrated that the correlation between mRNA and protein abundance is less than 0.5,67 emphasizing that ideally, mRNA expression studies should be accompanied by analyses at the protein level.39 Radioimmunoassay and immunohistochemistry have been used in a number of studies.15,68,69 These techniques, however, are not well suited to detecting low levels of expression, and they require the availability of an antibody specific for the protein to be studied. The field of proteomics, the large-scale parallel analysis of the proteins that are present in a cell, is developing rapidly, but has problems of its own. Proteins vary in abundance by many orders of magnitude within a given cell, and there is no PCR equivalent for the amplification of proteins. Moreover, proteins fold in many known (and unknown) ways that affect their function. The feasibility of the microarray analysis of proteins has begun to be explored. Antibodies attached to microarrays can be used to bind to and quantitatively detect proteins that have been tagged with fluorescent dyes.70 Skeptics doubt the plausibility of identifying thousands of unknown proteins in this manner.70 The diverse chemistry of various proteins poses serious difficulties, and it will be challenging to find antibodies for every protein. Thus, although it is important to incorporate protein analyses into expression profiling studies, current platforms are technically limiting. Functional Studies Confirming the role of a gene initially identified in a microarray experiment in animal models with transgene or knockout studies provides a particularly powerful alternative platform. Transcript function, rather than mere presence, is addressed. However, this approach is ill-suited for high-throughput conditions. It may be ideal for an in-depth investigation of 1 or 2 genes of interest, but it is not practical for confirming large quantities of profiling data. Confirmational studies are useful to corroborate the biological significance of differential gene expression determined by microarray analysis. While improved databases and more reliable statistical models will help to lend greater authority to array data, alternative platforms can be used to assess the relevance of genes first identified by array comparisons. It should be realized, however, that the alternative technologies are not intended for large-scale analyses. Realistically, only selected sequences from the array data can be confirmed with other platforms in the short-term, a retreat from the initial purpose of the genome-scale investigation by microarray. Conclusions Microarrays can be expected to prove extremely valuable as tools for the study of the genetic basis of complex diseases. The ability to measure expression profiles across entire genomes provides a level of information not previously attainable. Although complicated issues must be resolved, the potential payoff is big. Microarrays make it possible to investigate differential gene expression in normal vs diseased tissue, in treated vs nontreated tissue, and in different stages during the natural course of a disease, all on a genomic scale. Gene expression profiles may help to unlock the molecular basis of phenotype, response to treatment, and heterogeneity of disease. They may also help to define patterns of expression that will aid in diagnosis as well as define susceptibility loci that may lead to the identification of individuals at risk. Finally, as specific genes are identified and their functional roles in the development and course of disease are characterized, new targets for therapy should be identified. Despite the problems of defining normal, understanding tissue heterogeneity, making arrays comparable, analyzing and archiving massive quantities of data, and performing confirmational studies in alternative platforms, expression profiling with microarrays stands as a truly revolutionary technology. As we continue to delve into the possibilities, we will surely progress in our understanding of current issues and complications. No doubt the ride on the high-throughput highway will be exhilarating. References 1. Venter JC, Adams MD, Myers EW. et al. The sequence of the human genome. Science.2001;291:1304-1351.Google Scholar 2. Fields S. The future is function. Nat Genet.1997;15:325-327.Google Scholar 3. Lander ES. The new genomics. Science.1996;274:536-539.Google Scholar 4. Liang P, Pardee AB. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science.1992;257:967-971.Google Scholar 5. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science.1995;270:484-487.Google Scholar 6. Lockhart DJ, Dong H, Byrne MC. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol.1996;14:1675-1680.Google Scholar 7. Strachan T, Abitbol M, Davidson D, Beckmann JS. A new dimension for the human genome project. Nat Genet.1997;16:126-132.Google Scholar 8. Nishimoto IN, Hanaoka T, Sugimura H. et al. Cytochrome P450 2E1 polymorphism in gastric cancer in Brazil. Cancer Epidemiol Biomarkers Prev.2000;9:675-680.Google Scholar 9. Furuya KN, Gebhardt R, Schuetz EG, Schuetz JD. Isolation of rat pgp3 cDNA. Biochim Biophys Acta.1994;1219:636-644.Google Scholar 10. Deng G, Lu Y, Zlotnikov G. et al. Loss of heterozygosity in normal tissue adjacent to breast carcinomas. Science.1996;274:2057-2059.Google Scholar 11. Zhuang Z, Vortmeyer AO, Mark EJ. et al. Barrett's esophagus. Cancer Res.1996;56:1961-1964.Google Scholar 12. Hung J, Kishimoto Y, Sugio K. et al. Allele-specific chromosome 3p deletions occur at an early stage in the pathogenesis of lung carcinoma. JAMA.1995;273:558-563. [published correction appears in JAMA. 1995;273:1908].Google Scholar 13. Shimada S, Shiomori K, Tashima S. et al. Frequent p53 mutation in brain (fetal)-type glycogen phosphorylase positive foci adjacent to human ‘de novo' colorectal carcinomas. Br J Cancer.2001;84:1497-1504.Google Scholar 14. Cole KA, Krizman DB, Emmert-Buck MR. The genetics of cancer—a 3D model. Nat Genet.1999;21:38-41.Google Scholar 15. Xu J, Stolk JA, Zhang X. et al. Identification of differentially expressed genes in human prostate cancer using subtraction and microarray. Cancer Res.2000;60:1677-1682.Google Scholar 16. Elek J, Park KH, Narayanan R. Microarray-based expression profiling in prostate tumors. In Vivo.2000;14:173-182.Google Scholar 17. Chia SJ, Tang WY, Elnatan J. et al. Prostate tumours from an Asian population. Br J Cancer.2000;83:761-768.Google Scholar 18. Dong M, Nio Y, Tamura K. et al. Ki-ras point mutation and p53 expression in human pancreatic cancer. Cancer Epidemiol Biomarkers Prev.2000;9:279-284.Google Scholar 19. Pettaway CA. Racial differences in the androgen/androgen receptor pathway in prostate cancer. J Natl Med Assoc.1999;91:653-660.Google Scholar 20. Hata K, Fujiwaki R, Nakayama K, Miyazaki K. Expression of the endostatin gene in epithelial ovarian cancer. Clin Cancer Res.2001;7:2405-2409.Google Scholar 21. Saida T. Recent advances in melanoma research. J Dermatol Sci.2001;26:1-13.Google Scholar 22. Hegde U, Wilson WH. Gene expression profiling of lymphomas. Curr Oncol Rep.2001;3:243-249.Google Scholar 23. Liu L, Yang K. A study on C-erbB2, nm23 and p53 expressions in epithelial ovarian cancer and their clinical significance [in Chinese]. Zhonghua Fu Chan Ke Za Zhi.1999;34:101-104.Google Scholar 24. Zhang Z, DuBois RN. Detection of differentially expressed genes in human colon carcinoma cells treated with a selective COX-2 inhibitor. Oncogene.2001;20:4450-4456.Google Scholar 25. Oguri T, Isobe T, Fujitaka K. et al. Association between expression of the MRP3 gene and exposure to platinum drugs in lung cancer. Int J Cancer.2001;93:584-589.Google Scholar 26. Golub TR, Slonim DK, Tamayo P. et al. Molecular classification of cancer. Science.1999;286:531-537.Google Scholar 27. Khan J, Wei JS, Ringner M. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med.2001;7:673-679.Google Scholar 28. Zhang H, Yu CY, Singer B, Xiong M. Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci U S A.2001;98:6730-6735.Google Scholar 29. Bittner M, Meltzer P, Chen Y. et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature.2000;406:536-540.Google Scholar 30. Welsh JB, Zarrinkar PP, Sapinoso LM. et al. Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci U S A.2001;98:1176-1181.Google Scholar 31. Ross DT, Scherf U, Eisen MB. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet.2000;24:227-235.Google Scholar 32. Stanton LW, Garrard LJ, Damm D. et al. Altered patterns of gene expression in response to myocardial infarction. Circ Res.2000;86:939-945.Google Scholar 33. Abdellatif M. Leading the way using microarray. Circ Res.2000;86:919-920.Google Scholar 34. Bonner RF, Emmert-Buck M, Cole K. et al. Laser capture microdissection. Science.1997;278:1481,1483.Google Scholar 35. Emmert-Buck MR, Bonner RF, Smith PD. et al. Laser capture microdissection. Science.1996;274:998-1001.Google Scholar 36. Sgroi DC, Teng S, Robinson G. et al. In vivo gene expression profile analysis of human breast cancer progression. Cancer Res.1999;59:5656-5661.Google Scholar 37. Klimecki WT, Futscher BW, Dalton WS. Effects of ethanol and paraformaldehyde on RNA yield and quality. Biotechniques.1994;16:1021-1023.Google Scholar 38. Dolter KE, Braman JC. Small-sample total RNA purification. Biotechniques.2001;30:1358-1361.Google Scholar 39. van Hal NL, Vorst O, van Houwelingen AM. et al. The application of DNA microarrays in gene expression analysis. J Biotechnol.2000;78:271-280.Google Scholar 40. Burgess JK. Gene expression studies using microarrays. Clin Exp Pharmacol Physiol.2001;28:321-328.Google Scholar 41. Kitahara O, Furukawa Y, Tanaka T. et al. Alterations of gene expression during colorectal carcinogenesis revealed by cDNA microarrays after laser-capture microdissection of tumor tissues and normal epithelia. Cancer Res.2001;61:3544-3549.Google Scholar 42. Watson SJ, Meng F, Thompson RC, Akil H. The "chip" as a specific genetic tool. Biol Psychiatry.2000;48:1147-1156.Google Scholar 43. Cho RJ, Campbell MJ, Winzeler EA. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell.1998;2:65-73.Google Scholar 44. Klevecz RR, Kauffman SA, Shymko RM. Cellular clocks and oscillators. Int Rev Cytol.1984;86:97-128.Google Scholar 45. Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M. Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol Genomics.2000;2:143-147.Google Scholar 46. Claverie JM. Computational methods for the identification of differential and coordinated gene expression. Hum Mol Genet.1999;8:1821-1832.Google Scholar 47. Cole J, Tsou R, Wallace K. et al. Comparison of normal human skin gene expression using cDNA microarrays. Wound Repair Regen.2001;9:77-85.Google Scholar 48. Schuchhardt J, Beule D, Malik A. et al. Normalization strategies for cDNA microarrays. Nucleic Acids Res.2000;28:E47.Google Scholar 49. Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression studies. Proc Natl Acad Sci U S A.2000;97:9834-9839.Google Scholar 50. Jones MM. Researchers attempt to defuse the microarray data minefield. Available at: http://www.genomeweb.com/articles/view.asp?Article=200142175122. Accessibility verified October 8, 2001. 51. Schadt EE, Li C, Su C, Wong WH. Analyzing high-density oligonucleotide gene expression array data. J Cell Biochem.2000;80:192-202.Google Scholar 52. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A.1998;95:14863-14868.Google Scholar 53. Bassett Jr DE, Eisen MB, Boguski MS. Gene expression informatics—it's all in your mine. Nat Genet.1999;21:51-55.Google Scholar 54. Alon U, Barkai N, Notterman DA. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A.1999;96:6745-6750.Google Scholar 55. Granjeaud S, Bertucci F, Jordan BR. Expression profiling. Bioessays.1999;21:781-790.Google Scholar 56. Kane MD, Jatkoe TA, Stumpf CR. et al. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res.2000;28:4552-4557.Google Scholar 57. Taniguchi M, Miura K, Iwao H, Yamanaka S. Quantitative assessment of DNA microarrays—comparison with Northern blot analyses. Genomics.2001;71:34-39.Google Scholar 58. Bowtell DD. Options available—from start to finish—for obtaining expression data by microarray. Nat Genet.1999;21:25-32.Google Scholar 59. Halgren RG, Fielden MR, Fong CJ, Zacharewski TR. Assessment of clone identity and sequence fidelity for 1189 IMAGE cDNA clones. Nucleic Acids Res.2001;29:582-588.Google Scholar 60. Ichikawa JK, Norris A, Bangera MG. et al. Interaction of Pseudomonas aeruginosa with epithelial cells. Proc Natl Acad Sci U S A.2000;97:9659-9664.Google Scholar 61. Wang K, Gan L, Jeffery E. et al. Monitoring gene expression profile changes in ovarian carcinomas using cDNA microarray. Gene.1999;229:101-108.Google Scholar 62. Wong KK, Cheng RS, Mok SC. Identification of differentially expressed genes from ovarian cancer cells by MICROMAX cDNA microarray system. Biotechniques.2001;30:670-675.Google Scholar 63. Grove DS. Quantitative real-time polymerase chain reaction for the core facility using TaqMan and the Perkin-Elmer/Applied Biosystems Division 7700 Sequence Detector. J Biomol Techniques [serial online].1999;10. Available at: http://www.abrf.org/JBT/1999/March99/mar99grove.html. Accessibility verified October 9, 2001.Google Scholar 64. Raval P. Qualitative and quantitative determination of mRNA. J Pharmacol Toxicol Methods.1994;32:125-127.Google Scholar 65. Jung R, Soondrum K, Neumaier M. Quantitative PCR. Clin Chem Lab Med.2000;38:833-836.Google Scholar 66. Winzeler EA, Shoemaker DD, Astromoff A. et al. Functional characterization of the S cerevisiae genome by gene deletion and parallel analysis. Science.1999;285:901-906.Google Scholar 67. Garber K. Proteomics gears up. Signals [serial online].1999. Available at: http://www.signalsmag.com/signalsmag.nsf/0/F8A34B7EFDE4EB6C8825681C000B8A96. Accessibility verified October 16, 2001.Google Scholar 68. Shirota Y, Kaneko S, Honda M. et al. Identification of differentially expressed genes in hepatocellular carcinoma with cDNA microarrays. Hepatology.2001;33:832-840.Google Scholar 69. Storz M, Zepter K, Kamarashev J. et al. Coexpression of CD40 and CD40 ligand in cutaneous T-cell lymphoma (mycosis fungoides). Cancer Res.2001;61:452-454.Google Scholar 70. Dalton R, Abbott A. Can researchers find recipe for proteins and chips? Nature.1999;402:718-719.Google Scholar TI - Gene Expression Profile Analysis by DNA Microarrays: Promise and Pitfalls JF - JAMA DO - 10.1001/jama.286.18.2280 DA - 2001-11-14 UR - https://www.deepdyve.com/lp/american-medical-association/gene-expression-profile-analysis-by-dna-microarrays-promise-and-46Ym49Eb7X SP - 2280 EP - 2288 VL - 286 IS - 18 DP - DeepDyve ER -