Access the full text.
Sign up today, get DeepDyve free for 14 days.
Petra Platzer, M. Upender, Keith Wilson, J. Willis, J. Lutterbaugh, Arman Nosrati, J. Willson, David Mack, T. Ried, S. Markowitz (2002)
Silence of chromosomal amplifications in colon cancer.Cancer research, 62 4
C. Papadimitriou (1977)
The Euclidean Traveling Salesman Problem is NP-CompleteTheor. Comput. Sci., 4
Z. Bar-Joseph, D. Gifford, T. Jaakkola (2001)
Fast optimal leaf ordering for hierarchical clusteringBioinformatics, 17 Suppl 1
J. Lepre, J. Rice, Y. Tu, G. Stolovitzky (2004)
Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression dataBioinformatics, 20 7
P. Spellman, G. Sherlock, Michael Zhang, V. Iyer, Klaus Anders, M. Eisen, P. Brown, D. Botstein, B. Futcher (1998)
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.Molecular biology of the cell, 9 12
T. Rozovskaia, O. Ravid-Amir, S. Tillib, Godfrey Getz, E. Feinstein, Himanshu Agrawal, A. Nagler, E. Rappaport, Irina Issaeva, Y. Matsuo, U. Kees, T. Lapidot, F. Coco, R. Foa, A. Mazo, Tomonori Nakamura, C. Croce, G. Cimino, E. Domany, E. Canaani (2003)
Expression profiles of acute lymphoblastic and myeloblastic leukemias with ALL-1 rearrangementsProceedings of the National Academy of Sciences of the United States of America, 100
M. Eisen, P. Spellman, P. Brown, D. Botstein (1998)
Cluster analysis and display of genome-wide expression patterns.Proceedings of the National Academy of Sciences of the United States of America, 95 25
R. Maul, D. Chang (1999)
EPLIN, Epithelial protein lost in neoplasmOncogene, 18
G. Getz, Erel Levine, E. Domany (2000)
Coupled two-way clustering analysis of gene microarray data.Proceedings of the National Academy of Sciences of the United States of America, 97 22
A. Ben-Dor, B. Chor, R. Karp, Z. Yakhini (2002)
Discovering local structure in gene expression data: the order-preserving submatrix problemJournal of computational biology : a journal of computational molecular cell biology, 10 3-4
A. Frieze, J. Yadegar (1983)
On the quadratic assignment problemDiscret. Appl. Math., 5
U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mach, A. Levine (1999)
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.Proceedings of the National Academy of Sciences of the United States of America, 96 12
T. Koopmans, M. Beckmann (1957)
Assignment Problems and the Location of Economic ActivitiesEconometrica, 25
P. Behrens, U. Brinkmann, A. Wellmann (2004)
CSE1L/CAS: Its role in proliferation and apoptosisApoptosis, 8
A. Ben-Dor, B. Chor, R. Karp, Z. Yakhini (2003)
Discovering local structure in gene expression data: the order-preserving submatrix problem.Journal of Computational Biology, 10
O. Alter, P. Brown, D. Botstein (2000)
Singular value decomposition for genome-wide expression data processing and modeling.Proceedings of the National Academy of Sciences of the United States of America, 97 18
(2004)
A novel mathematical approach to analyzing gene expression data: results from an international colon cancer consortium
D. Notterman, U. Alon, A. Sierk, A. Levine (2001)
Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays.Cancer research, 61 7
Sridhar Ramaswamy, P. Tamayo, R. Rifkin, Sayan Mukherjee, Chen-Hsiang Yeang, Michael Angelo, C. Ladd, Michael Reich, É. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, T. Golub (2001)
Multiclass cancer diagnosis using tumor gene expression signaturesProceedings of the National Academy of Sciences of the United States of America, 98
Wolfram Liebermeister (2002)
Linear modes of gene expression determined by independent component analysisBioinformatics, 18 1
S. Dudoit, J. Fridlyand, T. Speed (2002)
Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 97
D. Ghosh (2004)
Mixture models for assessing differential expression in complex tissues using microarray dataBioinformatics, 20 11
(1969)
An algorithm for the solution of the assignment problem
W. Pan (2002)
A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experimentsBioinformatics, 18 4
Y. Benjamini, Y. Hochberg (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the royal statistical society series b-methodological, 57
K. Birkenkamp-Demtroder, L. Christensen, S. Olesen, C. Frederiksen, P. Laiho, L. Aaltonen, S. Laurberg, F. Sørensen, Rikke Hagemann, T. Ørntoft (2002)
Gene expression in colorectal cancer.Cancer research, 62 15
Vol. 21 no. 10 2005, pages 2301–2308 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti329 Phylogenetics Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices 1,∗ 1 1 1 2 1 D. Tsafrir , I. Tsafrir , L. Ein-Dor ,O.Zuk , D.A. Notterman and E. Domany 1 2 Department of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel and Department of Pediatrics and Molecular Genetics, UMDNJ-Robert Wood Johnson Medical School, New Brunswick, NJ 08903, USA Received on November 23, 2004; revised on January 27, 2005; accepted on February 12, 2005 Advance Access publication February 18, 2005 ABSTRACT by clear, abrupt changes between discrete states. For such cases Summary: We introduce a novel unsupervised approach for the clustering algorithms, whose aim is to partition the data into several organization and visualization of multidimensional data. At the heart of distinct groups, fail to capture the gradual nature of the phenomena. the method is a presentation of the full pairwise distance matrix of the For example, when studying the evolution of a certain disease one data points, viewed in pseudocolor. The ordering of points is iteratively expects the existence of continuous variables associated with disease permuted in search of a linear ordering, which can be used to study progression. Therefore, any attempt to place a sharp division on such embedded shapes. Several examples indicate how the shapes of cer- an inherently continuous phenomenon is doomed to be somewhat tain structures in the data (elongated, circular and compact) manifest arbitrary. themselves visually in our permuted distance matrix. It is important to The problem of uncovering and presenting continuous trajectories identify the elongated objects since they are often associated with a and variables led us to develop sorting points into neighborhoods set of hidden variables, underlying continuous variation in the data. (SPIN), an unsupervised sorting method, very different in spirit, The problem of determining an optimal linear ordering is shown to be philosophy and implementation from clustering. SPIN uses an iter- NP-Complete, and therefore an iterative search algorithm with O(n ) ative process to find an informative permutation of the data points. step-complexity is suggested. By using sorting points into neighbor- This is challenging since permutation space is factorially large, hoods, i.e. SPIN to analyze colon cancer expression data we were containing a very small measure of meaningful orderings (needle able to address the serious problem of sample heterogeneity, which in a haystack problem). SPIN’s intuitively color coded image of hinders identification of metastasis related genes in our data. Our the reordered distance matrix uncovers elongated structures that methodology brings to light the continuous variation of heterogeneity— reveal the existence of continuous variables that govern the vari- starting with homogeneous tumor samples and gradually increasing ation in the data. Our ordering approach is especially appropriate the amount of another tissue. Ordering the samples according to their for studying scenarios characterized by the accumulation of gradual degree of contamination by unrelated tissue allows the separation of changes, since it excels at tracking progression. There exist other genes associated with irrelevant contamination from those related to methods that search efficiently for informative submatrices in expres- cancer progression. sion data (Ben-Dor et al., 2003; Getz et al., 2000; Lepre et al., Availability: Software package will be available for academic users 2004). One of these (Ben-Dor et al., 2003) is limited to the case upon request. when the expression levels of the selected genes vary monoton- Contact: fedafna@wisemail.weizmann.ac.il ously over the ordered samples, whereas SPIN uncovers with equal Supplementary information: http://www.weizmann.ac.il/physics/ ease a multidimensional subspace of genes in which the samples complex/compphys/spin trace a complicated trajectory, along which not a single gene var- ies monotonously. Another method (Lepre et al., 2004) captures efficiently a group of samples that forms a tight sphere in the spe- INTRODUCTION cial subspace of genes, but is not designed to capture continuous Exploratory data analysis is critical in a broad range of research areas, variation. where large collections of data need to be meaningfully arranged and The purpose of identifying shapes is to gain insight into the under- presented. It is especially relevant in biology, where the past decade lying process, such as the continuous nature of cell differentiation or has witnessed an explosion in data production, largely attributed to the closed loop formed by cells along different stages in the yeast cell the widespread use of high-throughput technologies, such as gene- cycle. In our main application we demonstrate a possible resolution expression arrays. One major challenge in the analysis of large-scale of a well-known problem in microarray measurements, namely that expression data is effective data organization and visualization (Eisen of sample heterogeneity. Previous expression-array based studies of et al., 1998), where typical goals include class discovery and feature cancer (Alon et al., 1999) recognized the issue of variability in tis- extraction (Ramaswamy et al., 2001). However, in many cases the sue composition of samples, and showed that some of the variation data is characterized by inherently gradual progression rather than in expression between normal and cancer tissue can be attributed to such causes. This may result in identification of differentially To whom correspondence should be addressed. expressed genes that are unrelated to the biological agenda, wrongly © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 2301 D.Tsafrir et al. implicating them with cancer. We demonstrate how SPIN can be from the other side. The most obvious fingerprint of a circular object used to distinguish disease related genes from irrelevant ones. is the appearance of blue at the corners far from the main diagonal We start with the concept and intuition of SPIN, explaining how of the distance matrix. to infer shape characteristics from SPIN permuted distance matrices. For a real-world application consider the yeast elutriation- This is done by going through a set of examples of increasing com- synchronized cell-cycle expression data (taken from Spellman et al., plexity, including toy models as well as real biological data. Next, we 1998). Spellman et al. employed a supervised phasing method to explain how the algorithm works, and the rationale behind a specific assign genes to five known classes, namely G , S, S/G ,G /M and 1 2 2 implementation. We prove that the problem the algorithm attempts to M/G , utilizing the expression profiles of genes that were previously solve is NP-hard, and also show that our heuristic quickly converges known to participate in specific phases of the cell cycle. They then 2 3 (O(n ) − O(n )) to useful solutions. We stress the general applic- proceeded to perform unsupervised analysis, specifically hierarch- ability of SPIN to any dataset where a dissimilarity metric between ical clustering, and found that most genes belonging to the same points can be defined. Finally, we describe an application to analysis class were clustered together. Here we simply analyzed the most of large-scale gene-array data, specifically addressing the issue of highly varying transcripts, as monitored during the progression of disease progression versus sample contamination in colon cancer. cell cycle, and proceeded to calculate their pairwise dissimilarity matrix. As seen in Figure 1b (plate 4), the permuted distance matrix contains the signature of a ring. Assigning such a cyclic nature to DEMONSTRATING THE CONCEPT genes associated with cell cycle is in accordance with known bio- Ferreting elongation logical dynamics and functions. This example highlights the ease Our methodology assumes that a distance matrix D, can be defined, of ordering gene-expression data in SPIN, and the informative and whose element D represents the dissimilarity between points i and j intuitive nature of the color-enhanced output. Previous studies have ij (Euclidean distance was used throughout this paper). Starting with a recognized the inherent cyclic nature of this dataset (Alter et al., random ordering of data points, the corresponding initial unordered 2000), but required several stages of data manipulation and normal- distance matrix is impossible to interpret. However, the permuted ization, followed by a manual ordering using the PCA projection to image, obtained after reordering the data by SPIN, is highly inform- convey the results that are easily captured in SPIN. ative. For our first example consider points uniformly distributed Multiple clusters within a cylinder, as presented in Figure 1a (plate 1). SPIN orders the points from one end of the cylinder to the other, such that the cor- The most common approach to analyzing a dataset composed of mul- respondingly permuted distance matrix has a characteristic pattern, tiple objects is clustering. However, by emphasizing the partitioning as seen in Figure 1a (plate 3); the elements near the main diagonal of data, the clustering approach neglects the issue of elucidating the stand for short distances (colored blue), with a clear gradient of shapes of embedded objects in multidimensional data. SPIN, on the increasing distances (colors vary from blues to reds) as one moves other hand, focuses on meaningful ordering and presentation, thus away from the main diagonal. Although both the ordered [Fig. 1a gaining insight into local and global structures. This is demonstrated (plate 3)] and unordered [Fig. 1a (plate 2)] matrices contain exactly on artificial data (Fig. 2a), where a presentation of the permuted dis- the same elements, only the permuted matrix allows an observer to tance matrix (Fig. 2b) brings to light the separation into four groups, deduce structural information. as can be seen by the sharp boundaries. Furthermore, one can infer In gene-expression data an elongated conformation may be asso- the shape of each cluster, as well as the global conformation. The ciated with a gradual process, such as cells going through several two tight spherical clusters (eyes) appear as dark blue squares on successive stages of differentiation. In Rozovskaia et al. (2003) the the main diagonal. From the light blue color of the squares between U95 affymetrix chip was used to determine expression profiles of them we can deduce that the eyes are relatively close to each other, acute lymphoblastic leukemias (ALLs), including tumors at various i.e. their relative placement. The next cluster (smile) has a gradient stages of differentiation (such as pre-B, pro-B and T cell ALLs). One of colors, from dark blue on the main diagonal to light blue at the particular group of genes identified by Rozovskaia et al. displayed corners. As explained above, this indicates an elongated structure. expression profiles that are sensitive to the differences between the The fourth cluster cycles through the entire spectrum, returning to early and late differentiation stages (pro-B versus pre-B and T cell dark blue at the corners, signifying a cyclic shape (Fig. 1b). The fact tumors). Here the dissimilarity matrix between samples was calcu- that the distance between opposing points on the ring is the largest lated using only the differentiation-implicated genes as features. In (i.e. the darkest red in the matrix) indicates that the ring encompasses Figure 1a (plate 4) the SPIN permuted matrix displays a clear pattern all other points. of elongation, with early cells (pro-B) placed at one side and more Some clustering algorithms, such as average-linkage and k-means, differentiated cells (from pre-B and T cell tumors) located at the other fail to cluster this data correctly (Fig. 2c and d). Others, such as end of the trajectory. single-linkage, succeed in rightly separating the data (Fig. 2e), but If the elongated object closes upon itself, i.e. represents a cycle, are not able to convey the different shape characteristics of all four then the corresponding fingerprint is also periodic, as shown in objects. A linkage algorithm can be supplemented by a leaf-ordering Figure 1b (plate 1–3). The phasing in the colors as one progressively algorithm (Bar-Joseph et al., 2001), in order to provide a meaningful deviates from the main diagonal can be understood by considering organization of points within clusters. However, even an ordered tree the organization of a circle. Starting from any arbitrary point A and is lacking with respect to highlighting shapes. In SPIN, the inherent going around the ring, the distance of the current point to A increases coupling between visualization and organization produces a powerful monotonously (color changes from blue to red) until the diametric- presentation tool. The permuted distance matrix captures the overall ally opposing point is reached. At this stage the distances begin to layout of compound structures, as well as the local conformation of decrease (color goes back to blue), as we approach the point of origin its components. 2302 Data analysis and visualization by ordering distance matrices Projection enhancement FORMAL STATEMENT OF THE PROBLEM n×n In the exploration of gene-expression data linear models were The input to SPIN is a distance matrix D ∈ R calculated for data employed to describe the expression levels of genes as a linear composed of n points, and its output is a reordered distance matrix, function of common hidden variables. Singular value decomposi- obtained by permuting the n objects according to a particular per- tion (SVD) (Alter et al., 2000) was used to decompose the gene mutation P ∈ S (the permutation group of n points). We denote by profiles into linear combinations of eigengenes, i.e. the eigenvectors P also the permutation matrix associated with P . In search for cri- of the covariance matrix; independent component analysis (ICA) teria for an informative permutation, we observed that well-ordered (Liebermeister, 2002) produced a linear model based on hidden vari- distance matrices exhibit two distinct and sometimes competing ables termed expression modes. In such approaches the projection of properties. First, in many cases the values in the upper rows tend data onto smaller subspaces reduces noise and allows useful visualiz- to increase with the column index (and decrease in the bottom rows), ation. In SPIN we suggest a different approach, in which the distance as in Figure 1a (plate 3). This type of ordering demands that large matrix is not subjected to any distortions, thus fully preserving the distances are assigned to corners, far from the diagonal. The second, original structure of the data. One advantage of avoiding distortion alternative aim is to ensure that the elements near the main diagonal is elimination of false positives, in the sense that the fingerprint of an tend to have smaller dissimilarity values, i.e. the linear ordering is elongated structure in the SPIN permuted matrix invariably implies such that if two points are positioned near each other, their distance a genuine elongation in the data. In the Supplementary informa- in the full high-dimensional space is also small [Fig. 1b (plate 3)]. We tion section we further discuss the relationships between the SPIN term the two properties ‘Side-to-Side’ (STS) and ‘Neighborhood’. permutation and projection according to PCA. These attributes can be mathematically formulated by introdu- In the examples presented so far, by applying dimensionality cing an energy (or cost) function F ≡ F :S → R quantifying D n reduction methods, the shapes of objects could be clearly discerned the quality of a permutation. Thus, the ordering problem becomes from a three-dimensional (3D) projection of the data points. The next one of finding the permutation P that minimizes F . We concen- example illustrates SPIN’s ability to deal with more complex objects trate on the following family of functions: F (P ) = tr(P DP W) = N ×N embedded in a truly high-dimensional space, objects whose struc- W D , where tr denotes matrix trace and W ∈ R ij P(i)P(j) i,j =1 ture is seriously distorted when projected onto 3D. In such cases even is some weight matrix. For this family, the optimization problem is the most up-to-date dimensionality reduction methods are doomed known as the quadratic assignment problem (QAP), introduced by to fail in finding 3D representations which properly capture the data Koopmans and Beckmann (1957). The general QAP is considered structure. Figure 3a shows a PCA projection of points constituting an extremely difficult optimization problem. It is known to be NP- a set of seven intersecting twisted cylinders in d = 7 dimensions. Hard even to approximate, and in practice, usually untractable for Projecting such a relatively complicated object onto the first prin- n>30. [See Burkard et al. (1998) for a comprehensive survey of the cipal components does not produce a clear image (Fig. 3a). Coloring problem.] the points according to SPIN’s linear ordering (Fig. 3b) produces The STS property is captured by setting W = XX , for some a much more informative image. Furthermore, the distance matrix strictly increasing (column) vector X (in our implementation we (Fig. 3c) identifies each rod as an elongated structure (along the main worked with X = i − (n + 1)/2). Neighborhood is reflected by diagonal). The relationships between the seven rods can be deduced choosing W to be symmetric and concentrated in a region, determ- from the patterns in the off-diagonal regions in the organized dis- ined by a parameter σ , around its main diagonal (our choice of W tance matrix. For example, the fact that the rods share a common is defined below). We show below that finding a global minimum nexus is reflected by a grid of blue patches. The combination of any for our particular choices of F is NP-hard, and we propose two iter- dimensionality reduction technique with SPIN may serve to highlight ative heuristic algorithms to search for minima. We prove, for both the shape-characteristics of a high-dimensional object, which are not algorithms, that the energy is non-increasing on every iteration. Both immediately made evident by projection onto a lower dimensional algorithms were used in the examples presented in this paper, but the space. displayed images are from Neighborhood. Expression data analysis The STS algorithm In the context of gene-expression data we implemented SPIN in an interactive GUI that accepts an expression matrix as input, and We have shown that the STS problem is NP-Complete by redu- supports the following actions: cing it to the well known k-clique problem in graph theory (see Supplementary material). (1) ordering of samples using genes as features, The STS algorithm is shown below: (2) ordering of genes using samples as features and (3) zooming in on subsets of the original expression matrix to order Side-to-Side objects in a reduced subspace. Input : D and X. 0 −1 (1) Set X = X, t = 0, define P = I . n×n A coupling between samples and genes is produced by the ability to t t (2) Calculate S = DX . identify a group of genes (samples) that fluctuate in a synchronized t t (3) Find P which sorts S in a descending order. manner. Similar in spirit to the coupled two-way clustering (CTWC) T t t t −1 t t +1 t 0 (4) If P S = P S , set X = P X , approach (Getz et al., 2000) we proceed by using the zoom in oper- set t = t + 1 and go to 2. ation to order the samples (genes) in a selected reduced subspace. t t (5) Output P DP . One can redefine the working space in a recursive manner. 2303 D.Tsafrir et al. We call each pass through steps 2–4 a STS iteration, whose guaranteed after a finite number of steps. The proof is based on complexity is O(n ). Each STS iteration can be viewed as a map- showing that every STS iteration reduces the cost function, F , ping from the permutation group S to itself, G :S → S . guaranteeing convergence to a local minimum. Note that the STS n D n n Thus P is a possible output of STS if and only if it is a fixed procedure may converge to a P which does not correspond to the point of G . global minimum ofF ; for different initial permutations the algorithm In the Supplementary material we prove that when the input matrix, may terminate at different fixed points, with different values of F . D, is a distance matrix, convergence of STS to a fixed point is A known strategy to cope with this problem is to start the algorithm a High 1 2 3 4 ab Low -2 -4 -6 -6 -4 -2 0 2 4 6 Eyes Mouth Ring PCA 1 c d e Fig. 1 3 3 K-means Average linkage Single linkage -1 -1 -2 -2 -3 -3 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 PCA1 PCA1 Fig. 3 Fig. 2 Fig. 4 PCA 2 PCA 2 PCA 2 Data analysis and visualization by ordering distance matrices from many randomly generated initial permutations, and choose the The following algorithm attempts to relocate a point A to a best fixed point obtained. Moreover, it is also possible to have mul- local neighborhood that best fits it, i.e. none of the points in the tiple global minima. For example, define for every permutation P its neighborhood of A are at a large distance from it. ¯ ¯ ‘reverse’ P by P(i) = P(n + 1 − i), (i = 1, ... , n).If X is anti- symmetric we get for STS: F (P ) = F (P), leading to at least two Neighborhood global minima. Some data-sets may contain further degeneracies Input : D and W n×n n×n due to inherent symmetries. In practice it is not essential to reach 0 −1 (1) Set W = W , P = I , t = 0. n×n the global minimum since the fixed points to which the algorithm t t (2) Compute M = DW . converges are often just as informative. t t (3) Set P = argmin tr(QM ). Q∈S t t t −1 t −1 t +1 t (4) If tr(P M ) = tr(P M ), set W = P W , The Neighborhood algorithm t = t + 1 and go to 2. t t (5) Output P DP . Claim. The Neighborhood problem is NP-Hard. Each passage of steps 2–4 constitutes one Neighborhood iteration. Proof. The two ingredients of the problem are the distance The size of the neighborhood is dictated by the choice of W , and, matrix D and the weight matrix W . Setting W = 1 gives ij |i−j |=1 in turn, affects the scale at which objects are distinguished. Step 3 n−1 n n−1 tr(P DP W) = D + D = 2 can be accomplished by solving the linear assignment problem. This P(i+1),P(i) P(i−1),P(i) i=1 i=2 i=1 D . This is the cost function for the Travelling Salesman solution reflects the best current guess for an improved location for P(i+1),P(i) Problem, which is known to be NP-Hard, even in the Euclidian case all the data points. At every iteration, points are sent to their new (Papadimitriou, 1977). location, based on the current ordering of the points. That is, point A Fig. 1. Shapes of simple objects, each consisting of 500 points (except for plate 4 of a where only 27 samples are available). (a) 1. Points uniformly distributed within a cylinder. 2. The corresponding distance matrix: the color of element D reflects the relative distance between points i and j , where blue (red) denotes ij small (large) distances, respectively. This randomly ordered distance matrix is the input to SPIN. 3. The final permutation in which points are ordered along the trajectory of the cylinder. Only the permutation of the rows/columns changes. 4. Leukemia samples: the permuted distance matrix for 27 blasts at various stages of differentiation, including pre-B, pro-B and T cell ALLs. (b) Corresponding images for a cyclic object: A ring of points is characterized by a cyclic pattern, with small distances (blue) near the main diagonal and at the corners. 4. Yeast cell cycle: 500 genes with highest standard deviation across the samples were analyzed, using the raw expression data without any manipulation (except for thresholding at the 99th percentile to avoid spikes). The ordered image reveals the heterogeneous nature of the ring, corresponding to separation into different stages of the cell cycle. Fig. 2. Relations and shapes of multiple clusters. SPIN’s results for a toy dataset of 800 points in 10D. The complex object was originally generated in 3D, and then seven additional dimensions of noise (uniformly distributed between −1 and 1) were added. (a) The projection of the data points onto the first and second PCA plane. (b) The SPIN sorted distance matrix. From this organized matrix one can easily infer the shapes of the four clusters and their relative placement. For example, the position on the ring closest to the top eye is marked by a red circle. This can be inferred from the sorted matrix by locating the darkest blue elements in the rectangle corresponding to the distances between the eye and the ring, as shown by the arrows. For comparison, the results of three popular clustering methods were translated to permutations on the distance matrix: (c) k-Means (with k = 4); (d) average linkage and (e) single linkage. The division of the points into four clusters (red, blue, green and black) is presented below each matrix. k-Means and average linkage fail to identify the correct clusters. Single linkage does identify the four clusters, but does not order the points within them in an informative manner. Fig. 3. Intersecting rods. A set of seven orthogonal intersecting cylinders, comprised 1400 points in seven dimensions. The rods were twisted by rotation with angles that increase linearly with the distance from the origin. (a) The points displayed in the first two PCAs. (b) The same projection with the points colored according to their placement in SPIN; the first point in the SPIN permutation is colored dark blue, going through blue, green, yellow and orange, with the last points colored dark red. In this example the coloring is crucial for making sense out of a complex image. (c) The correspondingly SPIN-permuted distance matrix. The region of the intersection creates blue patches in the off-diagonal regions of the distance matrix. Fig. 4. Colon cancer data; expression levels of the 1000 highest variance transcripts overall 144 samples. (a) Projection of the samples onto the first (x-axis) and second (y-axis) principal components, calculated in gene space. The clinical identity of samples is indicated by a color: primary carcinomas (blue), adenomas (green), normal colon (red), liver metastasis (magenta), lung metastasis (orange), normal liver (black) and normal lung (cyan). This coloring scheme for tissues is kept in all subfigures. The first PCA reflects 34% of variance, and is dominated by the differences between normal liver and all other samples. (b) SPIN-permuted distance matrix for the samples. Colors depict dissimilarity levels between samples, with red (blue) indicating large (small) distances. (c) Genes SPIN-permuted distance matrix. The genes display several distinct expression profiles. (d) Two-way sorted expression matrix. Here colors depict relative expression intensities, where red (blue) denotes relatively high (low) expression. The colored bar below the matrix provides the clinical identity of the tissues. Some of the dominant gene clusters and their expression levels are highlighted by dark rectangles. Each gene cluster is used to construct the distance matrix of a particular subset of the samples. (e) The distance matrix of normal liver, liver metastasis and carcinoma samples, as calculated in the subspace of the liver-specific gene cluster. The normal liver and carcinoma samples form two distinctly separated, tight spherical clusters, while the metastases form a connecting elongated cloud, with some of the samples displaying higher proximity (i.e. similarity) to the normal liver samples. The six metastasis samples that were placed farthest from the liver samples presumably contain the lowest amounts of normal liver tissue, and are therefore referred to as clean metastasis. (f) Muscle and connective tissue associated genes. Expression profiles related to cell mixtures can be distinguished in SPIN by the fact that affected samples tend to order into an elongated shape, due to the relatively high variation in the composition of the samples. Here the ordering of the normal colon samples is indicative of levels of muscle and connective tissue contamination—lowest in the polyp samples. (g) Genes related with a gradual loss of differentiation. Note the placement of the polyp samples between normal and cancer tissue. In (e)–(g) the clinical identity of the tissues’ is given by the colored bar to the right of each distance matrix. 2305 D.Tsafrir et al. is sent to a new location i(A) on the basis of the presently residing liver metastases, 19 lung metastases, 11 normal livers, and 5 normal points near i(A). However, since all the points are permuted simul- lungs. Standard preprocessing of the data included thresholding to taneously, there is no guarantee that this assignment remains optimal, 10 (i.e. all expression values <10 were set to 10) and log trans- since the points that were near i(A) may have moved elsewhere. formation. A variance filter was utilized to concentrate on the most Hence the need to re-iterate. Since the linear assignment problem is relevant genes. We started with the 500 highest varying transcripts, known to be solvable in time O(n ) (Dinic and Kronrod, 1969), the then doubled the number; since there was a significant change in complexity of each iteration is O(n ). the results, the number of transcripts was doubled again, to 2000. We prove that the energy is improved on every iteration; Ensuring that this did not alter the main conclusions to a noticeable thus convergence to a fixed point is guaranteed after a finite degree we continued to work with the top 1000. time In the context of such complex data, the search for genes and pathways that are causally involved in cancer is complicated by the t +1 t t t −1 Claim. tr(P DP W) ≤ tr(P DP W). need to distinguish their signal from a large background of inno- t +1 t t +1 t +1 t +1 cent bystander genes, whose expression levels appear altered due Proof.tr(P DP W) = tr(P DW ) ≤ tr(QDW ) to secondary causes. An initial objective is to generate an overall ∀Q ∈ S . impression of the data’s structure, identifying major partitions and Using the symmetry of W and the property tr(AB) = tr(BA) relationships. By filtering the highest variance genes and ordering the we get: resulting expression matrix in SPIN (Fig. 4d) one can get a global T T t +1 t t T view of the data. Two separate ordering operations were performed: tr(QDW ) = t r(QDP W) = t r((QDP W) ) one on the genes’ distance matrix (rows, Fig. 4c) and another on t T t T = tr(W P DQ ) = tr(P DQ W). the samples’ distance matrix (columns, Fig. 4b). Thus, the two-way organized expression matrix allows one to study concurrently the t −1 Taking Q = P gives the desired result. structure of both samples and genes. In consecutive analysis stages, According to step 4, the algorithm terminates unless a strict detailed in the following paragraphs, we proceeded to focus indi- inequality holds in the above claim. This prevents cycles of con- vidually on sets of correlated genes that were identified in this initial stant energy. Since the permutation space is finite, termination in a step. SPIN is used to re-order the samples in the context of each fixed point after a finite number of steps is guaranteed. gene set separately, and the resulting permutation is shown to be informative of the underlying biology (Fig. 4e–g). This process of Our choice for the weight matrix is taken to be Gaussian, W = ij iteratively identifying and focusing on relevant subsets of the ini- −(i−j) /nσ e , which is then normalized into a doubly stochastic matrix tial data matrix is reminiscent of the previously proposed coupled (i.e. sum of each row and column is equal to 1). In this case the mis- two-way clustering algorithm (Getz et al., 2000). match matrix M = DW can be viewed as a Gaussian smoothing of variance σ on each row of D. For a given dataset, there exists a range Liver contamination of relevant length scales, where large scales reflect the overall layout Previous expression data studies recognized the challenge posed of the data, while smaller values give a better local organization at the by the heterogeneous composition of sampled tissues (Alon et al., expense of possibly fragmenting larger structures. This is captured 1999), which was not answered in the context of traditional analysis in SPIN by controlling the value of σ . One heuristic scheme that usu- methods (Ghosh, 2004). In the current data the clearest separa- ally works well is starting with a very large σ , iterating several times, tion in the samples is according to their organ of origin—either then lowering σ (e.g. by a factor of 2) and so forth, in the spirit of colon, liver or lung—with the liver samples forming the most dis- simulated annealing. Moreover, the solution of the linear assignment tinct group (Fig. 4b). Even though the tissue samples were carefully problem (step 3 in the algorithm) can be efficiently approximated by dissected, the strongest expression signals are indeed related with finding the minimum of each row of M , and then sorting the indices the composition of the various samples. The most prominent gene of the minima (ties are broken arbitrarily). This heuristic, though cluster, highlighted by the bottom black rectangle (Fig. 4c and d), not guaranteed to reduce F at every iteration, generally yields a low is characterized by highest expression levels in the liver samples. energy solution, while considerably speeding up the calculations. The annotation of genes belonging to this cluster is related to liver functions (including SERPINA3, CP, HP and APOC1), and therefore APPLICATION TO COLON CANCER we refer to it as liver-specific. These liver-specific genes are totally The biological question addressed here is that of recognizing altera- irrelevant to the disease, and yet when performing a PCA projection tions in gene expression that may be linked with the progression of of the samples (Fig. 4a) the first principal direction (explaining 34% cancer. SPIN is especially appropriate for this analysis, since can- of variance) is dominated by the difference between normal liver and cer evolution is an inherently continuous process, which arises from all other samples. The highly relevant aspect of this phenomenon is a gradual accumulation of genetic alterations that promote selec- that some of the liver metastasis samples display elevated expres- tion of cells with increasingly aggressive behavior. Such continuity sion levels for the liver-specific genes, shifting their placement in may be completely overlooked by traditional methods that emphas- the SPIN ordering toward the location of the normal liver samples. ize clear separations. Colon cancer is a good model system since This hinders the ability of traditional statistical analysis methods samples are readily available across several well-defined stages of the to generate a list of genes associated with metastatic cancer; when disease, enabling a study of the onset of the neoplastic transforma- searching for genes with high expression in liver metastasis versus tion. Expression profiles were determined for seven types of samples carcinoma samples, liver-specific genes may be implicated. Indeed using the Affymetrix U133A GeneChip (Tsafrir et al., 2004): 47 a supervised hypothesis test (Pan, 2002) generated a list of genes primary carcinomas, 24 adenomas, 22 normal colon epithelium, 16 significantly over expressed in liver metastasis as compared to the 2306 Data analysis and visualization by ordering distance matrices primary tumor samples [387 transcripts of the examined top 1000 distinguish the desired set of disease-progression associated genes, passed the Wilcoxon ranksum test with FDR of q = 0.05 (Benjamini and show that the reduction in their expression is correlated with the and Hochberg, 1995)]. The vast majority of these (97%) are associ- gradual onset of the cancer. Focusing on this subset of genes reveals ated with liver functions and are in fact members of our liver-specific that in this context the samples trace an elongated shape (Fig. 4g), cluster (Fig. 4e). The increased expression for these genes is probably with the normal colon epithelium placed to one side, followed by the a byproduct caused by contamination of the metastasis samples with adenomas that exhibit a somewhat reduced expression, which is even normal liver tissue. Therefore, these genes could potentially serve as lower in the carcinoma samples. This set includes genes that were the basis for constructing a liver-metastasis classifier (Dudoit et al., observed to be preferentially expressed in human epithelial cells and 2002); however, analysis based on SPIN clarifies that they do not play downregulated in cancer, such as carbonic anhydrases (Notterman a role in the progression of cancer, but rather as a tissue-of-origin et al., 2001), Guanylate cyclase activators (Birkenkamp-Demtroder indicator. et al., 2002) and EPLIN (Maul and Chang, 1999). A plausible hypo- thesis is that these genes are associated with colon functions, and Muscle and connective tissue contamination that the SPIN-permutation highlights a gradual loss of differenti- As demonstrated in the previous section, the problem of tissue het- ation in the transformed tissue. Perhaps the percentage of cells that erogeneity may be a major complication, and one that was mostly still keep their colon functions is steadily reduced with the progres- unresolved by traditional analysis methods. In some data sets an sion of the disease. To conclude, supervised tests were employed to assessment by the pathologist of the percentage of relevant tissues in answer a specific question—e.g. differential expression in sick versus each sample is available (Notterman et al., 2001; Alon et al., 1999), healthy tissue—while the analysis in SPIN revealed that some of the and this information can be utilized to construct an appropriate stat- implicated genes answer a very different question, i.e. which samples istical test (Ghosh, 2004). In the current data no such knowledge contain the highest proportion of muscle and connective tissue. is available, which prevents the proper employment of supervised Metastasis associated signal methods and necessitates the use of an unsupervised approach. For The analysis of the colon cancer data demonstrates a situation where example, consider a group of genes that appear significantly under- SPIN can be used to assign new labels to samples, and employ this expressed in the neoplastic samples as compared with normal tissue knowledge to improve the application of supervised methods. Meta- (434 transcripts of the examined top 1000 passed the Wilcoxon rank- stasis samples, for example, can be marked according to the degree sum test with FDR of q = 0.05). It has already been observed in colon of surrounding normal tissue inadvertently included in the sample’s cancer studies that tumor samples are more biased toward epithelium preparation. One way of gaining this information is in the context tissue than their normal counterparts, causing apparent underexpres- of the liver-specific cluster, where the expression of sample profiles sion of genes functioning in muscle and connective tissues (Alon can be viewed as the result of a gradual mixing process, starting with et al., 1999). In the SPIN-permuted data (Fig. 4c and d) the transcripts samples extracted from the colon, that contain no liver tissue, and that show reduced expression in diseased tissue clearly separate into continuing with the metastasis samples that vary in the amount of two different gene profiles. One of this gene clusters (Fig. 4f ) exhib- liver contamination. The degree of liver mixture in each sample is its extreme variation in expression in the context of the normal colon reflected by the SPIN ordering, as can be seen in Figure 4e. The least samples, which is visually manifested by a pattern of elongation in contaminated metastasis samples can be distinguished by their place- the relevant SPIN-sorted distance matrix (Fig. 4f ). The annotation ment next to the cluster of primary tumors, and labeled as clean. A of these genes associates them with smooth muscle and connect- clustering algorithm, such as average linkage, although clearly sep- ive tissue. Therefore, a likely cause for the disparity in expression arating between normal liver and primary tumors, does not produce among the normal samples are the differences in tissue composition. such meaningful ordering of the metastasis samples. Therefore, SPIN The reduced variability detected in the tumor samples (most tumors is especially useful in this situation since it can be used to perform form a tighter, less elongated shape in Fig. 4f ) is consistent with a type of electronic micro-dissection, allowing identification of the the observation made in earlier studies that those samples contain cleanest metastasis samples. A similar procedure can be performed mostly epithelial tissue (Alon et al., 1999), and with the fact that in for the lung metastasis samples by using the normal lung samples. this experiment they were carefully dissected (Tsafrir et al., 2004). It is then possible to proceed by focusing on the ‘clean’ metastasis The adenomas exhibit the lowest expression, perhaps associated with samples (from both liver and lung) to uncover genes relevant to the the fact that these benign precursors of cancer protrude into the lumen metastatic process. The resulting list included several known onco- of the colon, making it easy to remove them surgically without inad- genes [such as VEGF, CSE1L (Behrens et al., 2003), TGIF2 and vertently including some surrounding muscle or connective tissue. UBE2C]; some, in particular, are located on chromosomal arm 20q, a Therefore, using SPIN to study the profile of this gene cluster clarified region which has been previously shown to be amplified in metastatic that even though the genes are significantly differentially expressed colon cancer (Platzer et al., 2002). In SPIN one can further observe between normal and tumor cells, they are not connected with the that this group of genes exhibits a gradual elevation in expression neoplastic transformation, but rather with tissue mixtures. which is coupled with the progression of the cancer—from normal Gradual loss of differentiation tissue, through polyps, increasing in primary tumors and culminating in the ‘clean’ metastasis samples. The analysis described in the previous section illustrates how an unsupervised visualization tool such as SPIN can serve to guide rigor- SUMMARY AND DISCUSSION ous statistical analysis. Employing supervised statistical tests to com- pare our normal colon samples with the tumors resulted in a mixed The emphasis of SPIN is on providing an informative image of list, which included some genes that the SPIN analysis revealed to the data, one that facilitates extraction of meaningful characterist- be related with tissue mixtures. It is further possible using SPIN to ics. The distance matrix ordered by SPIN can reveal the fingerprint 2307 D.Tsafrir et al. of objects (e.g. regions with high density of data points) of vari- ACKNOWLEDGEMENTS ous shapes that are embedded in a high-dimensional representation This work was supported by the NIH under grant #5 P01 CA of the data. We demonstrated this for objects of fairly general 65930-06. We thank P.B. Paty and W.L. Gerald for preparation of shapes, including an elongated rod, associated with a continuous the colon cancer samples and acknowledge the use of the Gene variable and for a curve that closes upon itself indicating a cycle, Expression Core Facility of the Cancer Institute of New Jersey. as well as for gene-expression data on cell-cycle and differenti- We acknowledge the partial support by an EC Research Training ation. Furthermore, the reordered distance matrix is also able to Network (STIPCO), by the Ridgefield Foundation and by EC FP6 identify multiple objects, where the main linear variation of each funding. This publication reflects the author’s views and not neces- entity is followed sequentially, and for which the interobject rela- sarily those of the EC. The Community is not liable for any use that tionships, such as their relative placement, can be also identified. may be made of the information contained herein. We thank U. Feige, The concept of presenting an organized distance matrix is not new, I. Kanter, A. Natanzon, Y. Pilpel and R. Raz for useful discussions but the SPIN-permuted matrix is shown to be more informative than and their comments. images produced by popular methods. SPIN was used to resolve the problem of tissue mixtures in colon cancer expression data, REFERENCES and consequently to allow a clear identification of a set of genes Alon,U. et al. (1999) Broad patterns of gene expression revealed by clustering analysis implicated in the gradual loss of differentiation of the transformed of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. tissue. Sci. USA, 96, 6745–6750. We presented two different search heuristics for exploring Alter,O. et al. (2000) Singular value decomposition for genome-wide expression data permutation-space: STS generates a distance matrix that preferen- processing and modeling. Proc. Natl Acad. Sci. USA, 97, 10101–10106. Bar-Joseph,Z. et al. (2001) Fast optimal leaf ordering for hierarchical clustering. tially places red-colored elements (which denote large distances) Bioinformatics, 17, S22–S29. near the top-right (and bottom-left) corners. Thus points that are Behrens,P. et al. (2003) CSE1l/CAS: its role in proliferation and apoptosis. Apoptosis, placed far apart in the linear ordering are also distant in the full 8, 39–44. high-dimensional space. Neighborhood, on the other hand, tries to Ben-Dor,A. et al. (2003) Discovering local structure in gene expression data: The order- preserving submatrix problem. J. Comput. Biol., 10, 373–384. make sure that elements located near the main diagonal are blue- Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical colored, i.e. neighboring points in the linear ordering are also close and powerful approach to multiple testing. J. R. Stat. Soc., 57, 289–300. to each other in the high-dimensional space. This subtle distinction Birkenkamp-Demtroder,K. et al. (2002) Gene expression in colorectal cancer. Cancer in emphasis may lead to a substantial difference in the results, as Res., 62, 4352–4363. Burkard,R.E., Cela,E., Pardalos,P. and Pitsoulis,S.L. (1998) The Quadratic Assignment different energy functions reveal alternative aspects of the data, thus Problem, Kluwer Academic Publishers, Dordecht, Vol. 3, pp. 241–339. enabling the study of diverse properties. From a practical point of Dinic,E.A. and Kronrod,M.A. (1969) An algorithm for the solution of the assignment view, the STS algorithm is faster, while the Neighborhood algorithm problem. Soviet Math. Dokl., 10, 1324–1326. Dudoit,S. et al. (2002) Comparison of discrimination methods for the classification of produces better results for complex data, especially containing com- tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. pound objects. Therefore, a user could start by applying STS, which Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns. would generate an image that visually manifests the major elonga- Proc. Natl Acad. Sci. USA, 95, 14863–14868. tion in the data, and proceed by utilizing Neighborhood to study the Getz,G. et al. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl Acad. Sci. USA, 97, 12079–12084. more intricate objects. Ghosh,D. (2004) Mixture models for assessing differential expression in complex tissues The important advantages of SPIN are: (1) the simplicity of using microarray data. Bioinformatics, 20, 1663–1669. the underlying algorithm, which makes it easily implementable, Koopmans,T. and Beckmann,M. (1957) Assignment problems and the location of accessible and clear to a wide range of users. (2) Running time economic activities. Econometrica, 25, 53–76. 2 3 Lepre,J. et al. (2004) Genes@work: an efficient algorithm for pattern discovery of O(n ) − O(n ) gives almost instant feedback for datasets of and multivariate feature selection in gene expression data. Bioinformatics, 20, reasonable size (see Table 1 in Supplementary material). (3) A 1033–1044. fingerprint of an elongated structure in the sorted distance matrix Liebermeister,W. (2002) Linear modes of gene expression determined by independent invariably implies an elongation in the data. This follows from the component analysis. Bioinformatics, 18, 51–60. Maul,R.S. and Chang,D.D. (1999) Eplin, epithelial protein lost in neoplasm. Oncogene, fact that SPIN permutes the distance matrix with no distortions. (4) 18, 7838–7841. Synergy with other exploratory analysis techniques. For example, Notterman,D.A. et al. (2001) Transcriptional gene expression profiles of colorectal SPIN can be used to order within and between predefined clusters adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. obtained with most standard clustering algorithms. It was also shown Cancer Res., 61, 3124–3130. Pan,W. (2002) A comparative review of statistical methods for discovering differentially that SPIN can enhance dimensionality reduction analysis, as exem- expressed genes in replicated microarray experiments. Bioinformatics, 18, 546–554. plified in Figure 3 where the color coded ordering significantly Papadimitriou,C.H. (1977) The euclidean traveling salesman problem is NP-complete. clarifies the PCA image. Furthermore, the colon cancer application Theoret. Comput. Sci., 4, 237–244. demonstrates a biologically important scenario where the lack of Platzer,P. et al. (2002) Silence of chromosomal amplifications in colon cancer. Cancer Res., 62, 1134–1138. sufficient labels prevents the exclusive employment of supervised Ramaswamy,S. et al. (2001) Multiclass cancer diagnosis using tumor gene expression statistical methods, while the continuous nature of the underlying signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154. biological process makes SPIN an especially appropriate exploration Rozovskaia,T. et al. (2003) Expression profiles of acute lymphoblastic and myeloblastic methodology. leukemias with all-1 rearrangements. Proc. Natl Acad. Sci. USA, 100, 7853–7858. Spellman,P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, In order to show that SPIN does not find structures that do not exist, we per- 9, 3273–3297. formed a random-permutation test on expression data. Results are presented Tsafrir,D. et al. (2004) A novel mathematical approach to analyzing gene expression in the Supplementary material, in the section ‘Loss of structure in randomized data: results from an international colon cancer consortium. In Proceedings of expression data’. AACR 2004.
Bioinformatics – Oxford University Press
Published: Feb 18, 2005
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.