Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia

Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia JAMIA Open, 0(0), 2018, 1–12 doi: 10.1093/jamiaopen/ooy008 Research and Applications Research and Applications Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia 1,2,3, 4, 5 5,6 5,6 7 F. Vitali, S. Marini, D. Pala, A. Demartini, S. Montoli, A. Zambelli, and 5,6,8 R. Bellazzi 1 2 Center for Biomedical Informatics and Biostatistics, The University of Arizona, Tucson, Arizona, USA, BIO5 Institute, The 3 4 University of Arizona, Tucson, Arizona, USA, Department of Medicine, The University of Arizona, Tucson, AZ, USA, Depart- ment of Computational Biology and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy, Centre for Health Technologies, University of Pavia, 7 8 PV, Italy, Oncology Unit, ASST Papa Giovanni XXIII, Bergamo, BG, Italy and IRCCS Istituti Clinici Scientifici Maugeri, Pavia, PV, Italy Corresponding Author: Dr. Riccardo Bellazzi, University of Pavia, Department of Electrical, Computer and Biomedical Engineering, 27100, Pavia, PV, Italy (riccardo.bellazzi@unipv.it) These authors contributed equally to the work. Received 20 December 2017; Revised 7 March 2018; Accepted 20 March 2018 ABSTRACT Objective: Computing patients’ similarity is of great interest in precision oncology since it supports clustering and subgroup identification, eventually leading to tailored therapies. The availability of large amounts of bio- medical data, characterized by large feature sets and sparse content, motivates the development of new meth- ods to compute patient similarities able to fuse heterogeneous data sources with the available knowledge. Materials and Methods: In this work, we developed a data integration approach based on matrix trifactorization to compute patient similarities by integrating several sources of data and knowledge. We assess the accuracy of the proposed method: (1) on several synthetic data sets which similarity structures are affected by increasing levels of noise and data sparsity, and (2) on a real data set coming from an acute myeloid leukemia (AML) study. The results obtained are finally compared with the ones of traditional similarity calculation methods. Results: In the analysis of the synthetic data set, where the ground truth is known, we measured the capability of reconstructing the correct clusters, while in the AML study we evaluated the Kaplan-Meier curves obtained with the different clusters and measured their statistical difference by means of the log-rank test. In presence of noise and sparse data, our data integration method outperform other techniques, both in the synthetic and in the AML data. Discussion: In case of multiple heterogeneous data sources, a matrix trifactorization technique can successfully fuse all the information in a joint model. We demonstrated how this approach can be efficiently applied to dis- cover meaningful patient similarities and therefore may be considered a reliable data driven strategy for the definition of new research hypothesis for precision oncology. Conclusion: The better performance of the proposed approach presents an advantage over previous methods to provide accurate patient similarities supporting precision medicine. Key words: data integration, matrix trifactorization, acute myeloid leukemia (AML), patient similarity, precision medicine Published by Oxford University Press on behalf of the American Medical Informatics Association 2018. This work is written by US Government employees and is in the public domain in the US. This Open Access article contains public sector information licensed under the Open Government Licence v2.0 (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/). Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 2 JAMIA Open, 2018, Vol. 0, No. 0 both metafeatures extraction and distance measures to reveal hidden BACKGROUND AND SIGNIFICANCE patient similarities. In the majority of state-of-art dimension reduction The concept of precision medicine is based on the assumption that a methods, there is no constraint on the sign of the metafeature elements, careful identification of patients’ subgroups is able to properly take thus admitting negative components or subtractive combinations into into account individual variability, which may play a major role in the representation. Our method, on the contrary, is based on non- any prevention and treatment strategies. This concept is not new: negative trifactorization. The incorporation of non-negative constraints blood typing, for instance, has been used to correctly allocate blood has been shown to enhance the interpretability of the data integrated. transfusions for more than a century. We tested the performance of the proposed algorithm on different syn- The oncology-field seems to be a clear choice for taking advantage thetic data sets, affected by increasing levels of noise and data sparsity. of precision medicine. Cancers are common diseases highly impacting on We further validated the method by fusing a real data set coming from the population because of their lethality, severe symptoms, and toxicity an AML study with several external knowledge sources. By comparing associated with the oncological treatment. Moreover, each cancer has its it with state-of-art techniques, we show our method outperforms other own genomic signature, along with some common features shared by approaches in both simulated and real data. Identified patients’ sub- multiple cancer types. Patient similarity is an emerging approach in pre- groups are validated as significantly different by survival curves. cision oncology and medicine, identifying patients with similar profiles and derive insights to investigate diseases and potential treatments. In precision oncology, patient similarity is traditionally measured through TM 4 5 preidentified signatures (eg Oncotype DX , PAM50, and other clini- MATERIALS AND METHODS cally available classifiers), or patient-specific biomarkers. However, An overview of the trifactorization algorithm these preidentified onco-signatures rely only on a relatively small number The structure of data sources and knowledge bases is typically orga- of molecular features, and the trials launched to test the real impact of nized into relational matrices associating various objects/concepts, such the precision oncology in daily clinical practice have so far yield little 7–9 as patients, clinical data, genes, diseases, and so on. The non-negative results, being limited only to a tiny proportion of the entire population 9–11 trifactorization algorithm naturally exploits these data structures to per- enrolled in the trials. Therefore, conventional studies designed on the form data fusion by first representing them in a matrix form and subse- basis of the classical 4-phases of drug-development are probably neither quently organizing them in a unique big block matrix. The algorithm effective nor fit for precision oncology. aims at identifying low-rank non-negative matrices whose product can The availability of large heterogeneous biomedical data naturally provide a good approximation of the original non-negative matrix. The opens ways to develop computational methods leveraging on the result is a new matrix containing predictions and novel knowledge whole multidimensional patient data framework to search for patient about the associations represented. This algorithm can be considered as similarity. These data include, among others, clinical (ie coded data, a knowledge-based method that allows dealing with sparsity by interpo- text, images, signals), -omics (from genome to metabolome), and lating missing data through a prediction derived by explicitly modeling exposome data. The capability of computing patient similarity in the correlation and the dependency between attributes. presence of such large features becomes therefore a crucial component A Matlab implementation of our algorithm is available at to enable large-scale precision medicine implementation. In litera- https://gitlab.com/smarini/MaDDA. ture, patient similarity seems highly dependent on the specific prob- The algorithm is described step by step as follows. lems considered, and there is no consensus about the best metrics or Let us consider r different types of concepts, say, patients, genes, the best algorithms to calculate it in presence of heterogeneous and miRNAs, .. ., which we call objects o ; o ; .. . ; o and let’s suppose sparse data. To face these problems and consider patient similarity 1 2 r that we have a set of data sources that relate pairs of objects (o ; o Þ from a multidimensional perspective, in recent years a number of i j 15–22 for some i and j: for example we can have the objects “gene” and methods to determine patient similarity has been developed. “disease” and the repository “DisGeNeT” that relates them. If the In this article, we propose a novel method to compute patient sim- number of objects of type o are n and the number of objects of ilarity for precision oncology by an unsupervised discovery of patient i i type o are n the data source when i 6¼ j can be represented as a subgroups. This method is based on a strategy that integrates data j j ninj sparse matrix R 2 R , called relation matrix (Figure 1). For in- and knowledge in a sound and formal way. In particular, we ij stance, the relation matrix may contain information of the relation- exploited a modified version of a non-negative matrix trifactorization ships between genes (eg BRCA1) and diseases (eg breast cancer). If algorithm recently developed and applied also to biomedical prob- 21 24 we also have observations about the relationships of the objects of lems by Gligorijevic et al, Zang et al, Utro et al, and Zitnik et 21,25–30 the same type, such as genes coexpression, we might represent them al. Factorization techniques are efficient tools for data fusion nini with a matrix H 2 R , called constraint matrix (Figure 1). of large sparse data sets (like the ones available in the clinical setting). i Considering the entire set of R relation matrices given by all the These approaches adopt a useful dimension reduction by directly ij data sources of interest, we can represent them as a block matrix R, compressing the starting data into a lower number of features (ie vec- which may miss elements (eg not all the genes in the genome can be tor components). The decomposition methods play a central role in related to a given disease): the analysis of the latent structure hidden in the data that may unveil unknown interactions of the initial data, that is patient similarities. In 0 1 R .. . R 12 1r recent years, several dimension reduction techniques have been pro- B C B C posed to tackle biological problems. Examples rely on collective ma- B R  R C 21 2r 31,32 33 B C trix factorizations, tensor decomposition, Bayesian multitensor R ¼ B C (1) B C 34 35,36 . . . B C factorization, and group factor analysis (GFA). . . . . . . @ A Our approach takes into account the structural relation of several R R ... r1 r2 highly heterogeneous data, such as clinical and genomic data, and available knowledge from several public repositories. It implements Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 3 Figure 1. An example of the trifactorization algorithm constructed by considering 3 data sources. All the data sources are represented as relation matrices. R i; j matrices are used to describe associations between objects of the different type (eg gene-disease), their values range between [0, 1], where 1 indicates strong in- teraction and 0 association absence or lack. H matrices represent relations between objects of the same type (eg gene-gene) and H elements vary between i i [1, 1], where 1 represents a strong association and 1 a lack of association. R matrix is then trifactorized by running and optimization algorithm (see Eq. 5) into a set of lower-rank factors, eg G and S . Finally, the whole matrix is reconstructed by multiplying matrices G and S , thus revealing new associations. i i; j i i; j The values of the matrix R express the strength of the relation- are predicted on the basis of the multiplication of elements of much ships between objects, and they correspond to numbers between 0 lower rank. and 1, with 0 meaning no known relationships. The matrices S and G are reconstructed by minimizing the fol- On the other hand, the constraint matrices H can be expressed lowing objective function: as a diagonal block set, H , where t ¼ 1, 2,..., i denotes the possible max t i i X X t 2 t ðtÞ multiplicity of relationships of the same type, which can be derived min JGðÞ ; S ¼ kR  G S G k þ trðG H GÞ; (5) G0 ij i ij by i different knowledge sources (corresponding to different H ma- R 2R t¼1 i ij trices of the same object). For example, coexpression may be mea- where jj  jj is the Frobenius norm and tr() the trace of the matrix. sured through different types of experiments. The procedure adopted to solve Eq. 5 starts with a random initiali- ðÞ t ðÞ t ðtÞ ðÞ t zation of the G, next, S matrices are iteratively updated until conver- H ¼ Diag H ;H ; .. . ;H : (2) 1 2 r gence (proof of convergence and details in references 23 and 30). Differently from R matrices, H values vary between 1 and 1, ij i Details on the adopted procedure to solve the optimization problem expressing the dissimilarity between elements of the same object provided in Supplementary Material Method S1. types, so that 1 means full similarity while 1 is full dissimilarity. The algorithm described above has been adapted to calculate the Once the data are represented into matrices, the trifactorization similarity between the same type of objects: in this article, we are in- algorithm jointly factorizes the matrices R using the matrices H as ij i terested in the object “patients.” With a closer look to the approxi- constraints. First of all, a set of design parameters k  n is defined i i mation R  G S G , we can notice that matrix G is shared by all ij i ij i for each object. These parameters, also called ranks, define the di- blocks that are related to the object type o (in our case patients), mension of the latent factors for the ith object type with the objec- while S is specific to the relations between the objects o and o . ij i j tive of revealing hidden structure in the data. This is a crucial step in Since G is an n  k matrix, the rows correspond to the elements of i i i the algorithm, since wrongly assigned ranks may lead to overfitting the ith object type (in our case the different patients), while the col- (if too big), or may not be able to capture all the information (if too umns represent k groups. Therefore, each element can be inter- small). There is no general consensus about how to select these val- preted as the degree of membership of each patient (row) to each 21,26,30 ues and different approaches can be applied. In this work, we group (column). Therefore, we can assign an element (ie a patient) opted for an empirical approach (see Selection of initial parameters to the group (ie to a cluster) with the largest value, that is the col- section). umn with the maximum value for the corresponding row. After rank selection, each block of the matrix R is factorized in 2 Since the optimization strategy strongly depends on the initiali- lower-rank block matrices,G and S (Figure 1), as follows: zation (ie the selection of the dimension of k parameters), we aver- aged the results over 10 applications to obtain a final consensus n k n k 1 1 2 2 nrkr G ¼ DiagðG ; G ; .. . ; G Þ (3) 1 2 r ^ matrix (C), which is calculated as the element-wise averaged sum of the connectivity matrices. In our case, for example, a consensus ma- 0 1 k k k k 1 2 1 r trix element showing a value of 0.5 means that 5 times out of 10, in S ... S 12 1r B C the connectivity matrix, the 2 patients corresponding to the row and B C B C k2k1 k2kr B S  S C column indexes of the element ended up grouped together. 21 2r B C S ¼ (4) B C B C B . . . C . . . B C . . .  Patient and external knowledge-based data sets @ A In Table 1 are reported the data sources used in this work. In detail, k k k k r 2 r 2 S S .. . r1 r2 we propose to integrate patient data (extracted from TCGA, ) and Thanks to the joint factorization, information is spread out from several publicly available knowledge bases to extract meaningful pa- and to the different relation matrices, so that the missing elements tient similarities. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 4 JAMIA Open, 2018, Vol. 0, No. 0 Table 1. Patient data and external knowledge sources Data type Retrieved data Resource Patient data on acute myeloid leukemia c,39 1) Clinical data (200 patients) Gender, age, vital status: dead or alive, days to death (if dead), days to TCGA birth, days to last follow-up, date of the diagnosis 2) Somatic mutations (195 patients) 1620 mutations associated with 428 genes 3) Gene expression profiles (197 22578 genes (8897 after filter application) patients) External knowledge data sources d,40 4) Gene-gene interactions Starting from the 186 genes involved in AML (extracted from MalaCard) BioGRID and the 428 genes associated with the mutations, we extracted 37 811 first-degree gene-gene interactions between 8897 unique genes. e,41 5) Gene-pathway associations 3202 associations between the 8897 genes and 383 KEGG pathways KEGG f,42 6) Disease-disease relationships 35 201 associations between 6402 unique diseases. DO 7) Disease-gene associations 1925 associations between 6402 diseases and 278 genes. DisGeNET v4.0 e,41 8) Disease-pathway relations 605 associations between the 6402 diseases and the 383 pathways. KEGG We listed the clinical variables used in this work. Data have been accessed in November 2016. c 43 The Cancer Genome Atlas. Data have been retrieved by using the cBioPortal for Cancer Genomic, last updated 05/31/16. Biological General Repository for Interaction Datasets, Release 3.4.142. Kyoto Encyclopedia of Genes and Genomes, Release 80.0. Disease Ontology, Release 2016-01-07. rows cols 1 nnz þ nnz The AML TCGA cohort was used to (1) generate the data set for i i k ¼ 200 2 ; (10) a simulated study and (2) to validate the proposed approach on a real data set. Gene expression data were normalized by using robust i 6¼ 6 multichip average (RMA), normalization method. Mutation data 45 where nnz are the nonzero elements of the object i counted on the were analyzed using the software PaPi. PaPi is a machine-learning rows cols rows (nnz ), and on column (nnz ), respectively. i i approach to classify and score human coding variants by estimating On the other hand, the patient rank k corresponding to the the probability to damage their protein-related function. Each of the 45 number of expected clusters is computed following the approach 1620 mutations, gets a score (probability) between 0 (ie tolerated presented in reference 21. We applied a grid search to select k , since mutation) and 1 (ie damaging mutation). A list of 186 genes in- 46 Eq. 10 would provide a very high rank (ie high number of patient volved in AML were selected from MalaCards. This list was subse- clusters) due to the low sparsity of the gene/patient relationships. quently integrated with the 428 genes associated with the mutation Different values of k from a predefined interval are used as inputs data (Table 1). We finally retained the gene expression and mutation to the integration algorithm, and the results are compared in terms data for that list of 8900 genes and for the genes extracted as first- 40 of their dispersion coefficient, the larger the better (Eq. 11): degree interactor in BioGRID. n n XX ðÞ k 1 ðÞ k 1 6 ^ 6 ^ 2 q C ¼ 4½ l ¼½3; 5; 10; 20 (11) ij 2 k n 2 i¼1 j¼1 ðÞ k Matrix definition where k is a patient rank value from the list l, C is the consensus We represented all the data sources (Table 1) in the form of relation matrix (see An overview of the trifactorization algorithm section) ðk Þ matrices, each one formalizing the known associations between computed by using k . The rank k obtaining the higher qðC Þ 6 6 pairs of objects. In detail, the considered objects are: o clinical value was used as rank input for the proposed approach. This proce- data; o mutations; o genes; o pathways; o diseases; and o dure has been used for both the synthetic and the real data sets. 2 3 4 5 6 patients. According to the integration algorithm and in order to make the matrices comparable, associations between objects of dif- Simulated data set construction ferent type (R matrices) were rescaled in the interval [0, 1]. On the ij To evaluate the performance of the proposed approach, we gener- other hand, association between objects of the same type (H matri- ated different synthetic data sets with the same size of the real one ces) were rescaled in the interval [1, 1]. (Table 1). Details on data processing are provided in Table 2. Matrix data By varying 2 simulation parameters listed in Supplementary Ma- are available at https://gitlab.com/smarini/MaDDA. terial Table S1, we created 25 data sets, differing for amount of added noise, missing data, degrees of patient similarity, and data sparsity. For each scenario, we simulated 200 virtual patients grouped into 5 clusters with known similarity structures. The simu- Selection of initial parameters lation process is as follows: A crucial step in the factorization algorithm is the selection of the in- put ranks k . All the ranks, except the one related to the patient ob- 1. Patient clinical data, gene expression levels, and the PaPi score ject (k ), were computed resorting to an empirical rule proposed in mutation data are used to construct a patient similarity matrix reference 29 (Eq. 10). by computing the Euclidean distance (ED) between patients. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 5 Table 2. Matrix R and H construction ij i Matrix Relation Matrix value definition R ¼ R Clinical data-pa- We used 4 clinical variables (ie rows) for each patient (ie column), that is Female, Male, Age, and 16 61 tient** Survival. The gender variable was used to create 2 rows, that is Female and Male, for each patient, whose value was set to 0 or 1 according to the gender. The age field was used to define the Age row whose values are given by: a a i min age ðÞ¼ i (6) a a max min where a is the ith patient age at the time of the death or at the last follow-up, while a and a are i min max the cohort minimum and maximum ages, respectively. Finally, a Survival row for each patient was obtained by dividing the patients into alive and deceased individuals. Since the majority of the patients died within 1 year from the diagnosis date, we con- sidered the survival S at one year computed as: 1 if d > 365 days > i d  d i min if d < 365 days and vital status ¼ deceased SiðÞ ¼ (7) d  d max min 0 if d < 365 days and vital status ¼ alive where d is the number of days between the diagnosis and death rate or the follow-up date of the pa- tient i. d and d are the global minimum and the maximum number of days between the di- min max agnosis and death date, respectively. 0 means that it is unknown if the patient is alive or deceased 1 year after the diagnosis since the last follow-up date occurred before that time. This allowed to obtain values ranging between [0, 1]. R ¼ R Mutation-Gene Mutations mapped to their respective genes. 23 32 R ¼ R Mutation-patient PaPi scores evaluating harmfulness of each mutation, per patient. 26 62 R ¼ R Gene-disease We used DisGeNET data associating genes and diseases. Since DisGeNET score provided is already 34 43 in the interval [0, 1], no further processing was required. R ¼ R Gene-pathway* Genes mapped to KEGG pathways. Presence/absence of a gene in pathway determines its binary 1/0 35 35 value. R Gene-patient Gene-patient values correspond to the sum (mutational burden) of the PaPi scores of all the muta- tions associated to a specific gene. The obtained values were then rescaled between [0, 1]. R ¼ R Disease-pathway* Matrix values correspond to 0 or 1 according to the information extracted from KEGG about all the 45 54 diseases altering each KEGG pathway. R ¼ R Disease-patient Rows of this matrix represent the diseases and column the 200 patients. The values of the row corre- 46 64 sponding to acute myeloid leukemia (DOID: 9119) were set to 1 indicating the association. R Patient-gene Gene expression data from TCGA were used as matrix values according with the formula: e e i;j min ExpressionðÞ i; j ¼ (8) e e max min where e is the patient i expression of the gene j; while e and e are the global minimum i min max and the maximum values of the gene j. H Clinical data-clinical The rows and the columns of this matrix are Age, Female, Male, and Survival. The diagonal values data are 1 (ie fully associated). Male and female association is 1 (ie not associated). H Mutation-mutation No assumption was made on the mutation similarity, beside considering each mutation similar to it- self. H Gene-gene Gene-gene interactions from BioGRID. The raw data needed a preprocessing step, since for each gene pair, the related associations may appear multiple times (corresponding to different kind of interaction, eg direct physical binding, genetic interaction ). For this reason, denoting with x the number of times a certain pair appears, its score was determined by: lnðxÞ fxðÞ ¼  1 þ (9) 2 lnðx Þ max H Diseasedisease The similarity between 2 diseases is set to 0:8 where n is the length of the shortest path between corresponding terms in the DO (ie the minimum number of steps to reach one disease from the other one). H Pathway-pathway KEGG relations were used to measure binary pathway similarity. Also, each pathway is considered similar to itself. H Patient-patient No assumption was made on the patient similarity, besides considering each patient similar to itself. a 47 Data have been automatically extracted by using Python Bioservice 1.4.8. In case of missing data, the value of the element was set to 0 (ie unknown). Next, we selected the 5 less similar patients according with the all the P’s clinical data (excluding the gender variable). The mean ED values. gender is assigned according with a probability ¼ .9 to be the 2. For each patient P of the 5 patients identified in (i), 39 ‘virtual same as P. Finally, gene expression levels and PaPi mutation patients’ are generated by adding Gaussian noise with mean of scores of a new simulated patient j are obtained according with zero and variance equal to one half of the population variance to Eq. 12. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 6 JAMIA Open, 2018, Vol. 0, No. 0 D ¼ D 6CV m ; i i (12) 1 < j < 39 where D is the patient j’s value of the object i (gene or muta- tion) in the reference data set, the value m is the mean value of an element in the population of the object i (eg mean expression value of a gene), and CV is the coefficient of variation. 3. A percentage d sparsity is added to all the simulated data set in order to test the robustness of the approach to sparse data. With this procedure, we generated the 25 simulated data sets by varying the CV parameter and the data sparseness as reported in Supplementary Material Table S1. The obtained data were then inte- grated with the external knowledge (Table 1) in order to apply the proposed algorithm. Result evaluation of synthetic data We evaluated the performance of the trifactorization algorithm on the 25 simulated data sets by measuring the mean absolute error (MAE), defined as: n m XX expected c Figure 2. Data sources and matrix representation. The figure shows the num- MAE ¼ jC  C j; (13) ij ij ber of rows of each matrix constructed starting from the patient-related data i¼1 j¼1 and the external knowledge. N corresponds to the number of nonzero ele- ij ments in the matrix. Matrix data are available at https://gitlab.com/smarini/ where n and m are the number of matrix rows and columns, N is the ^ MaDDA/tree/master/Patient_similarity_TCGA/matrices. total number of matrix elements, C is the estimated consensus ma- ij expected trix in the position ij, C is the ideal consensus matrix (ie each ij the past 15 years, its molecular heterogeneity has become appar- simulated patient is clustered with its corresponding real patient, for ent, thanks to -omics technologies. The analysis of public data on a total of 5 clusters of 40 patients). AML combined with external knowledge sources may reveal novel insights into AML patient profiles to discriminate between good or Result comparison with other techniques bad responders and suggest tailored therapies. In order to test our In addition, using the simulated data sets, we assessed and compared approach, we collected patient data from The Cancer Genome Atlas the proposed approach with 2 widely used methods, that is principal (TCGA), and integrated them with external knowledge sources component analysis (PCA), and the ED measure. The results were (Table 1). Collected data were then transformed into a matrix for- also compared to 2 more advances techniques to integrate heteroge- mat according with Table 2. neous data: (1) a deep learning method based on restricted Boltz- The resulting matrices (Figure 2) were used to construct the R mann machines, that is multimodal deep belief networks and H block matrices providing the inputs for the factorization algo- 20,52 36 (MDBN), and (2) a GFA, based on factor analysis. Unlike the rithm. Note that the R block matrix will be symmetrical except for matrix trifactorization algorithm, these methods are not designed to the R and R blocks since these 2 matrices were obtained by us- 36 63 include external knowledge sources associating entities not related ing 2 different knowledge sources (ie mutation and gene expression to the prediction target (ie gene-gene interactions do not involve data). patients). These methods were therefore applied considering only All the object ranks except the patient one were initialized as in the patient-related data (R ;R ;R ;R ; Table 2). Details 1; 6 2; 6 3; 6 6; 3 Eq. 10, while the patient rank k was initialized to the value resulted on the adopted procedures used to apply PCA, ED, MDBN, and by performing a grid search and computing the dispersion coeffi- GFA methods are provided in Supplementary Material Method S1. cient as in Eq. 11. In order to evaluate their performances, MAE was computed by The convergence of the algorithm was then monitored by mea- replacing in Eq. 12 the C matrix with the similarity matrix Sim ij ij suring the objective function (Eq. 5). The algorithm stopped when obtained by applying the PCA, EA, MDBN, or GFA method. In this the difference between 2 consecutive norms was under the threshold case, Sim is built by setting its elements Sim ¼ 1 if the patient i ij ij 5 10 . The number of repetitions n was set to 10, in order to reduce and the patient j belong to the same cluster, and Sim ¼ 0, other- ij the effect of the initialization. The resulting consensus matrix C was wise. used as a similarity matrix to extract patient-patient similarities. Validation case study of acute myeloid leukemia We investigated the biological relevance of the patient similarities Validation of real data results uncovered by the proposed methodology focusing on AML. AML is To validate the results on the real data set, we further applied the a myeloid neoplasm related to an uncontrolled proliferation of white proposed algorithm on the same data set (Figure 2) but excluding blood cells. It is the most common leukemia in adult patients. AML the Survival column (ie n ¼ 3) that corresponds to what we want to is curable from 35 to 40% of patients younger than 60 years of age, estimate. We applied a hierarchical clustering on the resulting pa- while for older patients the percentage decreases from 5 to tient similarity matrix where the linkage was determined using com- 53,54 55 10%. Moreover, AML is an heterogeneous disease and, over plete link method. The survival curves were estimated by the Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 7 Figure 3. Heatmaps of the mean absolute error (MAE). MAEs were obtained by applying: A, the proposed trifactorization algorithm; B, principal component analy- sis (PCA); C, Euclidean distance; D, multimodal deep belief network (MDBN); E, group factor analysis (GFA) to the 25 synthetic data sets. The simulated data sets were constructed by considering different percentages of missing data (ie heatmap columns) and noise (ie heatmap row). The noise was added based on the co- efficient of variation (CV) (Eq. 12). The map clearly shows smaller MAEs for the proposed approach. The trifactorization gave results less sensitive to sparse data (ie high number of missing data). Kaplan-Meier method and were compared using the log-rank 10% (Supplementary Material Table S1). k ¼ 10 resulted as the test. rank with the highest values of q (Supplementary Material Table S2). As for the simulated data sets, we assessed and compared the The proposed algorithm was therefore applied to all the 25 data real data results of the proposed methodology with the ones sets by selecting as input k ¼ 10, while the other objects ranks obtained by applying the PCA, ED, MDBN, and GFA to the patient- were computed according with Eq. 10. related data. As for the trifactorization approach, a hierarchical The trifactorization performances on simulated data were evalu- clustering using complete link method was applied to the similarity ated by comparing the algorithm’s MAE (Eq. 13) with the MAEs matrices resulting by applying these techniques. The different per- obtained by applying the PCA, ED measure, MDBN, and GFA on formances were finally evaluated by comparing survival curves of the same data sets. The results are shown through heatmaps in Fig- the cluster of patients identified. ure 3 (details in Supplementary Material Table S3). These confirmed the best performance are achieved with the trifactorization algo- rithm, and our approach showed its robustness in presence of differ- ent similarity structures and missing data. RESULTS Validation study: patient similarity in acute myeloid Simulation study To evaluate the performance of the proposed approach, we pro- leukemia duced 25 synthetic data sets as described in Simulated data set con- We investigated the patient-patient similarities uncovered by the tri- struction section. factorization algorithm by applying it to a real case, that is consider- The k parameters (Selection of initial parameters section) were ing the AML data sources reported in Figure 1. The rank parameters initialized following Eq. 10 for all the objects except the patient were computed as in Eq. 10 for all the objects except the patient 1 rank (k Þ. k was defined through a grid search (Eq. 11) on the sim- that was set to k ¼ 5 (ie rank with the highest q values—Eq. 11). 6 6 6 ulated data set with 10% of missing data and noise between 0% to The other input ranks obtained were: k ¼ 2, Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 8 JAMIA Open, 2018, Vol. 0, No. 0 and different blood tests for different patients are sparse along a mostly uneventful time line). “large m small p” problems, is the fact that the number of records (predictors) can be orders of magni- tude bigger than samples (observations). For example, a gene panel can measure the expression of thousands of genes, but cohorts are usually tens to hundreds of patients. A number of approaches have been recently proposed to data in- tegration and to bring out intrinsic characteristics of the data. In particular, a specific class of methods rely on the dimension reduc- tion of the data through the definition of metafeatures that allows the projection of the data into a low dimensional space. This prop- erty can be used in a machine learning framework in order to cap- ture the hidden interaction effects between variable. State-of-the-art algorithms are typically based on the metafea- 20 19 ture extraction method, or the patient distance measure. They also leverage on a small set of different data types, mostly gene ex- 15–17,19,20,22 19,20 15 pression, methylation, copy number variation, 17 16 18 protein interactions, clinical data, and diseases. Gligorijevic et al, used also drug information since their aim was to reposition Figure 4. Consensus matrix obtained by applying the trifactorization algo- drugs for subgroups of patients. rithm to the acute myeloid leukemia (AML) data. Matrix rows and columns In this work, we proposed a framework based on matrix trifacto- correspond to the 200 patients, and it was constructed by considering the rization that integrates several source of data and knowledge with resulting G matrix (An overview of the factorization algorithm section). The the aim of predicting patient similarity. The proposed approach matrix shows 2 groups of patients clustered together (the corresponding den- drogram is reported in Supplementary Material Figure S1). combines several more data and external knowledge sources respect to other methods recently developed. In detail, our approach pro- vides a comprehensive framework for the prediction of patent simi- k ¼ 14; k ¼ 480, k ¼ 13, k ¼ 19. The consensus matrix 2 3 4 5 larity by fusing both clinical and multiomics data of patients, and obtained by applying this procedure is shown in Figure 4. automatically integrating them with external knowledge sources (eg The result validation was then conducted by applying the pro- gene-gene interactions, disease-disease associations). posed algorithm to same data set but not containing the survival in- The findings obtained by applying the trifactorization algorithm formation (Result validation section). The trifactorization revealed on synthetic data sets showed better performances if compared to 2 major groups that we labeled G and G (dendrogram reported in 1 2 other traditional methods (Figure 3). Moreover, this novel strategy Supplementary Material Figure S1). provides more resistance to sparsity and noise, which is ubiquitous In order to validate the obtained clusters, we plotted the Kaplan- in biological data. The application of the trifactorization algorithm Meier survival curves of the 2 patients’ groups, as reported in to the real AML data set underlined 2 big groups of similar patients Figure 5A. The survivals of the 2 groups were clearly different (log (Figure 4). To further confirm this finding, we searched for the ex- rank P-value ¼ .0159) with G indicating a better prognosis clusive gene signatures characterizing each group, in order to investi- than G . gate the presence of a molecular mechanism distinguishing the 2 In addition, we compared our results by performing on the AML groups. A univariate analysis of gene expression level did not pro- data set the PCA, ED, MDBN, GFA methods. In Figure 5 are shown vide significant differences (data not shown). On the other hand, by all the survival curves obtained with these approaches and no statis- analyzing mutation data, we extracted a common gene signature tical difference between the 2 groups was found (ED P-val- consisting of 19 genes mutated in at least one patient in both groups ue ¼ .7763, PCA P-value ¼ .6278, MDBN P-value ¼ .2954, and (Supplementary Material Table S4). As expected, these genes GFA P-value ¼ .1652). Taken together, these analyses demonstrated resulted significantly enriched for ‘leukemia’ Online Mendelian In- the capacity of the proposed method to add value in integrating dif- heritance in Man (OMIM), annotation (Table 3; Fisher’s exact ferent knowledge sources and provide patient-patient similarities for P-value ¼ .00245;—enrichment results shown in Supplementary personalized medicine. Material Table S5). In addition, we further searched for the genes whose mutation frequency differs significantly between the 2 groups [Fisher’s Exact test, False Discovery Rate (FDR) adjusted P-value]. DISCUSSION We found 5 genes, namely IDH1 (P ¼ .00022), NPM1 (P ¼ 3.54867e14), NRAS (P ¼ .00145), PTPN11 (P ¼ .01), TET2 The availability of increasingly larger amount of biomedical data is (P ¼ .00015). Mutations associated with all these genes have been pushing researchers and companies to investigate useful information 61–65 found to be highly related to AML development and progression. by combining data from difference resources, with the aim of unveil- These findings confirm both the 2 identified groups characterized by a ing hidden knowledge on patients, diseases and therapies. However, common AML gene signature. Finally, we identified gene signatures biomedical data are characterized by high complexity, heterogene- 58 59 58 discriminating the 2 groups by retrieving the mutations present ex- ity, sparsity, and “large m small p” problems. Sparsity, is a clusively either in patients of G (better prognosis) or the G (good characteristic of matricial/graph representations where most of the 1 2 prognosis). The gene signatures resulted in 342 (Supplementary Ma- elements are null, that is zeros/absent links. This particularly true terial Table S6) and 52 (Supplementary Material Table S7) genes for for genomic data (eg in a gene network, the number of links con- G and G group, respectively. An OMIM enrichment analysis con- necting the nodes will be orders of magnitude smaller than the num- 1 2 ducted on the 2 gene signatures retrieved the term ‘leukemia’ only ber of nodes), or temporal data (eg hospitalizations, complications Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 9 Figure 5. Survival curves. Survival curves corresponding to the clusters obtained by the trifactorization algorithm (A), principal component analysis (PCA) (B), Eu- clidean distance (ED) measure (C), multimodal deep belief network (MDBN) (D), and group factor analysis (GFA) (E). All the plots report the P-value (P) resulting from the log-rank test. Only the 2 clusters obtained with the proposed approach have statistically significant survival curves (trifactorization P-value ¼ .01; ED P-value ¼ .7763; PCA P-value ¼ .6278; MDBN P-value ¼ .2954; GFA P-value ¼ .1652). for G (Table 3; Fisher’s exact P-value ¼ .0175, complete results in involve associations between clinical data, mutations, genes, dis- Supplementary Material Tables S8 and S9), coherently with the hy- eases, pathways, and patients to extract patient similarity. The same pothesis the second group, associated with a poor survival progno- approach can be applied to classify novel patients and, the results sis, presents a more aggressive instance of AML. Interestingly, the might be used to suggest potential tailored treatments based on groups are not significantly different for sex (P ¼ .2594, Fisher’s Ex- successful drug treatments extracted from the patient’s clinical his- act test) or age (P ¼ .0884, Wilcoxon rank sum test), indicating that tories. Moreover, further improvements and novel features can be our unsupervised method did not simply discriminated obvious pa- provided by directly including into the trifactorization framework 66 67 tient subgroups, but was able to mine more subtle differences char- data sources involving drugs (ie DrugBank, PharmGKB, ). In acterizing poor-vs-good prognoses. this way, by selecting as target the patient-drug matrix, the algo- The current study showed is power in discovery patient-patient rithm could predict targeted therapies for specific profiles of similarity by integrating several data and knowledge sources that patients. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 10 JAMIA Open, 2018, Vol. 0, No. 0 Table 3. OMIM enrichment results OMIM term Significant P-value (Fisher’s exact test) OMIM enrichment analysis of the genes associated with Leukemia .00246 Common mutations between G1 and G2 Bardet-biedl_syndrome .026431 Mutations of G1 Macular_degeneration .037245 Leukemia .017554 Mutations of G2 Long_qt_syndrome .030766 Fibrosis .030766 Cone-rod_dystrophy .038311 Thyroid_carcinoma .043309 Note: Significant P-values (P < .05). The full enrichment results for the genes associated with common mutations between G1 and G2, mutations of G1, and mutations of G2 are respectively shown in Supplementary Material Tables S5, S8, and S9. OMIM: Online Mendelian Inheritance in Man. The main limitations of the proposed approach are 2-fold: Conflict of interest statement. None declared. On the one hand, it is necessary to represent data and knowledge in terms of bidimensional relation matrices. This often requires SUPPLEMENTARY MATERIAL flattening concept hierarchies and graphs, thus generating matri- ces of very high dimensionality. The computational burden of Supplementary material is available at Journal of the American the optimization algorithm is highly dependent on the number of Medical Informatics Association online. data sources and on the size of the relational matrices, in particu- lar when such matrices are not sparse. On the other hand, the method has a number of crucial design parameters, the ranks, which requires fine tuning in order to obtain ACKNOWLEDGMENTS a compromise between the quality of matrix reconstruction and the We thank Ivan Limongelli for the help in organizing genome data. We are need of representing data with low dimensionality latent vectors. also grateful to Marco Piastra, Gianluca Gerard and Stefano Montoli for the analysis carried on with MDBN. We sincerely acknowledge Blaz Zupan and Nevertheless, we believe that the proposed approach represents a Marinka Zitnik for their original ideas, and for their support and training in powerful and interesting strategy to deal with data and knowledge applying trifactorization methods. fusion for similarity calculation, which may provide advantages in precision oncology applications. Finally, the proposed approach is particularly suitable for preci- REFERENCES sion oncology due to the high number of available data sources on 1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J cancer, but it can be easy applied to other fields and diseases. Med 2015; 372 (9): 793–5. 2. Lu YF, Goldstein DB, Angrist M, Cavalleri G. Personalized medicine and human genetic diversity. Cold Spring Harbor Perspect Med 2014; 4 (9): a008581. CONCLUSION 3. Chin L, Gray JW. Translating insights from the cancer genome into clini- cal practice. Nature 2008; 452 (7187): 553–63. In this work, we analyzed the problems related to the application of pre- 4. Sparano JA, Paik S. Development of the 21-gene assay and its application cision oncology. Literature evidence seems to show how a traditional in clinical practice and clinical trials. J Clin Oncol 2008; 26 (5): 721–8. clinical trial approach looking for biomarker-specific patient subgroups 5. Parker JS, Mullins M, Cheang MC. Supervised risk predictor of breast is particularly hard to implement due to heavy, intrinsic statistical limi- cancer based on intrinsic subtypes. J Clin Oncol 2009; 27 (8): 1160–7. tations. We have shown how the problem of subgroup identification 6. Pellagatti A, Benner A, Mills KI, et al. Identification of gene can be solved with a data fusion approach exploiting relations of hetero- expression-based prognostic markers in the hematopoietic stem cells geneous, multidimensional data sources. In our application, we consid- of patients with myelodysplastic syndromes. J Clin Oncol 2013; 31 (28): ered objects as diverse as clinical data, diseases, genes, mutations, 3557–64. 7. Meric-Bernstam F, Brusco L, Shaw K, et al. Feasibility of large-scale geno- pathways, and the patients themselves. Thanks to a latent feature repre- mic testing to facilitate enrollment onto genomically matched clinical tri- sentation via matrix trifactorization, we were able to identify clinically als. J Clin Oncol 2015; 33 (25): 2753–62. meaningful patient subgroups. The approach showed better perfor- 8. Group E-ACR. Executive Summary: Interim Analysis of the NCI- mance when compared with standard clustering strategies. Future MATCH Trial. Secondary Executive Summary: Interim Analysis of the works will deal with optimizing the factorization strategy and to pro- NCI-MATCH Trial. 2016. http://ecog-acrin.org/nci-match-eay131/in- vide automated explanations of the results obtained. terim-analysis. Accessed June 2016. 9. Tredan O, Corset V, Wang Q, et al. Routine molecular screening of ad- vanced refractory cancer patients: An analysis of the first 2490 patients of the ProfiLER study. J Clinical Oncol 2017; 35 (18_suppl): LBA100. FUNDING 10. Le Tourneau C, Delord JP, Goncalves A, et al. Molecularly targeted ther- This work was supported by the 3 project grant no. 2015-0042 “Genomic apy based on tumour molecular profiling versus conventional therapy for profiling of rare hematologic malignancies, development of personalized med- advanced cancer (SHIVA): a multicentre, open-label, proof-of-concept, icine strategies, and their implementation into the Rete Ematologica Lom- randomised, controlled phase 2 trial. Lancet Oncol 2015; 16 (13): barda (REL) clinical network.” 1324–34. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 11 11. Prasad V, Vandross A. Characteristics of exceptional or super responders 34. Khan SA, Lepp€ aaho E, Kaski S. Bayesian multi-tensor factorization. Mach to cancer drugs. Mayo Clin Proc 2015; 90 (12): 1639–49. Learn 2016; 105 (2): 233–53. 12. Biankin AV, Piantadosi S, Hollingsworth SJ. Patient-centric trials for ther- 35. Virtanen S, Klami A, Khan AK, Kaski S. Bayesian group factor analysis. apeutic development in precision oncology. Nature 2015; 526 (7573): In: Artificial Intelligence and Statistics. La Palma, Canary Islands: 361–70. AISTATS; 2012. 13. Sun J, Wang F, Hu J, Edabollahi S. Supervised patient similarity measure 36. Klami A, Virtanen S, Lepp€ aaho E, Kaski S. Group factor analysis. IEEE of heterogeneous patient records. ACM SIGKDD Explor Newsl 2012; 14 Trans Neural Netw Learn Syst 2015; 26 (9): 2136–47. (1): 16–24. 37. Wang Y-X, Zhang Y-J. Nonnegative matrix factorization: a comprehen- 14. Brown SA. Patient similarity: emerging concepts in systems and precision sive review. IEEE Trans Knowl Data Eng 2013; 25 (6): 1336–53. medicine. Front Physiol 2016; 7: 561. 38. Pinero J, Bravo A, Queralt-Rosinach N, et al. DisGeNET: a comprehen- 15. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic sive platform integrating information on human disease-associated genes data types using a joint latent variable model with application to breast and variants. Nucleic Acids Res 2017; 45 (D1): D833–d39. and lung cancer subtype analysis. Bioinformatics (Oxford, England) 39. Hudson TJ, Anderson W, Aretz A, et al. International network of cancer 2009; 25 (22): 2906–12. genome projects. Nature 2010; 464 (7291): 993. 16. Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strat- 40. Chatr-Aryamontri A, Breitkreutz B-J, Oughtred R, et al. The BioGRID in- egy for personalized prognosis. Oncotarget 2016; 7 (26): 40200–20. teraction database: 2015 update. Nucleic Acids Res 2014; 43 (D1): 17. Xu T, Le TD, Liu L, Wang R, Sun B, Li J. Identifying cancer subtypes D470–D78. from miRNA-TF-mRNA regulatory networks and expression data. PLoS 41. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. One 2016; 11 (4): e0152792. Nucleic Acids Res 2000; 28 (1): 27–30. 18. Girardi D, Wartner S, Halmerbauer G, Ehrenmu ¨ ller M, Kosorus H, Drei- 42. Kibbe WA, Arze C, Felix V, et al. Disease Ontology 2015 update: an ex- seitl S. Using concept hierarchies to improve calculation of patient similar- panded and updated database of human diseases for linking biomedical ity. J Biomed Inform 2016; 63: 66–73. knowledge through disease data. Nucleic Acids Res 2014; 43 (D1): 19. Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggre- D1071–D78. gating data types on a genomic scale. Nat Methods 2014; 11 (3): 333–7. 43. Gao J, Aksoy BA, Dogrusoz U, et al. Integrative analysis of complex can- 20. Liang M, Li Z, Chen T, Zeng J. Integrative data analysis of multi-platform cer genomics and clinical profiles using the cBioPortal. Sci Signal 2013; 6 cancer data with a multimodal deep learning approach. IEEE/ACM Trans (269): pl1. Comput Biol and Bioinf 2015; 12 (4): 928–37. 44. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries 21. Gligorijevic V, Malod-Dognin N, Przulj N. Patient-specific data fusion for of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003; 31 (4): e15. cancer stratification and personalised treatment. Pac Symp Biocomput 45. Limongelli I, Marini S, Bellazzi R. PaPI: pseudo amino acid composition to 2016; 21: 321–32. score human protein-coding variants. BMC Bioinformatics 2015; 16 (1): 123. 22. Planey CR, Gevaert O. CoINcIDE: a framework for discovery of patient 46. Rappaport N, Nativ N, Stelzer G, et al. MalaCards: an integrated subtypes across multiple datasets. Genome Med 2016; 8 (1): 27. compendium for diseases and their annotation. Database 2013; 2013: 23. Wang F, Tao L, Changshui Z. Semi-supervised clustering via matrix fac- bat018. torization. In: Proceedings of the 2008 SIAM International Conference on 47. Cokelaer T, Pultz D, Harder LM, Serra-Musach J, Saez-Rodriguez J. Bio- Data Mining. Atlanta, GA: SIAM; 2008. Services: a common Python package to access biological Web Services pro- 24. Zhang P, Wang F, Hu J. Towards drug repositioning: a unified computa- grammatically. Bioinformatics 2013; 29 (24): 3241–2. tional framework for integrating multiple aspects of drug similarity and 48. Brown CE. Coefficient of Variation. Applied Multivariate Statistics in disease similarity. In: AMIA Annual Symposium Proceedings. Bethesda, Geohydrology and Related Sciences. Berlin, Heidelberg: Springer Berlin MD: American Medical Informatics Association; 2014. Heidelberg; 1998: 155–57. 25. Zitnik M, Janjic V, Larminie C, Zupan B, Przulj N. Discovering disease- 49. Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute disease associations by fusing systems-level molecular data. Sci Rep 2013; error (MAE)?—Arguments against avoiding RMSE in the literature. Geo- 3 (1): 3202. sci Model Dev 2014; 7 (3): 1247–50. 26. Zitnik M, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B. Gene priori- 50. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom tization by compressive data fusion and chaining. PLoS Comput Biol Intell Lab Syst 1987; 2 (1-3): 37–52. 2015; 11 (10): e1004552. 51. Hinton G. A practical guide to training restricted Boltzmann machines. 27. Zitnik M, Zupan B. Matrix factorization-based data fusion for gene func- Momentum 2010; 9 (1): 926. tion prediction in baker’s yeast and slime mold. Pac Symp Biocomput 52. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief 2014; 19: 400. nets. Neural Comput 2006; 18 (7): 1527–54. 28. Zitnik M, Zupan B. Matrix factorization-based data fusion for drug- 53. Lowenberg B, Downing JR, Burnett A. Acute myeloid leukemia. N Engl J induced liver injury prediction. Syst Biomed 2014; 2 (1): 16–22. Med 1999; 341 (14): 1051–62. 29. Vitali F, Cohen LD, Demartini A, et al. A network-based data integration 54. Dohner H, Weisdorf DJ, Bloomfield CD. Acute myeloid leukemia. N Engl approach to support drug repurposing and multi-target therapies in triple J Med 2015; 373 (12): 1136–52. negative breast cancer. PLoS One 2016; 11 (9): e0162407. 55. Hartigan JA, Hartigan J. Clustering Algorithms. Wiley: New York; 1975. 30. Zitnik M, Zupan B. Data fusion by matrix factorization. IEEE Trans Pat- 56. Dinse GE, Lagakos SW. Nonparametric estimation of lifetime and disease tern Anal Mach Intell 2015; 37 (1): 41–53. onset distributions from incomplete observations. Biometrics 1982; 38 31. Singh AP, Gordon JG. Relational learning via collective matrix factori- (4): 921–32. zation. In: Proceedings of the 14th ACM SIGKDD International Con- 57. Gray RJ. A class of K-sample tests for comparing the cumulative incidence ference on Knowledge Discovery and Data Mining. Las Vegas, NV: of a competing risk. Ann Stat 1988; 16: 1141–54. ACM; 2008. 58. Ye J, Liu J. Sparse methods for biomedical data. SIGKDD Explor Newsl 32. Klami A, Virtanen S, Lepp€ aaho E, Kaski S. Group Factor Analysis. IEEE 2012; 14 (1): 4–15. Transactions on Neural Networks and Learning Systems 2015; 26 (9): 59. Scott DW. The curse of dimensionality and dimension reduction. In: Mul- 2136–47. tivariate Density Estimation: Theory, Practice, and Visualization. 2nd ed. 33. Ruffini M, Gavalda R, Limon E. Clustering patients with tensor decompo- New York: Wiley; 1992: 217–40. sition. In: Finale D-V, Jim F, David K, Rajesh R, Byron W, Jenna W, eds. 60. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMI- Proceedings of the 2nd Machine Learning for Healthcare Conference. M.org: Online Mendelian Inheritance in Man (OMIM(R)), an online cata- Proceedings of Machine Learning Research: PMLR. Boston, MA: PMLR; log of human genes and genetic disorders. Nucleic Acids Res 2015; 43 2017: 126–46. (D1): D789–98. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 12 JAMIA Open, 2018, Vol. 0, No. 0 61. Paschka P, Schlenk RF, Gaidzik VI, et al. IDH1 and IDH2 mutations are 64. Bentires-Alj M, Paez JG, David FS, et al. Activating mutations of the frequent genetic alterations in acute myeloid leukemia and confer adverse noonan syndrome-associated SHP2/PTPN11 gene in human solid tumors prognosis in cytogenetically normal acute myeloid leukemia with NPM1 and adult acute myelogenous leukemia. Cancer Res 2004; 64 (24): mutation without FLT3 internal tandem duplication. J Clin Oncol 2010; 8816–20. 28 (22): 3636–43. 65. Gaidzik VI, Paschka P, Spath D, et al. TET2 mutations in acute myeloid 62. Verhaak RG, Goudswaard CS, van Putten W, et al. Mutations in nucleo- leukemia (AML): results from a comprehensive genetic and clinical analy- phosmin (NPM1) in acute myeloid leukemia (AML): association with other sis of the AML study group. J Clin Oncol 2012; 30 (12): 1350–7. gene abnormalities and previously established gene expression signatures and 66. Law V, Knox C, Djoumbou Y, et al.DrugBank 4.0:sheddingnew their favorable prognostic significance. Blood 2005; 106 (12): 3747–54. light on drug metabolism. Nucleic Acids Res 2014; 42 (D1): 63. Schlenk RF, Dohner K, Krauter J, et al. Mutations and treatment outcome D1091–7. in cytogenetically normal acute myeloid leukemia. N Engl J Med 2008; 67. Hewett M, Oliver DE, Rubin DL, et al. PharmGKB: the pharmacogenetics 358 (18): 1909–18. knowledge base. Nucleic Acids Res 2002; 30 (1): 163–65. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png JAMIA Open Oxford University Press

Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia

Free
12 pages

Loading next page...
 
/lp/ou_press/patient-similarity-by-joint-matrix-trifactorization-to-identify-Hs9EUQQFw7
Copyright
Published by Oxford University Press on behalf of the American Medical Informatics Association 2018. This work is written by US Government employees and is in the public domain in the US. This Open Access article contains public sector information licensed under the Open Government Licence v2.0 (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/).
eISSN
2574-2531
D.O.I.
10.1093/jamiaopen/ooy008
Publisher site
See Article on Publisher Site

Abstract

JAMIA Open, 0(0), 2018, 1–12 doi: 10.1093/jamiaopen/ooy008 Research and Applications Research and Applications Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia 1,2,3, 4, 5 5,6 5,6 7 F. Vitali, S. Marini, D. Pala, A. Demartini, S. Montoli, A. Zambelli, and 5,6,8 R. Bellazzi 1 2 Center for Biomedical Informatics and Biostatistics, The University of Arizona, Tucson, Arizona, USA, BIO5 Institute, The 3 4 University of Arizona, Tucson, Arizona, USA, Department of Medicine, The University of Arizona, Tucson, AZ, USA, Depart- ment of Computational Biology and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy, Centre for Health Technologies, University of Pavia, 7 8 PV, Italy, Oncology Unit, ASST Papa Giovanni XXIII, Bergamo, BG, Italy and IRCCS Istituti Clinici Scientifici Maugeri, Pavia, PV, Italy Corresponding Author: Dr. Riccardo Bellazzi, University of Pavia, Department of Electrical, Computer and Biomedical Engineering, 27100, Pavia, PV, Italy (riccardo.bellazzi@unipv.it) These authors contributed equally to the work. Received 20 December 2017; Revised 7 March 2018; Accepted 20 March 2018 ABSTRACT Objective: Computing patients’ similarity is of great interest in precision oncology since it supports clustering and subgroup identification, eventually leading to tailored therapies. The availability of large amounts of bio- medical data, characterized by large feature sets and sparse content, motivates the development of new meth- ods to compute patient similarities able to fuse heterogeneous data sources with the available knowledge. Materials and Methods: In this work, we developed a data integration approach based on matrix trifactorization to compute patient similarities by integrating several sources of data and knowledge. We assess the accuracy of the proposed method: (1) on several synthetic data sets which similarity structures are affected by increasing levels of noise and data sparsity, and (2) on a real data set coming from an acute myeloid leukemia (AML) study. The results obtained are finally compared with the ones of traditional similarity calculation methods. Results: In the analysis of the synthetic data set, where the ground truth is known, we measured the capability of reconstructing the correct clusters, while in the AML study we evaluated the Kaplan-Meier curves obtained with the different clusters and measured their statistical difference by means of the log-rank test. In presence of noise and sparse data, our data integration method outperform other techniques, both in the synthetic and in the AML data. Discussion: In case of multiple heterogeneous data sources, a matrix trifactorization technique can successfully fuse all the information in a joint model. We demonstrated how this approach can be efficiently applied to dis- cover meaningful patient similarities and therefore may be considered a reliable data driven strategy for the definition of new research hypothesis for precision oncology. Conclusion: The better performance of the proposed approach presents an advantage over previous methods to provide accurate patient similarities supporting precision medicine. Key words: data integration, matrix trifactorization, acute myeloid leukemia (AML), patient similarity, precision medicine Published by Oxford University Press on behalf of the American Medical Informatics Association 2018. This work is written by US Government employees and is in the public domain in the US. This Open Access article contains public sector information licensed under the Open Government Licence v2.0 (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/). Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 2 JAMIA Open, 2018, Vol. 0, No. 0 both metafeatures extraction and distance measures to reveal hidden BACKGROUND AND SIGNIFICANCE patient similarities. In the majority of state-of-art dimension reduction The concept of precision medicine is based on the assumption that a methods, there is no constraint on the sign of the metafeature elements, careful identification of patients’ subgroups is able to properly take thus admitting negative components or subtractive combinations into into account individual variability, which may play a major role in the representation. Our method, on the contrary, is based on non- any prevention and treatment strategies. This concept is not new: negative trifactorization. The incorporation of non-negative constraints blood typing, for instance, has been used to correctly allocate blood has been shown to enhance the interpretability of the data integrated. transfusions for more than a century. We tested the performance of the proposed algorithm on different syn- The oncology-field seems to be a clear choice for taking advantage thetic data sets, affected by increasing levels of noise and data sparsity. of precision medicine. Cancers are common diseases highly impacting on We further validated the method by fusing a real data set coming from the population because of their lethality, severe symptoms, and toxicity an AML study with several external knowledge sources. By comparing associated with the oncological treatment. Moreover, each cancer has its it with state-of-art techniques, we show our method outperforms other own genomic signature, along with some common features shared by approaches in both simulated and real data. Identified patients’ sub- multiple cancer types. Patient similarity is an emerging approach in pre- groups are validated as significantly different by survival curves. cision oncology and medicine, identifying patients with similar profiles and derive insights to investigate diseases and potential treatments. In precision oncology, patient similarity is traditionally measured through TM 4 5 preidentified signatures (eg Oncotype DX , PAM50, and other clini- MATERIALS AND METHODS cally available classifiers), or patient-specific biomarkers. However, An overview of the trifactorization algorithm these preidentified onco-signatures rely only on a relatively small number The structure of data sources and knowledge bases is typically orga- of molecular features, and the trials launched to test the real impact of nized into relational matrices associating various objects/concepts, such the precision oncology in daily clinical practice have so far yield little 7–9 as patients, clinical data, genes, diseases, and so on. The non-negative results, being limited only to a tiny proportion of the entire population 9–11 trifactorization algorithm naturally exploits these data structures to per- enrolled in the trials. Therefore, conventional studies designed on the form data fusion by first representing them in a matrix form and subse- basis of the classical 4-phases of drug-development are probably neither quently organizing them in a unique big block matrix. The algorithm effective nor fit for precision oncology. aims at identifying low-rank non-negative matrices whose product can The availability of large heterogeneous biomedical data naturally provide a good approximation of the original non-negative matrix. The opens ways to develop computational methods leveraging on the result is a new matrix containing predictions and novel knowledge whole multidimensional patient data framework to search for patient about the associations represented. This algorithm can be considered as similarity. These data include, among others, clinical (ie coded data, a knowledge-based method that allows dealing with sparsity by interpo- text, images, signals), -omics (from genome to metabolome), and lating missing data through a prediction derived by explicitly modeling exposome data. The capability of computing patient similarity in the correlation and the dependency between attributes. presence of such large features becomes therefore a crucial component A Matlab implementation of our algorithm is available at to enable large-scale precision medicine implementation. In litera- https://gitlab.com/smarini/MaDDA. ture, patient similarity seems highly dependent on the specific prob- The algorithm is described step by step as follows. lems considered, and there is no consensus about the best metrics or Let us consider r different types of concepts, say, patients, genes, the best algorithms to calculate it in presence of heterogeneous and miRNAs, .. ., which we call objects o ; o ; .. . ; o and let’s suppose sparse data. To face these problems and consider patient similarity 1 2 r that we have a set of data sources that relate pairs of objects (o ; o Þ from a multidimensional perspective, in recent years a number of i j 15–22 for some i and j: for example we can have the objects “gene” and methods to determine patient similarity has been developed. “disease” and the repository “DisGeNeT” that relates them. If the In this article, we propose a novel method to compute patient sim- number of objects of type o are n and the number of objects of ilarity for precision oncology by an unsupervised discovery of patient i i type o are n the data source when i 6¼ j can be represented as a subgroups. This method is based on a strategy that integrates data j j ninj sparse matrix R 2 R , called relation matrix (Figure 1). For in- and knowledge in a sound and formal way. In particular, we ij stance, the relation matrix may contain information of the relation- exploited a modified version of a non-negative matrix trifactorization ships between genes (eg BRCA1) and diseases (eg breast cancer). If algorithm recently developed and applied also to biomedical prob- 21 24 we also have observations about the relationships of the objects of lems by Gligorijevic et al, Zang et al, Utro et al, and Zitnik et 21,25–30 the same type, such as genes coexpression, we might represent them al. Factorization techniques are efficient tools for data fusion nini with a matrix H 2 R , called constraint matrix (Figure 1). of large sparse data sets (like the ones available in the clinical setting). i Considering the entire set of R relation matrices given by all the These approaches adopt a useful dimension reduction by directly ij data sources of interest, we can represent them as a block matrix R, compressing the starting data into a lower number of features (ie vec- which may miss elements (eg not all the genes in the genome can be tor components). The decomposition methods play a central role in related to a given disease): the analysis of the latent structure hidden in the data that may unveil unknown interactions of the initial data, that is patient similarities. In 0 1 R .. . R 12 1r recent years, several dimension reduction techniques have been pro- B C B C posed to tackle biological problems. Examples rely on collective ma- B R  R C 21 2r 31,32 33 B C trix factorizations, tensor decomposition, Bayesian multitensor R ¼ B C (1) B C 34 35,36 . . . B C factorization, and group factor analysis (GFA). . . . . . . @ A Our approach takes into account the structural relation of several R R ... r1 r2 highly heterogeneous data, such as clinical and genomic data, and available knowledge from several public repositories. It implements Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 3 Figure 1. An example of the trifactorization algorithm constructed by considering 3 data sources. All the data sources are represented as relation matrices. R i; j matrices are used to describe associations between objects of the different type (eg gene-disease), their values range between [0, 1], where 1 indicates strong in- teraction and 0 association absence or lack. H matrices represent relations between objects of the same type (eg gene-gene) and H elements vary between i i [1, 1], where 1 represents a strong association and 1 a lack of association. R matrix is then trifactorized by running and optimization algorithm (see Eq. 5) into a set of lower-rank factors, eg G and S . Finally, the whole matrix is reconstructed by multiplying matrices G and S , thus revealing new associations. i i; j i i; j The values of the matrix R express the strength of the relation- are predicted on the basis of the multiplication of elements of much ships between objects, and they correspond to numbers between 0 lower rank. and 1, with 0 meaning no known relationships. The matrices S and G are reconstructed by minimizing the fol- On the other hand, the constraint matrices H can be expressed lowing objective function: as a diagonal block set, H , where t ¼ 1, 2,..., i denotes the possible max t i i X X t 2 t ðtÞ multiplicity of relationships of the same type, which can be derived min JGðÞ ; S ¼ kR  G S G k þ trðG H GÞ; (5) G0 ij i ij by i different knowledge sources (corresponding to different H ma- R 2R t¼1 i ij trices of the same object). For example, coexpression may be mea- where jj  jj is the Frobenius norm and tr() the trace of the matrix. sured through different types of experiments. The procedure adopted to solve Eq. 5 starts with a random initiali- ðÞ t ðÞ t ðtÞ ðÞ t zation of the G, next, S matrices are iteratively updated until conver- H ¼ Diag H ;H ; .. . ;H : (2) 1 2 r gence (proof of convergence and details in references 23 and 30). Differently from R matrices, H values vary between 1 and 1, ij i Details on the adopted procedure to solve the optimization problem expressing the dissimilarity between elements of the same object provided in Supplementary Material Method S1. types, so that 1 means full similarity while 1 is full dissimilarity. The algorithm described above has been adapted to calculate the Once the data are represented into matrices, the trifactorization similarity between the same type of objects: in this article, we are in- algorithm jointly factorizes the matrices R using the matrices H as ij i terested in the object “patients.” With a closer look to the approxi- constraints. First of all, a set of design parameters k  n is defined i i mation R  G S G , we can notice that matrix G is shared by all ij i ij i for each object. These parameters, also called ranks, define the di- blocks that are related to the object type o (in our case patients), mension of the latent factors for the ith object type with the objec- while S is specific to the relations between the objects o and o . ij i j tive of revealing hidden structure in the data. This is a crucial step in Since G is an n  k matrix, the rows correspond to the elements of i i i the algorithm, since wrongly assigned ranks may lead to overfitting the ith object type (in our case the different patients), while the col- (if too big), or may not be able to capture all the information (if too umns represent k groups. Therefore, each element can be inter- small). There is no general consensus about how to select these val- preted as the degree of membership of each patient (row) to each 21,26,30 ues and different approaches can be applied. In this work, we group (column). Therefore, we can assign an element (ie a patient) opted for an empirical approach (see Selection of initial parameters to the group (ie to a cluster) with the largest value, that is the col- section). umn with the maximum value for the corresponding row. After rank selection, each block of the matrix R is factorized in 2 Since the optimization strategy strongly depends on the initiali- lower-rank block matrices,G and S (Figure 1), as follows: zation (ie the selection of the dimension of k parameters), we aver- aged the results over 10 applications to obtain a final consensus n k n k 1 1 2 2 nrkr G ¼ DiagðG ; G ; .. . ; G Þ (3) 1 2 r ^ matrix (C), which is calculated as the element-wise averaged sum of the connectivity matrices. In our case, for example, a consensus ma- 0 1 k k k k 1 2 1 r trix element showing a value of 0.5 means that 5 times out of 10, in S ... S 12 1r B C the connectivity matrix, the 2 patients corresponding to the row and B C B C k2k1 k2kr B S  S C column indexes of the element ended up grouped together. 21 2r B C S ¼ (4) B C B C B . . . C . . . B C . . .  Patient and external knowledge-based data sets @ A In Table 1 are reported the data sources used in this work. In detail, k k k k r 2 r 2 S S .. . r1 r2 we propose to integrate patient data (extracted from TCGA, ) and Thanks to the joint factorization, information is spread out from several publicly available knowledge bases to extract meaningful pa- and to the different relation matrices, so that the missing elements tient similarities. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 4 JAMIA Open, 2018, Vol. 0, No. 0 Table 1. Patient data and external knowledge sources Data type Retrieved data Resource Patient data on acute myeloid leukemia c,39 1) Clinical data (200 patients) Gender, age, vital status: dead or alive, days to death (if dead), days to TCGA birth, days to last follow-up, date of the diagnosis 2) Somatic mutations (195 patients) 1620 mutations associated with 428 genes 3) Gene expression profiles (197 22578 genes (8897 after filter application) patients) External knowledge data sources d,40 4) Gene-gene interactions Starting from the 186 genes involved in AML (extracted from MalaCard) BioGRID and the 428 genes associated with the mutations, we extracted 37 811 first-degree gene-gene interactions between 8897 unique genes. e,41 5) Gene-pathway associations 3202 associations between the 8897 genes and 383 KEGG pathways KEGG f,42 6) Disease-disease relationships 35 201 associations between 6402 unique diseases. DO 7) Disease-gene associations 1925 associations between 6402 diseases and 278 genes. DisGeNET v4.0 e,41 8) Disease-pathway relations 605 associations between the 6402 diseases and the 383 pathways. KEGG We listed the clinical variables used in this work. Data have been accessed in November 2016. c 43 The Cancer Genome Atlas. Data have been retrieved by using the cBioPortal for Cancer Genomic, last updated 05/31/16. Biological General Repository for Interaction Datasets, Release 3.4.142. Kyoto Encyclopedia of Genes and Genomes, Release 80.0. Disease Ontology, Release 2016-01-07. rows cols 1 nnz þ nnz The AML TCGA cohort was used to (1) generate the data set for i i k ¼ 200 2 ; (10) a simulated study and (2) to validate the proposed approach on a real data set. Gene expression data were normalized by using robust i 6¼ 6 multichip average (RMA), normalization method. Mutation data 45 where nnz are the nonzero elements of the object i counted on the were analyzed using the software PaPi. PaPi is a machine-learning rows cols rows (nnz ), and on column (nnz ), respectively. i i approach to classify and score human coding variants by estimating On the other hand, the patient rank k corresponding to the the probability to damage their protein-related function. Each of the 45 number of expected clusters is computed following the approach 1620 mutations, gets a score (probability) between 0 (ie tolerated presented in reference 21. We applied a grid search to select k , since mutation) and 1 (ie damaging mutation). A list of 186 genes in- 46 Eq. 10 would provide a very high rank (ie high number of patient volved in AML were selected from MalaCards. This list was subse- clusters) due to the low sparsity of the gene/patient relationships. quently integrated with the 428 genes associated with the mutation Different values of k from a predefined interval are used as inputs data (Table 1). We finally retained the gene expression and mutation to the integration algorithm, and the results are compared in terms data for that list of 8900 genes and for the genes extracted as first- 40 of their dispersion coefficient, the larger the better (Eq. 11): degree interactor in BioGRID. n n XX ðÞ k 1 ðÞ k 1 6 ^ 6 ^ 2 q C ¼ 4½ l ¼½3; 5; 10; 20 (11) ij 2 k n 2 i¼1 j¼1 ðÞ k Matrix definition where k is a patient rank value from the list l, C is the consensus We represented all the data sources (Table 1) in the form of relation matrix (see An overview of the trifactorization algorithm section) ðk Þ matrices, each one formalizing the known associations between computed by using k . The rank k obtaining the higher qðC Þ 6 6 pairs of objects. In detail, the considered objects are: o clinical value was used as rank input for the proposed approach. This proce- data; o mutations; o genes; o pathways; o diseases; and o dure has been used for both the synthetic and the real data sets. 2 3 4 5 6 patients. According to the integration algorithm and in order to make the matrices comparable, associations between objects of dif- Simulated data set construction ferent type (R matrices) were rescaled in the interval [0, 1]. On the ij To evaluate the performance of the proposed approach, we gener- other hand, association between objects of the same type (H matri- ated different synthetic data sets with the same size of the real one ces) were rescaled in the interval [1, 1]. (Table 1). Details on data processing are provided in Table 2. Matrix data By varying 2 simulation parameters listed in Supplementary Ma- are available at https://gitlab.com/smarini/MaDDA. terial Table S1, we created 25 data sets, differing for amount of added noise, missing data, degrees of patient similarity, and data sparsity. For each scenario, we simulated 200 virtual patients grouped into 5 clusters with known similarity structures. The simu- Selection of initial parameters lation process is as follows: A crucial step in the factorization algorithm is the selection of the in- put ranks k . All the ranks, except the one related to the patient ob- 1. Patient clinical data, gene expression levels, and the PaPi score ject (k ), were computed resorting to an empirical rule proposed in mutation data are used to construct a patient similarity matrix reference 29 (Eq. 10). by computing the Euclidean distance (ED) between patients. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 5 Table 2. Matrix R and H construction ij i Matrix Relation Matrix value definition R ¼ R Clinical data-pa- We used 4 clinical variables (ie rows) for each patient (ie column), that is Female, Male, Age, and 16 61 tient** Survival. The gender variable was used to create 2 rows, that is Female and Male, for each patient, whose value was set to 0 or 1 according to the gender. The age field was used to define the Age row whose values are given by: a a i min age ðÞ¼ i (6) a a max min where a is the ith patient age at the time of the death or at the last follow-up, while a and a are i min max the cohort minimum and maximum ages, respectively. Finally, a Survival row for each patient was obtained by dividing the patients into alive and deceased individuals. Since the majority of the patients died within 1 year from the diagnosis date, we con- sidered the survival S at one year computed as: 1 if d > 365 days > i d  d i min if d < 365 days and vital status ¼ deceased SiðÞ ¼ (7) d  d max min 0 if d < 365 days and vital status ¼ alive where d is the number of days between the diagnosis and death rate or the follow-up date of the pa- tient i. d and d are the global minimum and the maximum number of days between the di- min max agnosis and death date, respectively. 0 means that it is unknown if the patient is alive or deceased 1 year after the diagnosis since the last follow-up date occurred before that time. This allowed to obtain values ranging between [0, 1]. R ¼ R Mutation-Gene Mutations mapped to their respective genes. 23 32 R ¼ R Mutation-patient PaPi scores evaluating harmfulness of each mutation, per patient. 26 62 R ¼ R Gene-disease We used DisGeNET data associating genes and diseases. Since DisGeNET score provided is already 34 43 in the interval [0, 1], no further processing was required. R ¼ R Gene-pathway* Genes mapped to KEGG pathways. Presence/absence of a gene in pathway determines its binary 1/0 35 35 value. R Gene-patient Gene-patient values correspond to the sum (mutational burden) of the PaPi scores of all the muta- tions associated to a specific gene. The obtained values were then rescaled between [0, 1]. R ¼ R Disease-pathway* Matrix values correspond to 0 or 1 according to the information extracted from KEGG about all the 45 54 diseases altering each KEGG pathway. R ¼ R Disease-patient Rows of this matrix represent the diseases and column the 200 patients. The values of the row corre- 46 64 sponding to acute myeloid leukemia (DOID: 9119) were set to 1 indicating the association. R Patient-gene Gene expression data from TCGA were used as matrix values according with the formula: e e i;j min ExpressionðÞ i; j ¼ (8) e e max min where e is the patient i expression of the gene j; while e and e are the global minimum i min max and the maximum values of the gene j. H Clinical data-clinical The rows and the columns of this matrix are Age, Female, Male, and Survival. The diagonal values data are 1 (ie fully associated). Male and female association is 1 (ie not associated). H Mutation-mutation No assumption was made on the mutation similarity, beside considering each mutation similar to it- self. H Gene-gene Gene-gene interactions from BioGRID. The raw data needed a preprocessing step, since for each gene pair, the related associations may appear multiple times (corresponding to different kind of interaction, eg direct physical binding, genetic interaction ). For this reason, denoting with x the number of times a certain pair appears, its score was determined by: lnðxÞ fxðÞ ¼  1 þ (9) 2 lnðx Þ max H Diseasedisease The similarity between 2 diseases is set to 0:8 where n is the length of the shortest path between corresponding terms in the DO (ie the minimum number of steps to reach one disease from the other one). H Pathway-pathway KEGG relations were used to measure binary pathway similarity. Also, each pathway is considered similar to itself. H Patient-patient No assumption was made on the patient similarity, besides considering each patient similar to itself. a 47 Data have been automatically extracted by using Python Bioservice 1.4.8. In case of missing data, the value of the element was set to 0 (ie unknown). Next, we selected the 5 less similar patients according with the all the P’s clinical data (excluding the gender variable). The mean ED values. gender is assigned according with a probability ¼ .9 to be the 2. For each patient P of the 5 patients identified in (i), 39 ‘virtual same as P. Finally, gene expression levels and PaPi mutation patients’ are generated by adding Gaussian noise with mean of scores of a new simulated patient j are obtained according with zero and variance equal to one half of the population variance to Eq. 12. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 6 JAMIA Open, 2018, Vol. 0, No. 0 D ¼ D 6CV m ; i i (12) 1 < j < 39 where D is the patient j’s value of the object i (gene or muta- tion) in the reference data set, the value m is the mean value of an element in the population of the object i (eg mean expression value of a gene), and CV is the coefficient of variation. 3. A percentage d sparsity is added to all the simulated data set in order to test the robustness of the approach to sparse data. With this procedure, we generated the 25 simulated data sets by varying the CV parameter and the data sparseness as reported in Supplementary Material Table S1. The obtained data were then inte- grated with the external knowledge (Table 1) in order to apply the proposed algorithm. Result evaluation of synthetic data We evaluated the performance of the trifactorization algorithm on the 25 simulated data sets by measuring the mean absolute error (MAE), defined as: n m XX expected c Figure 2. Data sources and matrix representation. The figure shows the num- MAE ¼ jC  C j; (13) ij ij ber of rows of each matrix constructed starting from the patient-related data i¼1 j¼1 and the external knowledge. N corresponds to the number of nonzero ele- ij ments in the matrix. Matrix data are available at https://gitlab.com/smarini/ where n and m are the number of matrix rows and columns, N is the ^ MaDDA/tree/master/Patient_similarity_TCGA/matrices. total number of matrix elements, C is the estimated consensus ma- ij expected trix in the position ij, C is the ideal consensus matrix (ie each ij the past 15 years, its molecular heterogeneity has become appar- simulated patient is clustered with its corresponding real patient, for ent, thanks to -omics technologies. The analysis of public data on a total of 5 clusters of 40 patients). AML combined with external knowledge sources may reveal novel insights into AML patient profiles to discriminate between good or Result comparison with other techniques bad responders and suggest tailored therapies. In order to test our In addition, using the simulated data sets, we assessed and compared approach, we collected patient data from The Cancer Genome Atlas the proposed approach with 2 widely used methods, that is principal (TCGA), and integrated them with external knowledge sources component analysis (PCA), and the ED measure. The results were (Table 1). Collected data were then transformed into a matrix for- also compared to 2 more advances techniques to integrate heteroge- mat according with Table 2. neous data: (1) a deep learning method based on restricted Boltz- The resulting matrices (Figure 2) were used to construct the R mann machines, that is multimodal deep belief networks and H block matrices providing the inputs for the factorization algo- 20,52 36 (MDBN), and (2) a GFA, based on factor analysis. Unlike the rithm. Note that the R block matrix will be symmetrical except for matrix trifactorization algorithm, these methods are not designed to the R and R blocks since these 2 matrices were obtained by us- 36 63 include external knowledge sources associating entities not related ing 2 different knowledge sources (ie mutation and gene expression to the prediction target (ie gene-gene interactions do not involve data). patients). These methods were therefore applied considering only All the object ranks except the patient one were initialized as in the patient-related data (R ;R ;R ;R ; Table 2). Details 1; 6 2; 6 3; 6 6; 3 Eq. 10, while the patient rank k was initialized to the value resulted on the adopted procedures used to apply PCA, ED, MDBN, and by performing a grid search and computing the dispersion coeffi- GFA methods are provided in Supplementary Material Method S1. cient as in Eq. 11. In order to evaluate their performances, MAE was computed by The convergence of the algorithm was then monitored by mea- replacing in Eq. 12 the C matrix with the similarity matrix Sim ij ij suring the objective function (Eq. 5). The algorithm stopped when obtained by applying the PCA, EA, MDBN, or GFA method. In this the difference between 2 consecutive norms was under the threshold case, Sim is built by setting its elements Sim ¼ 1 if the patient i ij ij 5 10 . The number of repetitions n was set to 10, in order to reduce and the patient j belong to the same cluster, and Sim ¼ 0, other- ij the effect of the initialization. The resulting consensus matrix C was wise. used as a similarity matrix to extract patient-patient similarities. Validation case study of acute myeloid leukemia We investigated the biological relevance of the patient similarities Validation of real data results uncovered by the proposed methodology focusing on AML. AML is To validate the results on the real data set, we further applied the a myeloid neoplasm related to an uncontrolled proliferation of white proposed algorithm on the same data set (Figure 2) but excluding blood cells. It is the most common leukemia in adult patients. AML the Survival column (ie n ¼ 3) that corresponds to what we want to is curable from 35 to 40% of patients younger than 60 years of age, estimate. We applied a hierarchical clustering on the resulting pa- while for older patients the percentage decreases from 5 to tient similarity matrix where the linkage was determined using com- 53,54 55 10%. Moreover, AML is an heterogeneous disease and, over plete link method. The survival curves were estimated by the Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 7 Figure 3. Heatmaps of the mean absolute error (MAE). MAEs were obtained by applying: A, the proposed trifactorization algorithm; B, principal component analy- sis (PCA); C, Euclidean distance; D, multimodal deep belief network (MDBN); E, group factor analysis (GFA) to the 25 synthetic data sets. The simulated data sets were constructed by considering different percentages of missing data (ie heatmap columns) and noise (ie heatmap row). The noise was added based on the co- efficient of variation (CV) (Eq. 12). The map clearly shows smaller MAEs for the proposed approach. The trifactorization gave results less sensitive to sparse data (ie high number of missing data). Kaplan-Meier method and were compared using the log-rank 10% (Supplementary Material Table S1). k ¼ 10 resulted as the test. rank with the highest values of q (Supplementary Material Table S2). As for the simulated data sets, we assessed and compared the The proposed algorithm was therefore applied to all the 25 data real data results of the proposed methodology with the ones sets by selecting as input k ¼ 10, while the other objects ranks obtained by applying the PCA, ED, MDBN, and GFA to the patient- were computed according with Eq. 10. related data. As for the trifactorization approach, a hierarchical The trifactorization performances on simulated data were evalu- clustering using complete link method was applied to the similarity ated by comparing the algorithm’s MAE (Eq. 13) with the MAEs matrices resulting by applying these techniques. The different per- obtained by applying the PCA, ED measure, MDBN, and GFA on formances were finally evaluated by comparing survival curves of the same data sets. The results are shown through heatmaps in Fig- the cluster of patients identified. ure 3 (details in Supplementary Material Table S3). These confirmed the best performance are achieved with the trifactorization algo- rithm, and our approach showed its robustness in presence of differ- ent similarity structures and missing data. RESULTS Validation study: patient similarity in acute myeloid Simulation study To evaluate the performance of the proposed approach, we pro- leukemia duced 25 synthetic data sets as described in Simulated data set con- We investigated the patient-patient similarities uncovered by the tri- struction section. factorization algorithm by applying it to a real case, that is consider- The k parameters (Selection of initial parameters section) were ing the AML data sources reported in Figure 1. The rank parameters initialized following Eq. 10 for all the objects except the patient were computed as in Eq. 10 for all the objects except the patient 1 rank (k Þ. k was defined through a grid search (Eq. 11) on the sim- that was set to k ¼ 5 (ie rank with the highest q values—Eq. 11). 6 6 6 ulated data set with 10% of missing data and noise between 0% to The other input ranks obtained were: k ¼ 2, Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 8 JAMIA Open, 2018, Vol. 0, No. 0 and different blood tests for different patients are sparse along a mostly uneventful time line). “large m small p” problems, is the fact that the number of records (predictors) can be orders of magni- tude bigger than samples (observations). For example, a gene panel can measure the expression of thousands of genes, but cohorts are usually tens to hundreds of patients. A number of approaches have been recently proposed to data in- tegration and to bring out intrinsic characteristics of the data. In particular, a specific class of methods rely on the dimension reduc- tion of the data through the definition of metafeatures that allows the projection of the data into a low dimensional space. This prop- erty can be used in a machine learning framework in order to cap- ture the hidden interaction effects between variable. State-of-the-art algorithms are typically based on the metafea- 20 19 ture extraction method, or the patient distance measure. They also leverage on a small set of different data types, mostly gene ex- 15–17,19,20,22 19,20 15 pression, methylation, copy number variation, 17 16 18 protein interactions, clinical data, and diseases. Gligorijevic et al, used also drug information since their aim was to reposition Figure 4. Consensus matrix obtained by applying the trifactorization algo- drugs for subgroups of patients. rithm to the acute myeloid leukemia (AML) data. Matrix rows and columns In this work, we proposed a framework based on matrix trifacto- correspond to the 200 patients, and it was constructed by considering the rization that integrates several source of data and knowledge with resulting G matrix (An overview of the factorization algorithm section). The the aim of predicting patient similarity. The proposed approach matrix shows 2 groups of patients clustered together (the corresponding den- drogram is reported in Supplementary Material Figure S1). combines several more data and external knowledge sources respect to other methods recently developed. In detail, our approach pro- vides a comprehensive framework for the prediction of patent simi- k ¼ 14; k ¼ 480, k ¼ 13, k ¼ 19. The consensus matrix 2 3 4 5 larity by fusing both clinical and multiomics data of patients, and obtained by applying this procedure is shown in Figure 4. automatically integrating them with external knowledge sources (eg The result validation was then conducted by applying the pro- gene-gene interactions, disease-disease associations). posed algorithm to same data set but not containing the survival in- The findings obtained by applying the trifactorization algorithm formation (Result validation section). The trifactorization revealed on synthetic data sets showed better performances if compared to 2 major groups that we labeled G and G (dendrogram reported in 1 2 other traditional methods (Figure 3). Moreover, this novel strategy Supplementary Material Figure S1). provides more resistance to sparsity and noise, which is ubiquitous In order to validate the obtained clusters, we plotted the Kaplan- in biological data. The application of the trifactorization algorithm Meier survival curves of the 2 patients’ groups, as reported in to the real AML data set underlined 2 big groups of similar patients Figure 5A. The survivals of the 2 groups were clearly different (log (Figure 4). To further confirm this finding, we searched for the ex- rank P-value ¼ .0159) with G indicating a better prognosis clusive gene signatures characterizing each group, in order to investi- than G . gate the presence of a molecular mechanism distinguishing the 2 In addition, we compared our results by performing on the AML groups. A univariate analysis of gene expression level did not pro- data set the PCA, ED, MDBN, GFA methods. In Figure 5 are shown vide significant differences (data not shown). On the other hand, by all the survival curves obtained with these approaches and no statis- analyzing mutation data, we extracted a common gene signature tical difference between the 2 groups was found (ED P-val- consisting of 19 genes mutated in at least one patient in both groups ue ¼ .7763, PCA P-value ¼ .6278, MDBN P-value ¼ .2954, and (Supplementary Material Table S4). As expected, these genes GFA P-value ¼ .1652). Taken together, these analyses demonstrated resulted significantly enriched for ‘leukemia’ Online Mendelian In- the capacity of the proposed method to add value in integrating dif- heritance in Man (OMIM), annotation (Table 3; Fisher’s exact ferent knowledge sources and provide patient-patient similarities for P-value ¼ .00245;—enrichment results shown in Supplementary personalized medicine. Material Table S5). In addition, we further searched for the genes whose mutation frequency differs significantly between the 2 groups [Fisher’s Exact test, False Discovery Rate (FDR) adjusted P-value]. DISCUSSION We found 5 genes, namely IDH1 (P ¼ .00022), NPM1 (P ¼ 3.54867e14), NRAS (P ¼ .00145), PTPN11 (P ¼ .01), TET2 The availability of increasingly larger amount of biomedical data is (P ¼ .00015). Mutations associated with all these genes have been pushing researchers and companies to investigate useful information 61–65 found to be highly related to AML development and progression. by combining data from difference resources, with the aim of unveil- These findings confirm both the 2 identified groups characterized by a ing hidden knowledge on patients, diseases and therapies. However, common AML gene signature. Finally, we identified gene signatures biomedical data are characterized by high complexity, heterogene- 58 59 58 discriminating the 2 groups by retrieving the mutations present ex- ity, sparsity, and “large m small p” problems. Sparsity, is a clusively either in patients of G (better prognosis) or the G (good characteristic of matricial/graph representations where most of the 1 2 prognosis). The gene signatures resulted in 342 (Supplementary Ma- elements are null, that is zeros/absent links. This particularly true terial Table S6) and 52 (Supplementary Material Table S7) genes for for genomic data (eg in a gene network, the number of links con- G and G group, respectively. An OMIM enrichment analysis con- necting the nodes will be orders of magnitude smaller than the num- 1 2 ducted on the 2 gene signatures retrieved the term ‘leukemia’ only ber of nodes), or temporal data (eg hospitalizations, complications Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 9 Figure 5. Survival curves. Survival curves corresponding to the clusters obtained by the trifactorization algorithm (A), principal component analysis (PCA) (B), Eu- clidean distance (ED) measure (C), multimodal deep belief network (MDBN) (D), and group factor analysis (GFA) (E). All the plots report the P-value (P) resulting from the log-rank test. Only the 2 clusters obtained with the proposed approach have statistically significant survival curves (trifactorization P-value ¼ .01; ED P-value ¼ .7763; PCA P-value ¼ .6278; MDBN P-value ¼ .2954; GFA P-value ¼ .1652). for G (Table 3; Fisher’s exact P-value ¼ .0175, complete results in involve associations between clinical data, mutations, genes, dis- Supplementary Material Tables S8 and S9), coherently with the hy- eases, pathways, and patients to extract patient similarity. The same pothesis the second group, associated with a poor survival progno- approach can be applied to classify novel patients and, the results sis, presents a more aggressive instance of AML. Interestingly, the might be used to suggest potential tailored treatments based on groups are not significantly different for sex (P ¼ .2594, Fisher’s Ex- successful drug treatments extracted from the patient’s clinical his- act test) or age (P ¼ .0884, Wilcoxon rank sum test), indicating that tories. Moreover, further improvements and novel features can be our unsupervised method did not simply discriminated obvious pa- provided by directly including into the trifactorization framework 66 67 tient subgroups, but was able to mine more subtle differences char- data sources involving drugs (ie DrugBank, PharmGKB, ). In acterizing poor-vs-good prognoses. this way, by selecting as target the patient-drug matrix, the algo- The current study showed is power in discovery patient-patient rithm could predict targeted therapies for specific profiles of similarity by integrating several data and knowledge sources that patients. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 10 JAMIA Open, 2018, Vol. 0, No. 0 Table 3. OMIM enrichment results OMIM term Significant P-value (Fisher’s exact test) OMIM enrichment analysis of the genes associated with Leukemia .00246 Common mutations between G1 and G2 Bardet-biedl_syndrome .026431 Mutations of G1 Macular_degeneration .037245 Leukemia .017554 Mutations of G2 Long_qt_syndrome .030766 Fibrosis .030766 Cone-rod_dystrophy .038311 Thyroid_carcinoma .043309 Note: Significant P-values (P < .05). The full enrichment results for the genes associated with common mutations between G1 and G2, mutations of G1, and mutations of G2 are respectively shown in Supplementary Material Tables S5, S8, and S9. OMIM: Online Mendelian Inheritance in Man. The main limitations of the proposed approach are 2-fold: Conflict of interest statement. None declared. On the one hand, it is necessary to represent data and knowledge in terms of bidimensional relation matrices. This often requires SUPPLEMENTARY MATERIAL flattening concept hierarchies and graphs, thus generating matri- ces of very high dimensionality. The computational burden of Supplementary material is available at Journal of the American the optimization algorithm is highly dependent on the number of Medical Informatics Association online. data sources and on the size of the relational matrices, in particu- lar when such matrices are not sparse. On the other hand, the method has a number of crucial design parameters, the ranks, which requires fine tuning in order to obtain ACKNOWLEDGMENTS a compromise between the quality of matrix reconstruction and the We thank Ivan Limongelli for the help in organizing genome data. We are need of representing data with low dimensionality latent vectors. also grateful to Marco Piastra, Gianluca Gerard and Stefano Montoli for the analysis carried on with MDBN. We sincerely acknowledge Blaz Zupan and Nevertheless, we believe that the proposed approach represents a Marinka Zitnik for their original ideas, and for their support and training in powerful and interesting strategy to deal with data and knowledge applying trifactorization methods. fusion for similarity calculation, which may provide advantages in precision oncology applications. Finally, the proposed approach is particularly suitable for preci- REFERENCES sion oncology due to the high number of available data sources on 1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J cancer, but it can be easy applied to other fields and diseases. Med 2015; 372 (9): 793–5. 2. Lu YF, Goldstein DB, Angrist M, Cavalleri G. Personalized medicine and human genetic diversity. Cold Spring Harbor Perspect Med 2014; 4 (9): a008581. CONCLUSION 3. Chin L, Gray JW. Translating insights from the cancer genome into clini- cal practice. Nature 2008; 452 (7187): 553–63. In this work, we analyzed the problems related to the application of pre- 4. Sparano JA, Paik S. Development of the 21-gene assay and its application cision oncology. Literature evidence seems to show how a traditional in clinical practice and clinical trials. J Clin Oncol 2008; 26 (5): 721–8. clinical trial approach looking for biomarker-specific patient subgroups 5. Parker JS, Mullins M, Cheang MC. Supervised risk predictor of breast is particularly hard to implement due to heavy, intrinsic statistical limi- cancer based on intrinsic subtypes. J Clin Oncol 2009; 27 (8): 1160–7. tations. We have shown how the problem of subgroup identification 6. Pellagatti A, Benner A, Mills KI, et al. Identification of gene can be solved with a data fusion approach exploiting relations of hetero- expression-based prognostic markers in the hematopoietic stem cells geneous, multidimensional data sources. In our application, we consid- of patients with myelodysplastic syndromes. J Clin Oncol 2013; 31 (28): ered objects as diverse as clinical data, diseases, genes, mutations, 3557–64. 7. Meric-Bernstam F, Brusco L, Shaw K, et al. Feasibility of large-scale geno- pathways, and the patients themselves. Thanks to a latent feature repre- mic testing to facilitate enrollment onto genomically matched clinical tri- sentation via matrix trifactorization, we were able to identify clinically als. J Clin Oncol 2015; 33 (25): 2753–62. meaningful patient subgroups. The approach showed better perfor- 8. Group E-ACR. Executive Summary: Interim Analysis of the NCI- mance when compared with standard clustering strategies. Future MATCH Trial. Secondary Executive Summary: Interim Analysis of the works will deal with optimizing the factorization strategy and to pro- NCI-MATCH Trial. 2016. http://ecog-acrin.org/nci-match-eay131/in- vide automated explanations of the results obtained. terim-analysis. Accessed June 2016. 9. Tredan O, Corset V, Wang Q, et al. Routine molecular screening of ad- vanced refractory cancer patients: An analysis of the first 2490 patients of the ProfiLER study. J Clinical Oncol 2017; 35 (18_suppl): LBA100. FUNDING 10. Le Tourneau C, Delord JP, Goncalves A, et al. Molecularly targeted ther- This work was supported by the 3 project grant no. 2015-0042 “Genomic apy based on tumour molecular profiling versus conventional therapy for profiling of rare hematologic malignancies, development of personalized med- advanced cancer (SHIVA): a multicentre, open-label, proof-of-concept, icine strategies, and their implementation into the Rete Ematologica Lom- randomised, controlled phase 2 trial. Lancet Oncol 2015; 16 (13): barda (REL) clinical network.” 1324–34. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 JAMIA Open, 2018, Vol. 0, No. 0 11 11. Prasad V, Vandross A. Characteristics of exceptional or super responders 34. Khan SA, Lepp€ aaho E, Kaski S. Bayesian multi-tensor factorization. Mach to cancer drugs. Mayo Clin Proc 2015; 90 (12): 1639–49. Learn 2016; 105 (2): 233–53. 12. Biankin AV, Piantadosi S, Hollingsworth SJ. Patient-centric trials for ther- 35. Virtanen S, Klami A, Khan AK, Kaski S. Bayesian group factor analysis. apeutic development in precision oncology. Nature 2015; 526 (7573): In: Artificial Intelligence and Statistics. La Palma, Canary Islands: 361–70. AISTATS; 2012. 13. Sun J, Wang F, Hu J, Edabollahi S. Supervised patient similarity measure 36. Klami A, Virtanen S, Lepp€ aaho E, Kaski S. Group factor analysis. IEEE of heterogeneous patient records. ACM SIGKDD Explor Newsl 2012; 14 Trans Neural Netw Learn Syst 2015; 26 (9): 2136–47. (1): 16–24. 37. Wang Y-X, Zhang Y-J. Nonnegative matrix factorization: a comprehen- 14. Brown SA. Patient similarity: emerging concepts in systems and precision sive review. IEEE Trans Knowl Data Eng 2013; 25 (6): 1336–53. medicine. Front Physiol 2016; 7: 561. 38. Pinero J, Bravo A, Queralt-Rosinach N, et al. DisGeNET: a comprehen- 15. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic sive platform integrating information on human disease-associated genes data types using a joint latent variable model with application to breast and variants. Nucleic Acids Res 2017; 45 (D1): D833–d39. and lung cancer subtype analysis. Bioinformatics (Oxford, England) 39. Hudson TJ, Anderson W, Aretz A, et al. International network of cancer 2009; 25 (22): 2906–12. genome projects. Nature 2010; 464 (7291): 993. 16. Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strat- 40. Chatr-Aryamontri A, Breitkreutz B-J, Oughtred R, et al. The BioGRID in- egy for personalized prognosis. Oncotarget 2016; 7 (26): 40200–20. teraction database: 2015 update. Nucleic Acids Res 2014; 43 (D1): 17. Xu T, Le TD, Liu L, Wang R, Sun B, Li J. Identifying cancer subtypes D470–D78. from miRNA-TF-mRNA regulatory networks and expression data. PLoS 41. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. One 2016; 11 (4): e0152792. Nucleic Acids Res 2000; 28 (1): 27–30. 18. Girardi D, Wartner S, Halmerbauer G, Ehrenmu ¨ ller M, Kosorus H, Drei- 42. Kibbe WA, Arze C, Felix V, et al. Disease Ontology 2015 update: an ex- seitl S. Using concept hierarchies to improve calculation of patient similar- panded and updated database of human diseases for linking biomedical ity. J Biomed Inform 2016; 63: 66–73. knowledge through disease data. Nucleic Acids Res 2014; 43 (D1): 19. Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggre- D1071–D78. gating data types on a genomic scale. Nat Methods 2014; 11 (3): 333–7. 43. Gao J, Aksoy BA, Dogrusoz U, et al. Integrative analysis of complex can- 20. Liang M, Li Z, Chen T, Zeng J. Integrative data analysis of multi-platform cer genomics and clinical profiles using the cBioPortal. Sci Signal 2013; 6 cancer data with a multimodal deep learning approach. IEEE/ACM Trans (269): pl1. Comput Biol and Bioinf 2015; 12 (4): 928–37. 44. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries 21. Gligorijevic V, Malod-Dognin N, Przulj N. Patient-specific data fusion for of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003; 31 (4): e15. cancer stratification and personalised treatment. Pac Symp Biocomput 45. Limongelli I, Marini S, Bellazzi R. PaPI: pseudo amino acid composition to 2016; 21: 321–32. score human protein-coding variants. BMC Bioinformatics 2015; 16 (1): 123. 22. Planey CR, Gevaert O. CoINcIDE: a framework for discovery of patient 46. Rappaport N, Nativ N, Stelzer G, et al. MalaCards: an integrated subtypes across multiple datasets. Genome Med 2016; 8 (1): 27. compendium for diseases and their annotation. Database 2013; 2013: 23. Wang F, Tao L, Changshui Z. Semi-supervised clustering via matrix fac- bat018. torization. In: Proceedings of the 2008 SIAM International Conference on 47. Cokelaer T, Pultz D, Harder LM, Serra-Musach J, Saez-Rodriguez J. Bio- Data Mining. Atlanta, GA: SIAM; 2008. Services: a common Python package to access biological Web Services pro- 24. Zhang P, Wang F, Hu J. Towards drug repositioning: a unified computa- grammatically. Bioinformatics 2013; 29 (24): 3241–2. tional framework for integrating multiple aspects of drug similarity and 48. Brown CE. Coefficient of Variation. Applied Multivariate Statistics in disease similarity. In: AMIA Annual Symposium Proceedings. Bethesda, Geohydrology and Related Sciences. Berlin, Heidelberg: Springer Berlin MD: American Medical Informatics Association; 2014. Heidelberg; 1998: 155–57. 25. Zitnik M, Janjic V, Larminie C, Zupan B, Przulj N. Discovering disease- 49. Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute disease associations by fusing systems-level molecular data. Sci Rep 2013; error (MAE)?—Arguments against avoiding RMSE in the literature. Geo- 3 (1): 3202. sci Model Dev 2014; 7 (3): 1247–50. 26. Zitnik M, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B. Gene priori- 50. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom tization by compressive data fusion and chaining. PLoS Comput Biol Intell Lab Syst 1987; 2 (1-3): 37–52. 2015; 11 (10): e1004552. 51. Hinton G. A practical guide to training restricted Boltzmann machines. 27. Zitnik M, Zupan B. Matrix factorization-based data fusion for gene func- Momentum 2010; 9 (1): 926. tion prediction in baker’s yeast and slime mold. Pac Symp Biocomput 52. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief 2014; 19: 400. nets. Neural Comput 2006; 18 (7): 1527–54. 28. Zitnik M, Zupan B. Matrix factorization-based data fusion for drug- 53. Lowenberg B, Downing JR, Burnett A. Acute myeloid leukemia. N Engl J induced liver injury prediction. Syst Biomed 2014; 2 (1): 16–22. Med 1999; 341 (14): 1051–62. 29. Vitali F, Cohen LD, Demartini A, et al. A network-based data integration 54. Dohner H, Weisdorf DJ, Bloomfield CD. Acute myeloid leukemia. N Engl approach to support drug repurposing and multi-target therapies in triple J Med 2015; 373 (12): 1136–52. negative breast cancer. PLoS One 2016; 11 (9): e0162407. 55. Hartigan JA, Hartigan J. Clustering Algorithms. Wiley: New York; 1975. 30. Zitnik M, Zupan B. Data fusion by matrix factorization. IEEE Trans Pat- 56. Dinse GE, Lagakos SW. Nonparametric estimation of lifetime and disease tern Anal Mach Intell 2015; 37 (1): 41–53. onset distributions from incomplete observations. Biometrics 1982; 38 31. Singh AP, Gordon JG. Relational learning via collective matrix factori- (4): 921–32. zation. In: Proceedings of the 14th ACM SIGKDD International Con- 57. Gray RJ. A class of K-sample tests for comparing the cumulative incidence ference on Knowledge Discovery and Data Mining. Las Vegas, NV: of a competing risk. Ann Stat 1988; 16: 1141–54. ACM; 2008. 58. Ye J, Liu J. Sparse methods for biomedical data. SIGKDD Explor Newsl 32. Klami A, Virtanen S, Lepp€ aaho E, Kaski S. Group Factor Analysis. IEEE 2012; 14 (1): 4–15. Transactions on Neural Networks and Learning Systems 2015; 26 (9): 59. Scott DW. The curse of dimensionality and dimension reduction. In: Mul- 2136–47. tivariate Density Estimation: Theory, Practice, and Visualization. 2nd ed. 33. Ruffini M, Gavalda R, Limon E. Clustering patients with tensor decompo- New York: Wiley; 1992: 217–40. sition. In: Finale D-V, Jim F, David K, Rajesh R, Byron W, Jenna W, eds. 60. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMI- Proceedings of the 2nd Machine Learning for Healthcare Conference. M.org: Online Mendelian Inheritance in Man (OMIM(R)), an online cata- Proceedings of Machine Learning Research: PMLR. Boston, MA: PMLR; log of human genes and genetic disorders. Nucleic Acids Res 2015; 43 2017: 126–46. (D1): D789–98. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018 12 JAMIA Open, 2018, Vol. 0, No. 0 61. Paschka P, Schlenk RF, Gaidzik VI, et al. IDH1 and IDH2 mutations are 64. Bentires-Alj M, Paez JG, David FS, et al. Activating mutations of the frequent genetic alterations in acute myeloid leukemia and confer adverse noonan syndrome-associated SHP2/PTPN11 gene in human solid tumors prognosis in cytogenetically normal acute myeloid leukemia with NPM1 and adult acute myelogenous leukemia. Cancer Res 2004; 64 (24): mutation without FLT3 internal tandem duplication. J Clin Oncol 2010; 8816–20. 28 (22): 3636–43. 65. Gaidzik VI, Paschka P, Spath D, et al. TET2 mutations in acute myeloid 62. Verhaak RG, Goudswaard CS, van Putten W, et al. Mutations in nucleo- leukemia (AML): results from a comprehensive genetic and clinical analy- phosmin (NPM1) in acute myeloid leukemia (AML): association with other sis of the AML study group. J Clin Oncol 2012; 30 (12): 1350–7. gene abnormalities and previously established gene expression signatures and 66. Law V, Knox C, Djoumbou Y, et al.DrugBank 4.0:sheddingnew their favorable prognostic significance. Blood 2005; 106 (12): 3747–54. light on drug metabolism. Nucleic Acids Res 2014; 42 (D1): 63. Schlenk RF, Dohner K, Krauter J, et al. Mutations and treatment outcome D1091–7. in cytogenetically normal acute myeloid leukemia. N Engl J Med 2008; 67. Hewett M, Oliver DE, Rubin DL, et al. PharmGKB: the pharmacogenetics 358 (18): 1909–18. knowledge base. Nucleic Acids Res 2002; 30 (1): 163–65. Downloaded from https://academic.oup.com/jamiaopen/advance-article-abstract/doi/10.1093/jamiaopen/ooy008/4996526 by Ed 'DeepDyve' Gillespie user on 08 June 2018

Journal

JAMIA OpenOxford University Press

Published: May 14, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off