Biometrics 74, 300–312 DOI: 10.1111/biom.12715
Integrative Analysis of Transcriptomic and Metabolomic Data via
Sparse Canonical Correlation Analysis with Incorporation of
Sandra E. Safo ,
and Qi Long
Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A.
Department of Medicine, Division of Pulmonary, Allergy and Critical Care Medicine, Emory University, Atlanta,
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, U.S.A.
Summary. Integrative analysis of high dimensional omics data is becoming increasingly popular. At the same time, incorpo-
rating known functional relationships among variables in analysis of omics data has been shown to help elucidate underlying
mechanisms for complex diseases. In this article, our goal is to assess association between transcriptomic and metabolomic
data from a Predictive Health Institute (PHI) study that includes healthy adults at a high risk of developing cardiovascu-
lar diseases. Adopting a strategy that is both data-driven and knowledge-based, we develop statistical methods for sparse
canonical correlation analysis (CCA) with incorporation of known biological information. Our proposed methods use prior
network structural information among genes and among metabolites to guide selection of relevant genes and metabolites in
sparse CCA, providing insight on the molecular underpinning of cardiovascular disease. Our simulations demonstrate that the
structured sparse CCA methods outperform several existing sparse CCA methods in selecting relevant genes and metabolites
when structural information is informative and are robust to mis-speciﬁed structural information. Our analysis of the PHI
study reveals that a number of gene and metabolic pathways including some known to be associated with cardiovascular
diseases are enriched in the set of genes and metabolites selected by our proposed approach.
Key words: Biological information; Canonical correlation analysis; High dimension; Integrative analysis; Low sample size;
Sparsity; Structural information.
Recent advancement in high-throughput, biomedical tech-
nologies has enabled the measurement of multiple high-
dimensional omics data types in a single study, including
genomics, epigenomics, transcriptomics, and metabolomics.
Each of these data types provides a diﬀerent snapshot of the
underlying biological system, and combining multiple data
types has been shown to be very valuable in investigating
complex diseases. It has been demonstrated that individual
components in these data are functionally structured in net-
works or pathways and incorporation of such structural (or
biological) information can improve analysis and lead to bio-
logically more meaningful results (Li and Li, 2008; Pan et al.,
2010; Chen et al., 2013). By the same token, it is desirable to
jointly assess the association between these data types with
incorporation of available structural information for each data
type, enabling us to uncover drivers that individually or in
combination provide better insight about the biological mech-
anism. In this article, we develop new canonical correlation
analysis (CCA) methods for studying the overall dependency
structure between transcripts and metabolites while incorpo-
rating structural information for each data type.
1.1. The PHI Study
Our work is motivated by data from the Emory University
and Georgia Tech Predictive Health Institute (PHI) study.
The PHI was established in 2005 with the goal of maintain-
ing health rather than treating disease. The PHI data are
collected from a longitudinal study of health measures in over
750 healthy employees of Emory University and Georgia Tech.
We use data for 52 participants for whom gene expression and
metabolomics data at baseline were available, and who were
at a high risk of developing cardiovascular diseases deﬁned
by the Framingham risk scores (D’Agostino et al., 2008). The
data consist of 32 females and 20 males with mean age of
47.35 years. The gene expression data consist of 38, 624 probes
and the metabolomic data consist of 6, 009 features, where
each metabolomic feature is deﬁned by mass-to-charge ratio
(m/z) and retention time and its relative concentration is cap-
tured by ion intensity. We exclude genes with variance and
entropy expression values that are, respectively, less than the
90th and 20th percentile, resulting in 1, 547 genes. For the
metabolomics data, we exclude features with more than 50%
zeros, and use mummichog (Li et al., 2013) to annotate the
m/z features, resulting in 252 metabolites.
2017, The International Biometric Society