Genetics and Genomics
Cancer subtype identiﬁcation using somatic mutation data
Marieke Lydia Kuijjer
, Joseph Nathaniel Paulson
, Peter Salzman
, Wei Ding
and John Quackenbush
BACKGROUND: With the onset of next-generation sequencing technologies, we have made great progress in identifying recurrent
mutational drivers of cancer. As cancer tissues are now frequently screened for speciﬁc sets of mutations, a large amount of
samples has become available for analysis. Classiﬁcation of patients with similar mutation proﬁles may help identifying subgroups
of patients who might beneﬁt from speciﬁc types of treatment. However, classiﬁcation based on somatic mutations is challenging
due to the sparseness and heterogeneity of the data.
METHODS: Here we describe a new method to de-sparsify somatic mutation data using biological pathways. We applied this
method to 23 cancer types from The Cancer Genome Atlas, including samples from 5805 primary tumours.
RESULTS: We show that, for most cancer types, de-sparsiﬁed mutation data associate with phenotypic data. We identify poor
prognostic subtypes in three cancer types, which are associated with mutations in signal transduction pathways for which targeted
treatment options are available. We identify subtype–drug associations for 14 additional subtypes. Finally, we perform a pan-cancer
subtyping analysis and identify nine pan-cancer subtypes, which associate with mutations in four overarching sets of biological
CONCLUSIONS: This study is an important step toward understanding mutational patterns in cancer.
British Journal of Cancer (2018) 118:1492–1501; https://doi.org/10.1038/s41416-018-0109-7
Cancer is a heterogeneous disease that can develop in different
tissues and cell types. Even within one cancer type, the disease
may manifest itself in multiple subtypes, which are usually
distinguished based on different histology, molecular proﬁles or
speciﬁc mutations, and which may lead to different clinical
outcomes. Identifying new cancer subtypes can help classiﬁcation
of patients into groups with similar clinical phenotypes, prognosis
or response to treatment. As an example, breast cancer is typically
classiﬁed into four primary molecular subtypes based on the
expression of HER2, hormone receptors and tumour grade, and
these different subtypes have different prognosis and respond
differently to hormone therapy.
While these subtypes are used to
manage patient treatment, even here we know that individual
subtypes themselves represent a diversity of smaller groups.
Since the onset of large-scale genomic experiments, cancer
subtypes have been identiﬁed in multiple cancers, using mRNA
and microRNA expression levels,
number alterations and combinations of different ‘omics data
but few studies have subtyped patients based on somatic
mutations. Somatic mutations play a large role in cancer
development and disease progression, and mutational proﬁling
is used far more commonly than other ‘omics analyses in clinical
practice because most clinical guidelines are based on single gene
mutations. Consequently, classiﬁcation based on patterns of
mutation could be particularly informative for identiﬁcation of
subgroups of patients who might respond to speciﬁc targeted
treatment regimens and of those who are unlikely to respond.
However, subtype classiﬁcation using somatic mutations in
cancer is challenging, mainly because the data are very sparse:
many tumours only have a handful of mutations in coding regions
yet the total number of mutations within a population is typically
substantial. Often, frequent cancer drivers—such as TP53—are
mutated, as well as so-called “passenger” events that are
considered mutational noise yet which may still inﬂuence tumour
properties. And even within the same cancer type, tumours often
exhibit very different mutational patterns, including drivers and
passengers—as well as mutations that may fall somewhere in
To classify sparse somatic mutation data into subtypes,
published methods generally ﬁrst de-sparsify the data. Some
methods use a gene-gene network as “prior” knowledge to de-
sparsify the data.
Hofree et al.,
for example, use network
propagation to “ﬁll in” the mutational status of neighbouring
genes (in protein–protein interaction networks) of mutated
drivers, while Le Morvan et al.
use networks from Pathway
Commons to normalise a patient’s mutational proﬁle by adding
“missing” or by removing “non-essential” mutations.
Data de-sparsiﬁcation using gene–gene networks has been
helpful in identifying subnetworks involved in cancer,
as well as
in identifying genes associated with patient survival.
gene–gene networks depend on a set of known “prior” interac-
tions, but these priors may or may not be “correct” in the sense
that they may not be relevant to the tissue or tumour under study.
This reliance on “canonical” networks might overemphasise genes
that are connected to mutational drivers through such
Received: 16 January 2018 Revised: 11 April 2018 Accepted: 12 April 2018
Published online: 16 May 2018
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA;
Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA;
Department of Biostatistics, Product Development, Genentech Inc., South San Francisco, San Francisco, CA, USA;
Bristol-Myers Squibb, Devens, MA, USA;
Department of Computer Science, University of Massachusetts, Boston, MA, USA and
Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
Correspondence: Marieke Lydia. Kuijjer (email@example.com)
© The Author(s) 2018 Published by Springer Nature on behalf of Cancer Research UK