Computational identification of micro-structural variations and their proteogenomic consequences in cancer

Computational identification of micro-structural variations and their proteogenomic consequences... Abstract Motivation Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples. Results We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure. Availability and implementation MiStrVar is available for download at https://bitbucket.org/compbio/mistrvar, and ProTIE is available at https://bitbucket.org/compbio/protie. Contact cenksahi@indiana.edu Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Rapid advances in high throughput sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. The availability of three types of fundamental omics data provides complementary views on the global molecular profile of a tissue under normal and disease conditions (Nesvizhskii, 2014). Recently developed computational methods have aimed to integrate two or three of these data types to address important biological questions, such as (i) correlating the abundances of transcription and translation products (Ning and Nesvizhskii, 2010); (ii) detecting peptides associated with un-annotated genes or splice variants [in mouse (Ning et al., 2012), C.elegans (Woo et al., 2014), zebrafish (Castellana et al., 2014) and human samples (Mo et al., 2008; Sheynkman et al., 2013)]; (iii) characterizing chimeric proteins by searching unidentified tandem mass spectrometry(MS/MS) data through the use of conventional peptide identification algorithms applied to a pre-assembled database of ‘known’ chimeric transcripts from the literature (Frenkel-Morgenstern et al., 2012). In the past year or so, several studies have aimed to identify novel peptides matching patient specific transcripts derived from RNA-Seq data. For example, Zhang et al. (2014) focused on identifying novel peptides involving Single Amino Acid Variants (SAAVs) in colorectal cancer. A later study by Cesnik et al. (2015) also considered novel splice junctions and (a limited set of user defined) Post-Translational Modifications (PTM) in a number of cell lines. Because of the importance of phosphorylation in cellular activity and cancer treatment (Reimand et al., 2013), this was further expanded to identify novel phosphorylation sites by Mertins et al. (2016), on the CPTAC breast cancer dataset, which is the subject of our paper. However, none of these studies aimed to perform integrative analysis of transcribed and translated genomic structural alterations such as fusions, inversions and duplications in tumor tissues. Genomic structural variants (SVs) significantly alter the sequence composition of associated genomic regions. Major SV types include (segmental) deletions, duplications (tandem or interspersed), inversions, translocations and transpositions. SVs observed in exonic regions may lead to aberrant protein products. Many such SVs have been associated with disease conditions and especially cancer. Common SVs associated with cancer include deletions in tumor suppressors such as BRCA1/2 (Ewald et al., 2009) in breast cancer, duplications in FMS-like tyrosine kinase (FLT3) gene in acute myeloid leukemia (AML) (Nakao et al., 1996) and an inversion causing cyclin D1 overexpression in parathyroid neoplasms (Hemmer et al., 2001). A gene fusion occurs when exonic regions of two (or more) distinct genes are concatenated to form a new chimeric gene, as a result of a large scale SV. Gene fusions can disrupt the normal function of one or both partners, for example by up-regulating an oncogene (e.g. TMPRSS2-ERG) or generating a novel or truncated protein [e.g. BCR-ABL1 (Fernandez-Luna, 2000)]. They have been demonstrated to play important roles in the development of haematological disorders, childhood sarcomas and in a variety of solid tumors. For example, ETS gene fusions are present in 80% of malignancies of the male genital organs, and as a result these fusions alone are associated with 16% of all cancer morbidity (Mitelman et al., 2007). There are a number of available computational tools for detecting structural variants, each based on one or more of the following general strategies. (i) Detection of variants using discordantly mapping paired end reads i.e. read mappings that either invert one or both of the read ends, or change the expected distance between the read ends. Tools using this approach include Breakdancer (Fan et al., 2014) and VariationHunter (Hormozdiari et al., 2009). (ii) Detection of variants using split-read mappings, which partition a single end read into two and map them independently to two distant loci, or soft-clipped read mappings, which map only a prefix/suffix of a read. One example employing this approach is Socrates (Schroder et al., 2014). (iii) Detection of variants using an assembly based approach. These tools map assembled contigs for improved precision. Examples include Barnacle (Swanson et al., 2013) and Dissect (Yorukoglu et al., 2012) (both of which are RNA-Seq analysis tools, but can also analyze genomic data). Additional tools employing a combination of these strategies include Pindel (Ye et al., 2009), Delly (Rausch et al., 2012), GASVPro (Sindi et al., 2012) and HYDRA (Quinlan et al., 2010). Our focus in this paper is microSVs (micro structural variants), i.e. events involving genomic sequences shorter than 1 kb, especially in exonic regions, since they are more likely to result in a translated protein. Available tools for SV discovery typically fail to capture microSVs, or do so while producing many false positives, thus the problem of robustly discovering microSVs remains open. Our first contribution in this paper is a novel algorithmic tool named MiStrVar (Micro Structural Variant caller), which identifies microSV breakpoints at single-nucleotide resolution by (1) identifying each one-end anchor (OEA), i.e. a paired-end read where one end maps to the reference genome and the other end cannot be mapped, (2) clustering OEAs based on (i) mapping loci similarity and (ii) the possibility of assembling the unmappable ends into a single contig and (3) aligning the contig formed by unmappable ends with the reference genome—in the vicinity of the mapped ends—simultaneously detecting putative inversions, duplications, indels or single nucleotide variants (SNVs) through a unified dynamic programming formulation. MiStrVar's approach has several advantages over existing SV discovery tools. First, MiStrVar analyzes many more reads than those considered by the tools using only split-reads or soft clipped reads. Any mapped read which has a hamming distance to the reference greater than four (as a default parameter, which can be user modified) is considered for assembly. This allows for the discovery of inversions or duplications as short as 5 bp and inversions with palindromic sequences, improving sensitivity. Second, this approach is much less time consuming than assembly based methods, since only the subset of unmappable reads are assembled rather than the entire genome. Finally, MiStrVar uses a unified dynamic programming formulation, superior to tools that identify each type of variant individually, especially because these tools misinterpret certain variants, such as inversions, as a combination of other variants. See Supplementary Figure S1 for a detailed illustration. Both fusions and microSVs may be independently observed in genomic, transcriptomic and proteomic data; however, the most impactful aberrations, especially in the context of cancer, are the ones that can be observed in all levels in the same tissue simultaneously. In such cases, integrative analysis of these three omics data types can provide independent evidence for the presence and heritability of aberrations. For example, trans-splicing events, which lead to chimeric transcripts, can only be observed in transcriptomic (but not in genomic) data, and thus can be distinguished from fusion events with genomic breakpoints through simultaneous analysis of genomic and transcriptomic data acquired from the same sample. No available large-scale study has been conducted on the detection and validation of aberrant proteins and their genomic and transcriptomic origins. As mentioned earlier, expressed aberrant genome variants can have considerable functional influence on proteins, and as such, they may affect molecular pathway activity or pathogenesis in disease, especially in cancer. Detection of aberrant protein variants provides new insights into diagnostic marker identification and drug development (recurrent protein aberrations can imply potential drug targets) and can help develop novel strategies for therapeutic intervention. Proteomic technologies have enabled high throughput, sensitive and deep protein analysis for complex disease-associated samples, aiming at discovering potential disease protein biomarkers (Gillette and Carr, 2013; Mustafa et al., 2013; Wulfkuhle et al., 2003), including low-abundant proteins or protein isoforms, or variants. Moreover, proteomic analyses can provide complementary information to transcriptomic and genomic analysis, as proteomic analyses are carried out by completely different technologies (i.e. mass spectrometry or MS) from DNA sequencing. Furthermore, advancement in MS instrumentation has enabled proteomic analysis to achieve sensitivity on par with RNA-seq in detecting low abundant events of gene expression in complex samples (Zhang et al., 2014). Therefore, integrating transcriptomic and proteomic data can improve both the sensitivity and confidence in characterizing expressed aberrant variants in complex samples such as tumor tissues. Our second contribution in this paper is MiStrVar (ProTeogenomics Integration Engine), the first computational framework that integrates high throughput genomic, transcriptomic and proteomic data to identify translated structural aberrations, specifically gene fusions and microSVs, in protein-coding genes. In particular, ProTIE takes sequence aberrations from WGS and RNA-Seq data as its input and validates them on the mass-spectrometry based proteomics data, while ensuring that each such proteomic signature is unique to the matching sequence aberration. By integrating multiple data sources simultaneously, ProTIE is able to provide a strongly supported set of candidate aberrations from the highly sensitive results of MiStrVar and deFuse. This is particularly helpful for selecting target events or genes for clinical studies. Results. We ran our computational framework to detect all translated gene fusions in RNA-Seq (low coverage 50 bp paired-end) data in the complete set of 105 TCGA (The Cancer Genome Atlas) breast cancer samples for which CPTAC (Clinical Proteomic Tumor Analysis Consortium) mass spectrometry data have been released (The primary goal of CPTAC is to characterize protein level expression differences for SNVs/SAAVs. Our focus here is complementary to the goals of CPTAC.). These 105 samples include all four of the most common intrinsic subtypes of breast cancer (Koboldt et al., 2012). Among them, 22 samples also have matching WGS data, on which we used our framework to identify exonic microSVs. This resulted in 206 255 fusions and 69 876 microSVs across the 105 samples. 2215 of these microSVs are also supported by transcriptomic (RNA-Seq) evidence. All breakpoints from the predicted fusions and microSVs were then analyzed for identifying supporting peptides from mass spectrometry data. This yielded 244 aberrant peptides from 432 possible aberrations. More specifically, 169 novel peptides originate from 295 fusion candidates (many of the fusions are recurrent and thus produce the same novel fusion peptide) and 75 peptides originate from 137 potential microSVs; this is of particular note since many of the genomic microSVs are recurrent, yet the ones that are translated are mostly private. Note that a sequence aberration may give rise to more than one novel peptide in case it results in a frameshift. See Table 1 for a summary of results [One interesting observation is that among the microSVs discovered, only 4 (1 microinversions and 3 tandem microduplications) have supporting evidence at all omics levels. This implies that the transcriptomic support for the remaining translated microSVs are too low to be detected, partially due to low abundance of RNA-Seq data made available by TCGA on the breast cancer samples. This also suggests that with deeper coverage RNA-Seq data, ProTIE is likely to detect additional translated gene fusions.]. Table 1. Distribution of 244 detected, high confidence, aberrant peptides over four breast cancer subtypes, across 105 patients Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 # Aberrant Patients indicates the number of patients with either detected fusion peptides or microSV peptides in that subtype. a The high number of microinversion peptides in Basal-Like breast cancer can be attributed to two patients, A0CM, A0J6, whose genomes had gone through substantial reorganization. Table 1. Distribution of 244 detected, high confidence, aberrant peptides over four breast cancer subtypes, across 105 patients Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 # Aberrant Patients indicates the number of patients with either detected fusion peptides or microSV peptides in that subtype. a The high number of microinversion peptides in Basal-Like breast cancer can be attributed to two patients, A0CM, A0J6, whose genomes had gone through substantial reorganization. 2 Materials and methods Our computational framework (see Fig. 1), is comprised of a number of algorithmic tools that we developed for detecting transcriptomic and genomic aberrations, and searching for expressed protein variants resulting from these aberrant sequences. Given a set of genomic (WGS), transcriptomic (RNA-seq) and proteomic (Mass Spectrometry) data, each collected from the tumor tissue of a patient, our pipeline detects translated sequence aberrations in three major steps. Fig. 1. View largeDownload slide Overview of the computational pipeline for identifying translated sequence aberrations Fig. 1. View largeDownload slide Overview of the computational pipeline for identifying translated sequence aberrations Each whole genome sequencing dataset is analyzed with MiStrVar, the microSV discovery tool we introduce in this paper, to identify microSVs occurring in protein-coding genes. (Note that our computational framework provides the option of validating genomic microSVs at the transcriptomic level by identifying RNA-Seq reads associated with each microSV breakpoint.) Each transcriptomic dataset is analyzed by our in-house fusion detection method deFuse (McPherson et al., 2011b), which reports potential fusion events between two protein coding genes, and the fused transcript sequences spanning the fusion breakpoints. [Note that our computational framework enables the use of our integrative fusion detection methods nFuse (McPherson et al., 2012)/Comrad(McPherson et al., 2011a) for corroborating potential fusions observable in WGS and RNA-Seq data.] All omics data is finally integratively analyzed through ProTIE, our novel ProTeogenomics Integration Engine as follows. Each mass spectrometry dataset is searched against a protein sequence database consisting of all human proteins from Ensembl human protein database GRCh37.70 (Ensembl, ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz), along with a database of proteins generated by fused transcripts and microSVs, by the use of MS-GF+ search engine (Kim and Pevzner, 2014). Aberrant peptides identified by the procedure with high confidence [e.g. at 1% false discovery rate estimated by using the target-decoy approach (Elias and Gygi, 2007)] are reported, provided they are also detected in the genomic/transcriptomic dataset from the same tumor tissue sample. 2.1 Detection of fusions and microSVs in WGS and RNA-seq data To detect fusions in RNA-Seq data, we applied deFuse (McPherson et al., 2011b) which predicts fusion transcripts based on analyzing discordantly mapped read-pairs and one-end anchors. To detect microSVs in WGS data, we applied our novel micro-structural variant caller, MiStrVar, which works in three major steps (See Supplementary Fig. S2 for an overview): In step (A), MiStrVar identifies all one-end anchors (OEA) in the read data: an OEA is a paired-end read for which only one end maps to the reference genome within a user defined error threshold. Once all reads are (multiply) mapped to a reference genome using mrsFAST-ultra (Hach et al., 2010, 2014), and all OEAs are extracted, the mapped ends of OEAs are clustered based on the mapping loci. MiStrVar provides the user two options for cluster identification, each satisfying one of the following distinct goals. For applications where sensitivity is of high priority, MiStrVar employs a sweeping algorithm for OEA mapping loci [introduced for VariationHunter (Hormozdiari et al., 2009)]. For applications where running time is of high priority, MiStrVar employs an iterative greedy strategy. In step (B), for each OEA cluster identified in step (A), MiStrVar assembles the unmapped end of the reads to form contigs (of length <400 bp) by aiming to solve the NP-hard (Gallant et al., 1980) dominant superstring (DSS) problem. MiStrVar employs a greedy strategy similar to that used to compute a constant factor approximation to the shortest superstring problem (Blum et al., 1994). In step (C), each contig associated to an OEA cluster is aligned to a region (of length several kilobases long) surrounding the OEA mapping loci, first through a simple local-to-global sequence alignment algorithm, that does not consider any structural alteration. (The reverse complement of the contig is also aligned to the same region.) The start and end position of this first, crude alignment is used to determine the approximate locus and length of the potential microSV implied by the contig. The exact microSV breakpoints are obtained in the next step through a more sophisticated alignment that considers structural alterations, which is applied to the portion of the reference genome restricted by the first alignment. The dynamic programming formulation for this alignment is an extension of the Schöniger-Waterman algorithm (Schöniger and Waterman, 1992) which was designed to capture inversions in the alignment. Specifically, our extensions enable the user to (i) discover the single best optimal event, rather than an arbitrary number of events, (ii) handle gaps extending over breakpoints (in cases of missing contig sequence) and (iii) simultaneously predict duplications, insertions, deletions and SNVs in addition to inversions. Further details on deFuse and MiStrVar methodology can be found in the Supplementary Text. 2.2 Identification of translated and transcribed sequence aberrations ProTIE provides the ability to detect translated aberrations by searching mass spectra against an aberrant peptide database. More specifically, given transcriptomic breakpoints pointing to fusions or microSVs, ProTIE identifies respective aberrant peptides from proteomic data by first generating a peptide database, and then identifying aberrant peptides based on mass spectrometry search results provided by MS-GF+ (Kim and Pevzner, 2014). (See Section S2.3 in Supplementary Material for details of database construction and parameters used in proteomics search.) Our pipeline also provides the user with the additional ability to jointly analyze matching WGS and RNA-Seq data for identifying transcribed genomic microSVs. Given a set of genomic microSVs, along with their breakpoints detected by MiStrVar, our pipeline generates corresponding aberrant transcripts. It then maps RNA-Seq reads to the collection of these aberrant transcripts. After filtering reads that can be mapped to a known isoform or potential novel spliceform, the remaining mappings provide evidence for aberrations in transcribed regions. See Section S2.5 in Supplementary Material for details about mappings/read filtration steps. 3 Experimental results CPTAC Breast Cancer Dataset. Clinical Proteomic Tumor Analysis Consortium (CPTAC, http://proteomics.cancer.gov) (Ellis et al., 2013; Whiteaker et al., 2014) aims to provide proteogenomic characterization of specific cancers based on joint analysis of proteomic, transcriptomic and genomic data acquired from the same group of cancer patients. CPTAC currently focuses on the relationship between protein abundance, somatic mutations and copy number alterations occurring in cancer-related genes (Zhang et al., 2014). Information about aberrations hidden in unidentified spectra and unmapped sequenced reads have not been revealed in the current CPTAC analysis framework; this happens to be the main focus of our paper. At the time of submission of this paper, proteomics data for tumor samples from three cancer types had been released by CPTAC: colorectal cancer, breast cancer and ovarian cancer. In addition, The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) has released RNA-Seq and WGS data on both normal and tumor tissues from the same group of patients through Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/). RNA-Seq data for breast and ovarian cancer patients are in the form of paired-end reads, however, for most of colon and rectal cancer samples only single-end reads were collected. Because we rely on paired-end mappings for detecting fusions and microSVs and since the RNA-Seq data from normal tissues from the ovarian cancer patients had not been released at the time of the submission, our focus in this paper is the breast cancer dataset. Details about CPTAC samples used in our analysis can be found in Supplementary Tables S5, S6. Breast Cancer Cell Line. In addition to the CPTAC and TCGA datasets, we used the HCC1143 ductal breast cancer cell line (triple negative breast cancer cell line from ATCC) for which we obtained matching tumor/normal Illumina HiSeq WGS, RNA-Seq and mass spectrometry data. The matching normal cell line, HCC1143-BL, is a B lymphoblastoid cell line initiated from peripheral blood lymphocytes from the same patient as HCC1143 by transformation with Epstein-Barr virus (EBV). The WGS data was obtained from NCI Genomic Data Commons (https://gdc.cancer.gov/), originally sequenced as part of the Cancer Cell Line Encyclopedia Project (Barretina et al., 2012). We used this cell line as preliminary validation for our approach before starting full scale analysis. 3.1 Gene fusion detection by deFuse Gene Fusions in the HCC1143 Breast Cancer Cell Line. We have run our fusion detection method, deFuse to detect gene fusions on RNA-Seq data from HCC1143 cell line. There are 81.73 M paired end reads of 101 bp length. Based on concordant mapping results, the average fragment length and standard deviation were 264.2 and 86.59 bp, respectively. deFuse predicts 1325 fusions from this dataset, out of which 74 are considered high confidence predictions based on the filtering criteria employed by deFuse (McPherson et al., 2011b). Gene fusions in breast cancer patient RNA-seq data. Each RNA-Seq dataset from the CPTAC breast cancer patient cohort was, on average, comprised of 76 M paired-end Illumina reads with length 50 bp. Based on transcriptome mapping results, the average fragment length and standard deviation were 190.3 and 65.47 bp, respectively. In total deFuse detected 206 255 fusions; on average, this amounts to 1964 predictions per sample. However, many of these predictions had low deFuse scores, either due to low sequence similarity or limited read support, and thus were not good fusion candidates. Only 3907 of these predictions (roughly 2% of all predictions) in total are considered to be high confidence calls by deFuse. 3.2 MicroSV detection by MiStrVar MicroSV predictions were based on three WGS datasets. The first is a simulation dataset based on the Venter genome developed with the goal of assessing sensitivity and precision of our methods with respect to available tools for SV discovery. These results are summarized in Figure 2; more details can be found in Supplementary Material. The second dataset consists of WGS data from the HCC1143 cell line (both tumor and normal), which was used to assess our methods’ accuracy on a homogeneous tumor sample. The third dataset is comprised of 22 TCGA/CPTAC breast cancer WGS data, which were used for full scale evaluation of our methods. Fig. 2. View largeDownload slide Comparison of precision and recall of MiStrVar against other SV discovery tools Fig. 2. View largeDownload slide Comparison of precision and recall of MiStrVar against other SV discovery tools 3.3 MicroSVs in the HCC1143 breast cancer cell line Before running MiStrVar on the TCGA/CPTAC breast cancer samples, we applied it to the HCC1143 breast cancer cell line. We identified 116 microinversions and 197 microduplications (Supplementary Table S3.2) on this sample. Among these, 11 inversions and 12 duplications have both high read coverage and low mapping multiplicity. We focus only on these microSVs for the remainder of the discussion. Details on the 11 inversion candidates can be found in Table 2. All 11 inversions appear in both normal and matching tumor samples indicating that they are germline events. We experimentally validated these inversions using Sanger sequencing. The primers were constructed by using the inverted sequence flanked by 200–300 bp from the reference genome. Five of the predicted inversions show a clear sequence match between the amplicon and predicted inversion, validating these inversion candidates. A representative example is given for the inversion in SLC3A1 in Supplementary Table S8 and the complete set of chromatograms is included in the appendix. We note here that all the high confidence microinversions, except for the one found in UBP1, have an associated multiple nucleotide polymorphism (MNP) entry in dbSNP. This includes the microinversion in BOK, which was not validated by Sanger sequencing. Table 2. Sanger sequencing validation of top 11 microinversion and top 12 exonic microduplication (tandem or interspersed) candidates in the breast cancer cell line HCC1143 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 Note: Known cancer genes are indicated in bold face. a Region (Ut: UTR, In: Intron, Ex: Exon). b Identity. c Tumor. d Normal Two alleles ① One allele Zero alleles Inconclusive in the sample. e RNA-Seq NE: No expression of gene. Table 2. Sanger sequencing validation of top 11 microinversion and top 12 exonic microduplication (tandem or interspersed) candidates in the breast cancer cell line HCC1143 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 Note: Known cancer genes are indicated in bold face. a Region (Ut: UTR, In: Intron, Ex: Exon). b Identity. c Tumor. d Normal Two alleles ① One allele Zero alleles Inconclusive in the sample. e RNA-Seq NE: No expression of gene. The 12 duplication candidates are summarized in Table 2; all were exonic, i.e. fully or partially overlapping with exons. All of these duplications produced amplicons except for the one located in IRAK1BP1. Additionally, two amplicons from the normal sample (on genes ADAMTS19 and CIDEA) yielded a weak signal in the chromatogram so it was impossible to determine if they support the call or not; furthermore, the corresponding amplicon from the tumor sample showed no evidence of the call. Three of the nine remaining calls, in FAM20C, GTPBP6 and KIAA1009, show a clear match in the tumor sample but not in normal, indicating they are true somatic calls. Two calls, in BAIAP2L2 and RBMXL3, have a clear match in both tumor and normal samples, indicating they are germline calls. For discussion on hetrozygous calls, see Supplementary Section S3.2.1. In addition to MiStrVar we ran all the SV callers we tested on the HCC1143 cell line data. The parameters for all tools were identical to those used in the simulation. Out of these tools, only Pindel was able to identify any of the inversions. However, Pindel missed 2 of the 9 PCR validated inversion calls (in PFKP and OSBP2), out of 11 tested. The two calls made by MiStrVar that could not be validated were also called by Pindel, providing further evidence that MiStrVar improves Pindel with respect to both precision and recall. None of the tools were able to predict any of the microduplications. (Note that ITDetector was never able to complete execution after more than a month of processing). 3.3.1 MicroSVs in the complete set of TCGA-CPTAC breast cancer samples We applied MiStrVar and MiStrVar to the complete set of matched tumor/normal samples from 22 TCGA breast cancer patients for which matching WGS, RNA-Seq and CPTAC Mass Spectrometry data were all available (see Supplementary Material for details). Minimal filtering was used on the calls since few calls uniquely matched proteomic signatures. Note that we only focus on exonic microinversion and microduplication calls (either fully or partially overlapping with exons) for further analysis. The number of calls for each sample can be found in Supplementary Table S7. Although only exonic calls were used for further analysis, the highest confidence calls within intronic and UTR regions, with respective support of > 40 and > 10 (identity = 100%) were also collected (see Supplementary Table S8). We also provide the highest confidence microduplications without proteomic support (support > 40, identity = 100%) as well as somatic microduplications (see Supplementary Table S9). 3.4 ProTIE proteogenomics analysis of CPTAC breast cancer datasets CPTAC has produced global proteome and phosphor-proteome data for 105 TCGA breast cancer samples using iTRAQ protein quantification method. Each iTRAQ experiment included three TCGA samples and one common internal reference control sample. The internal reference is comprised of a mixture of 40 TCGA samples (out of the 105 breast cancer samples) with equal representation of the four breast cancer subtypes. Three of the TCGA samples were analyzed in duplicates for quality control purposes. See Section S5 in Supplementary Material for more details about mass spectrometry datasets. Based on the database search strategy mentioned in Supplementary Section S2.3, in each mixture, our first level analysis resulted in approximately 5342 spectra (1% FDR) matching to fusion peptide sequences, and about 620 spectra matching to microSV peptide sequences. If a matched peptide is identical to a substring of any known protein in Ensembl database, the corresponding spectra is discarded so as to ensure that the peptide is novel. The remaining results thus consist of all mass spectra in a single mixture supporting novel peptides originating from high confidence sequence aberrations. For a specific mixture, we can extract all the genes and the corresponding patient(s) generating these translated aberrations based on deFuse and MiStrVar calls. We also apply class-specific peptide-level FDR (Nesvizhskii, 2014) to provide more stringent results of novel proteins detection. In Table 3, a checkmark in the last column (labeled Str FDR) indicates that the corresponding PSMs pass this more stringent class-specific peptide-level FDR under 1%. Table 3. The list of interesting aberration events with translated peptides Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Note: A (✓) in the column BP (BreakPoint) indicates that the peptide crosses the fusion breakpoint, and in the last column that the peptide satisfies our more stringent FDR criterion. For inversions, associated peptides always span one or two breakpoints (1/2 or 2/2), or the inverted sequence between the breakpoints (‘Between’). For duplications, the associated peptide always spans the single breakpoint and the entire inserted sequence (1/1). Breakpoints in the peptide sequences are marked with ‘|’. Calls marked as ‘Low’ in the RNA-seq column are those from genes with low sequence coverage; and as ‘N/A’ indicate the lack of RNA-seq data for this sample. Known cancer genes are indicated in bold face. Starred peptides (*) are SAAVs according to validated peptides in Ensembl GRCh38 protein database. a LA, LB, Bl, H2 stand for Luminal A, Lunimal B, Basal-like and HER2-enriched, respectively. b T, B, N indicate event detection in tumor tissue, normal tissue or both. Table 3. The list of interesting aberration events with translated peptides Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Note: A (✓) in the column BP (BreakPoint) indicates that the peptide crosses the fusion breakpoint, and in the last column that the peptide satisfies our more stringent FDR criterion. For inversions, associated peptides always span one or two breakpoints (1/2 or 2/2), or the inverted sequence between the breakpoints (‘Between’). For duplications, the associated peptide always spans the single breakpoint and the entire inserted sequence (1/1). Breakpoints in the peptide sequences are marked with ‘|’. Calls marked as ‘Low’ in the RNA-seq column are those from genes with low sequence coverage; and as ‘N/A’ indicate the lack of RNA-seq data for this sample. Known cancer genes are indicated in bold face. Starred peptides (*) are SAAVs according to validated peptides in Ensembl GRCh38 protein database. a LA, LB, Bl, H2 stand for Luminal A, Lunimal B, Basal-like and HER2-enriched, respectively. b T, B, N indicate event detection in tumor tissue, normal tissue or both. 3.4.1 ProTIE inferred fusion peptides Given the proteomics search results for a specific mixture, a peptide will be further evaluated only if the corresponding fusion is also observed in at least one patient within the mixture. Among the remaining 5579 spectra, 3185 match to peptides coming from immunoglobulin heavy and light chain fusions. These peptides are not considered any further since highly repeated regions shared between those genes can lead to false positives in both fusion detection and proteomics search stages (Boutz et al., 2014; Cheung et al., 2012). Among the peptides remaining, we also discard those associated with a fusion for which no breakpoint crossing peptide is observed (This is due to the difficulty of determining whether such a peptide is a result of a fusion or because of a reading frame shift). At the end of these filtering steps ProTIE returns 807 spectra matching to 169 potential fusion peptides. Among fusions related to these potential fusion peptides, we summarize special events with either high confidence RNA-Seq level evidence or proteomics support in Table 3. The first part of Table 3 shows events with better fusion quality based on reports of deFuse (see Section S7 in Supplementary Material for details). Among these fusions, two stand out with respect to peptide-spectrum matching quality, respectively observed in patients A08G and A15A. The PSMs supporting these two fusions are shown in Supplementary Material. We also provide a list of fusions with multiple translation peptides in the second part of Table 3. More specifically, four of these fusions have matching peptides located on both at the breakpoint and further downstream. Note that although we only detect a single peptide for some additional fusions, the peptide may be supported by multiple spectra as can be seen in Supplementary Table S10. 3.4.2 ProTIE inferred MicroSV peptides As per ProTIE's translated fusion peptide inference approach, for each mixture, we only consider previously unknown peptides that can be unique byproducts of microSVs detected in at least one patient within the mixture. To ensure that these peptides support microSV (duplication or inversion) calls and not SNVs/SNPs, we only consider potential peptides from an interspersed duplication or inversion with a minimum of two amino acids on each side of at least one of the two breakpoints associated with that microSV; for tandem duplications we ensure that at least two amino acids are present in the peptide from both sides of the single breakpoint. Proteomics search of these peptides on 22 patients resulted in 115 spectra potentially supporting microSVs. These spectra support a total of 75 peptides, due to the fact that some of the peptides are supported by more than one spectra. Of these 75 peptides, 7 support microduplications and 68 support microinversions. Incorporating the RNA-Seq results from Section S6.2 in Supplementary Material, we obtain 4 microSV calls with support on all omics levels. The resulting peptides with the highest quality spectra support are summarized in Table 3. Here the number of spectra supporting these peptides is indicated in the ‘Spectra’ column. Similarly, column ‘Breakpoint Support’ indicates the number and type of the breakpoints supported by spectra for each peptide. 4 Discussion 4.1 Genomic MicroSVs detected with MiStrVar Our simulations show that MiStrVar effectively and accurately identifies all microSVs, specifically, insertions, tandem and interspersed duplications in WGS datasets. In particular, MiStrVar has high sensitivity, as well as high precision—especially for inversions. For duplications, even though its precision may not look as impressive, MiStrVar still outperforms all available alternatives. In addition, the precision values for duplications are likely to have been underestimated, since many of calls labelled as ‘false positives’ could, in fact, be true germline differences between the Venter genome and the reference genome. On a very high coverage dataset (120x) from the Venter genome, with no simulated microSVs, duplications detected by MiStrVar have a large overlap with those it detects in the simulation dataset. Elimination of these calls from the simulation dataset increases MiStrVar’s precision to 71% without any additional filtering. MiStrVar is also very accurate in identifying the exact breakpoint loci of the microSVs. This is particularly important for our proteogenomics analysis where we only consider exact peptide matches. If a breakpoint were off even by only one nucleotide there is a high likelihood the predicted peptide would not match. With the exception of Pindel for inversions, which correctly identified 10% fewer exact breakpoints, no tool was even close to correctly identifying as many single-nucleotide resolution microSV breakpoints as MiStrVar. For inversions, the calls where MiStrVar cannot identify the exact breakpoints are often due to the presence of palindromic sequences, resulting in co-optimal breakpoint predictions. More importantly, these cases yield identical peptides and therefore do not affect further analysis results. For duplications, the errors are usually observed in cases where the insertion is into a low complexity region. Again, in many of these cases the resulting peptides would be identical (e.g. consider a duplication that occurs in a polynucleotide tract). Furthermore, even in the worst case, MiStrVar predictions are within 30 bp from the real breakpoints, still much better than the available alternatives. It should also be noted here that unlike other tools, MiStrVar provides not only the duplication breakpoint coordinates but also the precise coordinates of the ‘source’ sequence (i.e. the region of the genome that is duplicated). Through this feature it becomes easier for the user to interpret interspersed as well as tandem duplications. 4.2 Translated aberrations detected with ProTIE The use of a proteogenomic approach enables two novel capabilities that are highly relevant to cancer biology and precision medicine. (i) The ability to hone in on potential clinically actionable mutations that are expressed at the protein level. The vast majority of clinical cancer testing focuses only on DNA-level mutations. A gene mutation–drug association is predicated on the assumption that a mutation will translate to the protein level, however, this is often not the case, as genes that contain a mutation may not be expressed in RNA. Likewise, RNA expression does not always directly translate to protein expression. Thus, having protein level evidence to confirm genomic aberrations provides assurance of the functional presence of a mutation. (ii) The ability to observe the presence of protein spectra from fusion transcripts that are predicted to be ‘out-of-frame’. The vast majority of fusion annotation pipelines filter out fusions that are not in-frame secondary to a widely-held reasoning that these protein products would be misfolded and degraded or subject to non-sense mediated decay. Surprisingly, in this study, high quality spectra were observed from out-of-frame fusion spectra. While additional studies will need to be performed, these data suggest these out-of-frame fusion products are stable enough and at a relative abundance to be detected by Mass Spectrometry. Whether these products are stable by chance or confer a gain-of-function capability is yet to be seen, but these data at minimum suggest that out-of-frame fusions should not be eliminated from consideration when searching for oncogenic candidates. 4.2.1 Translated gene fusions To better understand the properties of genes with translation evidence for fusions, we analyzed these genes through Ingenuity Pathway Analysis (https://www.ingenuity.com). Note that we used all fusion genes detected by deFuse as the background genes in the analysis. The top 3 categories for gene function enrichment are: Cancer (137 genes), Organismal Injury and Abnormalities (150 genes), and Respiratory Disease (39 genes). All three sets of genes come with adjusted P-value around 0.0035 (via Benjamini-Hochberg procedure). Given that fusions are a somatic cancer-specific event, enrichment of cancer related genes provides a validation of our approach. See Section 8.1 in Supplementary Material for more details on functional interpretations of these genes. 4.2.2 Translated MicroSVs Most of the microinversions with proteomics support are in the 400 bp to 1 kb length range. Microinversions shorter than 100 bp are much less common in exonic regions. However in intronic and UTR regions, microinversions with the best genomic support (in terms of both read coverage and sequence similarity) are predominately of length less than 100 bp; See Supplementary Table S8, for a summary of intronic and UTR microinversions. We also observed that shorter microinversions tend to be germline events while longer events tend to be somatic. All of the microduplication calls with proteomic support (all of these—with the exception of the one in NUPL2—satisfy our more stringent FDR criterion) were predicted to be germline events. Indeed nearly all of these events have corresponding dbSNP entries. The call in FAM134A appears to be a novel germline event. The longest duplication in RPL14 also appears to be novel (rs369485042 includes variants with up to 5 alanines). Since we observed relatively few translated microduplications, it is unlikely that this type of microSV plays a major role in breast cancer through translation to aberrant proteins. However we predicted many high confidence microduplications in exonic regions, some with RNA-Seq support, in addition to many in UTRs and introns (Supplementary Table S9). It is possible that such exonic duplications lead to truncated or rapidly degraded proteins and the duplications in UTRs and intronic regions may affect gene expression and splicing. Our analysis resulted in 4 microSV calls with support on all omics levels. This includes 3 microduplications (within genes FAM134A, NUPL2 and RPL14) and 1 microinversions (within PLIN4). The microduplications in FAM134A and RPL14 (that with 27 bp) appear to be novel events. Additionally, there are several events with both genomic and proteomic support, which possibly lack RNA-Seq support due to low expression of the associated gene or the lack of RNA-Seq data for the sample. See Section S8.2 in Supplementary Material for more details on functional interpretations of these genes. 5 Conclusion Integration of genomic, transcriptomic and proteomic data provides a comprehensive view of the patient’s molecular profile. TCGA/CPTAC now offers matching genomic, transcriptomic and proteomic data across several cancer types, with a focus on the impact of Single Amino Acid Variants (SAAVs) and SNVs on protein abundances. In order to complement TCGA/CPTAC study and better establish the relationship between genomic, transcriptomic and proteomic aberrations and the cancer phenotype, we introduce MiStrVar, the first tool to capture multiple types of microSVs in WGS datasets. MiStrVar, and deFuse, a fusion detection tool we developed earlier, form key components of ProTIE, a computational framework we introduce here to automatically and jointly identify translated fusions and microSVs in matching omics datasets. Concurrently, ProTIE also incorporates RNA-Seq evidence to validate expressed microSVs. Based on both simulation and cell line data, we demonstrate that MiStrVar significantly outperforms available tools for SV detection. Our results on the TCGA/CPTAC breast cancer datasets also suggest the possibility of automatic calibration for some entries in dbSNP, which we believe are misannotated. It is interesting to note that the majority of the translated microSVs and fusions we observed in the breast cancer samples were private events; this prompts a larger and more detailed integrated study of all three omics data types through the use of ProTIE for a comprehensive molecular profiling of breast cancer subtypes. Acknowledgements The WGS and RNA-Seq datasets were retrieved from the Cancer Genomics Hub (now Genomic Data Commons); the proteomics data was released by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium. Clinical information was obtained through the database of Genotypes and Phenotypes (dbGaP). Funding This paper was partially funded by the US National Institutes of Health Grant GM108348, the NSERC Discovery Frontiers Grant, “Cancer Genome Collaboratory” (to S.C.S.), NSERC Discovery Grant (to F.H.), and NIGMS/NIH Grant No R01 GM103725-04 (to S.L.). Additional funding is provided by the Indiana University Grand Challenges Program, Precision Health Initiative (to S.C.S. and H.T.). Conflict of Interest: none declared. References Barretina J. et al. ( 2012 ) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity . Nature , 483 , 603 – 607 . Google Scholar CrossRef Search ADS PubMed Blum A. et al. ( 1994 ) Linear approximation of shortest superstrings . J. ACM , 41 , 630 – 647 . Google Scholar CrossRef Search ADS Boutz D.R. et al. ( 2014 ) Proteomic identification of monoclonal antibodies from serum . Anal. Chem ., 86 , 4758 – 4766 . Google Scholar CrossRef Search ADS PubMed Castellana N.E. et al. ( 2014 ) An automated proteogenomic method uses mass spectrometry to reveal novel genes in zea mays . Mol. Cell. Proteomics , 13 , 157 – 167 . Google Scholar CrossRef Search ADS PubMed Cesnik A.J. et al. ( 2015 ) Human proteomic variation revealed by combining RNA-seq proteogenomics and global Post-Translational modification (G-PTM) search strategy . J. Proteome Res ., 15 , 800 – 808 . Google Scholar CrossRef Search ADS Cheung W.C. et al. ( 2012 ) A proteomics approach for the identification and cloning of monoclonal antibodies from serum . Nat. Biotechnol ., 30 , 447 – 452 . Google Scholar CrossRef Search ADS PubMed Elias J.E. , Gygi S.P. ( 2007 ) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry . Nat. Methods , 4 , 207 – 214 . Google Scholar CrossRef Search ADS PubMed Ellis M.J. et al. ( 2013 ) Connecting genomic alterations to cancer biology with proteomics: the NCI clinical proteomic tumor analysis consortium . Cancer Discov ., 3 , 1108 – 1112 . Google Scholar CrossRef Search ADS PubMed Ewald I.P. et al. ( 2009 ) Genomic rearrangements in BRCA1 and BRCA2: a literature review . Genet. Mol. Biol ., 32 , 437 – 446 . Google Scholar CrossRef Search ADS PubMed Fan X. et al. ( 2014 ) BreakDancer – identification of genomic structural variation from paired-end read mapping . Curr. Protoc. Bioinf ., 45 , 15.6.1 – 15.6.11 . Google Scholar CrossRef Search ADS Fernandez-Luna J.L. ( 2000 ) Bcr-Abl and inhibition of apoptosis in chronic myelogenous leukemia cells . Apoptosis Int. J. Program. Cell Death , 5 , 315 – 318 . Google Scholar CrossRef Search ADS Frenkel-Morgenstern M. et al. ( 2012 ) Chimeras taking shape: potential functions of proteins encoded by chimeric RNA transcripts . Genome Res ., 22 , 1231 – 1242 . Google Scholar CrossRef Search ADS PubMed Gallant J. et al. ( 1980 ) On finding minimal length superstrings . J. Comput. Syst. Sci ., 20 , 50 – 58 . Google Scholar CrossRef Search ADS Gillette M.A. , Carr S.A. ( 2013 ) Quantitative analysis of peptides and proteins in biomedicine by targeted mass spectrometry . Nat. Methods , 10 , 28 – 34 . Google Scholar CrossRef Search ADS PubMed Hach F. et al. ( 2010 ) mrsFAST: a cache-oblivious algorithm for short-read mapping . Nat. Methods , 7 , 576 – 577 . Google Scholar CrossRef Search ADS PubMed Hach F. et al. ( 2014 ) mrsFAST-ultra: a compact, SNP-aware mapper for high performance sequencing applications . Nucleic Acids Res ., 42 , W494 – 500 . Google Scholar CrossRef Search ADS PubMed Hemmer S. et al. ( 2001 ) Deletion of 11q23 and cyclin D1 overexpression are frequent aberrations in parathyroid adenomas . Am. J. Pathol ., 158 , 1355 – 1362 . Google Scholar CrossRef Search ADS PubMed Hormozdiari F. et al. ( 2009 ) Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes . Genome Res ., 19 , 1270 – 1278 . Google Scholar CrossRef Search ADS PubMed Kim S. , Pevzner P.A. ( 2014 ) MS-GF+ makes progress towards a universal database search tool for proteomics . Nat. Commun ., 5 , 5277+ . Google Scholar CrossRef Search ADS PubMed Koboldt D.C. et al. ( 2012 ) Comprehensive molecular portraits of human breast tumours . Nature , 490 , 61 – 70 . Google Scholar CrossRef Search ADS PubMed McPherson A. et al. ( 2011a ) Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data . Bioinformatics , 27 , 1481 – 1488 . Google Scholar CrossRef Search ADS McPherson A. et al. ( 2011b ) defuse: an algorithm for gene fusion discovery in tumor RNA-seq data . PLoS Comput Biol ., 7 , e1001138 . Google Scholar CrossRef Search ADS McPherson A. et al. ( 2012 ) nFuse: discovery of complex genomic rearrangements in cancer using high-throughput sequencing . Genome Res ., 22 , 2250 – 2261 . Google Scholar CrossRef Search ADS PubMed Mertins P. et al. ( 2016 ) Proteogenomics connects somatic mutations to signalling in breast cancer . Nature , 534 , 55 – 62 . Google Scholar CrossRef Search ADS PubMed Mitelman F. et al. ( 2007 ) The impact of translocations and gene fusions on cancer causation . Nat. Rev. Cancer , 7 , 233 – 245 . Google Scholar CrossRef Search ADS PubMed Mo F. et al. ( 2008 ) A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data . BMC Bioinformatics , 9 , 537+ . Google Scholar CrossRef Search ADS PubMed Mustafa M.G. et al. ( 2013 ) Biomarker discovery for early detection of hepatocellular carcinoma in hepatitis c infected patients . Mol. Cell. Proteomics , 12 , 3640 – 3652 . Google Scholar CrossRef Search ADS PubMed Nakao M. et al. ( 1996 ) Internal tandem duplication of the flt3 gene found in acute myeloid leukemia . Leukemia , 10 , 1911 – 1918 . Google Scholar PubMed Nesvizhskii A.I. ( 2014 ) Proteogenomics: concepts, applications and computational strategies . Nat. Methods , 11 , 1114 – 1125 . Google Scholar CrossRef Search ADS PubMed Ning K. , Nesvizhskii A. ( 2010 ) The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-seq data: a preliminary assessment . BMC Bioinformatics , 11 , S14+ . Google Scholar CrossRef Search ADS PubMed Ning K. et al. ( 2012 ) Comparative analysis of different Label-Free mass spectrometry based protein abundance estimates and their correlation with RNA-seq gene expression data . J. Proteome Res ., 11 , 2261 – 2271 . Google Scholar CrossRef Search ADS PubMed Quinlan A.R. et al. ( 2010 ) Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome . Genome Res ., 20 , 623 – 635 . Google Scholar CrossRef Search ADS PubMed Rausch T. et al. ( 2012 ) DELLY: structural variant discovery by integrated paired-end and split-read analysis . Bioinformatics , 28 , i333 – i339 . Google Scholar CrossRef Search ADS PubMed Reimand J. et al. ( 2013 ) The mutational landscape of phosphorylation signaling in cancer . Sci. Rep ., 3 , 2651 . Google Scholar CrossRef Search ADS PubMed Schöniger M. , Waterman M.S. ( 1992 ) A local algorithm for DNA sequence alignment with inversions . Bull. Math. Biol ., 54 , 521 – 536 . Google Scholar CrossRef Search ADS PubMed Schroder J. et al. ( 2014 ) Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads . Bioinformatics , 30 , 1064 – 1072 . Google Scholar CrossRef Search ADS PubMed Sheynkman G.M. et al. ( 2013 ) Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-seq . Mol. Cell. Proteomics MCP , 12 , 2341 – 2353 . Google Scholar CrossRef Search ADS Sindi S.S. et al. ( 2012 ) An integrative probabilistic model for identification of structural variation in sequencing data . Genome Biol ., 13 , R22 . Google Scholar CrossRef Search ADS PubMed Swanson L. et al. ( 2013 ) Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies . BMC Genomics , 14 , 550 . Google Scholar CrossRef Search ADS PubMed Whiteaker J.R. et al. ( 2014 ) CPTAC assay portal: a repository of targeted proteomic assays . Nat. Methods , 11 , 703 – 704 . Google Scholar CrossRef Search ADS PubMed Woo S. et al. ( 2014 ) Proteogenomic database construction driven from large scale RNA-seq data . J. Proteome Res ., 13 , 21 – 28 . Google Scholar CrossRef Search ADS PubMed Wulfkuhle J.D. et al. ( 2003 ) Proteomic applications for the early detection of cancer . Nat. Rev. Cancer , 3 , 267 – 275 . Google Scholar CrossRef Search ADS PubMed Ye K. et al. ( 2009 ) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads . Bioinformatics , 25 , 2865 – 2871 . Google Scholar CrossRef Search ADS PubMed Yorukoglu D. et al. ( 2012 ) Dissect: detection and characterization of novel structural alterations in transcribed sequences . Bioinformatics , 28 , i179 – i187 . Google Scholar CrossRef Search ADS PubMed Zhang B. et al. ( 2014 ) Proteogenomic characterization of human colon and rectal cancer . Nature , 513 , 382 – 387 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Computational identification of micro-structural variations and their proteogenomic consequences in cancer

Loading next page...
 
/lp/ou_press/computational-identification-of-micro-structural-variations-and-their-XR4waEkMM0
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/btx807
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples. Results We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure. Availability and implementation MiStrVar is available for download at https://bitbucket.org/compbio/mistrvar, and ProTIE is available at https://bitbucket.org/compbio/protie. Contact cenksahi@indiana.edu Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Rapid advances in high throughput sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. The availability of three types of fundamental omics data provides complementary views on the global molecular profile of a tissue under normal and disease conditions (Nesvizhskii, 2014). Recently developed computational methods have aimed to integrate two or three of these data types to address important biological questions, such as (i) correlating the abundances of transcription and translation products (Ning and Nesvizhskii, 2010); (ii) detecting peptides associated with un-annotated genes or splice variants [in mouse (Ning et al., 2012), C.elegans (Woo et al., 2014), zebrafish (Castellana et al., 2014) and human samples (Mo et al., 2008; Sheynkman et al., 2013)]; (iii) characterizing chimeric proteins by searching unidentified tandem mass spectrometry(MS/MS) data through the use of conventional peptide identification algorithms applied to a pre-assembled database of ‘known’ chimeric transcripts from the literature (Frenkel-Morgenstern et al., 2012). In the past year or so, several studies have aimed to identify novel peptides matching patient specific transcripts derived from RNA-Seq data. For example, Zhang et al. (2014) focused on identifying novel peptides involving Single Amino Acid Variants (SAAVs) in colorectal cancer. A later study by Cesnik et al. (2015) also considered novel splice junctions and (a limited set of user defined) Post-Translational Modifications (PTM) in a number of cell lines. Because of the importance of phosphorylation in cellular activity and cancer treatment (Reimand et al., 2013), this was further expanded to identify novel phosphorylation sites by Mertins et al. (2016), on the CPTAC breast cancer dataset, which is the subject of our paper. However, none of these studies aimed to perform integrative analysis of transcribed and translated genomic structural alterations such as fusions, inversions and duplications in tumor tissues. Genomic structural variants (SVs) significantly alter the sequence composition of associated genomic regions. Major SV types include (segmental) deletions, duplications (tandem or interspersed), inversions, translocations and transpositions. SVs observed in exonic regions may lead to aberrant protein products. Many such SVs have been associated with disease conditions and especially cancer. Common SVs associated with cancer include deletions in tumor suppressors such as BRCA1/2 (Ewald et al., 2009) in breast cancer, duplications in FMS-like tyrosine kinase (FLT3) gene in acute myeloid leukemia (AML) (Nakao et al., 1996) and an inversion causing cyclin D1 overexpression in parathyroid neoplasms (Hemmer et al., 2001). A gene fusion occurs when exonic regions of two (or more) distinct genes are concatenated to form a new chimeric gene, as a result of a large scale SV. Gene fusions can disrupt the normal function of one or both partners, for example by up-regulating an oncogene (e.g. TMPRSS2-ERG) or generating a novel or truncated protein [e.g. BCR-ABL1 (Fernandez-Luna, 2000)]. They have been demonstrated to play important roles in the development of haematological disorders, childhood sarcomas and in a variety of solid tumors. For example, ETS gene fusions are present in 80% of malignancies of the male genital organs, and as a result these fusions alone are associated with 16% of all cancer morbidity (Mitelman et al., 2007). There are a number of available computational tools for detecting structural variants, each based on one or more of the following general strategies. (i) Detection of variants using discordantly mapping paired end reads i.e. read mappings that either invert one or both of the read ends, or change the expected distance between the read ends. Tools using this approach include Breakdancer (Fan et al., 2014) and VariationHunter (Hormozdiari et al., 2009). (ii) Detection of variants using split-read mappings, which partition a single end read into two and map them independently to two distant loci, or soft-clipped read mappings, which map only a prefix/suffix of a read. One example employing this approach is Socrates (Schroder et al., 2014). (iii) Detection of variants using an assembly based approach. These tools map assembled contigs for improved precision. Examples include Barnacle (Swanson et al., 2013) and Dissect (Yorukoglu et al., 2012) (both of which are RNA-Seq analysis tools, but can also analyze genomic data). Additional tools employing a combination of these strategies include Pindel (Ye et al., 2009), Delly (Rausch et al., 2012), GASVPro (Sindi et al., 2012) and HYDRA (Quinlan et al., 2010). Our focus in this paper is microSVs (micro structural variants), i.e. events involving genomic sequences shorter than 1 kb, especially in exonic regions, since they are more likely to result in a translated protein. Available tools for SV discovery typically fail to capture microSVs, or do so while producing many false positives, thus the problem of robustly discovering microSVs remains open. Our first contribution in this paper is a novel algorithmic tool named MiStrVar (Micro Structural Variant caller), which identifies microSV breakpoints at single-nucleotide resolution by (1) identifying each one-end anchor (OEA), i.e. a paired-end read where one end maps to the reference genome and the other end cannot be mapped, (2) clustering OEAs based on (i) mapping loci similarity and (ii) the possibility of assembling the unmappable ends into a single contig and (3) aligning the contig formed by unmappable ends with the reference genome—in the vicinity of the mapped ends—simultaneously detecting putative inversions, duplications, indels or single nucleotide variants (SNVs) through a unified dynamic programming formulation. MiStrVar's approach has several advantages over existing SV discovery tools. First, MiStrVar analyzes many more reads than those considered by the tools using only split-reads or soft clipped reads. Any mapped read which has a hamming distance to the reference greater than four (as a default parameter, which can be user modified) is considered for assembly. This allows for the discovery of inversions or duplications as short as 5 bp and inversions with palindromic sequences, improving sensitivity. Second, this approach is much less time consuming than assembly based methods, since only the subset of unmappable reads are assembled rather than the entire genome. Finally, MiStrVar uses a unified dynamic programming formulation, superior to tools that identify each type of variant individually, especially because these tools misinterpret certain variants, such as inversions, as a combination of other variants. See Supplementary Figure S1 for a detailed illustration. Both fusions and microSVs may be independently observed in genomic, transcriptomic and proteomic data; however, the most impactful aberrations, especially in the context of cancer, are the ones that can be observed in all levels in the same tissue simultaneously. In such cases, integrative analysis of these three omics data types can provide independent evidence for the presence and heritability of aberrations. For example, trans-splicing events, which lead to chimeric transcripts, can only be observed in transcriptomic (but not in genomic) data, and thus can be distinguished from fusion events with genomic breakpoints through simultaneous analysis of genomic and transcriptomic data acquired from the same sample. No available large-scale study has been conducted on the detection and validation of aberrant proteins and their genomic and transcriptomic origins. As mentioned earlier, expressed aberrant genome variants can have considerable functional influence on proteins, and as such, they may affect molecular pathway activity or pathogenesis in disease, especially in cancer. Detection of aberrant protein variants provides new insights into diagnostic marker identification and drug development (recurrent protein aberrations can imply potential drug targets) and can help develop novel strategies for therapeutic intervention. Proteomic technologies have enabled high throughput, sensitive and deep protein analysis for complex disease-associated samples, aiming at discovering potential disease protein biomarkers (Gillette and Carr, 2013; Mustafa et al., 2013; Wulfkuhle et al., 2003), including low-abundant proteins or protein isoforms, or variants. Moreover, proteomic analyses can provide complementary information to transcriptomic and genomic analysis, as proteomic analyses are carried out by completely different technologies (i.e. mass spectrometry or MS) from DNA sequencing. Furthermore, advancement in MS instrumentation has enabled proteomic analysis to achieve sensitivity on par with RNA-seq in detecting low abundant events of gene expression in complex samples (Zhang et al., 2014). Therefore, integrating transcriptomic and proteomic data can improve both the sensitivity and confidence in characterizing expressed aberrant variants in complex samples such as tumor tissues. Our second contribution in this paper is MiStrVar (ProTeogenomics Integration Engine), the first computational framework that integrates high throughput genomic, transcriptomic and proteomic data to identify translated structural aberrations, specifically gene fusions and microSVs, in protein-coding genes. In particular, ProTIE takes sequence aberrations from WGS and RNA-Seq data as its input and validates them on the mass-spectrometry based proteomics data, while ensuring that each such proteomic signature is unique to the matching sequence aberration. By integrating multiple data sources simultaneously, ProTIE is able to provide a strongly supported set of candidate aberrations from the highly sensitive results of MiStrVar and deFuse. This is particularly helpful for selecting target events or genes for clinical studies. Results. We ran our computational framework to detect all translated gene fusions in RNA-Seq (low coverage 50 bp paired-end) data in the complete set of 105 TCGA (The Cancer Genome Atlas) breast cancer samples for which CPTAC (Clinical Proteomic Tumor Analysis Consortium) mass spectrometry data have been released (The primary goal of CPTAC is to characterize protein level expression differences for SNVs/SAAVs. Our focus here is complementary to the goals of CPTAC.). These 105 samples include all four of the most common intrinsic subtypes of breast cancer (Koboldt et al., 2012). Among them, 22 samples also have matching WGS data, on which we used our framework to identify exonic microSVs. This resulted in 206 255 fusions and 69 876 microSVs across the 105 samples. 2215 of these microSVs are also supported by transcriptomic (RNA-Seq) evidence. All breakpoints from the predicted fusions and microSVs were then analyzed for identifying supporting peptides from mass spectrometry data. This yielded 244 aberrant peptides from 432 possible aberrations. More specifically, 169 novel peptides originate from 295 fusion candidates (many of the fusions are recurrent and thus produce the same novel fusion peptide) and 75 peptides originate from 137 potential microSVs; this is of particular note since many of the genomic microSVs are recurrent, yet the ones that are translated are mostly private. Note that a sequence aberration may give rise to more than one novel peptide in case it results in a frameshift. See Table 1 for a summary of results [One interesting observation is that among the microSVs discovered, only 4 (1 microinversions and 3 tandem microduplications) have supporting evidence at all omics levels. This implies that the transcriptomic support for the remaining translated microSVs are too low to be detected, partially due to low abundance of RNA-Seq data made available by TCGA on the breast cancer samples. This also suggests that with deeper coverage RNA-Seq data, ProTIE is likely to detect additional translated gene fusions.]. Table 1. Distribution of 244 detected, high confidence, aberrant peptides over four breast cancer subtypes, across 105 patients Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 # Aberrant Patients indicates the number of patients with either detected fusion peptides or microSV peptides in that subtype. a The high number of microinversion peptides in Basal-Like breast cancer can be attributed to two patients, A0CM, A0J6, whose genomes had gone through substantial reorganization. Table 1. Distribution of 244 detected, high confidence, aberrant peptides over four breast cancer subtypes, across 105 patients Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 Cancer # Patients # Peptides Subtype Total Aberrant Fusion Inversion Duplication Basal-Like 25 22 50 57a 2 HER2-Enriched 18 17 41 3 3 Luminal A 29 26 49 0 2 Luminal B 33 31 78 8 3 # Aberrant Patients indicates the number of patients with either detected fusion peptides or microSV peptides in that subtype. a The high number of microinversion peptides in Basal-Like breast cancer can be attributed to two patients, A0CM, A0J6, whose genomes had gone through substantial reorganization. 2 Materials and methods Our computational framework (see Fig. 1), is comprised of a number of algorithmic tools that we developed for detecting transcriptomic and genomic aberrations, and searching for expressed protein variants resulting from these aberrant sequences. Given a set of genomic (WGS), transcriptomic (RNA-seq) and proteomic (Mass Spectrometry) data, each collected from the tumor tissue of a patient, our pipeline detects translated sequence aberrations in three major steps. Fig. 1. View largeDownload slide Overview of the computational pipeline for identifying translated sequence aberrations Fig. 1. View largeDownload slide Overview of the computational pipeline for identifying translated sequence aberrations Each whole genome sequencing dataset is analyzed with MiStrVar, the microSV discovery tool we introduce in this paper, to identify microSVs occurring in protein-coding genes. (Note that our computational framework provides the option of validating genomic microSVs at the transcriptomic level by identifying RNA-Seq reads associated with each microSV breakpoint.) Each transcriptomic dataset is analyzed by our in-house fusion detection method deFuse (McPherson et al., 2011b), which reports potential fusion events between two protein coding genes, and the fused transcript sequences spanning the fusion breakpoints. [Note that our computational framework enables the use of our integrative fusion detection methods nFuse (McPherson et al., 2012)/Comrad(McPherson et al., 2011a) for corroborating potential fusions observable in WGS and RNA-Seq data.] All omics data is finally integratively analyzed through ProTIE, our novel ProTeogenomics Integration Engine as follows. Each mass spectrometry dataset is searched against a protein sequence database consisting of all human proteins from Ensembl human protein database GRCh37.70 (Ensembl, ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz), along with a database of proteins generated by fused transcripts and microSVs, by the use of MS-GF+ search engine (Kim and Pevzner, 2014). Aberrant peptides identified by the procedure with high confidence [e.g. at 1% false discovery rate estimated by using the target-decoy approach (Elias and Gygi, 2007)] are reported, provided they are also detected in the genomic/transcriptomic dataset from the same tumor tissue sample. 2.1 Detection of fusions and microSVs in WGS and RNA-seq data To detect fusions in RNA-Seq data, we applied deFuse (McPherson et al., 2011b) which predicts fusion transcripts based on analyzing discordantly mapped read-pairs and one-end anchors. To detect microSVs in WGS data, we applied our novel micro-structural variant caller, MiStrVar, which works in three major steps (See Supplementary Fig. S2 for an overview): In step (A), MiStrVar identifies all one-end anchors (OEA) in the read data: an OEA is a paired-end read for which only one end maps to the reference genome within a user defined error threshold. Once all reads are (multiply) mapped to a reference genome using mrsFAST-ultra (Hach et al., 2010, 2014), and all OEAs are extracted, the mapped ends of OEAs are clustered based on the mapping loci. MiStrVar provides the user two options for cluster identification, each satisfying one of the following distinct goals. For applications where sensitivity is of high priority, MiStrVar employs a sweeping algorithm for OEA mapping loci [introduced for VariationHunter (Hormozdiari et al., 2009)]. For applications where running time is of high priority, MiStrVar employs an iterative greedy strategy. In step (B), for each OEA cluster identified in step (A), MiStrVar assembles the unmapped end of the reads to form contigs (of length <400 bp) by aiming to solve the NP-hard (Gallant et al., 1980) dominant superstring (DSS) problem. MiStrVar employs a greedy strategy similar to that used to compute a constant factor approximation to the shortest superstring problem (Blum et al., 1994). In step (C), each contig associated to an OEA cluster is aligned to a region (of length several kilobases long) surrounding the OEA mapping loci, first through a simple local-to-global sequence alignment algorithm, that does not consider any structural alteration. (The reverse complement of the contig is also aligned to the same region.) The start and end position of this first, crude alignment is used to determine the approximate locus and length of the potential microSV implied by the contig. The exact microSV breakpoints are obtained in the next step through a more sophisticated alignment that considers structural alterations, which is applied to the portion of the reference genome restricted by the first alignment. The dynamic programming formulation for this alignment is an extension of the Schöniger-Waterman algorithm (Schöniger and Waterman, 1992) which was designed to capture inversions in the alignment. Specifically, our extensions enable the user to (i) discover the single best optimal event, rather than an arbitrary number of events, (ii) handle gaps extending over breakpoints (in cases of missing contig sequence) and (iii) simultaneously predict duplications, insertions, deletions and SNVs in addition to inversions. Further details on deFuse and MiStrVar methodology can be found in the Supplementary Text. 2.2 Identification of translated and transcribed sequence aberrations ProTIE provides the ability to detect translated aberrations by searching mass spectra against an aberrant peptide database. More specifically, given transcriptomic breakpoints pointing to fusions or microSVs, ProTIE identifies respective aberrant peptides from proteomic data by first generating a peptide database, and then identifying aberrant peptides based on mass spectrometry search results provided by MS-GF+ (Kim and Pevzner, 2014). (See Section S2.3 in Supplementary Material for details of database construction and parameters used in proteomics search.) Our pipeline also provides the user with the additional ability to jointly analyze matching WGS and RNA-Seq data for identifying transcribed genomic microSVs. Given a set of genomic microSVs, along with their breakpoints detected by MiStrVar, our pipeline generates corresponding aberrant transcripts. It then maps RNA-Seq reads to the collection of these aberrant transcripts. After filtering reads that can be mapped to a known isoform or potential novel spliceform, the remaining mappings provide evidence for aberrations in transcribed regions. See Section S2.5 in Supplementary Material for details about mappings/read filtration steps. 3 Experimental results CPTAC Breast Cancer Dataset. Clinical Proteomic Tumor Analysis Consortium (CPTAC, http://proteomics.cancer.gov) (Ellis et al., 2013; Whiteaker et al., 2014) aims to provide proteogenomic characterization of specific cancers based on joint analysis of proteomic, transcriptomic and genomic data acquired from the same group of cancer patients. CPTAC currently focuses on the relationship between protein abundance, somatic mutations and copy number alterations occurring in cancer-related genes (Zhang et al., 2014). Information about aberrations hidden in unidentified spectra and unmapped sequenced reads have not been revealed in the current CPTAC analysis framework; this happens to be the main focus of our paper. At the time of submission of this paper, proteomics data for tumor samples from three cancer types had been released by CPTAC: colorectal cancer, breast cancer and ovarian cancer. In addition, The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) has released RNA-Seq and WGS data on both normal and tumor tissues from the same group of patients through Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/). RNA-Seq data for breast and ovarian cancer patients are in the form of paired-end reads, however, for most of colon and rectal cancer samples only single-end reads were collected. Because we rely on paired-end mappings for detecting fusions and microSVs and since the RNA-Seq data from normal tissues from the ovarian cancer patients had not been released at the time of the submission, our focus in this paper is the breast cancer dataset. Details about CPTAC samples used in our analysis can be found in Supplementary Tables S5, S6. Breast Cancer Cell Line. In addition to the CPTAC and TCGA datasets, we used the HCC1143 ductal breast cancer cell line (triple negative breast cancer cell line from ATCC) for which we obtained matching tumor/normal Illumina HiSeq WGS, RNA-Seq and mass spectrometry data. The matching normal cell line, HCC1143-BL, is a B lymphoblastoid cell line initiated from peripheral blood lymphocytes from the same patient as HCC1143 by transformation with Epstein-Barr virus (EBV). The WGS data was obtained from NCI Genomic Data Commons (https://gdc.cancer.gov/), originally sequenced as part of the Cancer Cell Line Encyclopedia Project (Barretina et al., 2012). We used this cell line as preliminary validation for our approach before starting full scale analysis. 3.1 Gene fusion detection by deFuse Gene Fusions in the HCC1143 Breast Cancer Cell Line. We have run our fusion detection method, deFuse to detect gene fusions on RNA-Seq data from HCC1143 cell line. There are 81.73 M paired end reads of 101 bp length. Based on concordant mapping results, the average fragment length and standard deviation were 264.2 and 86.59 bp, respectively. deFuse predicts 1325 fusions from this dataset, out of which 74 are considered high confidence predictions based on the filtering criteria employed by deFuse (McPherson et al., 2011b). Gene fusions in breast cancer patient RNA-seq data. Each RNA-Seq dataset from the CPTAC breast cancer patient cohort was, on average, comprised of 76 M paired-end Illumina reads with length 50 bp. Based on transcriptome mapping results, the average fragment length and standard deviation were 190.3 and 65.47 bp, respectively. In total deFuse detected 206 255 fusions; on average, this amounts to 1964 predictions per sample. However, many of these predictions had low deFuse scores, either due to low sequence similarity or limited read support, and thus were not good fusion candidates. Only 3907 of these predictions (roughly 2% of all predictions) in total are considered to be high confidence calls by deFuse. 3.2 MicroSV detection by MiStrVar MicroSV predictions were based on three WGS datasets. The first is a simulation dataset based on the Venter genome developed with the goal of assessing sensitivity and precision of our methods with respect to available tools for SV discovery. These results are summarized in Figure 2; more details can be found in Supplementary Material. The second dataset consists of WGS data from the HCC1143 cell line (both tumor and normal), which was used to assess our methods’ accuracy on a homogeneous tumor sample. The third dataset is comprised of 22 TCGA/CPTAC breast cancer WGS data, which were used for full scale evaluation of our methods. Fig. 2. View largeDownload slide Comparison of precision and recall of MiStrVar against other SV discovery tools Fig. 2. View largeDownload slide Comparison of precision and recall of MiStrVar against other SV discovery tools 3.3 MicroSVs in the HCC1143 breast cancer cell line Before running MiStrVar on the TCGA/CPTAC breast cancer samples, we applied it to the HCC1143 breast cancer cell line. We identified 116 microinversions and 197 microduplications (Supplementary Table S3.2) on this sample. Among these, 11 inversions and 12 duplications have both high read coverage and low mapping multiplicity. We focus only on these microSVs for the remainder of the discussion. Details on the 11 inversion candidates can be found in Table 2. All 11 inversions appear in both normal and matching tumor samples indicating that they are germline events. We experimentally validated these inversions using Sanger sequencing. The primers were constructed by using the inverted sequence flanked by 200–300 bp from the reference genome. Five of the predicted inversions show a clear sequence match between the amplicon and predicted inversion, validating these inversion candidates. A representative example is given for the inversion in SLC3A1 in Supplementary Table S8 and the complete set of chromatograms is included in the appendix. We note here that all the high confidence microinversions, except for the one found in UBP1, have an associated multiple nucleotide polymorphism (MNP) entry in dbSNP. This includes the microinversion in BOK, which was not validated by Sanger sequencing. Table 2. Sanger sequencing validation of top 11 microinversion and top 12 exonic microduplication (tandem or interspersed) candidates in the breast cancer cell line HCC1143 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 Note: Known cancer genes are indicated in bold face. a Region (Ut: UTR, In: Intron, Ex: Exon). b Identity. c Tumor. d Normal Two alleles ① One allele Zero alleles Inconclusive in the sample. e RNA-Seq NE: No expression of gene. Table 2. Sanger sequencing validation of top 11 microinversion and top 12 exonic microduplication (tandem or interspersed) candidates in the breast cancer cell line HCC1143 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 WGS Sup. Val. Len. Gene Ra Ib Tc Nd Tc Nd RSe dbSNP Inversion Calls 27 SLC3A1 Ut 100.0 66 62 – rs71416108 26 TNIK In 100.0 96 76 – rs781523247 29 CTTNBP2 In 100.0 76 81 – rs386717124 24 PFKP In 98.8 62 51 – rs386740061 32 NLRP4 In 98.0 80 103 – rs386811126 29 ZNF571 In 99.3 26 35 ① ① – rs386809055 23 OSBP2 In 100.0 49 24 ① ① – rs67147751 18 GNG12 In 98.7 37 43 ① – rs386632129 29 LINGO2 In 100.0 18 41 ① – rs386733960 30 UBP1 In 100.0 10 40 – – 12 BOK In 98.6 77 117 – rs386657165 Duplication Calls 6 GTPBP6 Ex 100.0 28 33 0 – 34 FAM20C Ex 98.0 12 22 0 rs774848096 45 KIAA1009 Ex 98.6 34 0 7 rs539790644 9 BAIAP2L2 Ex 100.0 48 33 0 rs142739979 27 RBMXL3 Ex 98.7 37 41 NE rs782097222 9 GPRIN2 Ex 100.0 90 69 ① ① 36 rs112620425 6 PALM2-AKAP2 Ex 99.3 16 46 ① 5 rs150402481 5 PRSS48 Ex 100.0 0 47 ① – rs71901196 6 ADAMTS19 Ex 98.6 0 29 – rs142924298 16 CIDEA Ex 100.0 0 24 – rs71369912 5 IRAK1BP1 Ex 98.8 79 65 0 rs146020132 7 ADAMTS7 Ex 100.0 40 33 0 rs781638345 Note: Known cancer genes are indicated in bold face. a Region (Ut: UTR, In: Intron, Ex: Exon). b Identity. c Tumor. d Normal Two alleles ① One allele Zero alleles Inconclusive in the sample. e RNA-Seq NE: No expression of gene. The 12 duplication candidates are summarized in Table 2; all were exonic, i.e. fully or partially overlapping with exons. All of these duplications produced amplicons except for the one located in IRAK1BP1. Additionally, two amplicons from the normal sample (on genes ADAMTS19 and CIDEA) yielded a weak signal in the chromatogram so it was impossible to determine if they support the call or not; furthermore, the corresponding amplicon from the tumor sample showed no evidence of the call. Three of the nine remaining calls, in FAM20C, GTPBP6 and KIAA1009, show a clear match in the tumor sample but not in normal, indicating they are true somatic calls. Two calls, in BAIAP2L2 and RBMXL3, have a clear match in both tumor and normal samples, indicating they are germline calls. For discussion on hetrozygous calls, see Supplementary Section S3.2.1. In addition to MiStrVar we ran all the SV callers we tested on the HCC1143 cell line data. The parameters for all tools were identical to those used in the simulation. Out of these tools, only Pindel was able to identify any of the inversions. However, Pindel missed 2 of the 9 PCR validated inversion calls (in PFKP and OSBP2), out of 11 tested. The two calls made by MiStrVar that could not be validated were also called by Pindel, providing further evidence that MiStrVar improves Pindel with respect to both precision and recall. None of the tools were able to predict any of the microduplications. (Note that ITDetector was never able to complete execution after more than a month of processing). 3.3.1 MicroSVs in the complete set of TCGA-CPTAC breast cancer samples We applied MiStrVar and MiStrVar to the complete set of matched tumor/normal samples from 22 TCGA breast cancer patients for which matching WGS, RNA-Seq and CPTAC Mass Spectrometry data were all available (see Supplementary Material for details). Minimal filtering was used on the calls since few calls uniquely matched proteomic signatures. Note that we only focus on exonic microinversion and microduplication calls (either fully or partially overlapping with exons) for further analysis. The number of calls for each sample can be found in Supplementary Table S7. Although only exonic calls were used for further analysis, the highest confidence calls within intronic and UTR regions, with respective support of > 40 and > 10 (identity = 100%) were also collected (see Supplementary Table S8). We also provide the highest confidence microduplications without proteomic support (support > 40, identity = 100%) as well as somatic microduplications (see Supplementary Table S9). 3.4 ProTIE proteogenomics analysis of CPTAC breast cancer datasets CPTAC has produced global proteome and phosphor-proteome data for 105 TCGA breast cancer samples using iTRAQ protein quantification method. Each iTRAQ experiment included three TCGA samples and one common internal reference control sample. The internal reference is comprised of a mixture of 40 TCGA samples (out of the 105 breast cancer samples) with equal representation of the four breast cancer subtypes. Three of the TCGA samples were analyzed in duplicates for quality control purposes. See Section S5 in Supplementary Material for more details about mass spectrometry datasets. Based on the database search strategy mentioned in Supplementary Section S2.3, in each mixture, our first level analysis resulted in approximately 5342 spectra (1% FDR) matching to fusion peptide sequences, and about 620 spectra matching to microSV peptide sequences. If a matched peptide is identical to a substring of any known protein in Ensembl database, the corresponding spectra is discarded so as to ensure that the peptide is novel. The remaining results thus consist of all mass spectra in a single mixture supporting novel peptides originating from high confidence sequence aberrations. For a specific mixture, we can extract all the genes and the corresponding patient(s) generating these translated aberrations based on deFuse and MiStrVar calls. We also apply class-specific peptide-level FDR (Nesvizhskii, 2014) to provide more stringent results of novel proteins detection. In Table 3, a checkmark in the last column (labeled Str FDR) indicates that the corresponding PSMs pass this more stringent class-specific peptide-level FDR under 1%. Table 3. The list of interesting aberration events with translated peptides Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Note: A (✓) in the column BP (BreakPoint) indicates that the peptide crosses the fusion breakpoint, and in the last column that the peptide satisfies our more stringent FDR criterion. For inversions, associated peptides always span one or two breakpoints (1/2 or 2/2), or the inverted sequence between the breakpoints (‘Between’). For duplications, the associated peptide always spans the single breakpoint and the entire inserted sequence (1/1). Breakpoints in the peptide sequences are marked with ‘|’. Calls marked as ‘Low’ in the RNA-seq column are those from genes with low sequence coverage; and as ‘N/A’ indicate the lack of RNA-seq data for this sample. Known cancer genes are indicated in bold face. Starred peptides (*) are SAAVs according to validated peptides in Ensembl GRCh38 protein database. a LA, LB, Bl, H2 stand for Luminal A, Lunimal B, Basal-like and HER2-enriched, respectively. b T, B, N indicate event detection in tumor tissue, normal tissue or both. Table 3. The list of interesting aberration events with translated peptides Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Patient Clinical Gene 1 Gene 2 deFuse Breakpoint BP Peptide # of Str Infoa score location Spectra FDR Fusions satisfying RNA-Seq level filtration conditions Fusion Calls A08G LB, IIA UBAP2 TEAD1 0.94 coding coding ✓ AINILLEGNSDTDQTAK 1 ✓ A0AM LB, IIA C17orf85 ZMYND15 0.92 coding downstream ✓ AQTPGDQETR 1 A12E LA, IIB C20orf111 FITM2 0.93 utr5p coding ✓ NVLNVVNR 1 A142 Bl, IIB ACTG1 ACTB 0.53 coding coding ✓ QDATLALGLVTNWDDMEK 1 ✓ A159 Bl, IIA ACTL7B KLF9 0.54 coding utr3p ✓ EAQLPLEALGEAIQLCFLSFLSVR 1 A15A LB, IIIC HOOK3 CTA-392C11.1 0.43 coding intron ✓ CHELDMQEK 1 ✓ YHMFSLISGAEQGEHMDTGR 2 ✓ A18U LB, IIIA ZNF354A RP11-383H13.1 0.48 coding intron ✓ DGSGVSSLGVTPESR 2 ✓ Additional Fusions with Multiple Supporting Peptides Fusion Calls A04A LA, IIIA ACTG1 ACTB 0.03 coding coding ✓ QKEALFQPSFLGMESCGIHETTFNSIMK 30 ✓ ✓ KEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A06N LB, IIIB KRT19 CTD-2165H16.1 0.01 coding pseudogene ✓ DNPGVLKPGMVVTFAPVNVTTEVK 1 ✓ NPGVLKPGMVVTFAPVNVTTEVK 13 ✓ A0AS LB, IIIA ACTG1 ACTG1P2 0.39 coding pseudogene ✓ (*)DLYTNTVLSGGTTMYPGIADR 5 ✓ ✓ (*)LYTNTVLSGGTTMYPGIADR 6 ✓ A0AS LB, IIIA ACTB KDM4C 0.01 coding intron ✓ (*)FCCPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ (*)CCPEALFQPSFLGMESCGIHETTFNSIMK 6 A0D1 H2, IIA RPL8 CTD-2165H16.1 0.01 coding pseudogene ✓ EAVPIVAAGVGEFEAGISK 1 AFVPISGWNGNNMLEPSANMPWFK 2 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPW 1 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWF 3 ✓ IGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 16 ✓ KIGYNPDTVAFVPISGWNGNNMLEPSANMPWFK 2 ✓ A0TT LB, III ACTB ACTG1 0.47 coding coding ✓ AWSPEALFQPSFLGMESCGIHETTFNSIMK 13 ✓ ✓ WSPEALFQPSFLGMESCGIHETTFNSIMK 1 ✓ A12Z H2, II ACTB FNIP1 0.21 coding intron ✓ MTQIMFETFNTPAVYMAI 2 ✓ ✓ MTQIMFETFNTPAVYMAIQ 1 ✓ A158 Bl, IIA EEF1A1P7 EEF1A1P29 0.03 pseudogene pseudogene ✓ KIGYNPNTVAFVPISGWNGDNMLEPSANMPWFK 1 ✓ DGNASGTILLEALDCILPPTRPTDK 5 ✓ Patient Clinical Gene dbSNP RNA-Seq WGS Breakpoint SV Peptide # of Str Infoa score sourceb Support Length Spectra FDR Microduplication calls with supporting peptides Duplication Calls A0DG LA, IIA FAM134A ✓ B 1/1 6 QALDS|EE|EEEEEDVAAK 1 ✓ A0JM LB, IIB HSPBP1 rs3040014 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 2 ✓ A18U LB, IIIA HSPBP1 Low B 1/1 9 LPLALPPASQGCSSGGGGG|GG|GGSSAGGSGNSRPPR 1 ✓ A18R H2, IB NUPL2 rs200880793 ✓ B 1/1 12 QQP|RQQP|QQQPSGNNR 1 A12Q LB, IIIC RPL14 rs369485042 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 4 ✓ A0DG Bl, I RPL14 ✓ B 1/1 9 GT|AAA|AAAAAAAAAAK 3 ✓ A0YG LA, IIA RPL14 rs369485042 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 3 ✓ A0JM LA, IIB RPL14 ✓ B 1/1 15 GT|AAAAA|AAAAAAAAAAK 6 ✓ A0CE Bl, IIA RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 9 ✓ A18R LB, IIB RPL14 ✓ N 1/1 15 GT|AAAAA|AAAAAAAAAAK 4 ✓ A18U LB, IIA RPL14 ✓ B 1/1 27 GT|AAAAAAAAA|AAAAAAAAAAK 3 ✓ Microinversion calls with supporting peptides Inversion Calls A0D2 Bl, IIB PLIN4 rs12327614, rs56366613 N/A N 2/2 6 DTVCSGVT|SA|MNVAK 2 ✓ A0AV Bl, IIC PLIN4 rs75031432, rs79662071 ✓ B 2/2 6 DTVCSGVT|SA|MNVAK 4 ✓ A0YG LA, IIA CHD5 N/A N 1/2 939 KQVNYNDASQEDQ|GSER 1 ✓ A0J6 Bl, IIA C4orf21 T Between 73 KMTYVVNR 1 ✓ A0EY LB, IIB PTPN4 N/A N Between 794 FINNYIIK 1 A0J6 Bl, IIA ZNF415 Low T Between 497 QRAEILEK 1 A18R H2, IB ACSM2A Low T Between 528 VSQGNIK 1 A0CM Bl, IIA CC2D2A Low T Between 886 MEHMIQASVT 1 A0CM Bl, IIA ZNF257 Low T Between 732 FSHLIAGK 1 Note: A (✓) in the column BP (BreakPoint) indicates that the peptide crosses the fusion breakpoint, and in the last column that the peptide satisfies our more stringent FDR criterion. For inversions, associated peptides always span one or two breakpoints (1/2 or 2/2), or the inverted sequence between the breakpoints (‘Between’). For duplications, the associated peptide always spans the single breakpoint and the entire inserted sequence (1/1). Breakpoints in the peptide sequences are marked with ‘|’. Calls marked as ‘Low’ in the RNA-seq column are those from genes with low sequence coverage; and as ‘N/A’ indicate the lack of RNA-seq data for this sample. Known cancer genes are indicated in bold face. Starred peptides (*) are SAAVs according to validated peptides in Ensembl GRCh38 protein database. a LA, LB, Bl, H2 stand for Luminal A, Lunimal B, Basal-like and HER2-enriched, respectively. b T, B, N indicate event detection in tumor tissue, normal tissue or both. 3.4.1 ProTIE inferred fusion peptides Given the proteomics search results for a specific mixture, a peptide will be further evaluated only if the corresponding fusion is also observed in at least one patient within the mixture. Among the remaining 5579 spectra, 3185 match to peptides coming from immunoglobulin heavy and light chain fusions. These peptides are not considered any further since highly repeated regions shared between those genes can lead to false positives in both fusion detection and proteomics search stages (Boutz et al., 2014; Cheung et al., 2012). Among the peptides remaining, we also discard those associated with a fusion for which no breakpoint crossing peptide is observed (This is due to the difficulty of determining whether such a peptide is a result of a fusion or because of a reading frame shift). At the end of these filtering steps ProTIE returns 807 spectra matching to 169 potential fusion peptides. Among fusions related to these potential fusion peptides, we summarize special events with either high confidence RNA-Seq level evidence or proteomics support in Table 3. The first part of Table 3 shows events with better fusion quality based on reports of deFuse (see Section S7 in Supplementary Material for details). Among these fusions, two stand out with respect to peptide-spectrum matching quality, respectively observed in patients A08G and A15A. The PSMs supporting these two fusions are shown in Supplementary Material. We also provide a list of fusions with multiple translation peptides in the second part of Table 3. More specifically, four of these fusions have matching peptides located on both at the breakpoint and further downstream. Note that although we only detect a single peptide for some additional fusions, the peptide may be supported by multiple spectra as can be seen in Supplementary Table S10. 3.4.2 ProTIE inferred MicroSV peptides As per ProTIE's translated fusion peptide inference approach, for each mixture, we only consider previously unknown peptides that can be unique byproducts of microSVs detected in at least one patient within the mixture. To ensure that these peptides support microSV (duplication or inversion) calls and not SNVs/SNPs, we only consider potential peptides from an interspersed duplication or inversion with a minimum of two amino acids on each side of at least one of the two breakpoints associated with that microSV; for tandem duplications we ensure that at least two amino acids are present in the peptide from both sides of the single breakpoint. Proteomics search of these peptides on 22 patients resulted in 115 spectra potentially supporting microSVs. These spectra support a total of 75 peptides, due to the fact that some of the peptides are supported by more than one spectra. Of these 75 peptides, 7 support microduplications and 68 support microinversions. Incorporating the RNA-Seq results from Section S6.2 in Supplementary Material, we obtain 4 microSV calls with support on all omics levels. The resulting peptides with the highest quality spectra support are summarized in Table 3. Here the number of spectra supporting these peptides is indicated in the ‘Spectra’ column. Similarly, column ‘Breakpoint Support’ indicates the number and type of the breakpoints supported by spectra for each peptide. 4 Discussion 4.1 Genomic MicroSVs detected with MiStrVar Our simulations show that MiStrVar effectively and accurately identifies all microSVs, specifically, insertions, tandem and interspersed duplications in WGS datasets. In particular, MiStrVar has high sensitivity, as well as high precision—especially for inversions. For duplications, even though its precision may not look as impressive, MiStrVar still outperforms all available alternatives. In addition, the precision values for duplications are likely to have been underestimated, since many of calls labelled as ‘false positives’ could, in fact, be true germline differences between the Venter genome and the reference genome. On a very high coverage dataset (120x) from the Venter genome, with no simulated microSVs, duplications detected by MiStrVar have a large overlap with those it detects in the simulation dataset. Elimination of these calls from the simulation dataset increases MiStrVar’s precision to 71% without any additional filtering. MiStrVar is also very accurate in identifying the exact breakpoint loci of the microSVs. This is particularly important for our proteogenomics analysis where we only consider exact peptide matches. If a breakpoint were off even by only one nucleotide there is a high likelihood the predicted peptide would not match. With the exception of Pindel for inversions, which correctly identified 10% fewer exact breakpoints, no tool was even close to correctly identifying as many single-nucleotide resolution microSV breakpoints as MiStrVar. For inversions, the calls where MiStrVar cannot identify the exact breakpoints are often due to the presence of palindromic sequences, resulting in co-optimal breakpoint predictions. More importantly, these cases yield identical peptides and therefore do not affect further analysis results. For duplications, the errors are usually observed in cases where the insertion is into a low complexity region. Again, in many of these cases the resulting peptides would be identical (e.g. consider a duplication that occurs in a polynucleotide tract). Furthermore, even in the worst case, MiStrVar predictions are within 30 bp from the real breakpoints, still much better than the available alternatives. It should also be noted here that unlike other tools, MiStrVar provides not only the duplication breakpoint coordinates but also the precise coordinates of the ‘source’ sequence (i.e. the region of the genome that is duplicated). Through this feature it becomes easier for the user to interpret interspersed as well as tandem duplications. 4.2 Translated aberrations detected with ProTIE The use of a proteogenomic approach enables two novel capabilities that are highly relevant to cancer biology and precision medicine. (i) The ability to hone in on potential clinically actionable mutations that are expressed at the protein level. The vast majority of clinical cancer testing focuses only on DNA-level mutations. A gene mutation–drug association is predicated on the assumption that a mutation will translate to the protein level, however, this is often not the case, as genes that contain a mutation may not be expressed in RNA. Likewise, RNA expression does not always directly translate to protein expression. Thus, having protein level evidence to confirm genomic aberrations provides assurance of the functional presence of a mutation. (ii) The ability to observe the presence of protein spectra from fusion transcripts that are predicted to be ‘out-of-frame’. The vast majority of fusion annotation pipelines filter out fusions that are not in-frame secondary to a widely-held reasoning that these protein products would be misfolded and degraded or subject to non-sense mediated decay. Surprisingly, in this study, high quality spectra were observed from out-of-frame fusion spectra. While additional studies will need to be performed, these data suggest these out-of-frame fusion products are stable enough and at a relative abundance to be detected by Mass Spectrometry. Whether these products are stable by chance or confer a gain-of-function capability is yet to be seen, but these data at minimum suggest that out-of-frame fusions should not be eliminated from consideration when searching for oncogenic candidates. 4.2.1 Translated gene fusions To better understand the properties of genes with translation evidence for fusions, we analyzed these genes through Ingenuity Pathway Analysis (https://www.ingenuity.com). Note that we used all fusion genes detected by deFuse as the background genes in the analysis. The top 3 categories for gene function enrichment are: Cancer (137 genes), Organismal Injury and Abnormalities (150 genes), and Respiratory Disease (39 genes). All three sets of genes come with adjusted P-value around 0.0035 (via Benjamini-Hochberg procedure). Given that fusions are a somatic cancer-specific event, enrichment of cancer related genes provides a validation of our approach. See Section 8.1 in Supplementary Material for more details on functional interpretations of these genes. 4.2.2 Translated MicroSVs Most of the microinversions with proteomics support are in the 400 bp to 1 kb length range. Microinversions shorter than 100 bp are much less common in exonic regions. However in intronic and UTR regions, microinversions with the best genomic support (in terms of both read coverage and sequence similarity) are predominately of length less than 100 bp; See Supplementary Table S8, for a summary of intronic and UTR microinversions. We also observed that shorter microinversions tend to be germline events while longer events tend to be somatic. All of the microduplication calls with proteomic support (all of these—with the exception of the one in NUPL2—satisfy our more stringent FDR criterion) were predicted to be germline events. Indeed nearly all of these events have corresponding dbSNP entries. The call in FAM134A appears to be a novel germline event. The longest duplication in RPL14 also appears to be novel (rs369485042 includes variants with up to 5 alanines). Since we observed relatively few translated microduplications, it is unlikely that this type of microSV plays a major role in breast cancer through translation to aberrant proteins. However we predicted many high confidence microduplications in exonic regions, some with RNA-Seq support, in addition to many in UTRs and introns (Supplementary Table S9). It is possible that such exonic duplications lead to truncated or rapidly degraded proteins and the duplications in UTRs and intronic regions may affect gene expression and splicing. Our analysis resulted in 4 microSV calls with support on all omics levels. This includes 3 microduplications (within genes FAM134A, NUPL2 and RPL14) and 1 microinversions (within PLIN4). The microduplications in FAM134A and RPL14 (that with 27 bp) appear to be novel events. Additionally, there are several events with both genomic and proteomic support, which possibly lack RNA-Seq support due to low expression of the associated gene or the lack of RNA-Seq data for the sample. See Section S8.2 in Supplementary Material for more details on functional interpretations of these genes. 5 Conclusion Integration of genomic, transcriptomic and proteomic data provides a comprehensive view of the patient’s molecular profile. TCGA/CPTAC now offers matching genomic, transcriptomic and proteomic data across several cancer types, with a focus on the impact of Single Amino Acid Variants (SAAVs) and SNVs on protein abundances. In order to complement TCGA/CPTAC study and better establish the relationship between genomic, transcriptomic and proteomic aberrations and the cancer phenotype, we introduce MiStrVar, the first tool to capture multiple types of microSVs in WGS datasets. MiStrVar, and deFuse, a fusion detection tool we developed earlier, form key components of ProTIE, a computational framework we introduce here to automatically and jointly identify translated fusions and microSVs in matching omics datasets. Concurrently, ProTIE also incorporates RNA-Seq evidence to validate expressed microSVs. Based on both simulation and cell line data, we demonstrate that MiStrVar significantly outperforms available tools for SV detection. Our results on the TCGA/CPTAC breast cancer datasets also suggest the possibility of automatic calibration for some entries in dbSNP, which we believe are misannotated. It is interesting to note that the majority of the translated microSVs and fusions we observed in the breast cancer samples were private events; this prompts a larger and more detailed integrated study of all three omics data types through the use of ProTIE for a comprehensive molecular profiling of breast cancer subtypes. Acknowledgements The WGS and RNA-Seq datasets were retrieved from the Cancer Genomics Hub (now Genomic Data Commons); the proteomics data was released by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium. Clinical information was obtained through the database of Genotypes and Phenotypes (dbGaP). Funding This paper was partially funded by the US National Institutes of Health Grant GM108348, the NSERC Discovery Frontiers Grant, “Cancer Genome Collaboratory” (to S.C.S.), NSERC Discovery Grant (to F.H.), and NIGMS/NIH Grant No R01 GM103725-04 (to S.L.). Additional funding is provided by the Indiana University Grand Challenges Program, Precision Health Initiative (to S.C.S. and H.T.). Conflict of Interest: none declared. References Barretina J. et al. ( 2012 ) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity . Nature , 483 , 603 – 607 . Google Scholar CrossRef Search ADS PubMed Blum A. et al. ( 1994 ) Linear approximation of shortest superstrings . J. ACM , 41 , 630 – 647 . Google Scholar CrossRef Search ADS Boutz D.R. et al. ( 2014 ) Proteomic identification of monoclonal antibodies from serum . Anal. Chem ., 86 , 4758 – 4766 . Google Scholar CrossRef Search ADS PubMed Castellana N.E. et al. ( 2014 ) An automated proteogenomic method uses mass spectrometry to reveal novel genes in zea mays . Mol. Cell. Proteomics , 13 , 157 – 167 . Google Scholar CrossRef Search ADS PubMed Cesnik A.J. et al. ( 2015 ) Human proteomic variation revealed by combining RNA-seq proteogenomics and global Post-Translational modification (G-PTM) search strategy . J. Proteome Res ., 15 , 800 – 808 . Google Scholar CrossRef Search ADS Cheung W.C. et al. ( 2012 ) A proteomics approach for the identification and cloning of monoclonal antibodies from serum . Nat. Biotechnol ., 30 , 447 – 452 . Google Scholar CrossRef Search ADS PubMed Elias J.E. , Gygi S.P. ( 2007 ) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry . Nat. Methods , 4 , 207 – 214 . Google Scholar CrossRef Search ADS PubMed Ellis M.J. et al. ( 2013 ) Connecting genomic alterations to cancer biology with proteomics: the NCI clinical proteomic tumor analysis consortium . Cancer Discov ., 3 , 1108 – 1112 . Google Scholar CrossRef Search ADS PubMed Ewald I.P. et al. ( 2009 ) Genomic rearrangements in BRCA1 and BRCA2: a literature review . Genet. Mol. Biol ., 32 , 437 – 446 . Google Scholar CrossRef Search ADS PubMed Fan X. et al. ( 2014 ) BreakDancer – identification of genomic structural variation from paired-end read mapping . Curr. Protoc. Bioinf ., 45 , 15.6.1 – 15.6.11 . Google Scholar CrossRef Search ADS Fernandez-Luna J.L. ( 2000 ) Bcr-Abl and inhibition of apoptosis in chronic myelogenous leukemia cells . Apoptosis Int. J. Program. Cell Death , 5 , 315 – 318 . Google Scholar CrossRef Search ADS Frenkel-Morgenstern M. et al. ( 2012 ) Chimeras taking shape: potential functions of proteins encoded by chimeric RNA transcripts . Genome Res ., 22 , 1231 – 1242 . Google Scholar CrossRef Search ADS PubMed Gallant J. et al. ( 1980 ) On finding minimal length superstrings . J. Comput. Syst. Sci ., 20 , 50 – 58 . Google Scholar CrossRef Search ADS Gillette M.A. , Carr S.A. ( 2013 ) Quantitative analysis of peptides and proteins in biomedicine by targeted mass spectrometry . Nat. Methods , 10 , 28 – 34 . Google Scholar CrossRef Search ADS PubMed Hach F. et al. ( 2010 ) mrsFAST: a cache-oblivious algorithm for short-read mapping . Nat. Methods , 7 , 576 – 577 . Google Scholar CrossRef Search ADS PubMed Hach F. et al. ( 2014 ) mrsFAST-ultra: a compact, SNP-aware mapper for high performance sequencing applications . Nucleic Acids Res ., 42 , W494 – 500 . Google Scholar CrossRef Search ADS PubMed Hemmer S. et al. ( 2001 ) Deletion of 11q23 and cyclin D1 overexpression are frequent aberrations in parathyroid adenomas . Am. J. Pathol ., 158 , 1355 – 1362 . Google Scholar CrossRef Search ADS PubMed Hormozdiari F. et al. ( 2009 ) Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes . Genome Res ., 19 , 1270 – 1278 . Google Scholar CrossRef Search ADS PubMed Kim S. , Pevzner P.A. ( 2014 ) MS-GF+ makes progress towards a universal database search tool for proteomics . Nat. Commun ., 5 , 5277+ . Google Scholar CrossRef Search ADS PubMed Koboldt D.C. et al. ( 2012 ) Comprehensive molecular portraits of human breast tumours . Nature , 490 , 61 – 70 . Google Scholar CrossRef Search ADS PubMed McPherson A. et al. ( 2011a ) Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data . Bioinformatics , 27 , 1481 – 1488 . Google Scholar CrossRef Search ADS McPherson A. et al. ( 2011b ) defuse: an algorithm for gene fusion discovery in tumor RNA-seq data . PLoS Comput Biol ., 7 , e1001138 . Google Scholar CrossRef Search ADS McPherson A. et al. ( 2012 ) nFuse: discovery of complex genomic rearrangements in cancer using high-throughput sequencing . Genome Res ., 22 , 2250 – 2261 . Google Scholar CrossRef Search ADS PubMed Mertins P. et al. ( 2016 ) Proteogenomics connects somatic mutations to signalling in breast cancer . Nature , 534 , 55 – 62 . Google Scholar CrossRef Search ADS PubMed Mitelman F. et al. ( 2007 ) The impact of translocations and gene fusions on cancer causation . Nat. Rev. Cancer , 7 , 233 – 245 . Google Scholar CrossRef Search ADS PubMed Mo F. et al. ( 2008 ) A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data . BMC Bioinformatics , 9 , 537+ . Google Scholar CrossRef Search ADS PubMed Mustafa M.G. et al. ( 2013 ) Biomarker discovery for early detection of hepatocellular carcinoma in hepatitis c infected patients . Mol. Cell. Proteomics , 12 , 3640 – 3652 . Google Scholar CrossRef Search ADS PubMed Nakao M. et al. ( 1996 ) Internal tandem duplication of the flt3 gene found in acute myeloid leukemia . Leukemia , 10 , 1911 – 1918 . Google Scholar PubMed Nesvizhskii A.I. ( 2014 ) Proteogenomics: concepts, applications and computational strategies . Nat. Methods , 11 , 1114 – 1125 . Google Scholar CrossRef Search ADS PubMed Ning K. , Nesvizhskii A. ( 2010 ) The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-seq data: a preliminary assessment . BMC Bioinformatics , 11 , S14+ . Google Scholar CrossRef Search ADS PubMed Ning K. et al. ( 2012 ) Comparative analysis of different Label-Free mass spectrometry based protein abundance estimates and their correlation with RNA-seq gene expression data . J. Proteome Res ., 11 , 2261 – 2271 . Google Scholar CrossRef Search ADS PubMed Quinlan A.R. et al. ( 2010 ) Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome . Genome Res ., 20 , 623 – 635 . Google Scholar CrossRef Search ADS PubMed Rausch T. et al. ( 2012 ) DELLY: structural variant discovery by integrated paired-end and split-read analysis . Bioinformatics , 28 , i333 – i339 . Google Scholar CrossRef Search ADS PubMed Reimand J. et al. ( 2013 ) The mutational landscape of phosphorylation signaling in cancer . Sci. Rep ., 3 , 2651 . Google Scholar CrossRef Search ADS PubMed Schöniger M. , Waterman M.S. ( 1992 ) A local algorithm for DNA sequence alignment with inversions . Bull. Math. Biol ., 54 , 521 – 536 . Google Scholar CrossRef Search ADS PubMed Schroder J. et al. ( 2014 ) Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads . Bioinformatics , 30 , 1064 – 1072 . Google Scholar CrossRef Search ADS PubMed Sheynkman G.M. et al. ( 2013 ) Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-seq . Mol. Cell. Proteomics MCP , 12 , 2341 – 2353 . Google Scholar CrossRef Search ADS Sindi S.S. et al. ( 2012 ) An integrative probabilistic model for identification of structural variation in sequencing data . Genome Biol ., 13 , R22 . Google Scholar CrossRef Search ADS PubMed Swanson L. et al. ( 2013 ) Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies . BMC Genomics , 14 , 550 . Google Scholar CrossRef Search ADS PubMed Whiteaker J.R. et al. ( 2014 ) CPTAC assay portal: a repository of targeted proteomic assays . Nat. Methods , 11 , 703 – 704 . Google Scholar CrossRef Search ADS PubMed Woo S. et al. ( 2014 ) Proteogenomic database construction driven from large scale RNA-seq data . J. Proteome Res ., 13 , 21 – 28 . Google Scholar CrossRef Search ADS PubMed Wulfkuhle J.D. et al. ( 2003 ) Proteomic applications for the early detection of cancer . Nat. Rev. Cancer , 3 , 267 – 275 . Google Scholar CrossRef Search ADS PubMed Ye K. et al. ( 2009 ) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads . Bioinformatics , 25 , 2865 – 2871 . Google Scholar CrossRef Search ADS PubMed Yorukoglu D. et al. ( 2012 ) Dissect: detection and characterization of novel structural alterations in transcribed sequences . Bioinformatics , 28 , i179 – i187 . Google Scholar CrossRef Search ADS PubMed Zhang B. et al. ( 2014 ) Proteogenomic characterization of human colon and rectal cancer . Nature , 513 , 382 – 387 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Dec 18, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off