Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

An automated method for detecting alternatively spliced protein domains

An automated method for detecting alternatively spliced protein domains Motivation: Alternative splicing (AS) has been demonstrated to play a role in shaping eukaryotic gene diversity at the transcriptional level. However, the impact of AS on the proteome is still con- troversial. Studies that seek to explore the effect of AS at the proteomic level are hampered by technical difficulties in the cumbersome process of casting forth and back between genome, tran- scriptome and proteome space coordinates, and the naı ¨ve prediction of protein domains in the presence of AS suffers many redundant sequence scans that emerge from constitutively spliced regions that are shared between alternative products of a gene. Results: We developed the AstaFunk pipeline that computes for every generic transcriptome all domains that are altered by AS events in a systematic and efficient manner. In a nutshell, our method employs Viterbi dynamic programming, which guarantees to find all score-optimal hits of the domains under consideration, while complementary optimizations at different levels avoid re- dundant and other irrelevant computations. We evaluate AstaFunk qualitatively and quantitatively using RNAseq in well-studied genes with AS, and on large-scale employing entire transcriptomes. Our study confirms complementary reports that the effect of most AS events on the proteome seems to be rather limited, but our results also pinpoint several cases where AS could have a major impact on the function of a protein domain. Availability and implementation: The JAVA implementation of AstaFunk is available as an open source project on http://astafunk.sammeth.net. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction (Djebali et al., 2012; Wang et al., 2008), albeit the observed differ- Alternative splicing (AS) has already early been demonstrated to be ences are usually subordinate to gene expression. In specific cases, capable of alternating the proteins produced from specific messenger AS has also been demonstrated to severely compromise the function- transcripts, for instance in genes that encode receptors, cell-adhesion ality of genes, and aberrant AS variants have been linked to several molecules and membrane channels [reviewed by Black (2000)]. physiological diseases (Tazi et al., 2009), as well as to cancer pro- Quantitative assays based on microarrays and then on high- gression and metastasis (Oltean and Bates, 2014). However, there is throughput sequencing of the cellular RNA complement (RNA-Seq) to date still relatively few evidence that allows to conclude on an further showed AS to be an ubiquitous process that plays a role in equally large impact of AS in the proteome as has been reported in fine-tuning the eukaryotic transcriptome diversity according to spe- the transcriptome (Tress et al., 2017), underlining the importance of cific genotypes (Lappalainen et al., 2013), cell types and cell states automated methods that help to reduce the size of this evident gap V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 3809 3810 V.Coelho and M.Sammeth by a systematic analysis of the nucleic acid diversity in the context develop a method that enables systematic studies on the effect of AS of the encoded proteins. on protein domains and avoids unnecessary sequence searches. In structural biology, the term protein ‘domain’ was initially coined for regions or fragments of globular proteins that are able to 2 System and methods fold independently (Wetlaufer, 1973). This definition was later extended to a concept of domains as the basic structural, functional Herein we employ the generic Astalavista notation that delimits an and/or evolutionary units of proteins (Chothia et al., 2003; Moore AS event e ¼ðsource ; sink ;V Þ by the first/last common splice site e e e et al., 2008). Protein domains are harboured by one or multiple (‘source ’ and “sink ”) and specifies the collection of variants e e 1 d i exons (Kaessmann et al., 2002; Moore et al., 2008; Vibranovski V ¼fV ; .. . ; V g, where each variant V describes a specific exon- e e e et al., 2005), and therefore the functionality of a protein can be intron pattern (i.e. a sequence of splice sites) between source and regulated by AS by completely or partially removing/inserting some sink supported by a set of jV j 1 transcripts (Sammeth et al., protein-coding (PC) exons (Light and Elofsson, 2013; Keren et al., 2008). Collapsing transcripts with identical local splicing patterns 2010; Tress et al., 2007). into variants avoids redundancy introduced by different alternative The objective of investigating proteome changes caused by AS transcripts that locally employ the same exons and introns of an has not come up only recently. However, driven by the lack of an event. Supplementary Figure S1 shows an example of applying this universal method, previous studies on AS triggered protein modifica- notation to AS events in the TNNT1 gene. Note that both events tions employ substantially variable in-house data analysis pipelines shown in Supplementary Figure S1 have a variant collection with only that rely on heterogeneous sources (i.e. different databases of AS two variants jV j¼jV j¼ 2 that group the three alternative tran- e e 1 2 events and of protein annotations) and on specific ad hoc decisions scripts of the locus by common local splicing patterns. Dimension d for the filtering thresholds applied, which makes the corresponding specifies the number of different variants in e, and ‘alternative’ splic- results difficult to compare. For instance, Liu and Altman (2003) ing events naturally involve d  2 variants. Throughout this work we and Resch et al. (2004) conducted early transcriptome-wide studies are exclusively considering ‘complete’ AS events that comprise all d on AS in human protein domains. The study of Liu and Altman variants between source and sink (Sammeth, 2009). e e (2003) employed gene loci reviewed by the NCBI with more than The amino acid sequence of domains with the same function can one Reference Sequence (RefSeq) transcript, linked them to entries vary in different proteins of a protein family, therefore the represen- of the conserved domains database (CDD) and considered domains tative consensus patterns of a protein domain is often modelled by a that were not present in all splice variants to be ‘spliced-out’. Resch statistical profiles. The Pfam database (Finn et al., 2014) produces et al. (2004) in contrast identified a representative ‘major transcript’ sequence alignments of all proteins from the same family and subse- for each gene that maximized the number of mRNA and EST sup- quently profiles the attributes of each domain by condensing its se- port, and then determined the longest open reading frame in each quence conservation in a hidden Markov model (HMM). candidate transcript. Subsequently, they aligned the major–minor p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA ; p t e protein pairs and recorded the amino acid intervals that are removed (1) or replaced. The overlap between these alternatively spliced regions x 2X; Z ! A; A; Z 2A and annotated protein domains then allowed to identify domains HMM domain profiles p of length L are described by all Z ! A that were affected by AS. transition probabilities PðÞ AjZ between different states A; Z 2A of Later-on, Hegyi et al. (2011) described a large-scale study of AS the model, as well as by the (additional) emission probabilities in protein domains, employing major and minor alternative protein PðÞ xjA when aligning A 2fD ; I ; M g at position 1  i  L of e i i i p forms according to the underlying EST support as defined in the the model to an amino acid x 2X from the amino acid alphabet X . Swissprot database. The authors then investigated domain modifica- Latter emission probabilities are employed to evaluate the likelihood tions by the minor form by considering all major–minor transcript of p matching the symbols x 2X of a given peptide sequence X. pairs with AS in the region of domains annotated by the Protein Note that some state transitions Z ! A are not productive, for in- Family (Pfam) database. Whereas Buljan et al. (2012) analyzed how stance PðÞ D jI ¼ 0 and vice versa PðÞ I jD ¼ 0 to enforce at t i i1 t i i1 often amino acid sequences encoded by tissue-specific (TS) cassette least one mis-/matching symbol between insertions and deletions exons, non-TS cassette exons and constitutive exons are overlapping (Supplementary Algorithm S1). the Pfam protein domains; cassette exons and tissue specificity in Supplementary Figure S2 provides a more detailed description of this study were obtained by overlapping exons supported by at this so-called ‘Plan 7 architecture’ of Pfam HMMs as proposed by least two mature Ensembl (release 54) transcripts with RNAseq Durbin et al. (1998), which distinguishes in total 10 different states expression data from Wang et al. (2008); exons present in all anno- A¼ fB; C; D ; E; I ; J; M ; N; S; Tg, of which we employ in the fol- i i i tated transcripts were consequently assumed to be constitutive. lowing mainly the beginning/end state {B, E} of a domain alignment Subsequently, each exon was associated with the longest annotated and the alignment states fD ; I ; M g (deletion, insertion and mis-/ i i i ORF supporting it. match states). In contrast to Durbin et al. (1998), the model Summarizing the methodology behind these previous we apply herein prohibits partial (‘local’) alignments of p by B 6! M approaches, most approaches identified based on some reference i 8i > 1 and M 6! E8i < L . However, multiple local alignment(s) gene annotation one specific reference transcript per gene, the so- i p of the entire model p within X are still possible, by iterating the called ‘o-call transcript. This reference transcript then was usually interleaving amino acids x 2 X in the states {C, J, N} as described compared to all ‘llpare products by sequence alignments of the j by Durbin et al. (1998) (Supplementary Fig. S2). amino acid or mRNA sequences. Such alignments are computation- ally expensive and often partially redundant due to the presence of many identical subsequences from constitutively included exons. 3 System Furthermore, alignments of mature transcripts or protein sequences do not provide information about the location of introns or splicing We then developed AstaFunk (Alternative Splicing Transcriptome patterns. In order to overcome these shortcomings we sought to Annotation with Functional Knowledge), a computational pipeline AstaFunk 3811 [source; sink] by just a single nucleotide,ðÞ L  3  1 . This region then can be further extended by inserting gaps, as long as the align- ment score does not fall below the domain-specific gathering thresh- old h for a domain hit provided by the Pfam model. We further obtain for each model p ¼(L , P (AjZ), P (xjA)) the globally lowest p t e gap-opening penalty a considering transitions PðÞ I jM p t i a ¼ max bit PðÞ I jM PðÞ xjI ; (2) p t i e i 1  i  L ;x2X and correspondingly the lowest gap-extension penalty b consider- ing transitions PðÞ I jI , over all 1  i  L and emissions t i p PðÞ xjI ; x 2X e i b ¼ max bit PðÞ I jI  PðÞ xjI : (3) p t i e i 1  i  L ;x2X Note that by definition of the Pfam models (Eddy, 2011), gap- opening transitions have to be preceded by match states M ! I (Section 2.2), and that the ‘lowest penalty’ corresponds to the ‘highest bit score’ bitðÞ maximized by Equations (2) and (3). As bit scores of gap-opening and gap-extension probabilities are conse- quently negative throughout Pfam profiles p, we employ the abso- lute value of bitðÞ as a penalty measure for a and b . Fig. 1 Overview of the main steps in the AstaFunk workflow. Algorithmic We can compute the maximum number of amino acids that can steps are grey boxes labelled by letters in the order of their execution, from be introduced in the alignment of a perfect domain match x ðÞ 0 by (a)to(h). Calling AS events (left panel) requires as input a transcriptome an- notation (GTF format) and the sequence of the corresponding genome as- the expression x ðÞ 0  h  a  b , where p p p p sembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference x ðiÞ¼ max bitðP ðM jM Þ P ðxjM Þ p t k k1 e k domains for each gene (proprietary format) can optionally be provided in x2X k¼iþ1 order to avoid brute force exhaustive search (4) P ðxjM Þ e k ¼ max log ðP ðM jM Þþ log 2 t 2 k k1 x2X P ðxÞ designed to analyse the modifications caused by AS in protein 0 k¼iþ1 domains in a non-redundant manner, i.e. considering each alterna- describes the score when aligning the suffix½ i; L of p against its tive transcript subsequence only once. Figure 1 resumes the main consensus sequence, and x ðÞ 0 consequently converges to the opti- steps of our pipeline that are described in detail in the subsequent mal score for an alignment of the entire profile p. Employing this sections: based on the genomic exon–intron coordinates of a tran- term, the size of region D in nucleotide space can be delimited by scriptome annotation, the first step (a) consists in iterating all genes the function shown in Equation (5). to determine the positions of complete AS events; in step (b), the al- gorithm computes the size of a domain-specific region D around ðÞ L  3 1 x ðÞ 0  h  a p p p p D ¼ þ  3 (5) each AS event, which is required in order to ensure that all optimal 3 b alignments of a profile p with the alternatively spliced transcript h is the Pfam minimum score that a subsequence must reach in subsequences are found; if this region D possibly overlaps adjacent p order to belong to a protein domain (Finn et al., 2014). AS events, step (c) merges the corresponding events (Algorithm 1), and step (d) removes redundancy amongst alternatively spliced sub- sequences of merged events by reorganising the corresponding var- 3.2 Merge as events iants (Algorithm 2); in step (e), the nucleic acid sequences of the As detailed in the previous section, the AstaFunk algorithm extends alternatively spliced variants are translated in silico to amino acid the event region ½source ; sink  to both sides by D when scanning e e p sequences, which subsequently in step (f) are aligned with the profile for a profile p to delineate the relevant sequences of AS events p of the protein domain (Supplementary Algorithm S2); step (g) e ¼ðsource ; sink ;V Þ. Note that intronic sequence stretches obvi- e e e selects from the set of possibly overlapping domain hits a subset of ously are to be disregarded in these D extensions because the do- disjoint alignments ranked by their score; eventually, step (h) out- main alignment is carried out at the level of amino acids translated puts all AS events in the regions of thus selected domains hits. from exons. However, the extension of an AS event e by D can overlap the boundaries of an adjacent event e . In this case, events e 2 1 3.1 Compute D and e are mutually overlapping by D and need to be merged in order 2 p Domains can completely or just partially overlap the region of an to consider their variants together when aligning profile p. Obviously, the problem of events overlapping by D can be generalized for n > 2 AS event. Therefore, also parts of the sequences flanking the event are to be considered when computing relevant alignments of a do- adjacent events he ; ... ; e i, ordered by their genomic positions 1 n main profile p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA , and we developed an algo- source  source in the directionality of transcription. p t e e e i iþ1 rithm that determines the minimal extension D necessary to find all Algorithm 1 identifies all sets of D –overlapping events from a sequence of AS events E, sorted by aforementioned pre-order score-optimal alignments of p. In a first consideration we determine a lower bound on D by the number of nucleotides that participate on their position in the genome. While iterating jEj in a double loop in an exact domain match overlapping the region of an AS event (i, j), the algorithm identifies runs of adjacent events he ; .. . ; e i i j1 3812 V.Coelho and M.Sammeth merged events e (line 7) are also not contained in the same variant Algorithm 1: MergeEvents of the merged event e (lines 8–10). The reason behind this re- 1;...;n quirement is that each transcript t 2 V in variant x describes an exon–intron pattern on source ; sink that differs from the pat- e e k k tern of transcripts t in another variant V , and since source ; sink is necessarily a subregion of the merged event e e k k (Algorithm 1), both transcripts t and t cannot share a common vari- i j ant in the merged event. Supplementary Figure S1 exemplifies the application of Algorithm 2 by merging the variant collections of two AS events e and e from the Troponin T type 1 gene (TNNT1). Starting with 1 2 an initial value of V ¼fTg, SplitVariants iterates all (t , t ) e1;2 i j from T¼ {NM_003283, NM_001126132, NM_001126133}. When identifying that t ¼ NM 003283 and t ¼ NM 00112132 are from 1 2 different variants in e , the algorithm adapts V ¼ffNM 003283; 1 e 1;2 NM 001126133g; fNM 001126132gg: Next, also NM_003283 2 1 2 V and NM_00112133 2 V are split, producing the final variant e e 2 2 collection V ¼ffNM 003283g; fNM 001126133g; fNM 00 1;2 1126132gg: A general proof for the correctness of Algorithm 2 is that mutually overlap by D . In the trivial case of j ¼ðÞ i þ 1 , the given in the Supplementary Material (Supplementary Proof S1). merged event e corresponds to e . Otherwise, e extends i;...;j1 i i;...;j1 from source to sink and combines the variant collections of all e e i j1 3.4 Align profiles the merged events fV ; .. . ;V g, which we reorganized by e e i j1 By the principle of non-redundancy in variant sets (Sammeth, 2009), SplitVariantsðÞ described in the next section (Algorithm 2). any transcript t in a variant x of event e describes the same splicing i k Obviously, the complexity of Algorithm 1 is linear w.r.t. jEj, and pattern, and therefore also the same exonic sequence, between the result E comprises all (potentially merged) events. source and sink . To this end the AstaFunk pipeline considers for e e k k each variant x the exon–intron structure of an arbitrarily selected 3.3 Split variants x transcript t 2 V when translating the genomic subsequence Algorithm 1 also shows that a merge operation modifies not only source ; sink to a sequence X of correspondingly encoded amino e e k k the boundaries of consecutive events he ; ... ; e i, but also their re- 1 n acids. spective variant collections. The non-redundant concept of variants Subsequently, X is aligned to p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA by com- p t e requires that each transcript t 2 V is represented by exactly one i puting the dynamic programming (DP) recursion of the Viterbi L jXjjAj variant x of each event 1  k  n (Sammeth, 2009). Consequently, Algorithm (Supplementary Algorithm S1) in the matrix R . procedure SplitVariantsðÞ employed by Algorithm 1 has to find a Each cell r 2 R represents an optimal sub-solution of the partial i;j non-redundant variant collection V that complies with the var- alignment between the i first positions of p and j symbols in the 1;...;n iants described by each of the merged events. prefix of X. In order to reduce the number of cells that are to be computed in R, the AstaFunk algorithm employs an exact branch- and-bound heuristic for excluding subsolutions that cannot lead to a Algorithm 2: SplitVariants relevant alignment of p (see Supplementary Algorithm S2). Following this heuristic, subsolutions r are considered feasible iff i;j r þ x ðÞ i  log ðÞ jXj  h (6) p 2 p i;j holds (‘branch’). Otherwise the partial alignment is not extended (‘bound’), and consequently all DP cells r that extend exclusively i;j A A A infeasible subsolutions r ; r ; r also represent infeasible i;j1 i1;j i1;j1 (sub-)solutions and limit of the search space. As introduced in Section 3.1, the score of the optimal suffix alignment x ðÞ i in Equation (6) is estimated by the scores of exclusively match state transitions in the suffix M ! M ; i  k < L . Threshold h k kþ1 p p then is employed as discriminator whether a significant alignment score can be reached from a subsolution r . i;j 3.5 Call domains All relevant alignments of a profile p with a variant sequence X that exceed threshold h are considered hits (h). However, the Viterbi Algorithm described in Supplementary Algorithm S2 possibly identi- fies multiple hits h 2 H of the same profile that mutually overlap p;X Algorithm 2 (SplitVariants) commences by collecting non- on X, especially in the case of repetitive domains that carry intrinsic redundantly all transcripts that describe variants in any of the symmetries. In order to globally quantify the changes in the domain merged events in set T. The algorithm then ensures that all pairs of landscape introduced by AS, we determine for each profile p a sub- transcripts t ; t 2 T that are from different variants in one of the set of compatible (i.e. non-overlapping) hits H H . i j p;X p;X AstaFunk 3813 PROSITE (a) SMART AFUNK HMMER (b) Claudin family 1.0 Stomach 0.8 Lung 0.6 0.4 0.2 0.0 Exon 1A Exon 1B Exon 2 Exon 3 Exon 4 137,720,000 137,725,000 137,730,000 137,735,000 137,740,000 137,745,000 137,750,000 Bs = 112.7 ... ENST00000343735 Bs = 106.9 ... ENST00000183605 Fig. 2. Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, acces- sion identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs ¼ bit score) and the position of AstaFunk domain predictions relative to splicing patterns anno- tated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron struc- tures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) To this end the AstaFunk algorithm performs a greedy selection (Finn et al., 2014), AstaFunk predicts in most cases the same number of hits to be included in a set H of non-overlapping hits between of domains as HMMER3, as can be expected when optimizing the p;X the profile p and the sequence X: all initial hits h 2 H are iterated same scoring function on the same input. To further investigate the p;X according to their alignment score ranking score(h), from high to marginal differences in the number of C2H2-zinc domains, we com- pared the locations of this domain in the protein ZNF74 to the do- low scores, and h is included in H iff the aligned interval of X p;X main reference annotations from the PROSITE (Sigrist et al., 2013) does not overlap with any previously selected hit h 2 H of the p;X same profile p. Note that this greedy selection does neither guaran- and from the SMART (Letunic et al., 2015) database, obtained from Ensembl (release 87). Our results indicate that the additional tee the number nor the sum of scores of accepted hits h 2 H to p;X domains predicted by our AstaFunk heuristics are reproducing well be maximal. However, in agreement with the underlying biology the locations annotated by the two reference databases (Fig. 2a). our greedy selection prioritises high-scoring alignments of a domain Next, we employed the AstaFunk pipeline to investigate TS AS as the most trustful predictions. events reported earlier in the RBFOX1 gene (Castle et al., 2008), in Eventually, each hit h 2 H has the properties: p;X the SLC25A3 gene (Wang et al., 2008), and in the CLDN18 gene i. (i) h represents a hit, thus a significant alignment of profile p to (Tureci et al., 2011). First, we ran the AstaFunk pipeline to system- the target sequence X provided by scoreðÞ h  log ðÞ jXj  h ; 2 p atically predict all protein domains that overlap AS events described ii. (ii) h is alternatively spliced, i.e. the genomic area between the by the Gencode (v19) annotation in these genes. Figure 2b, start/end positions of h intersects with½ source ; sink of at e e Supplementary Figures S3 and S4 show that the domain predictions least one AS event e; on the alternative transcripts of each gene vary in their length and in iii. (iii) h is the best alignment of p with the corresponding subse- their score (Supplementary Data S1). According to our results, the quence of X, i.e. there exists no other hit h 2 H with score p;X score of the Claudin domain in the CLDN18 gene decreases when ðÞ h > scoreðÞ h in the subsequence of X that aligns with h . including the downstream exon 1B (score 114.9) instead of the alter- native upstream exon 1A (score 120.7, shown in the bottom panel of Fig. 2b). The splicing of exon 14 in the RBFOX1 gene increases 4 Results the score of the predicted Fox-1C domain (score 155.9 and 174.9) 4.1 Case studies of alternatively spliced protein as compared with the splicing of the mutually exclusive exon 15 domains downstream (score 146.6 and 165.7, Supplementary Fig. S3). In Our AstaFunk method employs and branch-and-bound heuristic contrast, the mitochondrial carrier domain of SLC25A3 yields a when computing the Viterbi Dynamic Programming matrices slightly lower score (77.0) when including the upstream exon 3A in- (Section 3.4) but a greedy selection step to resolve overlapping align- stead of its mutually exclusive counterpart 3B further downstream ments of the same domain profile (Section 3.5). We therefore first (scoring 78.3 and 78.6, Supplementary Fig. S4). sought to assess whether AstaFunk correctly identifies domains in In order to further assess the earlier reported tissue specificity of protein sequences in a case study of Zinc-Finger ZnF proteins from AS in these three genes, we next superimposed exon quantifications the Kru ¨ ppel family of transcription factors (Resch et al., 2004). To (Supplementary Data S2) based on RNA-Seq data produced by the this end we compared the domains found by our method to comple- GTEx Project (Ardlie et al., 2015) to these alternatively spliced mentary predictions by the HMMER3.2b software (Eddy, 2011). domains. In agreement with previous studies related to the CLDN18 Supplementary Table S1 shows that using the set of 14 831 gene (Niimi et al., 2001; Tureci et al., 2011), these quantifications manually curated profiles from the Pfam-A database v.27 demonstrate a TS selectivity for the alternative inclusion of exon 1A, Relative Mean Exon count 3814 V.Coelho and M.Sammeth Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. carrying the domain alignment that agrees better with the Pfam ref- reduced to loci with multiple (i.e. >1) alternatively spliced PC tran- erence, in stomach (on average 1460.3 in exon 1A versus 44.4 reads scripts; from these, we further removed transcripts with a length of in exon 1B, solid line in the top panel of Fig. 2b). In contrast, lung the annotated ORF that is not multiple of three in the genomic align- samples preferentially include exon 1B (488.7 in exon 1B versus 0.4 ment, or with an in-frame stop codon in the in silico translation of reads in exon 1A, dashed line), which yields a slightly worse align- the genomic sequence; then we scanned the longest ORF of each ment score with the Claudin domain profile (bottom panel of locus heuristically for significant Pfam-A profile hits (Eddy, 2011) Fig. 2b). and disconsidered loci without any domain hit. For all the remaining Similarly, our results also confirm well the previously reported loci we stored the domains that produced hit(s) in order to reduce tissue preferences of RBFOX1, where exon 14 preferentially spliced the domain space that is to be explored in the subsequent bench- in brain and exon 15 preferentially spliced in skeletal muscle (Castle mark (Supplementary Table S4). et al., 2008), and of SLC25A3, where exon 3A preferentially spliced Table 1a summarizes the attributes of thus obtained ‘PCAS loci’ in muscle tissues whereas exon 3B is preferred in testis and liver with  2 PC transcripts remaining after the previous filtering pro- (Wang et al., 2008). The AstaFunk domain prediction scores of cess, the corresponding sum of alternative PCAS ORFs and PCAS these case studies are summarized in Supplementary Data S1, and transcripts they produce, as well as the total number of ‘PCAS even- the reference read counts in each exon in Supplementary Data S2 ts’ collected from their splicing patterns (Foissac and Sammeth, (Supplementary Figs S3 and S4). 2007). Row ‘PCAS events per gene’ then lazily computes the fraction of (PCAS events)/(PCAS loci). Our results in Table 1b show a grad- ually increasing reduction of the search space in terms of number of 4.2 Evaluation of the AstaFunk time complexity amino acids that are explored by the AstaFunk method when com- A formal estimation of the AstaFunk algorithmic time complexity is paring WormBase to FlyBase and with the human RefSeq annota- not trivial since many algorithmic steps depend on specific attributes tion (44%, 49% and 53%), about proportional to the number of PC of the analyzed transcriptome and/or of a specific gene, for instance AS events per gene in each dataset (1.7, 1.9 and 2.3). This reduction the number and mutual distance of AS events, the type and the num- of the search space achieved by delimiting domain scans to the ber of domains found in the longest ORF of a gene, the agreement regions of AS events is also reflected by the speedup of the total run- between the variant sequences and a domain profile during branch- and-bound DP, and the number and mutual overlap of domains pre- time ( 43%timee Table 1c), although the overall gain in perform- dicted in variant sequences. We therefore based our evaluation of ance varies with specific attributes of each transcriptome; the the time complexity on empiric results, comparing the performance improvement obtained by AS independent branch-and-bound heu- of the AstaFunk method with the naı ¨ve approach of exhaustively ristics shows expectedly a rather constant performance comparing scanning all alternative transcript sequences with the corresponding the evaluated transcriptomes ( 25% on average, Supplementary domain profiles. Table S5), which does not affect the reported speedup since we For our benchmark, we employed the following genome assem- employed in our benchmark branch-and-bound heuristics for both, exhaustive as well as AstaFunk domain searches. blies and transcriptome annotations downloaded from the UCSC However, our evaluation also demonstrates that the speedup of Genome Browser (Kent et al., 2002): FlyBase Genes (Crosby et al., the AstaFunk method decreases when the AS diversity annotated in 2007; Smith et al., 2007) aligned to the dm3 genome of Drosophila a transcriptome further increases, from 47% in the human RefSeq, melanogaster, WormBase Genes (Harris et al., 2003) aligned to the to 30% in the human UCSC and 26% in the human Gencode anno- ce3 genome of Caenorhabditis elegans, and RefSeq transcripts tation (Table 1c). This reduction in efficiency is apparently caused (Pruitt et al., 2014), UCSC Genes (Rosenbloom et al., 2015), as well as the Gencode Comprehensive Gene Collection v19 (Harrow et al., by the overhead of additional operations that are necessary for 2012) aligned to the hg19 genome of Homo sapiens. After obtaining merging AS events in transcriptomes with high AS event densities these input annotations from the corresponding website (3.4 and 6 AS events per PC locus, Table 1a). Indeed, when consid- (Supplementary Table S2), we preprocessed the transcripts as fol- ering the proportion of AS events that have to be merged because lows (Supplementary Table S3): the total number of PC loci was first they overlap by D regions, we observe that RefSeq comprises the p AstaFunk 3815 comparatively highest proportion of loci with few (0–10%) merged (a) events, whereas the UCSC and the Gencode annotation show less such loci. Vice versa, the proportion of loci with most of their AS events being merged follows the ranking Gencode > UCSC > RefSeq (Supplementary Fig. S5). 4.3 The impact of alternative splicing on protein domains We then employed our results from Section 4.2 in order to estimate the transcriptome-wide impact of AS on protein domains. Assuming the prediction scores to be a proxy for the molecular activity of domains, we analyzed the impact of AS in the Gencode annotation, the transcriptome with the highest AS activity in our benchmark Domain Score Conservation (Table 1a). For our study, we clustered all AstaFunk predictions of (b) the same domain that overlap by genomic coordinates, and under the hypothesis of purifying selection we assumed the highest scoring prediction to represent the wild-type of the domain. We then com- puted for each AS variant of the domain a corresponding ‘domain conservation’ coefficient by the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. Figure 3a shows that most scores assigned by AstaFunk to alter- natively spliced domains preserve the wild-type score well, which implies that also changes to the corresponding domainso functional- ities are likely limited. According to our analysis, domains that show AS variants with low conservation scores tend to be longer and over- all less frequent in the proteome (Fig. 3b). To further investigate an alternatively spliced domain with poor prediction score, we selected Domain Score Conservation the TNNT1 gene (Supplementary Fig. S1) that harbours in total four different predictions of the troponin domain (Supplementary Fig. 3. Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to Fig. S6). If we assume the best prediction with a score of 170.5 to AS changes to the coding sequence preserve the wild-type score well, i.e. represent the wild-type, AS creates two minor modifications of the most alternative scores fall into the [0.9; 1] interval of domain conservation. domain, one by an alternative acceptor choice in exon 12 (conserva- (b) Domains with AS variants that exhibit very low scores (domain conserva- tion 98.9%) and another by a postponed start codon triggered by tion <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and AS in the 5 -UTR (conservation 88.7%); however, a major modifica- less frequent in the transcriptome (grey boxes, right y-axis) tion can be caused by the skipping of exon 8 that leads to the loss of a considerable portion of amino acids in the C-terminal end of the required calculations can dominate the running time when the dens- encoded troponin domain (conservation 27.5%). The phenotype of ity of AS events is as elevated as in the UCSC and the Gencode anno- individuals that exhibit the latter splicing anomaly has been tation of the human transcriptome (Table 1). Nevertheless, in all of characterized as ‘nemaline myopathy’ (van der Pol et al., 2014). the transcriptomes we investigated, AstaFunk reduces the CPU time needed for computing the genome-wide set of domains in AS events by > 1/4 of the time consumed by an exhaustive search approach. 5 Discussion Since we employed for our benchmark only a heuristically deter- We developed AstaFunk, a computational pipeline that links AS mined subset of domain profiles, i.e. domains occurring in the lon- events to protein domains and identifies alternatively spliced gest ORF of each gene, the speedup that we measure here in hours domains without requiring a cumbersome manual overlap between extrapolates to a reduction of days or weeks when performing a full AS events (in nucleotide space) and protein domains (in amino acid exhaustive search of all Pfam domains in each gene. space). To this end, our method combines previously published As a proof of principle, our results show that superimposing methods for the automated predicition of AS events (Foissac and RNAseq-based exon quantifications from the GTEx project to the Sammeth, 2007) with in silico predictions of protein domains (Eddy, AstaFunk domain predictions confirms well previously established 2011). To boost efficiency, our approach reduces the search space knowledge on tissue-selective preferences of mutually exclusive primarily by omitting redundant sequence scans of identical (i.e. exons in three in-depth studied genes. Since our AstaFunk protocol constitutively spliced) parts that are intrinsic to alternatively spliced is not restricted to a specific database of gene models or AS events, it transcripts, considering each unique splicing pattern (i.e. variant) of also provides the potential to automatically and exhaustively detect an AS event exactly once. AS changes to protein domains implied by generic RNA-Seq results However, some additional operations are necessary to detect in combination with ab initio and/or de novo gene predictors. But whether two neighbouring AS events overlap by a profile-specific D p also when employing AS events from reference gene annotations, region (Algorithm 1), and in such cases to merge the corresponding the AstaFunk pipeline could provide a unified platform for produc- variants of neighbouring events (Algorithm 2). Our benchmark ing comparable results on the impact of transcriptional modifica- shows that the computational overhead for these additional opera- tions on protein domains. tions is rather negligible in transcriptomes with modest AS activity, If we believe that a comparison of the scores computed based on i.e. in the WormBase, FlyBase and the RefSeq collection, but the the Pfam models allows us to make conclusions about the molecular [0;0.1[ [0.1;0.2[ [0.2;0.3[ [0.3;0.4[ [0.4;0.5[ [0.5;0.6[ [0.6;0.7[ [0.7;0.8[ [0.8;0.9[ [0;0.1[ [0.9;1] [0.1;0.2[ [0.2;0.3[ [0.3;0.4[ [0.4;0.5[ [0.5;0.6[ [0.6;0.7[ [0.7;0.8[ [0.8;0.9[ [0.9;1] Domain Length [aa] Density 0 100 300 500 700 900 0 0.05 0.1 0.15 0.2 0.25 0.3 0 50 100 150 200 250 300 350 Number of Predictions 3816 V.Coelho and M.Sammeth Harrow,J. et al. (2012) Gencode: the reference human genome annotation for activity of a domain, then the predictions produced by AstaFunk en- the encode project. Genome Res., 22, 1760–1774. able us to suggest that most AS events impact the domains they affect Hegyi,H. et al. (2011) Verification of alternative splicing variants based on do- only marginally. These observations agree well with complementary main integrity, truncation length and intrinsic protein disorder. Nucleic proteomics studies (Tress et al.,2017), that conclude that the impact Acids Res., 39, 1208–1219. of AS in the proteome is likely less severe than the plethora of splicing Kaessmann,H. et al. (2002) Signatures of domain shuffling in the human gen- variants found in transcriptome interrogations would suggest (Djebali ome. Genome Res., 12, 1642–1650. et al.,2012). However, some of the score changes predicted by Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., AstaFunk on AS configurations of a domain indicate that the conse- 12, 996–1006. quences on the protein functionality could be non-negligible. One Keren,H. et al. (2010) Alternative splicing and evolution: diversification, exon such example is the skipping of exonic sequences in an AS variant of definition and function. Nat. Rev. 11, 345–355. the TNNT1 gene that leads to the predicted loss of the C-terminal Lappalainen,T. et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501, 506–511. part of its essential troponin domain and was linked to a pathogen Letunic,I. et al. (2015) Smart: recent updates, new developments and status in phenotype described as nemaline myopathy. Although certainly only 2015. Nucleic Acids Res., 43, D257–D260. a minority of individuals suffer this AS-triggered domain malfunction, Light,S. and Elofsson,A. (2013) The impact of splicing on protein domain the example demonstrates the exploratory power provided by our architecture. Curr. Opin. Struct. Biol., 23, 451–458. AstaFunk pipeline to detect potential consequences of AS on protein Liu,S. and Altman,R.B. (2003) Large scale study of protein domain distribu- domains, which later on can be further explored by complementary tion in the context of alternative splicing. Nucleic Acids Res., 31, genetic or proteomic studies. 4828–4835. Moore,A.D. et al. (2008) Arrangements in the modular evolution of proteins. Trends Biochem. Sci., 33, 444–451. Funding Niimi,T. et al. (2001) claudin-18, a novel downstream target gene for the This work was supported by the National Counsel of Technological t/ebp/nkx2.1 homeodomain transcription factor, encodes lung- and stomach-specific isoforms through alternative splicing. Mol. Cell. Biol., 21, and Scientific Development [132455/2013-7 and 140941/2015-0 to 7380–7390. V.C. 310132/2015-0 to M.S.]; and the Research Support Oltean,S. and Bates,D.O. (2014) Hallmarks of alternative splicing in cancer. Foundation of the State of Rio de Janeiro [E_06/2015 to M.S.]. Oncogene, 33, 5311–5318. Conflict of Interest: none declared. van der Pol,W.L. et al. (2014) Nemaline myopathy caused by tnnt1 mutations in a Dutch pedigree. Mol. Genet. Genomic Med., 2, 134–137. Pruitt,K.D. et al. (2014) Refseq: an update on mammalian reference sequences. References Nucleic Acids Res., 42, D756–D763. Resch,A. et al. (2004) Assessing the impact of alternative splicing on domain Black,D. (2000) Protein diversity from alternative splicing: a challenge for bio- interactions in the human proteome. J. Proteome Res., 3, 76–83. informatics and post-genome biology. Cell, 103, 367–370. Rosenbloom,K.R. et al. (2015) The ucsc genome browser database: 2015 up- Buljan,M. et al. (2012) Tissue-specific splicing of disordered segments that date. Nucleic Acids Res., 43, D670–D681. embed binding motifs rewires protein interaction networks. Mol. Cell, 46, Sammeth,M. (2009) Complete alternative splicing events are bubbles in splic- 871–883. ing graphs. J. Comput. Biol., 16, 1117–1140. Castle,J.C. et al. (2008) Expression of 24, 426 human alternative splicing Sammeth,M. et al. (2008) A general definition and nomenclature for alterna- events and predicted cis regulation in 48 tissues and cell lines. Nat. Genet., tive splicing events. PLoS Comput. Biol., 4, e1000147. 40, 1416–1425. Sigrist,C.J.A. et al. (2013) New and continuing developments at prosite. Chothia,C. et al. (2003) Evolution of the protein repertoire. Science, 300, Nucleic Acids Res., 41, D344–D347. 1701–1703. Smith,C.D. et al. (2007) The release 5.1 annotation of Drosophila mela- Crosby,M.A. et al. (2007) Flybase: genomes by the dozen. Nucleic Acids Res., nogaster heterochromatin. Science, 316, 1586–1591. 35, D486–D491. Tazi,J. et al. (2009) Alternative splicing and disease. Biochim. Biophys. Acta, Djebali,S. et al. (2012) Landscape of transcription in human cells. Nature, 1792, 14–26. 489, 101–108. Tress,M.L. et al. (2007) The implications of alternative splicing in the encode Durbin,R. et al. (1998). Biological Sequence Analysis: Probabilistic Models of protein complement. PNAS, 104, 5495–5500. Proteins Nucleic Acids. 1st edn. Cambridge University Press. Tress,M.L. et al. (2017) Alternative splicing may not be the key to proteome Eddy,S.R. (2011) Accelerated profile hmm searches. PLoS Comput. Biol., 7, complexity. Trends Biochem. Sci., 42, 98–110. e1002195–e1002116. Tureci,O. et al. (2011) Claudin-18 gene structure, regulation, and expression Finn,R. et al. (2014) Pfam: the protein families database. Nucleic Acids Res., is evolutionary conserved in mammals. Gene, 481, 83–92. 42, D222–D230. Vibranovski,M.D. et al. (2005) Signs of ancient and modern exon-shuffling Foissac,S. and Sammeth,M. (2007) Astalavista: dynamic and flexible analysis are correlated to the distribution of ancient and modern domains along pro- of alternative splicing events in custom gene datasets. Nucleic Acids Res., 35, w297–w299. teins. J. Mol. Evol., 61, 341–350. Ardlie,K.G. et al. (2015) The genotype-tissue expression (gtex) pilot analysis: Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue tran- multitissue gene regulation in humans. Science, 348, 648–660. scriptomes. Nature, 456, 470–476. Harris,T.W. et al. (2003) Wormbase: a cross-species database for comparative Wetlaufer,D.B. (1973) Nucleation, rapid folding, and globular intrachain genomics. Nucleic Acids Res., 31, 133–137. regions in proteins. Proc. Natl Acad. Sci. USA, 70, 697–701. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

An automated method for detecting alternatively spliced protein domains

Bioinformatics , Volume 34 (22): 8 – Jun 1, 2018

Loading next page...
 
/lp/ou_press/an-automated-method-for-detecting-alternatively-spliced-protein-0IYy7aFJL2

References (51)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty425
Publisher site
See Article on Publisher Site

Abstract

Motivation: Alternative splicing (AS) has been demonstrated to play a role in shaping eukaryotic gene diversity at the transcriptional level. However, the impact of AS on the proteome is still con- troversial. Studies that seek to explore the effect of AS at the proteomic level are hampered by technical difficulties in the cumbersome process of casting forth and back between genome, tran- scriptome and proteome space coordinates, and the naı ¨ve prediction of protein domains in the presence of AS suffers many redundant sequence scans that emerge from constitutively spliced regions that are shared between alternative products of a gene. Results: We developed the AstaFunk pipeline that computes for every generic transcriptome all domains that are altered by AS events in a systematic and efficient manner. In a nutshell, our method employs Viterbi dynamic programming, which guarantees to find all score-optimal hits of the domains under consideration, while complementary optimizations at different levels avoid re- dundant and other irrelevant computations. We evaluate AstaFunk qualitatively and quantitatively using RNAseq in well-studied genes with AS, and on large-scale employing entire transcriptomes. Our study confirms complementary reports that the effect of most AS events on the proteome seems to be rather limited, but our results also pinpoint several cases where AS could have a major impact on the function of a protein domain. Availability and implementation: The JAVA implementation of AstaFunk is available as an open source project on http://astafunk.sammeth.net. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction (Djebali et al., 2012; Wang et al., 2008), albeit the observed differ- Alternative splicing (AS) has already early been demonstrated to be ences are usually subordinate to gene expression. In specific cases, capable of alternating the proteins produced from specific messenger AS has also been demonstrated to severely compromise the function- transcripts, for instance in genes that encode receptors, cell-adhesion ality of genes, and aberrant AS variants have been linked to several molecules and membrane channels [reviewed by Black (2000)]. physiological diseases (Tazi et al., 2009), as well as to cancer pro- Quantitative assays based on microarrays and then on high- gression and metastasis (Oltean and Bates, 2014). However, there is throughput sequencing of the cellular RNA complement (RNA-Seq) to date still relatively few evidence that allows to conclude on an further showed AS to be an ubiquitous process that plays a role in equally large impact of AS in the proteome as has been reported in fine-tuning the eukaryotic transcriptome diversity according to spe- the transcriptome (Tress et al., 2017), underlining the importance of cific genotypes (Lappalainen et al., 2013), cell types and cell states automated methods that help to reduce the size of this evident gap V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 3809 3810 V.Coelho and M.Sammeth by a systematic analysis of the nucleic acid diversity in the context develop a method that enables systematic studies on the effect of AS of the encoded proteins. on protein domains and avoids unnecessary sequence searches. In structural biology, the term protein ‘domain’ was initially coined for regions or fragments of globular proteins that are able to 2 System and methods fold independently (Wetlaufer, 1973). This definition was later extended to a concept of domains as the basic structural, functional Herein we employ the generic Astalavista notation that delimits an and/or evolutionary units of proteins (Chothia et al., 2003; Moore AS event e ¼ðsource ; sink ;V Þ by the first/last common splice site e e e et al., 2008). Protein domains are harboured by one or multiple (‘source ’ and “sink ”) and specifies the collection of variants e e 1 d i exons (Kaessmann et al., 2002; Moore et al., 2008; Vibranovski V ¼fV ; .. . ; V g, where each variant V describes a specific exon- e e e et al., 2005), and therefore the functionality of a protein can be intron pattern (i.e. a sequence of splice sites) between source and regulated by AS by completely or partially removing/inserting some sink supported by a set of jV j 1 transcripts (Sammeth et al., protein-coding (PC) exons (Light and Elofsson, 2013; Keren et al., 2008). Collapsing transcripts with identical local splicing patterns 2010; Tress et al., 2007). into variants avoids redundancy introduced by different alternative The objective of investigating proteome changes caused by AS transcripts that locally employ the same exons and introns of an has not come up only recently. However, driven by the lack of an event. Supplementary Figure S1 shows an example of applying this universal method, previous studies on AS triggered protein modifica- notation to AS events in the TNNT1 gene. Note that both events tions employ substantially variable in-house data analysis pipelines shown in Supplementary Figure S1 have a variant collection with only that rely on heterogeneous sources (i.e. different databases of AS two variants jV j¼jV j¼ 2 that group the three alternative tran- e e 1 2 events and of protein annotations) and on specific ad hoc decisions scripts of the locus by common local splicing patterns. Dimension d for the filtering thresholds applied, which makes the corresponding specifies the number of different variants in e, and ‘alternative’ splic- results difficult to compare. For instance, Liu and Altman (2003) ing events naturally involve d  2 variants. Throughout this work we and Resch et al. (2004) conducted early transcriptome-wide studies are exclusively considering ‘complete’ AS events that comprise all d on AS in human protein domains. The study of Liu and Altman variants between source and sink (Sammeth, 2009). e e (2003) employed gene loci reviewed by the NCBI with more than The amino acid sequence of domains with the same function can one Reference Sequence (RefSeq) transcript, linked them to entries vary in different proteins of a protein family, therefore the represen- of the conserved domains database (CDD) and considered domains tative consensus patterns of a protein domain is often modelled by a that were not present in all splice variants to be ‘spliced-out’. Resch statistical profiles. The Pfam database (Finn et al., 2014) produces et al. (2004) in contrast identified a representative ‘major transcript’ sequence alignments of all proteins from the same family and subse- for each gene that maximized the number of mRNA and EST sup- quently profiles the attributes of each domain by condensing its se- port, and then determined the longest open reading frame in each quence conservation in a hidden Markov model (HMM). candidate transcript. Subsequently, they aligned the major–minor p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA ; p t e protein pairs and recorded the amino acid intervals that are removed (1) or replaced. The overlap between these alternatively spliced regions x 2X; Z ! A; A; Z 2A and annotated protein domains then allowed to identify domains HMM domain profiles p of length L are described by all Z ! A that were affected by AS. transition probabilities PðÞ AjZ between different states A; Z 2A of Later-on, Hegyi et al. (2011) described a large-scale study of AS the model, as well as by the (additional) emission probabilities in protein domains, employing major and minor alternative protein PðÞ xjA when aligning A 2fD ; I ; M g at position 1  i  L of e i i i p forms according to the underlying EST support as defined in the the model to an amino acid x 2X from the amino acid alphabet X . Swissprot database. The authors then investigated domain modifica- Latter emission probabilities are employed to evaluate the likelihood tions by the minor form by considering all major–minor transcript of p matching the symbols x 2X of a given peptide sequence X. pairs with AS in the region of domains annotated by the Protein Note that some state transitions Z ! A are not productive, for in- Family (Pfam) database. Whereas Buljan et al. (2012) analyzed how stance PðÞ D jI ¼ 0 and vice versa PðÞ I jD ¼ 0 to enforce at t i i1 t i i1 often amino acid sequences encoded by tissue-specific (TS) cassette least one mis-/matching symbol between insertions and deletions exons, non-TS cassette exons and constitutive exons are overlapping (Supplementary Algorithm S1). the Pfam protein domains; cassette exons and tissue specificity in Supplementary Figure S2 provides a more detailed description of this study were obtained by overlapping exons supported by at this so-called ‘Plan 7 architecture’ of Pfam HMMs as proposed by least two mature Ensembl (release 54) transcripts with RNAseq Durbin et al. (1998), which distinguishes in total 10 different states expression data from Wang et al. (2008); exons present in all anno- A¼ fB; C; D ; E; I ; J; M ; N; S; Tg, of which we employ in the fol- i i i tated transcripts were consequently assumed to be constitutive. lowing mainly the beginning/end state {B, E} of a domain alignment Subsequently, each exon was associated with the longest annotated and the alignment states fD ; I ; M g (deletion, insertion and mis-/ i i i ORF supporting it. match states). In contrast to Durbin et al. (1998), the model Summarizing the methodology behind these previous we apply herein prohibits partial (‘local’) alignments of p by B 6! M approaches, most approaches identified based on some reference i 8i > 1 and M 6! E8i < L . However, multiple local alignment(s) gene annotation one specific reference transcript per gene, the so- i p of the entire model p within X are still possible, by iterating the called ‘o-call transcript. This reference transcript then was usually interleaving amino acids x 2 X in the states {C, J, N} as described compared to all ‘llpare products by sequence alignments of the j by Durbin et al. (1998) (Supplementary Fig. S2). amino acid or mRNA sequences. Such alignments are computation- ally expensive and often partially redundant due to the presence of many identical subsequences from constitutively included exons. 3 System Furthermore, alignments of mature transcripts or protein sequences do not provide information about the location of introns or splicing We then developed AstaFunk (Alternative Splicing Transcriptome patterns. In order to overcome these shortcomings we sought to Annotation with Functional Knowledge), a computational pipeline AstaFunk 3811 [source; sink] by just a single nucleotide,ðÞ L  3  1 . This region then can be further extended by inserting gaps, as long as the align- ment score does not fall below the domain-specific gathering thresh- old h for a domain hit provided by the Pfam model. We further obtain for each model p ¼(L , P (AjZ), P (xjA)) the globally lowest p t e gap-opening penalty a considering transitions PðÞ I jM p t i a ¼ max bit PðÞ I jM PðÞ xjI ; (2) p t i e i 1  i  L ;x2X and correspondingly the lowest gap-extension penalty b consider- ing transitions PðÞ I jI , over all 1  i  L and emissions t i p PðÞ xjI ; x 2X e i b ¼ max bit PðÞ I jI  PðÞ xjI : (3) p t i e i 1  i  L ;x2X Note that by definition of the Pfam models (Eddy, 2011), gap- opening transitions have to be preceded by match states M ! I (Section 2.2), and that the ‘lowest penalty’ corresponds to the ‘highest bit score’ bitðÞ maximized by Equations (2) and (3). As bit scores of gap-opening and gap-extension probabilities are conse- quently negative throughout Pfam profiles p, we employ the abso- lute value of bitðÞ as a penalty measure for a and b . Fig. 1 Overview of the main steps in the AstaFunk workflow. Algorithmic We can compute the maximum number of amino acids that can steps are grey boxes labelled by letters in the order of their execution, from be introduced in the alignment of a perfect domain match x ðÞ 0 by (a)to(h). Calling AS events (left panel) requires as input a transcriptome an- notation (GTF format) and the sequence of the corresponding genome as- the expression x ðÞ 0  h  a  b , where p p p p sembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference x ðiÞ¼ max bitðP ðM jM Þ P ðxjM Þ p t k k1 e k domains for each gene (proprietary format) can optionally be provided in x2X k¼iþ1 order to avoid brute force exhaustive search (4) P ðxjM Þ e k ¼ max log ðP ðM jM Þþ log 2 t 2 k k1 x2X P ðxÞ designed to analyse the modifications caused by AS in protein 0 k¼iþ1 domains in a non-redundant manner, i.e. considering each alterna- describes the score when aligning the suffix½ i; L of p against its tive transcript subsequence only once. Figure 1 resumes the main consensus sequence, and x ðÞ 0 consequently converges to the opti- steps of our pipeline that are described in detail in the subsequent mal score for an alignment of the entire profile p. Employing this sections: based on the genomic exon–intron coordinates of a tran- term, the size of region D in nucleotide space can be delimited by scriptome annotation, the first step (a) consists in iterating all genes the function shown in Equation (5). to determine the positions of complete AS events; in step (b), the al- gorithm computes the size of a domain-specific region D around ðÞ L  3 1 x ðÞ 0  h  a p p p p D ¼ þ  3 (5) each AS event, which is required in order to ensure that all optimal 3 b alignments of a profile p with the alternatively spliced transcript h is the Pfam minimum score that a subsequence must reach in subsequences are found; if this region D possibly overlaps adjacent p order to belong to a protein domain (Finn et al., 2014). AS events, step (c) merges the corresponding events (Algorithm 1), and step (d) removes redundancy amongst alternatively spliced sub- sequences of merged events by reorganising the corresponding var- 3.2 Merge as events iants (Algorithm 2); in step (e), the nucleic acid sequences of the As detailed in the previous section, the AstaFunk algorithm extends alternatively spliced variants are translated in silico to amino acid the event region ½source ; sink  to both sides by D when scanning e e p sequences, which subsequently in step (f) are aligned with the profile for a profile p to delineate the relevant sequences of AS events p of the protein domain (Supplementary Algorithm S2); step (g) e ¼ðsource ; sink ;V Þ. Note that intronic sequence stretches obvi- e e e selects from the set of possibly overlapping domain hits a subset of ously are to be disregarded in these D extensions because the do- disjoint alignments ranked by their score; eventually, step (h) out- main alignment is carried out at the level of amino acids translated puts all AS events in the regions of thus selected domains hits. from exons. However, the extension of an AS event e by D can overlap the boundaries of an adjacent event e . In this case, events e 2 1 3.1 Compute D and e are mutually overlapping by D and need to be merged in order 2 p Domains can completely or just partially overlap the region of an to consider their variants together when aligning profile p. Obviously, the problem of events overlapping by D can be generalized for n > 2 AS event. Therefore, also parts of the sequences flanking the event are to be considered when computing relevant alignments of a do- adjacent events he ; ... ; e i, ordered by their genomic positions 1 n main profile p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA , and we developed an algo- source  source in the directionality of transcription. p t e e e i iþ1 rithm that determines the minimal extension D necessary to find all Algorithm 1 identifies all sets of D –overlapping events from a sequence of AS events E, sorted by aforementioned pre-order score-optimal alignments of p. In a first consideration we determine a lower bound on D by the number of nucleotides that participate on their position in the genome. While iterating jEj in a double loop in an exact domain match overlapping the region of an AS event (i, j), the algorithm identifies runs of adjacent events he ; .. . ; e i i j1 3812 V.Coelho and M.Sammeth merged events e (line 7) are also not contained in the same variant Algorithm 1: MergeEvents of the merged event e (lines 8–10). The reason behind this re- 1;...;n quirement is that each transcript t 2 V in variant x describes an exon–intron pattern on source ; sink that differs from the pat- e e k k tern of transcripts t in another variant V , and since source ; sink is necessarily a subregion of the merged event e e k k (Algorithm 1), both transcripts t and t cannot share a common vari- i j ant in the merged event. Supplementary Figure S1 exemplifies the application of Algorithm 2 by merging the variant collections of two AS events e and e from the Troponin T type 1 gene (TNNT1). Starting with 1 2 an initial value of V ¼fTg, SplitVariants iterates all (t , t ) e1;2 i j from T¼ {NM_003283, NM_001126132, NM_001126133}. When identifying that t ¼ NM 003283 and t ¼ NM 00112132 are from 1 2 different variants in e , the algorithm adapts V ¼ffNM 003283; 1 e 1;2 NM 001126133g; fNM 001126132gg: Next, also NM_003283 2 1 2 V and NM_00112133 2 V are split, producing the final variant e e 2 2 collection V ¼ffNM 003283g; fNM 001126133g; fNM 00 1;2 1126132gg: A general proof for the correctness of Algorithm 2 is that mutually overlap by D . In the trivial case of j ¼ðÞ i þ 1 , the given in the Supplementary Material (Supplementary Proof S1). merged event e corresponds to e . Otherwise, e extends i;...;j1 i i;...;j1 from source to sink and combines the variant collections of all e e i j1 3.4 Align profiles the merged events fV ; .. . ;V g, which we reorganized by e e i j1 By the principle of non-redundancy in variant sets (Sammeth, 2009), SplitVariantsðÞ described in the next section (Algorithm 2). any transcript t in a variant x of event e describes the same splicing i k Obviously, the complexity of Algorithm 1 is linear w.r.t. jEj, and pattern, and therefore also the same exonic sequence, between the result E comprises all (potentially merged) events. source and sink . To this end the AstaFunk pipeline considers for e e k k each variant x the exon–intron structure of an arbitrarily selected 3.3 Split variants x transcript t 2 V when translating the genomic subsequence Algorithm 1 also shows that a merge operation modifies not only source ; sink to a sequence X of correspondingly encoded amino e e k k the boundaries of consecutive events he ; ... ; e i, but also their re- 1 n acids. spective variant collections. The non-redundant concept of variants Subsequently, X is aligned to p ¼ðÞ L ; PðÞ AjZ ; PðÞ xjA by com- p t e requires that each transcript t 2 V is represented by exactly one i puting the dynamic programming (DP) recursion of the Viterbi L jXjjAj variant x of each event 1  k  n (Sammeth, 2009). Consequently, Algorithm (Supplementary Algorithm S1) in the matrix R . procedure SplitVariantsðÞ employed by Algorithm 1 has to find a Each cell r 2 R represents an optimal sub-solution of the partial i;j non-redundant variant collection V that complies with the var- alignment between the i first positions of p and j symbols in the 1;...;n iants described by each of the merged events. prefix of X. In order to reduce the number of cells that are to be computed in R, the AstaFunk algorithm employs an exact branch- and-bound heuristic for excluding subsolutions that cannot lead to a Algorithm 2: SplitVariants relevant alignment of p (see Supplementary Algorithm S2). Following this heuristic, subsolutions r are considered feasible iff i;j r þ x ðÞ i  log ðÞ jXj  h (6) p 2 p i;j holds (‘branch’). Otherwise the partial alignment is not extended (‘bound’), and consequently all DP cells r that extend exclusively i;j A A A infeasible subsolutions r ; r ; r also represent infeasible i;j1 i1;j i1;j1 (sub-)solutions and limit of the search space. As introduced in Section 3.1, the score of the optimal suffix alignment x ðÞ i in Equation (6) is estimated by the scores of exclusively match state transitions in the suffix M ! M ; i  k < L . Threshold h k kþ1 p p then is employed as discriminator whether a significant alignment score can be reached from a subsolution r . i;j 3.5 Call domains All relevant alignments of a profile p with a variant sequence X that exceed threshold h are considered hits (h). However, the Viterbi Algorithm described in Supplementary Algorithm S2 possibly identi- fies multiple hits h 2 H of the same profile that mutually overlap p;X Algorithm 2 (SplitVariants) commences by collecting non- on X, especially in the case of repetitive domains that carry intrinsic redundantly all transcripts that describe variants in any of the symmetries. In order to globally quantify the changes in the domain merged events in set T. The algorithm then ensures that all pairs of landscape introduced by AS, we determine for each profile p a sub- transcripts t ; t 2 T that are from different variants in one of the set of compatible (i.e. non-overlapping) hits H H . i j p;X p;X AstaFunk 3813 PROSITE (a) SMART AFUNK HMMER (b) Claudin family 1.0 Stomach 0.8 Lung 0.6 0.4 0.2 0.0 Exon 1A Exon 1B Exon 2 Exon 3 Exon 4 137,720,000 137,725,000 137,730,000 137,735,000 137,740,000 137,745,000 137,750,000 Bs = 112.7 ... ENST00000343735 Bs = 106.9 ... ENST00000183605 Fig. 2. Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, acces- sion identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs ¼ bit score) and the position of AstaFunk domain predictions relative to splicing patterns anno- tated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron struc- tures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) To this end the AstaFunk algorithm performs a greedy selection (Finn et al., 2014), AstaFunk predicts in most cases the same number of hits to be included in a set H of non-overlapping hits between of domains as HMMER3, as can be expected when optimizing the p;X the profile p and the sequence X: all initial hits h 2 H are iterated same scoring function on the same input. To further investigate the p;X according to their alignment score ranking score(h), from high to marginal differences in the number of C2H2-zinc domains, we com- pared the locations of this domain in the protein ZNF74 to the do- low scores, and h is included in H iff the aligned interval of X p;X main reference annotations from the PROSITE (Sigrist et al., 2013) does not overlap with any previously selected hit h 2 H of the p;X same profile p. Note that this greedy selection does neither guaran- and from the SMART (Letunic et al., 2015) database, obtained from Ensembl (release 87). Our results indicate that the additional tee the number nor the sum of scores of accepted hits h 2 H to p;X domains predicted by our AstaFunk heuristics are reproducing well be maximal. However, in agreement with the underlying biology the locations annotated by the two reference databases (Fig. 2a). our greedy selection prioritises high-scoring alignments of a domain Next, we employed the AstaFunk pipeline to investigate TS AS as the most trustful predictions. events reported earlier in the RBFOX1 gene (Castle et al., 2008), in Eventually, each hit h 2 H has the properties: p;X the SLC25A3 gene (Wang et al., 2008), and in the CLDN18 gene i. (i) h represents a hit, thus a significant alignment of profile p to (Tureci et al., 2011). First, we ran the AstaFunk pipeline to system- the target sequence X provided by scoreðÞ h  log ðÞ jXj  h ; 2 p atically predict all protein domains that overlap AS events described ii. (ii) h is alternatively spliced, i.e. the genomic area between the by the Gencode (v19) annotation in these genes. Figure 2b, start/end positions of h intersects with½ source ; sink of at e e Supplementary Figures S3 and S4 show that the domain predictions least one AS event e; on the alternative transcripts of each gene vary in their length and in iii. (iii) h is the best alignment of p with the corresponding subse- their score (Supplementary Data S1). According to our results, the quence of X, i.e. there exists no other hit h 2 H with score p;X score of the Claudin domain in the CLDN18 gene decreases when ðÞ h > scoreðÞ h in the subsequence of X that aligns with h . including the downstream exon 1B (score 114.9) instead of the alter- native upstream exon 1A (score 120.7, shown in the bottom panel of Fig. 2b). The splicing of exon 14 in the RBFOX1 gene increases 4 Results the score of the predicted Fox-1C domain (score 155.9 and 174.9) 4.1 Case studies of alternatively spliced protein as compared with the splicing of the mutually exclusive exon 15 domains downstream (score 146.6 and 165.7, Supplementary Fig. S3). In Our AstaFunk method employs and branch-and-bound heuristic contrast, the mitochondrial carrier domain of SLC25A3 yields a when computing the Viterbi Dynamic Programming matrices slightly lower score (77.0) when including the upstream exon 3A in- (Section 3.4) but a greedy selection step to resolve overlapping align- stead of its mutually exclusive counterpart 3B further downstream ments of the same domain profile (Section 3.5). We therefore first (scoring 78.3 and 78.6, Supplementary Fig. S4). sought to assess whether AstaFunk correctly identifies domains in In order to further assess the earlier reported tissue specificity of protein sequences in a case study of Zinc-Finger ZnF proteins from AS in these three genes, we next superimposed exon quantifications the Kru ¨ ppel family of transcription factors (Resch et al., 2004). To (Supplementary Data S2) based on RNA-Seq data produced by the this end we compared the domains found by our method to comple- GTEx Project (Ardlie et al., 2015) to these alternatively spliced mentary predictions by the HMMER3.2b software (Eddy, 2011). domains. In agreement with previous studies related to the CLDN18 Supplementary Table S1 shows that using the set of 14 831 gene (Niimi et al., 2001; Tureci et al., 2011), these quantifications manually curated profiles from the Pfam-A database v.27 demonstrate a TS selectivity for the alternative inclusion of exon 1A, Relative Mean Exon count 3814 V.Coelho and M.Sammeth Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. carrying the domain alignment that agrees better with the Pfam ref- reduced to loci with multiple (i.e. >1) alternatively spliced PC tran- erence, in stomach (on average 1460.3 in exon 1A versus 44.4 reads scripts; from these, we further removed transcripts with a length of in exon 1B, solid line in the top panel of Fig. 2b). In contrast, lung the annotated ORF that is not multiple of three in the genomic align- samples preferentially include exon 1B (488.7 in exon 1B versus 0.4 ment, or with an in-frame stop codon in the in silico translation of reads in exon 1A, dashed line), which yields a slightly worse align- the genomic sequence; then we scanned the longest ORF of each ment score with the Claudin domain profile (bottom panel of locus heuristically for significant Pfam-A profile hits (Eddy, 2011) Fig. 2b). and disconsidered loci without any domain hit. For all the remaining Similarly, our results also confirm well the previously reported loci we stored the domains that produced hit(s) in order to reduce tissue preferences of RBFOX1, where exon 14 preferentially spliced the domain space that is to be explored in the subsequent bench- in brain and exon 15 preferentially spliced in skeletal muscle (Castle mark (Supplementary Table S4). et al., 2008), and of SLC25A3, where exon 3A preferentially spliced Table 1a summarizes the attributes of thus obtained ‘PCAS loci’ in muscle tissues whereas exon 3B is preferred in testis and liver with  2 PC transcripts remaining after the previous filtering pro- (Wang et al., 2008). The AstaFunk domain prediction scores of cess, the corresponding sum of alternative PCAS ORFs and PCAS these case studies are summarized in Supplementary Data S1, and transcripts they produce, as well as the total number of ‘PCAS even- the reference read counts in each exon in Supplementary Data S2 ts’ collected from their splicing patterns (Foissac and Sammeth, (Supplementary Figs S3 and S4). 2007). Row ‘PCAS events per gene’ then lazily computes the fraction of (PCAS events)/(PCAS loci). Our results in Table 1b show a grad- ually increasing reduction of the search space in terms of number of 4.2 Evaluation of the AstaFunk time complexity amino acids that are explored by the AstaFunk method when com- A formal estimation of the AstaFunk algorithmic time complexity is paring WormBase to FlyBase and with the human RefSeq annota- not trivial since many algorithmic steps depend on specific attributes tion (44%, 49% and 53%), about proportional to the number of PC of the analyzed transcriptome and/or of a specific gene, for instance AS events per gene in each dataset (1.7, 1.9 and 2.3). This reduction the number and mutual distance of AS events, the type and the num- of the search space achieved by delimiting domain scans to the ber of domains found in the longest ORF of a gene, the agreement regions of AS events is also reflected by the speedup of the total run- between the variant sequences and a domain profile during branch- and-bound DP, and the number and mutual overlap of domains pre- time ( 43%timee Table 1c), although the overall gain in perform- dicted in variant sequences. We therefore based our evaluation of ance varies with specific attributes of each transcriptome; the the time complexity on empiric results, comparing the performance improvement obtained by AS independent branch-and-bound heu- of the AstaFunk method with the naı ¨ve approach of exhaustively ristics shows expectedly a rather constant performance comparing scanning all alternative transcript sequences with the corresponding the evaluated transcriptomes ( 25% on average, Supplementary domain profiles. Table S5), which does not affect the reported speedup since we For our benchmark, we employed the following genome assem- employed in our benchmark branch-and-bound heuristics for both, exhaustive as well as AstaFunk domain searches. blies and transcriptome annotations downloaded from the UCSC However, our evaluation also demonstrates that the speedup of Genome Browser (Kent et al., 2002): FlyBase Genes (Crosby et al., the AstaFunk method decreases when the AS diversity annotated in 2007; Smith et al., 2007) aligned to the dm3 genome of Drosophila a transcriptome further increases, from 47% in the human RefSeq, melanogaster, WormBase Genes (Harris et al., 2003) aligned to the to 30% in the human UCSC and 26% in the human Gencode anno- ce3 genome of Caenorhabditis elegans, and RefSeq transcripts tation (Table 1c). This reduction in efficiency is apparently caused (Pruitt et al., 2014), UCSC Genes (Rosenbloom et al., 2015), as well as the Gencode Comprehensive Gene Collection v19 (Harrow et al., by the overhead of additional operations that are necessary for 2012) aligned to the hg19 genome of Homo sapiens. After obtaining merging AS events in transcriptomes with high AS event densities these input annotations from the corresponding website (3.4 and 6 AS events per PC locus, Table 1a). Indeed, when consid- (Supplementary Table S2), we preprocessed the transcripts as fol- ering the proportion of AS events that have to be merged because lows (Supplementary Table S3): the total number of PC loci was first they overlap by D regions, we observe that RefSeq comprises the p AstaFunk 3815 comparatively highest proportion of loci with few (0–10%) merged (a) events, whereas the UCSC and the Gencode annotation show less such loci. Vice versa, the proportion of loci with most of their AS events being merged follows the ranking Gencode > UCSC > RefSeq (Supplementary Fig. S5). 4.3 The impact of alternative splicing on protein domains We then employed our results from Section 4.2 in order to estimate the transcriptome-wide impact of AS on protein domains. Assuming the prediction scores to be a proxy for the molecular activity of domains, we analyzed the impact of AS in the Gencode annotation, the transcriptome with the highest AS activity in our benchmark Domain Score Conservation (Table 1a). For our study, we clustered all AstaFunk predictions of (b) the same domain that overlap by genomic coordinates, and under the hypothesis of purifying selection we assumed the highest scoring prediction to represent the wild-type of the domain. We then com- puted for each AS variant of the domain a corresponding ‘domain conservation’ coefficient by the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. Figure 3a shows that most scores assigned by AstaFunk to alter- natively spliced domains preserve the wild-type score well, which implies that also changes to the corresponding domainso functional- ities are likely limited. According to our analysis, domains that show AS variants with low conservation scores tend to be longer and over- all less frequent in the proteome (Fig. 3b). To further investigate an alternatively spliced domain with poor prediction score, we selected Domain Score Conservation the TNNT1 gene (Supplementary Fig. S1) that harbours in total four different predictions of the troponin domain (Supplementary Fig. 3. Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to Fig. S6). If we assume the best prediction with a score of 170.5 to AS changes to the coding sequence preserve the wild-type score well, i.e. represent the wild-type, AS creates two minor modifications of the most alternative scores fall into the [0.9; 1] interval of domain conservation. domain, one by an alternative acceptor choice in exon 12 (conserva- (b) Domains with AS variants that exhibit very low scores (domain conserva- tion 98.9%) and another by a postponed start codon triggered by tion <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and AS in the 5 -UTR (conservation 88.7%); however, a major modifica- less frequent in the transcriptome (grey boxes, right y-axis) tion can be caused by the skipping of exon 8 that leads to the loss of a considerable portion of amino acids in the C-terminal end of the required calculations can dominate the running time when the dens- encoded troponin domain (conservation 27.5%). The phenotype of ity of AS events is as elevated as in the UCSC and the Gencode anno- individuals that exhibit the latter splicing anomaly has been tation of the human transcriptome (Table 1). Nevertheless, in all of characterized as ‘nemaline myopathy’ (van der Pol et al., 2014). the transcriptomes we investigated, AstaFunk reduces the CPU time needed for computing the genome-wide set of domains in AS events by > 1/4 of the time consumed by an exhaustive search approach. 5 Discussion Since we employed for our benchmark only a heuristically deter- We developed AstaFunk, a computational pipeline that links AS mined subset of domain profiles, i.e. domains occurring in the lon- events to protein domains and identifies alternatively spliced gest ORF of each gene, the speedup that we measure here in hours domains without requiring a cumbersome manual overlap between extrapolates to a reduction of days or weeks when performing a full AS events (in nucleotide space) and protein domains (in amino acid exhaustive search of all Pfam domains in each gene. space). To this end, our method combines previously published As a proof of principle, our results show that superimposing methods for the automated predicition of AS events (Foissac and RNAseq-based exon quantifications from the GTEx project to the Sammeth, 2007) with in silico predictions of protein domains (Eddy, AstaFunk domain predictions confirms well previously established 2011). To boost efficiency, our approach reduces the search space knowledge on tissue-selective preferences of mutually exclusive primarily by omitting redundant sequence scans of identical (i.e. exons in three in-depth studied genes. Since our AstaFunk protocol constitutively spliced) parts that are intrinsic to alternatively spliced is not restricted to a specific database of gene models or AS events, it transcripts, considering each unique splicing pattern (i.e. variant) of also provides the potential to automatically and exhaustively detect an AS event exactly once. AS changes to protein domains implied by generic RNA-Seq results However, some additional operations are necessary to detect in combination with ab initio and/or de novo gene predictors. But whether two neighbouring AS events overlap by a profile-specific D p also when employing AS events from reference gene annotations, region (Algorithm 1), and in such cases to merge the corresponding the AstaFunk pipeline could provide a unified platform for produc- variants of neighbouring events (Algorithm 2). Our benchmark ing comparable results on the impact of transcriptional modifica- shows that the computational overhead for these additional opera- tions on protein domains. tions is rather negligible in transcriptomes with modest AS activity, If we believe that a comparison of the scores computed based on i.e. in the WormBase, FlyBase and the RefSeq collection, but the the Pfam models allows us to make conclusions about the molecular [0;0.1[ [0.1;0.2[ [0.2;0.3[ [0.3;0.4[ [0.4;0.5[ [0.5;0.6[ [0.6;0.7[ [0.7;0.8[ [0.8;0.9[ [0;0.1[ [0.9;1] [0.1;0.2[ [0.2;0.3[ [0.3;0.4[ [0.4;0.5[ [0.5;0.6[ [0.6;0.7[ [0.7;0.8[ [0.8;0.9[ [0.9;1] Domain Length [aa] Density 0 100 300 500 700 900 0 0.05 0.1 0.15 0.2 0.25 0.3 0 50 100 150 200 250 300 350 Number of Predictions 3816 V.Coelho and M.Sammeth Harrow,J. et al. (2012) Gencode: the reference human genome annotation for activity of a domain, then the predictions produced by AstaFunk en- the encode project. Genome Res., 22, 1760–1774. able us to suggest that most AS events impact the domains they affect Hegyi,H. et al. (2011) Verification of alternative splicing variants based on do- only marginally. These observations agree well with complementary main integrity, truncation length and intrinsic protein disorder. Nucleic proteomics studies (Tress et al.,2017), that conclude that the impact Acids Res., 39, 1208–1219. of AS in the proteome is likely less severe than the plethora of splicing Kaessmann,H. et al. (2002) Signatures of domain shuffling in the human gen- variants found in transcriptome interrogations would suggest (Djebali ome. Genome Res., 12, 1642–1650. et al.,2012). However, some of the score changes predicted by Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., AstaFunk on AS configurations of a domain indicate that the conse- 12, 996–1006. quences on the protein functionality could be non-negligible. One Keren,H. et al. (2010) Alternative splicing and evolution: diversification, exon such example is the skipping of exonic sequences in an AS variant of definition and function. Nat. Rev. 11, 345–355. the TNNT1 gene that leads to the predicted loss of the C-terminal Lappalainen,T. et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501, 506–511. part of its essential troponin domain and was linked to a pathogen Letunic,I. et al. (2015) Smart: recent updates, new developments and status in phenotype described as nemaline myopathy. Although certainly only 2015. Nucleic Acids Res., 43, D257–D260. a minority of individuals suffer this AS-triggered domain malfunction, Light,S. and Elofsson,A. (2013) The impact of splicing on protein domain the example demonstrates the exploratory power provided by our architecture. Curr. Opin. Struct. Biol., 23, 451–458. AstaFunk pipeline to detect potential consequences of AS on protein Liu,S. and Altman,R.B. (2003) Large scale study of protein domain distribu- domains, which later on can be further explored by complementary tion in the context of alternative splicing. Nucleic Acids Res., 31, genetic or proteomic studies. 4828–4835. Moore,A.D. et al. (2008) Arrangements in the modular evolution of proteins. Trends Biochem. Sci., 33, 444–451. Funding Niimi,T. et al. (2001) claudin-18, a novel downstream target gene for the This work was supported by the National Counsel of Technological t/ebp/nkx2.1 homeodomain transcription factor, encodes lung- and stomach-specific isoforms through alternative splicing. Mol. Cell. Biol., 21, and Scientific Development [132455/2013-7 and 140941/2015-0 to 7380–7390. V.C. 310132/2015-0 to M.S.]; and the Research Support Oltean,S. and Bates,D.O. (2014) Hallmarks of alternative splicing in cancer. Foundation of the State of Rio de Janeiro [E_06/2015 to M.S.]. Oncogene, 33, 5311–5318. Conflict of Interest: none declared. van der Pol,W.L. et al. (2014) Nemaline myopathy caused by tnnt1 mutations in a Dutch pedigree. Mol. Genet. Genomic Med., 2, 134–137. Pruitt,K.D. et al. (2014) Refseq: an update on mammalian reference sequences. References Nucleic Acids Res., 42, D756–D763. Resch,A. et al. (2004) Assessing the impact of alternative splicing on domain Black,D. (2000) Protein diversity from alternative splicing: a challenge for bio- interactions in the human proteome. J. Proteome Res., 3, 76–83. informatics and post-genome biology. Cell, 103, 367–370. Rosenbloom,K.R. et al. (2015) The ucsc genome browser database: 2015 up- Buljan,M. et al. (2012) Tissue-specific splicing of disordered segments that date. Nucleic Acids Res., 43, D670–D681. embed binding motifs rewires protein interaction networks. Mol. Cell, 46, Sammeth,M. (2009) Complete alternative splicing events are bubbles in splic- 871–883. ing graphs. J. Comput. Biol., 16, 1117–1140. Castle,J.C. et al. (2008) Expression of 24, 426 human alternative splicing Sammeth,M. et al. (2008) A general definition and nomenclature for alterna- events and predicted cis regulation in 48 tissues and cell lines. Nat. Genet., tive splicing events. PLoS Comput. Biol., 4, e1000147. 40, 1416–1425. Sigrist,C.J.A. et al. (2013) New and continuing developments at prosite. Chothia,C. et al. (2003) Evolution of the protein repertoire. Science, 300, Nucleic Acids Res., 41, D344–D347. 1701–1703. Smith,C.D. et al. (2007) The release 5.1 annotation of Drosophila mela- Crosby,M.A. et al. (2007) Flybase: genomes by the dozen. Nucleic Acids Res., nogaster heterochromatin. Science, 316, 1586–1591. 35, D486–D491. Tazi,J. et al. (2009) Alternative splicing and disease. Biochim. Biophys. Acta, Djebali,S. et al. (2012) Landscape of transcription in human cells. Nature, 1792, 14–26. 489, 101–108. Tress,M.L. et al. (2007) The implications of alternative splicing in the encode Durbin,R. et al. (1998). Biological Sequence Analysis: Probabilistic Models of protein complement. PNAS, 104, 5495–5500. Proteins Nucleic Acids. 1st edn. Cambridge University Press. Tress,M.L. et al. (2017) Alternative splicing may not be the key to proteome Eddy,S.R. (2011) Accelerated profile hmm searches. PLoS Comput. Biol., 7, complexity. Trends Biochem. Sci., 42, 98–110. e1002195–e1002116. Tureci,O. et al. (2011) Claudin-18 gene structure, regulation, and expression Finn,R. et al. (2014) Pfam: the protein families database. Nucleic Acids Res., is evolutionary conserved in mammals. Gene, 481, 83–92. 42, D222–D230. Vibranovski,M.D. et al. (2005) Signs of ancient and modern exon-shuffling Foissac,S. and Sammeth,M. (2007) Astalavista: dynamic and flexible analysis are correlated to the distribution of ancient and modern domains along pro- of alternative splicing events in custom gene datasets. Nucleic Acids Res., 35, w297–w299. teins. J. Mol. Evol., 61, 341–350. Ardlie,K.G. et al. (2015) The genotype-tissue expression (gtex) pilot analysis: Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue tran- multitissue gene regulation in humans. Science, 348, 648–660. scriptomes. Nature, 456, 470–476. Harris,T.W. et al. (2003) Wormbase: a cross-species database for comparative Wetlaufer,D.B. (1973) Nucleation, rapid folding, and globular intrachain genomics. Nucleic Acids Res., 31, 133–137. regions in proteins. Proc. Natl Acad. Sci. USA, 70, 697–701.

Journal

BioinformaticsOxford University Press

Published: Jun 1, 2018

There are no references for this article.