An automated method for detecting alternatively spliced protein domains

An automated method for detecting alternatively spliced protein domains Abstract Motivation Alternative splicing (AS) has been demonstrated to play a role in shaping eukaryotic gene diversity at the transcriptional level. However, the impact of AS on the proteome is still controversial. Studies that seek to explore the effect of AS at the proteomic level are hampered by technical difficulties in the cumbersome process of casting forth and back between genome, transcriptome and proteome space coordinates, and the naïve prediction of protein domains in the presence of AS suffers many redundant sequence scans that emerge from constitutively spliced regions that are shared between alternative products of a gene. Results We developed the AstaFunk pipeline that computes for every generic transcriptome all domains that are altered by AS events in a systematic and efficient manner. In a nutshell, our method employs Viterbi dynamic programming, which guarantees to find all score-optimal hits of the domains under consideration, while complementary optimizations at different levels avoid redundant and other irrelevant computations. We evaluate AstaFunk qualitatively and quantitatively using RNAseq in well-studied genes with AS, and on large-scale employing entire transcriptomes. Our study confirms complementary reports that the effect of most AS events on the proteome seems to be rather limited, but our results also pinpoint several cases where AS could have a major impact on the function of a protein domain. Availability and implementation The JAVA implementation of AstaFunk is available as an open source project on http://astafunk.sammeth.net. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Alternative splicing (AS) has already early been demonstrated to be capable of alternating the proteins produced from specific messenger transcripts, for instance in genes that encode receptors, cell-adhesion molecules and membrane channels [reviewed by Black (2000)]. Quantitative assays based on microarrays and then on high-throughput sequencing of the cellular RNA complement (RNA-Seq) further showed AS to be an ubiquitous process that plays a role in fine-tuning the eukaryotic transcriptome diversity according to specific genotypes (Lappalainen et al., 2013), cell types and cell states (Djebali et al., 2012; Wang et al., 2008), albeit the observed differences are usually subordinate to gene expression. In specific cases, AS has also been demonstrated to severely compromise the functionality of genes, and aberrant AS variants have been linked to several physiological diseases (Tazi et al., 2009), as well as to cancer progression and metastasis (Oltean and Bates, 2014). However, there is to date still relatively few evidence that allows to conclude on an equally large impact of AS in the proteome as has been reported in the transcriptome (Tress et al., 2017), underlining the importance of automated methods that help to reduce the size of this evident gap by a systematic analysis of the nucleic acid diversity in the context of the encoded proteins. In structural biology, the term protein ‘domain’ was initially coined for regions or fragments of globular proteins that are able to fold independently (Wetlaufer, 1973). This definition was later extended to a concept of domains as the basic structural, functional and/or evolutionary units of proteins (Chothia et al., 2003; Moore et al., 2008). Protein domains are harboured by one or multiple exons (Kaessmann et al., 2002; Moore et al., 2008; Vibranovski et al., 2005), and therefore the functionality of a protein can be regulated by AS by completely or partially removing/inserting some protein-coding (PC) exons (Light and Elofsson, 2013; Keren et al., 2010; Tress et al., 2007). The objective of investigating proteome changes caused by AS has not come up only recently. However, driven by the lack of an universal method, previous studies on AS triggered protein modifications employ substantially variable in-house data analysis pipelines that rely on heterogeneous sources (i.e. different databases of AS events and of protein annotations) and on specific ad hoc decisions for the filtering thresholds applied, which makes the corresponding results difficult to compare. For instance, Liu and Altman (2003) and Resch et al. (2004) conducted early transcriptome-wide studies on AS in human protein domains. The study of Liu and Altman (2003) employed gene loci reviewed by the NCBI with more than one Reference Sequence (RefSeq) transcript, linked them to entries of the conserved domains database (CDD) and considered domains that were not present in all splice variants to be ‘spliced-out’. Resch et al. (2004) in contrast identified a representative ‘major transcript’ for each gene that maximized the number of mRNA and EST support, and then determined the longest open reading frame in each candidate transcript. Subsequently, they aligned the major–minor protein pairs and recorded the amino acid intervals that are removed or replaced. The overlap between these alternatively spliced regions and annotated protein domains then allowed to identify domains that were affected by AS. Later-on, Hegyi et al. (2011) described a large-scale study of AS in protein domains, employing major and minor alternative protein forms according to the underlying EST support as defined in the Swissprot database. The authors then investigated domain modifications by the minor form by considering all major–minor transcript pairs with AS in the region of domains annotated by the Protein Family (Pfam) database. Whereas Buljan et al. (2012) analyzed how often amino acid sequences encoded by tissue-specific (TS) cassette exons, non-TS cassette exons and constitutive exons are overlapping the Pfam protein domains; cassette exons and tissue specificity in this study were obtained by overlapping exons supported by at least two mature Ensembl (release 54) transcripts with RNAseq expression data from Wang et al. (2008); exons present in all annotated transcripts were consequently assumed to be constitutive. Subsequently, each exon was associated with the longest annotated ORF supporting it. Summarizing the methodology behind these previous approaches, most approaches identified based on some reference gene annotation one specific reference transcript per gene, the so-called ‘o-call transcript. This reference transcript then was usually compared to all ‘llpare products by sequence alignments of the amino acid or mRNA sequences. Such alignments are computationally expensive and often partially redundant due to the presence of many identical subsequences from constitutively included exons. Furthermore, alignments of mature transcripts or protein sequences do not provide information about the location of introns or splicing patterns. In order to overcome these shortcomings we sought to develop a method that enables systematic studies on the effect of AS on protein domains and avoids unnecessary sequence searches. 2 System and methods Herein we employ the generic Astalavista notation that delimits an AS event e=(sourcee,sinke,Ve) by the first/last common splice site (‘ sourcee’ and “ sinke”) and specifies the collection of variants Ve={Ve1,…,Ved} ⁠, where each variant Vei describes a specific exon-intron pattern (i.e. a sequence of splice sites) between sourcee and sinke supported by a set of |Vei|≥1 transcripts (Sammeth et al., 2008). Collapsing transcripts with identical local splicing patterns into variants avoids redundancy introduced by different alternative transcripts that locally employ the same exons and introns of an event. Supplementary Figure S1 shows an example of applying this notation to AS events in the TNNT1 gene. Note that both events shown in Supplementary Figure S1 have a variant collection with only two variants |Ve1|=|Ve2|=2 that group the three alternative transcripts of the locus by common local splicing patterns. Dimension d specifies the number of different variants in e, and ‘alternative’ splicing events naturally involve d≥2 variants. Throughout this work we are exclusively considering ‘complete’ AS events that comprise all d variants between sourcee and sinke (Sammeth, 2009). The amino acid sequence of domains with the same function can vary in different proteins of a protein family, therefore the representative consensus patterns of a protein domain is often modelled by a statistical profiles. The Pfam database (Finn et al., 2014) produces sequence alignments of all proteins from the same family and subsequently profiles the attributes of each domain by condensing its sequence conservation in a hidden Markov model (HMM). π=(Lπ,Pt(A|Z),Pe(x|A)),x∈X,Z→A,A,Z∈A (1) HMM domain profiles π of length Lπ are described by all Z→Atransition probabilities Pt(A|Z) between different states A,Z∈A of the model, as well as by the (additional) emission probabilities Pe(x|A) when aligning A∈{Di,Ii,Mi} at position 1≤i≤Lπ of the model to an amino acid x∈X from the amino acid alphabet X ⁠. Latter emission probabilities are employed to evaluate the likelihood of π matching the symbols xj∈X of a given peptide sequence X. Note that some state transitions Z→A are not productive, for instance Pt(Di|Ii−1)=0 and vice versa Pt(Ii|Di−1)=0 to enforce at least one mis-/matching symbol between insertions and deletions (Supplementary Algorithm S1). Supplementary Figure S2 provides a more detailed description of this so-called ‘Plan 7 architecture’ of Pfam HMMs as proposed by Durbin et al. (1998), which distinguishes in total 10 different states A={B,C,Di,E,Ii,J,Mi,N,S,T} ⁠, of which we employ in the following mainly the beginning/end state {B, E} of a domain alignment and the alignment states {Di,Ii,Mi} (deletion, insertion and mis-/match states). In contrast to Durbin et al. (1998), the model we apply herein prohibits partial (‘local’) alignments of π by B ↛   Mi∀i>1 and Mi ↛   E∀i<Lπ ⁠. However, multiple local alignment(s) of the entire model π within X are still possible, by iterating the interleaving amino acids xj∈X in the states {C, J, N} as described by Durbin et al. (1998) (Supplementary Fig. S2). 3 System We then developed AstaFunk (Alternative Splicing Transcriptome Annotation with Functional Knowledge), a computational pipeline designed to analyse the modifications caused by AS in protein domains in a non-redundant manner, i.e. considering each alternative transcript subsequence only once. Figure 1 resumes the main steps of our pipeline that are described in detail in the subsequent sections: based on the genomic exon–intron coordinates of a transcriptome annotation, the first step (a) consists in iterating all genes to determine the positions of complete AS events; in step (b), the algorithm computes the size of a domain-specific region Δπ around each AS event, which is required in order to ensure that all optimal alignments of a profile π with the alternatively spliced transcript subsequences are found; if this region Δπ possibly overlaps adjacent AS events, step (c) merges the corresponding events (Algorithm 1), and step (d) removes redundancy amongst alternatively spliced subsequences of merged events by reorganising the corresponding variants (Algorithm 2); in step (e), the nucleic acid sequences of the alternatively spliced variants are translated in silico to amino acid sequences, which subsequently in step (f) are aligned with the profile π of the protein domain (Supplementary Algorithm S2); step (g) selects from the set of possibly overlapping domain hits a subset of disjoint alignments ranked by their score; eventually, step (h) outputs all AS events in the regions of thus selected domains hits. Fig. 1 View largeDownload slide Overview of the main steps in the AstaFunk workflow. Algorithmic steps are grey boxes labelled by letters in the order of their execution, from (a) to (h). Calling AS events (left panel) requires as input a transcriptome annotation (GTF format) and the sequence of the corresponding genome assembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference domains for each gene (proprietary format) can optionally be provided in order to avoid brute force exhaustive search Fig. 1 View largeDownload slide Overview of the main steps in the AstaFunk workflow. Algorithmic steps are grey boxes labelled by letters in the order of their execution, from (a) to (h). Calling AS events (left panel) requires as input a transcriptome annotation (GTF format) and the sequence of the corresponding genome assembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference domains for each gene (proprietary format) can optionally be provided in order to avoid brute force exhaustive search 3.1 Compute Δπ Domains can completely or just partially overlap the region of an AS event. Therefore, also parts of the sequences flanking the event are to be considered when computing relevant alignments of a domain profile π=(Lπ,Pt(A|Z),Pe(x|A)) ⁠, and we developed an algorithm that determines the minimal extension Δπ necessary to find all score-optimal alignments of π. In a first consideration we determine a lower bound on Δπ by the number of nucleotides that participate in an exact domain match overlapping the region of an AS event [source; sink] by just a single nucleotide, (Lπ×3−1) ⁠. This region then can be further extended by inserting gaps, as long as the alignment score does not fall below the domain-specific gathering threshold θπ for a domain hit provided by the Pfam model. We further obtain for each model π=(Lπ, Pt(A|Z), Pe(x|A)) the globally lowest gap-opening penalty απ considering transitions Pt(Ii|M) απ=|max1≤i≤Lπ;x∈Xbit(Pt(Ii|M)×Pe(x|Ii))|, (2) and correspondingly the lowest gap-extension penalty βπ considering transitions Pt(Ii|I) ⁠, over all 1≤i≤Lπ and emissions Pe(x|Ii),x∈X βπ=|max1≤i≤Lπ;x∈Xbit(Pt(Ii|I)×Pe(x|Ii))|. (3) Note that by definition of the Pfam models (Eddy, 2011), gap-opening transitions have to be preceded by match states M→Ii (Section 2.2), and that the ‘lowest penalty’ corresponds to the ‘highest bit score’ bit() maximized by Equations (2) and (3). As bit scores of gap-opening and gap-extension probabilities are consequently negative throughout Pfam profiles π, we employ the absolute value of bit() as a penalty measure for απ and βπ ⁠. We can compute the maximum number of amino acids that can be introduced in the alignment of a perfect domain match ωπ(0) by the expression (ωπ(0)−θπ−απ)÷βπ ⁠, where ωπ(i)=∑k=i+1Lπmaxx∈X(bit(Pt(Mk|Mk−1)×Pe(x|Mk))=∑k=i+1Lπmaxx∈X( log 2(Pt(Mk|Mk−1)+ log 2(Pe(x|Mk)P0(x))) (4) describes the score when aligning the suffix [i;Lπ] of π against its consensus sequence, and ωπ(0) consequently converges to the optimal score for an alignment of the entire profile π. Employing this term, the size of region Δπ in nucleotide space can be delimited by the function shown in Equation (5). Δπ=⌈(Lπ×3)−13+ωπ(0)−θπ−απβπ⌉×3 (5) θπ is the Pfam minimum score that a subsequence must reach in order to belong to a protein domain (Finn et al., 2014). 3.2 Merge as events As detailed in the previous section, the AstaFunk algorithm extends the event region [sourcee;sinke] to both sides by Δπ when scanning for a profile π to delineate the relevant sequences of AS events e=(sourcee,sinke,Ve) ⁠. Note that intronic sequence stretches obviously are to be disregarded in these Δπ extensions because the domain alignment is carried out at the level of amino acids translated from exons. However, the extension of an AS event e1 by Δπ can overlap the boundaries of an adjacent event e2. In this case, events e1 and e2 are mutually overlapping by Δπ and need to be merged in order to consider their variants together when aligning profile π. Obviously, the problem of events overlapping by Δπ can be generalized for n > 2 adjacent events ⟨e1,…,en⟩ ⁠, ordered by their genomic positions (sourceei≺sourceei+1) in the directionality of transcription. Algorithm 1: MergeEvents Algorithm 1 identifies all sets of Δπ–overlapping events from a sequence of AS events E, sorted by aforementioned pre-order ≺ on their position in the genome. While iterating |E| in a double loop (i, j), the algorithm identifies runs of adjacent events ⟨ei,…,ej−1⟩ that mutually overlap by Δπ ⁠. In the trivial case of j=(i+1) ⁠, the merged event ei,…,j−1 corresponds to ei. Otherwise, ei,…,j−1 extends from sourceei to sinkej−1 and combines the variant collections of all the merged events {Vei,…,Vej−1} ⁠, which we reorganized by SplitVariants () described in the next section (Algorithm 2). Obviously, the complexity of Algorithm 1 is linear w.r.t. |E| ⁠, and the result EΔπ comprises all (potentially merged) events. 3.3 Split variants Algorithm 1 also shows that a merge operation modifies not only the boundaries of consecutive events ⟨e1,…,en⟩ ⁠, but also their respective variant collections. The non-redundant concept of variants requires that each transcript ti∈Vekx is represented by exactly one variant x of each event 1≤k≤n (Sammeth, 2009). Consequently, procedure SplitVariants () employed by Algorithm 1 has to find a non-redundant variant collection Ve1,…,n that complies with the variants described by each of the merged events. Algorithm 2: SplitVariants Algorithm 2 (SplitVariants) commences by collecting non-redundantly all transcripts that describe variants in any of the merged events in set T. The algorithm then ensures that all pairs of transcripts (ti,tj)∈T that are from different variants in one of the merged events ek (line 7) are also not contained in the same variant of the merged event e1,…,n (lines 8–10). The reason behind this requirement is that each transcript ti∈Vekx in variant x describes an exon–intron pattern on [sourceek;sinkek] that differs from the pattern of transcripts tj in another variant Veky ⁠, and since [sourceek;sinkek] is necessarily a subregion of the merged event (Algorithm 1), both transcripts ti and tj cannot share a common variant in the merged event. Supplementary Figure S1 exemplifies the application of Algorithm 2 by merging the variant collections of two AS events e1 and e2 from the Troponin T type 1 gene (TNNT1). Starting with an initial value of Ve1,2={T} ⁠, SplitVariants iterates all (ti, tj) from T= {NM_003283, NM_001126132, NM_001126133}. When identifying that t1=NM_003283 and t2=NM_00112132 are from different variants in e1, the algorithm adapts Ve1,2={{NM_003283, NM_001126133},{NM_001126132}}. Next, also NM_003283 ∈Ve21 and NM_00112133 ∈Ve22 are split, producing the final variant collection Ve1,2={{NM_003283}, {NM_001126133}, {NM_001126132}}. A general proof for the correctness of Algorithm 2 is given in the Supplementary Material (Supplementary Proof S1). 3.4 Align profiles By the principle of non-redundancy in variant sets (Sammeth, 2009), any transcript ti in a variant x of event ek describes the same splicing pattern, and therefore also the same exonic sequence, between sourceek and sinkek ⁠. To this end the AstaFunk pipeline considers for each variant x the exon–intron structure of an arbitrarily selected transcript ti∈Vekx when translating the genomic subsequence [sourceek;sinkek] to a sequence X of correspondingly encoded amino acids. Subsequently, X is aligned to π=(Lπ,Pt(A|Z),Pe(x|A)) by computing the dynamic programming (DP) recursion of the Viterbi Algorithm (Supplementary Algorithm S1) in the matrix RLπ×|X|×|A| ⁠. Each cell ri,jA∈R represents an optimal sub-solution of the partial alignment between the i first positions of π and j symbols in the prefix of X. In order to reduce the number of cells that are to be computed in R, the AstaFunk algorithm employs an exact branch-and-bound heuristic for excluding subsolutions that cannot lead to a relevant alignment of π (see Supplementary Algorithm S2). Following this heuristic, subsolutions ri,jA are considered feasible iff ri,jA+ωπ(i)− log 2(|X|)≥θπ (6) holds (‘branch’). Otherwise the partial alignment is not extended (‘bound’), and consequently all DP cells ri,jA that extend exclusively infeasible subsolutions ri,j−1A, ri−1,jA, ri−1,j−1A also represent infeasible (sub-)solutions and limit of the search space. As introduced in Section 3.1, the score of the optimal suffix alignment ωπ(i) in Equation (6) is estimated by the scores of exclusively match state transitions in the suffix Mk→Mk+1,i≤k<Lπ ⁠. Threshold θπ then is employed as discriminator whether a significant alignment score can be reached from a subsolution ri,jA ⁠. 3.5 Call domains All relevant alignments of a profile π with a variant sequence X that exceed threshold θπ are considered hits (h). However, the Viterbi Algorithm described in Supplementary Algorithm S2 possibly identifies multiple hits h∈Hπ,X of the same profile that mutually overlap on X, especially in the case of repetitive domains that carry intrinsic symmetries. In order to globally quantify the changes in the domain landscape introduced by AS, we determine for each profile π a subset of compatible (i.e. non-overlapping) hits Hπ,X*⊆Hπ,X ⁠. To this end the AstaFunk algorithm performs a greedy selection of hits to be included in a set Hπ,X* of non-overlapping hits between the profile π and the sequence X: all initial hits h∈Hπ,X are iterated according to their alignment score ranking score(h), from high to low scores, and h is included in Hπ,X*iff the aligned interval of X does not overlap with any previously selected hit h*∈Hπ,X* of the same profile π. Note that this greedy selection does neither guarantee the number nor the sum of scores of accepted hits h*∈Hπ,X* to be maximal. However, in agreement with the underlying biology our greedy selection prioritises high-scoring alignments of a domain as the most trustful predictions. Eventually, each hit h*∈Hπ,X* has the properties: (i) h* represents a hit, thus a significant alignment of profile π to the target sequence X provided by score(h*)− log 2(|X|)≥θπ ⁠; (ii) h* is alternatively spliced, i.e. the genomic area between the start/end positions of h* intersects with [sourcee;sinke] of at least one AS event e; (iii) h* is the best alignment of π with the corresponding subsequence of X, i.e. there exists no other hit h∈Hπ,X with score(h)>score(h*) in the subsequence of X that aligns with h* ⁠. 4 Results 4.1 Case studies of alternatively spliced protein domains Our AstaFunk method employs and branch-and-bound heuristic when computing the Viterbi Dynamic Programming matrices (Section 3.4) but a greedy selection step to resolve overlapping alignments of the same domain profile (Section 3.5). We therefore first sought to assess whether AstaFunk correctly identifies domains in protein sequences in a case study of Zinc-Finger ZnF proteins from the Krüppel family of transcription factors (Resch et al., 2004). To this end we compared the domains found by our method to complementary predictions by the HMMER3.2b software (Eddy, 2011). Supplementary Table S1 shows that using the set of 14 831 manually curated profiles from the Pfam-A database v.27 (Finn et al., 2014), AstaFunk predicts in most cases the same number of domains as HMMER3, as can be expected when optimizing the same scoring function on the same input. To further investigate the marginal differences in the number of C2H2-zinc domains, we compared the locations of this domain in the protein ZNF74 to the domain reference annotations from the PROSITE (Sigrist et al., 2013) and from the SMART (Letunic et al., 2015) database, obtained from Ensembl (release 87). Our results indicate that the additional domains predicted by our AstaFunk heuristics are reproducing well the locations annotated by the two reference databases (Fig. 2a). Fig. 2. View largeDownload slide Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, accession identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs = bit score) and the position of AstaFunk domain predictions relative to splicing patterns annotated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron structures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) Fig. 2. View largeDownload slide Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, accession identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs = bit score) and the position of AstaFunk domain predictions relative to splicing patterns annotated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron structures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) Next, we employed the AstaFunk pipeline to investigate TS AS events reported earlier in the RBFOX1 gene (Castle et al., 2008), in the SLC25A3 gene (Wang et al., 2008), and in the CLDN18 gene (Tureci et al., 2011). First, we ran the AstaFunk pipeline to systematically predict all protein domains that overlap AS events described by the Gencode (v19) annotation in these genes. Figure 2b, Supplementary Figures S3 and S4 show that the domain predictions on the alternative transcripts of each gene vary in their length and in their score (Supplementary Data S1). According to our results, the score of the Claudin domain in the CLDN18 gene decreases when including the downstream exon 1B (score 114.9) instead of the alternative upstream exon 1A (score 120.7, shown in the bottom panel of Fig. 2b). The splicing of exon 14 in the RBFOX1 gene increases the score of the predicted Fox-1C domain (score 155.9 and 174.9) as compared with the splicing of the mutually exclusive exon 15 downstream (score 146.6 and 165.7, Supplementary Fig. S3). In contrast, the mitochondrial carrier domain of SLC25A3 yields a slightly lower score (77.0) when including the upstream exon 3A instead of its mutually exclusive counterpart 3B further downstream (scoring 78.3 and 78.6, Supplementary Fig. S4). In order to further assess the earlier reported tissue specificity of AS in these three genes, we next superimposed exon quantifications (Supplementary Data S2) based on RNA-Seq data produced by the GTEx Project (Ardlie et al., 2015) to these alternatively spliced domains. In agreement with previous studies related to the CLDN18 gene (Niimi et al., 2001; Tureci et al., 2011), these quantifications demonstrate a TS selectivity for the alternative inclusion of exon 1A, carrying the domain alignment that agrees better with the Pfam reference, in stomach (on average 1460.3 in exon 1A versus 44.4 reads in exon 1B, solid line in the top panel of Fig. 2b). In contrast, lung samples preferentially include exon 1B (488.7 in exon 1B versus 0.4 reads in exon 1A, dashed line), which yields a slightly worse alignment score with the Claudin domain profile (bottom panel of Fig. 2b). Similarly, our results also confirm well the previously reported tissue preferences of RBFOX1, where exon 14 preferentially spliced in brain and exon 15 preferentially spliced in skeletal muscle (Castle et al., 2008), and of SLC25A3, where exon 3A preferentially spliced in muscle tissues whereas exon 3B is preferred in testis and liver (Wang et al., 2008). The AstaFunk domain prediction scores of these case studies are summarized in Supplementary Data S1, and the reference read counts in each exon in Supplementary Data S2 (Supplementary Figs S3 and S4). 4.2 Evaluation of the AstaFunk time complexity A formal estimation of the AstaFunk algorithmic time complexity is not trivial since many algorithmic steps depend on specific attributes of the analyzed transcriptome and/or of a specific gene, for instance the number and mutual distance of AS events, the type and the number of domains found in the longest ORF of a gene, the agreement between the variant sequences and a domain profile during branch-and-bound DP, and the number and mutual overlap of domains predicted in variant sequences. We therefore based our evaluation of the time complexity on empiric results, comparing the performance of the AstaFunk method with the naïve approach of exhaustively scanning all alternative transcript sequences with the corresponding domain profiles. For our benchmark, we employed the following genome assemblies and transcriptome annotations downloaded from the UCSC Genome Browser (Kent et al., 2002): FlyBase Genes (Crosby et al., 2007; Smith et al., 2007) aligned to the dm3 genome of Drosophila melanogaster, WormBase Genes (Harris et al., 2003) aligned to the ce3 genome of Caenorhabditis elegans, and RefSeq transcripts (Pruitt et al., 2014), UCSC Genes (Rosenbloom et al., 2015), as well as the Gencode Comprehensive Gene Collection v19 (Harrow et al., 2012) aligned to the hg19 genome of Homo sapiens. After obtaining these input annotations from the corresponding website (Supplementary Table S2), we preprocessed the transcripts as follows (Supplementary Table S3): the total number of PC loci was first reduced to loci with multiple (i.e. >1) alternatively spliced PC transcripts; from these, we further removed transcripts with a length of the annotated ORF that is not multiple of three in the genomic alignment, or with an in-frame stop codon in the in silico translation of the genomic sequence; then we scanned the longest ORF of each locus heuristically for significant Pfam-A profile hits (Eddy, 2011) and disconsidered loci without any domain hit. For all the remaining loci we stored the domains that produced hit(s) in order to reduce the domain space that is to be explored in the subsequent benchmark (Supplementary Table S4). Table 1a summarizes the attributes of thus obtained ‘PCAS loci’ with ≥2 PC transcripts remaining after the previous filtering process, the corresponding sum of alternative PCAS ORFs and PCAS transcripts they produce, as well as the total number of ‘PCAS events’ collected from their splicing patterns (Foissac and Sammeth, 2007). Row ‘PCAS events per gene’ then lazily computes the fraction of (PCAS events)/(PCAS loci). Our results in Table 1b show a gradually increasing reduction of the search space in terms of number of amino acids that are explored by the AstaFunk method when comparing WormBase to FlyBase and with the human RefSeq annotation (44%, 49% and 53%), about proportional to the number of PC AS events per gene in each dataset (1.7, 1.9 and 2.3). This reduction of the search space achieved by delimiting domain scans to the regions of AS events is also reflected by the speedup of the total runtime (∼43%timee Table 1c), although the overall gain in performance varies with specific attributes of each transcriptome; the improvement obtained by AS independent branch-and-bound heuristics shows expectedly a rather constant performance comparing the evaluated transcriptomes (∼25% on average, Supplementary Table S5), which does not affect the reported speedup since we employed in our benchmark branch-and-bound heuristics for both, exhaustive as well as AstaFunk domain searches. Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. However, our evaluation also demonstrates that the speedup of the AstaFunk method decreases when the AS diversity annotated in a transcriptome further increases, from 47% in the human RefSeq, to 30% in the human UCSC and 26% in the human Gencode annotation (Table 1c). This reduction in efficiency is apparently caused by the overhead of additional operations that are necessary for merging AS events in transcriptomes with high AS event densities (3.4 and 6 AS events per PC locus, Table 1a). Indeed, when considering the proportion of AS events that have to be merged because they overlap by Δπ regions, we observe that RefSeq comprises the comparatively highest proportion of loci with few (0–10%) merged events, whereas the UCSC and the Gencode annotation show less such loci. Vice versa, the proportion of loci with most of their AS events being merged follows the ranking Gencode > UCSC > RefSeq (Supplementary Fig. S5). 4.3 The impact of alternative splicing on protein domains We then employed our results from Section 4.2 in order to estimate the transcriptome-wide impact of AS on protein domains. Assuming the prediction scores to be a proxy for the molecular activity of domains, we analyzed the impact of AS in the Gencode annotation, the transcriptome with the highest AS activity in our benchmark (Table 1a). For our study, we clustered all AstaFunk predictions of the same domain that overlap by genomic coordinates, and under the hypothesis of purifying selection we assumed the highest scoring prediction to represent the wild-type of the domain. We then computed for each AS variant of the domain a corresponding ‘domain conservation’ coefficient by the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. Figure 3a shows that most scores assigned by AstaFunk to alternatively spliced domains preserve the wild-type score well, which implies that also changes to the corresponding domainso functionalities are likely limited. According to our analysis, domains that show AS variants with low conservation scores tend to be longer and overall less frequent in the proteome (Fig. 3b). To further investigate an alternatively spliced domain with poor prediction score, we selected the TNNT1 gene (Supplementary Fig. S1) that harbours in total four different predictions of the troponin domain (Supplementary Fig. S6). If we assume the best prediction with a score of 170.5 to represent the wild-type, AS creates two minor modifications of the domain, one by an alternative acceptor choice in exon 12 (conservation 98.9%) and another by a postponed start codon triggered by AS in the 5′-UTR (conservation 88.7%); however, a major modification can be caused by the skipping of exon 8 that leads to the loss of a considerable portion of amino acids in the C-terminal end of the encoded troponin domain (conservation 27.5%). The phenotype of individuals that exhibit the latter splicing anomaly has been characterized as ‘nemaline myopathy’ (van der Pol et al., 2014). Fig. 3. View largeDownload slide Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to AS changes to the coding sequence preserve the wild-type score well, i.e. most alternative scores fall into the [0.9; 1] interval of domain conservation. (b) Domains with AS variants that exhibit very low scores (domain conservation <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and less frequent in the transcriptome (grey boxes, right y-axis) Fig. 3. View largeDownload slide Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to AS changes to the coding sequence preserve the wild-type score well, i.e. most alternative scores fall into the [0.9; 1] interval of domain conservation. (b) Domains with AS variants that exhibit very low scores (domain conservation <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and less frequent in the transcriptome (grey boxes, right y-axis) 5 Discussion We developed AstaFunk, a computational pipeline that links AS events to protein domains and identifies alternatively spliced domains without requiring a cumbersome manual overlap between AS events (in nucleotide space) and protein domains (in amino acid space). To this end, our method combines previously published methods for the automated predicition of AS events (Foissac and Sammeth, 2007) with in silico predictions of protein domains (Eddy, 2011). To boost efficiency, our approach reduces the search space primarily by omitting redundant sequence scans of identical (i.e. constitutively spliced) parts that are intrinsic to alternatively spliced transcripts, considering each unique splicing pattern (i.e. variant) of an AS event exactly once. However, some additional operations are necessary to detect whether two neighbouring AS events overlap by a profile-specific Δπ region (Algorithm 1), and in such cases to merge the corresponding variants of neighbouring events (Algorithm 2). Our benchmark shows that the computational overhead for these additional operations is rather negligible in transcriptomes with modest AS activity, i.e. in the WormBase, FlyBase and the RefSeq collection, but the required calculations can dominate the running time when the density of AS events is as elevated as in the UCSC and the Gencode annotation of the human transcriptome (Table 1). Nevertheless, in all of the transcriptomes we investigated, AstaFunk reduces the CPU time needed for computing the genome-wide set of domains in AS events by > 1/4 of the time consumed by an exhaustive search approach. Since we employed for our benchmark only a heuristically determined subset of domain profiles, i.e. domains occurring in the longest ORF of each gene, the speedup that we measure here in hours extrapolates to a reduction of days or weeks when performing a full exhaustive search of all Pfam domains in each gene. As a proof of principle, our results show that superimposing RNAseq-based exon quantifications from the GTEx project to the AstaFunk domain predictions confirms well previously established knowledge on tissue-selective preferences of mutually exclusive exons in three in-depth studied genes. Since our AstaFunk protocol is not restricted to a specific database of gene models or AS events, it also provides the potential to automatically and exhaustively detect AS changes to protein domains implied by generic RNA-Seq results in combination with ab initio and/or de novo gene predictors. But also when employing AS events from reference gene annotations, the AstaFunk pipeline could provide a unified platform for producing comparable results on the impact of transcriptional modifications on protein domains. If we believe that a comparison of the scores computed based on the Pfam models allows us to make conclusions about the molecular activity of a domain, then the predictions produced by AstaFunk enable us to suggest that most AS events impact the domains they affect only marginally. These observations agree well with complementary proteomics studies (Tress et al., 2017), that conclude that the impact of AS in the proteome is likely less severe than the plethora of splicing variants found in transcriptome interrogations would suggest (Djebali et al., 2012). However, some of the score changes predicted by AstaFunk on AS configurations of a domain indicate that the consequences on the protein functionality could be non-negligible. One such example is the skipping of exonic sequences in an AS variant of the TNNT1 gene that leads to the predicted loss of the C-terminal part of its essential troponin domain and was linked to a pathogen phenotype described as nemaline myopathy. Although certainly only a minority of individuals suffer this AS-triggered domain malfunction, the example demonstrates the exploratory power provided by our AstaFunk pipeline to detect potential consequences of AS on protein domains, which later on can be further explored by complementary genetic or proteomic studies. Funding This work was supported by the National Counsel of Technological and Scientific Development [132455/2013-7 and 140941/2015-0 to V.C. 310132/2015-0 to M.S.]; and the Research Support Foundation of the State of Rio de Janeiro [E_06/2015 to M.S.]. Conflict of Interest: none declared. References Black D. ( 2000 ) Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology . Cell , 103 , 367 – 370 . Google Scholar Crossref Search ADS PubMed Buljan M. et al. ( 2012 ) Tissue-specific splicing of disordered segments that embed binding motifs rewires protein interaction networks . Mol. Cell , 46 , 871 – 883 . Google Scholar Crossref Search ADS PubMed Castle J.C. et al. ( 2008 ) Expression of 24, 426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines . Nat. Genet ., 40 , 1416 – 1425 . Google Scholar Crossref Search ADS PubMed Chothia C. et al. ( 2003 ) Evolution of the protein repertoire . Science , 300 , 1701 – 1703 . Google Scholar Crossref Search ADS PubMed Crosby M.A. et al. ( 2007 ) Flybase: genomes by the dozen . Nucleic Acids Res ., 35 , D486 – D491 . Google Scholar Crossref Search ADS PubMed Djebali S. et al. ( 2012 ) Landscape of transcription in human cells . Nature , 489 , 101 – 108 . Google Scholar Crossref Search ADS PubMed Durbin R. et al. ( 1998 ). Biological Sequence Analysis: Probabilistic Models of Proteins Nucleic Acids . 1 st edn. Cambridge University Press . Eddy S.R. ( 2011 ) Accelerated profile hmm searches . PLoS Comput. Biol ., 7 , e1002195 – e1002116 . Google Scholar Crossref Search ADS PubMed Finn R. et al. ( 2014 ) Pfam: the protein families database . Nucleic Acids Res ., 42 , D222 – D230 . Google Scholar Crossref Search ADS PubMed Foissac S. , Sammeth M. ( 2007 ) Astalavista: dynamic and flexible analysis of alternative splicing events in custom gene datasets . Nucleic Acids Res ., 35 , w297 – w299 . Google Scholar Crossref Search ADS PubMed Ardlie K.G. et al. ( 2015 ) The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans . Science , 348 , 648 – 660 . Google Scholar Crossref Search ADS PubMed Harris T.W. et al. ( 2003 ) Wormbase: a cross-species database for comparative genomics . Nucleic Acids Res ., 31 , 133 – 137 . Google Scholar Crossref Search ADS PubMed Harrow J. et al. ( 2012 ) Gencode: the reference human genome annotation for the encode project . Genome Res ., 22 , 1760 – 1774 . Google Scholar Crossref Search ADS PubMed Hegyi H. et al. ( 2011 ) Verification of alternative splicing variants based on domain integrity, truncation length and intrinsic protein disorder . Nucleic Acids Res ., 39 , 1208 – 1219 . Google Scholar Crossref Search ADS PubMed Kaessmann H. et al. ( 2002 ) Signatures of domain shuffling in the human genome . Genome Res ., 12 , 1642 – 1650 . Google Scholar Crossref Search ADS PubMed Kent W.J. et al. ( 2002 ) The human genome browser at UCSC . Genome Res ., 12 , 996 – 1006 . Google Scholar Crossref Search ADS PubMed Keren H. et al. ( 2010 ) Alternative splicing and evolution: diversification, exon definition and function . Nat. Rev . 11 , 345 – 355 . Google Scholar Crossref Search ADS Lappalainen T. et al. ( 2013 ) Transcriptome and genome sequencing uncovers functional variation in humans . Nature , 501 , 506 – 511 . Google Scholar Crossref Search ADS PubMed Letunic I. et al. ( 2015 ) Smart: recent updates, new developments and status in 2015 . Nucleic Acids Res ., 43 , D257 – D260 . Google Scholar Crossref Search ADS PubMed Light S. , Elofsson A. ( 2013 ) The impact of splicing on protein domain architecture . Curr. Opin. Struct. Biol ., 23 , 451 – 458 . Google Scholar Crossref Search ADS PubMed Liu S. , Altman R.B. ( 2003 ) Large scale study of protein domain distribution in the context of alternative splicing . Nucleic Acids Res ., 31 , 4828 – 4835 . Google Scholar Crossref Search ADS PubMed Moore A.D. et al. ( 2008 ) Arrangements in the modular evolution of proteins . Trends Biochem. Sci ., 33 , 444 – 451 . Google Scholar Crossref Search ADS PubMed Niimi T. et al. ( 2001 ) claudin-18, a novel downstream target gene for the t/ebp/nkx2.1 homeodomain transcription factor, encodes lung- and stomach-specific isoforms through alternative splicing . Mol. Cell. Biol ., 21 , 7380 – 7390 . Google Scholar Crossref Search ADS PubMed Oltean S. , Bates D.O. ( 2014 ) Hallmarks of alternative splicing in cancer . Oncogene , 33 , 5311 – 5318 . Google Scholar Crossref Search ADS PubMed van der Pol W.L. et al. ( 2014 ) Nemaline myopathy caused by tnnt1 mutations in a Dutch pedigree . Mol. Genet. Genomic Med ., 2 , 134 – 137 . Google Scholar Crossref Search ADS PubMed Pruitt K.D. et al. ( 2014 ) Refseq: an update on mammalian reference sequences . Nucleic Acids Res ., 42 , D756 – D763 . Google Scholar Crossref Search ADS PubMed Resch A. et al. ( 2004 ) Assessing the impact of alternative splicing on domain interactions in the human proteome . J. Proteome Res ., 3 , 76 – 83 . Google Scholar Crossref Search ADS PubMed Rosenbloom K.R. et al. ( 2015 ) The ucsc genome browser database: 2015 update . Nucleic Acids Res ., 43 , D670 – D681 . Google Scholar Crossref Search ADS PubMed Sammeth M. ( 2009 ) Complete alternative splicing events are bubbles in splicing graphs . J. Comput. Biol ., 16 , 1117 – 1140 . Google Scholar Crossref Search ADS PubMed Sammeth M. et al. ( 2008 ) A general definition and nomenclature for alternative splicing events . PLoS Comput. Biol ., 4 , e1000147. Google Scholar Crossref Search ADS PubMed Sigrist C.J.A. et al. ( 2013 ) New and continuing developments at prosite . Nucleic Acids Res ., 41 , D344 – D347 . Google Scholar Crossref Search ADS PubMed Smith C.D. et al. ( 2007 ) The release 5.1 annotation of Drosophila melanogaster heterochromatin . Science , 316 , 1586 – 1591 . Google Scholar Crossref Search ADS PubMed Tazi J. et al. ( 2009 ) Alternative splicing and disease . Biochim. Biophys. Acta , 1792 , 14 – 26 . Google Scholar Crossref Search ADS PubMed Tress M.L. et al. ( 2007 ) The implications of alternative splicing in the encode protein complement . PNAS , 104 , 5495 – 5500 . Google Scholar Crossref Search ADS PubMed Tress M.L. et al. ( 2017 ) Alternative splicing may not be the key to proteome complexity . Trends Biochem. Sci ., 42 , 98 – 110 . Google Scholar Crossref Search ADS PubMed Tureci O. et al. ( 2011 ) Claudin-18 gene structure, regulation, and expression is evolutionary conserved in mammals . Gene , 481 , 83 – 92 . Google Scholar Crossref Search ADS PubMed Vibranovski M.D. et al. ( 2005 ) Signs of ancient and modern exon-shuffling are correlated to the distribution of ancient and modern domains along proteins . J. Mol. Evol ., 61 , 341 – 350 . Google Scholar Crossref Search ADS PubMed Wang E.T. et al. ( 2008 ) Alternative isoform regulation in human tissue transcriptomes . Nature , 456 , 470 – 476 . Google Scholar Crossref Search ADS PubMed Wetlaufer D.B. ( 1973 ) Nucleation, rapid folding, and globular intrachain regions in proteins . Proc. Natl Acad. Sci. USA , 70 , 697 – 701 . Google Scholar Crossref Search ADS © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

An automated method for detecting alternatively spliced protein domains

Loading next page...
 
/lp/ou_press/an-automated-method-for-detecting-alternatively-spliced-protein-0IYy7aFJL2
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty425
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Alternative splicing (AS) has been demonstrated to play a role in shaping eukaryotic gene diversity at the transcriptional level. However, the impact of AS on the proteome is still controversial. Studies that seek to explore the effect of AS at the proteomic level are hampered by technical difficulties in the cumbersome process of casting forth and back between genome, transcriptome and proteome space coordinates, and the naïve prediction of protein domains in the presence of AS suffers many redundant sequence scans that emerge from constitutively spliced regions that are shared between alternative products of a gene. Results We developed the AstaFunk pipeline that computes for every generic transcriptome all domains that are altered by AS events in a systematic and efficient manner. In a nutshell, our method employs Viterbi dynamic programming, which guarantees to find all score-optimal hits of the domains under consideration, while complementary optimizations at different levels avoid redundant and other irrelevant computations. We evaluate AstaFunk qualitatively and quantitatively using RNAseq in well-studied genes with AS, and on large-scale employing entire transcriptomes. Our study confirms complementary reports that the effect of most AS events on the proteome seems to be rather limited, but our results also pinpoint several cases where AS could have a major impact on the function of a protein domain. Availability and implementation The JAVA implementation of AstaFunk is available as an open source project on http://astafunk.sammeth.net. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Alternative splicing (AS) has already early been demonstrated to be capable of alternating the proteins produced from specific messenger transcripts, for instance in genes that encode receptors, cell-adhesion molecules and membrane channels [reviewed by Black (2000)]. Quantitative assays based on microarrays and then on high-throughput sequencing of the cellular RNA complement (RNA-Seq) further showed AS to be an ubiquitous process that plays a role in fine-tuning the eukaryotic transcriptome diversity according to specific genotypes (Lappalainen et al., 2013), cell types and cell states (Djebali et al., 2012; Wang et al., 2008), albeit the observed differences are usually subordinate to gene expression. In specific cases, AS has also been demonstrated to severely compromise the functionality of genes, and aberrant AS variants have been linked to several physiological diseases (Tazi et al., 2009), as well as to cancer progression and metastasis (Oltean and Bates, 2014). However, there is to date still relatively few evidence that allows to conclude on an equally large impact of AS in the proteome as has been reported in the transcriptome (Tress et al., 2017), underlining the importance of automated methods that help to reduce the size of this evident gap by a systematic analysis of the nucleic acid diversity in the context of the encoded proteins. In structural biology, the term protein ‘domain’ was initially coined for regions or fragments of globular proteins that are able to fold independently (Wetlaufer, 1973). This definition was later extended to a concept of domains as the basic structural, functional and/or evolutionary units of proteins (Chothia et al., 2003; Moore et al., 2008). Protein domains are harboured by one or multiple exons (Kaessmann et al., 2002; Moore et al., 2008; Vibranovski et al., 2005), and therefore the functionality of a protein can be regulated by AS by completely or partially removing/inserting some protein-coding (PC) exons (Light and Elofsson, 2013; Keren et al., 2010; Tress et al., 2007). The objective of investigating proteome changes caused by AS has not come up only recently. However, driven by the lack of an universal method, previous studies on AS triggered protein modifications employ substantially variable in-house data analysis pipelines that rely on heterogeneous sources (i.e. different databases of AS events and of protein annotations) and on specific ad hoc decisions for the filtering thresholds applied, which makes the corresponding results difficult to compare. For instance, Liu and Altman (2003) and Resch et al. (2004) conducted early transcriptome-wide studies on AS in human protein domains. The study of Liu and Altman (2003) employed gene loci reviewed by the NCBI with more than one Reference Sequence (RefSeq) transcript, linked them to entries of the conserved domains database (CDD) and considered domains that were not present in all splice variants to be ‘spliced-out’. Resch et al. (2004) in contrast identified a representative ‘major transcript’ for each gene that maximized the number of mRNA and EST support, and then determined the longest open reading frame in each candidate transcript. Subsequently, they aligned the major–minor protein pairs and recorded the amino acid intervals that are removed or replaced. The overlap between these alternatively spliced regions and annotated protein domains then allowed to identify domains that were affected by AS. Later-on, Hegyi et al. (2011) described a large-scale study of AS in protein domains, employing major and minor alternative protein forms according to the underlying EST support as defined in the Swissprot database. The authors then investigated domain modifications by the minor form by considering all major–minor transcript pairs with AS in the region of domains annotated by the Protein Family (Pfam) database. Whereas Buljan et al. (2012) analyzed how often amino acid sequences encoded by tissue-specific (TS) cassette exons, non-TS cassette exons and constitutive exons are overlapping the Pfam protein domains; cassette exons and tissue specificity in this study were obtained by overlapping exons supported by at least two mature Ensembl (release 54) transcripts with RNAseq expression data from Wang et al. (2008); exons present in all annotated transcripts were consequently assumed to be constitutive. Subsequently, each exon was associated with the longest annotated ORF supporting it. Summarizing the methodology behind these previous approaches, most approaches identified based on some reference gene annotation one specific reference transcript per gene, the so-called ‘o-call transcript. This reference transcript then was usually compared to all ‘llpare products by sequence alignments of the amino acid or mRNA sequences. Such alignments are computationally expensive and often partially redundant due to the presence of many identical subsequences from constitutively included exons. Furthermore, alignments of mature transcripts or protein sequences do not provide information about the location of introns or splicing patterns. In order to overcome these shortcomings we sought to develop a method that enables systematic studies on the effect of AS on protein domains and avoids unnecessary sequence searches. 2 System and methods Herein we employ the generic Astalavista notation that delimits an AS event e=(sourcee,sinke,Ve) by the first/last common splice site (‘ sourcee’ and “ sinke”) and specifies the collection of variants Ve={Ve1,…,Ved} ⁠, where each variant Vei describes a specific exon-intron pattern (i.e. a sequence of splice sites) between sourcee and sinke supported by a set of |Vei|≥1 transcripts (Sammeth et al., 2008). Collapsing transcripts with identical local splicing patterns into variants avoids redundancy introduced by different alternative transcripts that locally employ the same exons and introns of an event. Supplementary Figure S1 shows an example of applying this notation to AS events in the TNNT1 gene. Note that both events shown in Supplementary Figure S1 have a variant collection with only two variants |Ve1|=|Ve2|=2 that group the three alternative transcripts of the locus by common local splicing patterns. Dimension d specifies the number of different variants in e, and ‘alternative’ splicing events naturally involve d≥2 variants. Throughout this work we are exclusively considering ‘complete’ AS events that comprise all d variants between sourcee and sinke (Sammeth, 2009). The amino acid sequence of domains with the same function can vary in different proteins of a protein family, therefore the representative consensus patterns of a protein domain is often modelled by a statistical profiles. The Pfam database (Finn et al., 2014) produces sequence alignments of all proteins from the same family and subsequently profiles the attributes of each domain by condensing its sequence conservation in a hidden Markov model (HMM). π=(Lπ,Pt(A|Z),Pe(x|A)),x∈X,Z→A,A,Z∈A (1) HMM domain profiles π of length Lπ are described by all Z→Atransition probabilities Pt(A|Z) between different states A,Z∈A of the model, as well as by the (additional) emission probabilities Pe(x|A) when aligning A∈{Di,Ii,Mi} at position 1≤i≤Lπ of the model to an amino acid x∈X from the amino acid alphabet X ⁠. Latter emission probabilities are employed to evaluate the likelihood of π matching the symbols xj∈X of a given peptide sequence X. Note that some state transitions Z→A are not productive, for instance Pt(Di|Ii−1)=0 and vice versa Pt(Ii|Di−1)=0 to enforce at least one mis-/matching symbol between insertions and deletions (Supplementary Algorithm S1). Supplementary Figure S2 provides a more detailed description of this so-called ‘Plan 7 architecture’ of Pfam HMMs as proposed by Durbin et al. (1998), which distinguishes in total 10 different states A={B,C,Di,E,Ii,J,Mi,N,S,T} ⁠, of which we employ in the following mainly the beginning/end state {B, E} of a domain alignment and the alignment states {Di,Ii,Mi} (deletion, insertion and mis-/match states). In contrast to Durbin et al. (1998), the model we apply herein prohibits partial (‘local’) alignments of π by B ↛   Mi∀i>1 and Mi ↛   E∀i<Lπ ⁠. However, multiple local alignment(s) of the entire model π within X are still possible, by iterating the interleaving amino acids xj∈X in the states {C, J, N} as described by Durbin et al. (1998) (Supplementary Fig. S2). 3 System We then developed AstaFunk (Alternative Splicing Transcriptome Annotation with Functional Knowledge), a computational pipeline designed to analyse the modifications caused by AS in protein domains in a non-redundant manner, i.e. considering each alternative transcript subsequence only once. Figure 1 resumes the main steps of our pipeline that are described in detail in the subsequent sections: based on the genomic exon–intron coordinates of a transcriptome annotation, the first step (a) consists in iterating all genes to determine the positions of complete AS events; in step (b), the algorithm computes the size of a domain-specific region Δπ around each AS event, which is required in order to ensure that all optimal alignments of a profile π with the alternatively spliced transcript subsequences are found; if this region Δπ possibly overlaps adjacent AS events, step (c) merges the corresponding events (Algorithm 1), and step (d) removes redundancy amongst alternatively spliced subsequences of merged events by reorganising the corresponding variants (Algorithm 2); in step (e), the nucleic acid sequences of the alternatively spliced variants are translated in silico to amino acid sequences, which subsequently in step (f) are aligned with the profile π of the protein domain (Supplementary Algorithm S2); step (g) selects from the set of possibly overlapping domain hits a subset of disjoint alignments ranked by their score; eventually, step (h) outputs all AS events in the regions of thus selected domains hits. Fig. 1 View largeDownload slide Overview of the main steps in the AstaFunk workflow. Algorithmic steps are grey boxes labelled by letters in the order of their execution, from (a) to (h). Calling AS events (left panel) requires as input a transcriptome annotation (GTF format) and the sequence of the corresponding genome assembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference domains for each gene (proprietary format) can optionally be provided in order to avoid brute force exhaustive search Fig. 1 View largeDownload slide Overview of the main steps in the AstaFunk workflow. Algorithmic steps are grey boxes labelled by letters in the order of their execution, from (a) to (h). Calling AS events (left panel) requires as input a transcriptome annotation (GTF format) and the sequence of the corresponding genome assembly (FASTA format); domain calling (right panel) additionally requires HMM domain profiles (Pfam format). A pre-computed list of reference domains for each gene (proprietary format) can optionally be provided in order to avoid brute force exhaustive search 3.1 Compute Δπ Domains can completely or just partially overlap the region of an AS event. Therefore, also parts of the sequences flanking the event are to be considered when computing relevant alignments of a domain profile π=(Lπ,Pt(A|Z),Pe(x|A)) ⁠, and we developed an algorithm that determines the minimal extension Δπ necessary to find all score-optimal alignments of π. In a first consideration we determine a lower bound on Δπ by the number of nucleotides that participate in an exact domain match overlapping the region of an AS event [source; sink] by just a single nucleotide, (Lπ×3−1) ⁠. This region then can be further extended by inserting gaps, as long as the alignment score does not fall below the domain-specific gathering threshold θπ for a domain hit provided by the Pfam model. We further obtain for each model π=(Lπ, Pt(A|Z), Pe(x|A)) the globally lowest gap-opening penalty απ considering transitions Pt(Ii|M) απ=|max1≤i≤Lπ;x∈Xbit(Pt(Ii|M)×Pe(x|Ii))|, (2) and correspondingly the lowest gap-extension penalty βπ considering transitions Pt(Ii|I) ⁠, over all 1≤i≤Lπ and emissions Pe(x|Ii),x∈X βπ=|max1≤i≤Lπ;x∈Xbit(Pt(Ii|I)×Pe(x|Ii))|. (3) Note that by definition of the Pfam models (Eddy, 2011), gap-opening transitions have to be preceded by match states M→Ii (Section 2.2), and that the ‘lowest penalty’ corresponds to the ‘highest bit score’ bit() maximized by Equations (2) and (3). As bit scores of gap-opening and gap-extension probabilities are consequently negative throughout Pfam profiles π, we employ the absolute value of bit() as a penalty measure for απ and βπ ⁠. We can compute the maximum number of amino acids that can be introduced in the alignment of a perfect domain match ωπ(0) by the expression (ωπ(0)−θπ−απ)÷βπ ⁠, where ωπ(i)=∑k=i+1Lπmaxx∈X(bit(Pt(Mk|Mk−1)×Pe(x|Mk))=∑k=i+1Lπmaxx∈X( log 2(Pt(Mk|Mk−1)+ log 2(Pe(x|Mk)P0(x))) (4) describes the score when aligning the suffix [i;Lπ] of π against its consensus sequence, and ωπ(0) consequently converges to the optimal score for an alignment of the entire profile π. Employing this term, the size of region Δπ in nucleotide space can be delimited by the function shown in Equation (5). Δπ=⌈(Lπ×3)−13+ωπ(0)−θπ−απβπ⌉×3 (5) θπ is the Pfam minimum score that a subsequence must reach in order to belong to a protein domain (Finn et al., 2014). 3.2 Merge as events As detailed in the previous section, the AstaFunk algorithm extends the event region [sourcee;sinke] to both sides by Δπ when scanning for a profile π to delineate the relevant sequences of AS events e=(sourcee,sinke,Ve) ⁠. Note that intronic sequence stretches obviously are to be disregarded in these Δπ extensions because the domain alignment is carried out at the level of amino acids translated from exons. However, the extension of an AS event e1 by Δπ can overlap the boundaries of an adjacent event e2. In this case, events e1 and e2 are mutually overlapping by Δπ and need to be merged in order to consider their variants together when aligning profile π. Obviously, the problem of events overlapping by Δπ can be generalized for n > 2 adjacent events ⟨e1,…,en⟩ ⁠, ordered by their genomic positions (sourceei≺sourceei+1) in the directionality of transcription. Algorithm 1: MergeEvents Algorithm 1 identifies all sets of Δπ–overlapping events from a sequence of AS events E, sorted by aforementioned pre-order ≺ on their position in the genome. While iterating |E| in a double loop (i, j), the algorithm identifies runs of adjacent events ⟨ei,…,ej−1⟩ that mutually overlap by Δπ ⁠. In the trivial case of j=(i+1) ⁠, the merged event ei,…,j−1 corresponds to ei. Otherwise, ei,…,j−1 extends from sourceei to sinkej−1 and combines the variant collections of all the merged events {Vei,…,Vej−1} ⁠, which we reorganized by SplitVariants () described in the next section (Algorithm 2). Obviously, the complexity of Algorithm 1 is linear w.r.t. |E| ⁠, and the result EΔπ comprises all (potentially merged) events. 3.3 Split variants Algorithm 1 also shows that a merge operation modifies not only the boundaries of consecutive events ⟨e1,…,en⟩ ⁠, but also their respective variant collections. The non-redundant concept of variants requires that each transcript ti∈Vekx is represented by exactly one variant x of each event 1≤k≤n (Sammeth, 2009). Consequently, procedure SplitVariants () employed by Algorithm 1 has to find a non-redundant variant collection Ve1,…,n that complies with the variants described by each of the merged events. Algorithm 2: SplitVariants Algorithm 2 (SplitVariants) commences by collecting non-redundantly all transcripts that describe variants in any of the merged events in set T. The algorithm then ensures that all pairs of transcripts (ti,tj)∈T that are from different variants in one of the merged events ek (line 7) are also not contained in the same variant of the merged event e1,…,n (lines 8–10). The reason behind this requirement is that each transcript ti∈Vekx in variant x describes an exon–intron pattern on [sourceek;sinkek] that differs from the pattern of transcripts tj in another variant Veky ⁠, and since [sourceek;sinkek] is necessarily a subregion of the merged event (Algorithm 1), both transcripts ti and tj cannot share a common variant in the merged event. Supplementary Figure S1 exemplifies the application of Algorithm 2 by merging the variant collections of two AS events e1 and e2 from the Troponin T type 1 gene (TNNT1). Starting with an initial value of Ve1,2={T} ⁠, SplitVariants iterates all (ti, tj) from T= {NM_003283, NM_001126132, NM_001126133}. When identifying that t1=NM_003283 and t2=NM_00112132 are from different variants in e1, the algorithm adapts Ve1,2={{NM_003283, NM_001126133},{NM_001126132}}. Next, also NM_003283 ∈Ve21 and NM_00112133 ∈Ve22 are split, producing the final variant collection Ve1,2={{NM_003283}, {NM_001126133}, {NM_001126132}}. A general proof for the correctness of Algorithm 2 is given in the Supplementary Material (Supplementary Proof S1). 3.4 Align profiles By the principle of non-redundancy in variant sets (Sammeth, 2009), any transcript ti in a variant x of event ek describes the same splicing pattern, and therefore also the same exonic sequence, between sourceek and sinkek ⁠. To this end the AstaFunk pipeline considers for each variant x the exon–intron structure of an arbitrarily selected transcript ti∈Vekx when translating the genomic subsequence [sourceek;sinkek] to a sequence X of correspondingly encoded amino acids. Subsequently, X is aligned to π=(Lπ,Pt(A|Z),Pe(x|A)) by computing the dynamic programming (DP) recursion of the Viterbi Algorithm (Supplementary Algorithm S1) in the matrix RLπ×|X|×|A| ⁠. Each cell ri,jA∈R represents an optimal sub-solution of the partial alignment between the i first positions of π and j symbols in the prefix of X. In order to reduce the number of cells that are to be computed in R, the AstaFunk algorithm employs an exact branch-and-bound heuristic for excluding subsolutions that cannot lead to a relevant alignment of π (see Supplementary Algorithm S2). Following this heuristic, subsolutions ri,jA are considered feasible iff ri,jA+ωπ(i)− log 2(|X|)≥θπ (6) holds (‘branch’). Otherwise the partial alignment is not extended (‘bound’), and consequently all DP cells ri,jA that extend exclusively infeasible subsolutions ri,j−1A, ri−1,jA, ri−1,j−1A also represent infeasible (sub-)solutions and limit of the search space. As introduced in Section 3.1, the score of the optimal suffix alignment ωπ(i) in Equation (6) is estimated by the scores of exclusively match state transitions in the suffix Mk→Mk+1,i≤k<Lπ ⁠. Threshold θπ then is employed as discriminator whether a significant alignment score can be reached from a subsolution ri,jA ⁠. 3.5 Call domains All relevant alignments of a profile π with a variant sequence X that exceed threshold θπ are considered hits (h). However, the Viterbi Algorithm described in Supplementary Algorithm S2 possibly identifies multiple hits h∈Hπ,X of the same profile that mutually overlap on X, especially in the case of repetitive domains that carry intrinsic symmetries. In order to globally quantify the changes in the domain landscape introduced by AS, we determine for each profile π a subset of compatible (i.e. non-overlapping) hits Hπ,X*⊆Hπ,X ⁠. To this end the AstaFunk algorithm performs a greedy selection of hits to be included in a set Hπ,X* of non-overlapping hits between the profile π and the sequence X: all initial hits h∈Hπ,X are iterated according to their alignment score ranking score(h), from high to low scores, and h is included in Hπ,X*iff the aligned interval of X does not overlap with any previously selected hit h*∈Hπ,X* of the same profile π. Note that this greedy selection does neither guarantee the number nor the sum of scores of accepted hits h*∈Hπ,X* to be maximal. However, in agreement with the underlying biology our greedy selection prioritises high-scoring alignments of a domain as the most trustful predictions. Eventually, each hit h*∈Hπ,X* has the properties: (i) h* represents a hit, thus a significant alignment of profile π to the target sequence X provided by score(h*)− log 2(|X|)≥θπ ⁠; (ii) h* is alternatively spliced, i.e. the genomic area between the start/end positions of h* intersects with [sourcee;sinke] of at least one AS event e; (iii) h* is the best alignment of π with the corresponding subsequence of X, i.e. there exists no other hit h∈Hπ,X with score(h)>score(h*) in the subsequence of X that aligns with h* ⁠. 4 Results 4.1 Case studies of alternatively spliced protein domains Our AstaFunk method employs and branch-and-bound heuristic when computing the Viterbi Dynamic Programming matrices (Section 3.4) but a greedy selection step to resolve overlapping alignments of the same domain profile (Section 3.5). We therefore first sought to assess whether AstaFunk correctly identifies domains in protein sequences in a case study of Zinc-Finger ZnF proteins from the Krüppel family of transcription factors (Resch et al., 2004). To this end we compared the domains found by our method to complementary predictions by the HMMER3.2b software (Eddy, 2011). Supplementary Table S1 shows that using the set of 14 831 manually curated profiles from the Pfam-A database v.27 (Finn et al., 2014), AstaFunk predicts in most cases the same number of domains as HMMER3, as can be expected when optimizing the same scoring function on the same input. To further investigate the marginal differences in the number of C2H2-zinc domains, we compared the locations of this domain in the protein ZNF74 to the domain reference annotations from the PROSITE (Sigrist et al., 2013) and from the SMART (Letunic et al., 2015) database, obtained from Ensembl (release 87). Our results indicate that the additional domains predicted by our AstaFunk heuristics are reproducing well the locations annotated by the two reference databases (Fig. 2a). Fig. 2. View largeDownload slide Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, accession identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs = bit score) and the position of AstaFunk domain predictions relative to splicing patterns annotated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron structures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) Fig. 2. View largeDownload slide Case studies of the AstaFunk pipeline. (a) The schema shows the positions (x-axis) of reference annotations obtained from the PROSITE and the SMART database in comparison to domain predictions by the AstaFunk, respectively, by the HMMER3 program in the zinc finger protein 74 (transcript ZNF74-202, accession identifier ENST00000611540). Annotations of the KRAB box domain (PF01352.22) are depicted by grey boxes, and annotations of the C2H2-zinc finger domains (PF00096.21) by white boxes. (b) The alternative first exons 1A and 1B of the CLDN18 gene (ENSG00000066405.8) are overlapping with a Claudin family protein domain. The bottom panel shows in black the score (Bs = bit score) and the position of AstaFunk domain predictions relative to splicing patterns annotated by Ensembl reference transcripts (in grey). The top panel superimposes exon quantifications obtained from the GTEx Project to these exon–intron structures (x-axis) and indicates a tissue selectivity for exon 1A in stomach, whereas exon 1B is preferentially included in lung. We computed for each exon the inclusion level relative to the highest exon expression level of a transcript (y-axis) Next, we employed the AstaFunk pipeline to investigate TS AS events reported earlier in the RBFOX1 gene (Castle et al., 2008), in the SLC25A3 gene (Wang et al., 2008), and in the CLDN18 gene (Tureci et al., 2011). First, we ran the AstaFunk pipeline to systematically predict all protein domains that overlap AS events described by the Gencode (v19) annotation in these genes. Figure 2b, Supplementary Figures S3 and S4 show that the domain predictions on the alternative transcripts of each gene vary in their length and in their score (Supplementary Data S1). According to our results, the score of the Claudin domain in the CLDN18 gene decreases when including the downstream exon 1B (score 114.9) instead of the alternative upstream exon 1A (score 120.7, shown in the bottom panel of Fig. 2b). The splicing of exon 14 in the RBFOX1 gene increases the score of the predicted Fox-1C domain (score 155.9 and 174.9) as compared with the splicing of the mutually exclusive exon 15 downstream (score 146.6 and 165.7, Supplementary Fig. S3). In contrast, the mitochondrial carrier domain of SLC25A3 yields a slightly lower score (77.0) when including the upstream exon 3A instead of its mutually exclusive counterpart 3B further downstream (scoring 78.3 and 78.6, Supplementary Fig. S4). In order to further assess the earlier reported tissue specificity of AS in these three genes, we next superimposed exon quantifications (Supplementary Data S2) based on RNA-Seq data produced by the GTEx Project (Ardlie et al., 2015) to these alternatively spliced domains. In agreement with previous studies related to the CLDN18 gene (Niimi et al., 2001; Tureci et al., 2011), these quantifications demonstrate a TS selectivity for the alternative inclusion of exon 1A, carrying the domain alignment that agrees better with the Pfam reference, in stomach (on average 1460.3 in exon 1A versus 44.4 reads in exon 1B, solid line in the top panel of Fig. 2b). In contrast, lung samples preferentially include exon 1B (488.7 in exon 1B versus 0.4 reads in exon 1A, dashed line), which yields a slightly worse alignment score with the Claudin domain profile (bottom panel of Fig. 2b). Similarly, our results also confirm well the previously reported tissue preferences of RBFOX1, where exon 14 preferentially spliced in brain and exon 15 preferentially spliced in skeletal muscle (Castle et al., 2008), and of SLC25A3, where exon 3A preferentially spliced in muscle tissues whereas exon 3B is preferred in testis and liver (Wang et al., 2008). The AstaFunk domain prediction scores of these case studies are summarized in Supplementary Data S1, and the reference read counts in each exon in Supplementary Data S2 (Supplementary Figs S3 and S4). 4.2 Evaluation of the AstaFunk time complexity A formal estimation of the AstaFunk algorithmic time complexity is not trivial since many algorithmic steps depend on specific attributes of the analyzed transcriptome and/or of a specific gene, for instance the number and mutual distance of AS events, the type and the number of domains found in the longest ORF of a gene, the agreement between the variant sequences and a domain profile during branch-and-bound DP, and the number and mutual overlap of domains predicted in variant sequences. We therefore based our evaluation of the time complexity on empiric results, comparing the performance of the AstaFunk method with the naïve approach of exhaustively scanning all alternative transcript sequences with the corresponding domain profiles. For our benchmark, we employed the following genome assemblies and transcriptome annotations downloaded from the UCSC Genome Browser (Kent et al., 2002): FlyBase Genes (Crosby et al., 2007; Smith et al., 2007) aligned to the dm3 genome of Drosophila melanogaster, WormBase Genes (Harris et al., 2003) aligned to the ce3 genome of Caenorhabditis elegans, and RefSeq transcripts (Pruitt et al., 2014), UCSC Genes (Rosenbloom et al., 2015), as well as the Gencode Comprehensive Gene Collection v19 (Harrow et al., 2012) aligned to the hg19 genome of Homo sapiens. After obtaining these input annotations from the corresponding website (Supplementary Table S2), we preprocessed the transcripts as follows (Supplementary Table S3): the total number of PC loci was first reduced to loci with multiple (i.e. >1) alternatively spliced PC transcripts; from these, we further removed transcripts with a length of the annotated ORF that is not multiple of three in the genomic alignment, or with an in-frame stop codon in the in silico translation of the genomic sequence; then we scanned the longest ORF of each locus heuristically for significant Pfam-A profile hits (Eddy, 2011) and disconsidered loci without any domain hit. For all the remaining loci we stored the domains that produced hit(s) in order to reduce the domain space that is to be explored in the subsequent benchmark (Supplementary Table S4). Table 1a summarizes the attributes of thus obtained ‘PCAS loci’ with ≥2 PC transcripts remaining after the previous filtering process, the corresponding sum of alternative PCAS ORFs and PCAS transcripts they produce, as well as the total number of ‘PCAS events’ collected from their splicing patterns (Foissac and Sammeth, 2007). Row ‘PCAS events per gene’ then lazily computes the fraction of (PCAS events)/(PCAS loci). Our results in Table 1b show a gradually increasing reduction of the search space in terms of number of amino acids that are explored by the AstaFunk method when comparing WormBase to FlyBase and with the human RefSeq annotation (44%, 49% and 53%), about proportional to the number of PC AS events per gene in each dataset (1.7, 1.9 and 2.3). This reduction of the search space achieved by delimiting domain scans to the regions of AS events is also reflected by the speedup of the total runtime (∼43%timee Table 1c), although the overall gain in performance varies with specific attributes of each transcriptome; the improvement obtained by AS independent branch-and-bound heuristics shows expectedly a rather constant performance comparing the evaluated transcriptomes (∼25% on average, Supplementary Table S5), which does not affect the reported speedup since we employed in our benchmark branch-and-bound heuristics for both, exhaustive as well as AstaFunk domain searches. Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. Table 1. Comparison of the characteristics of (a) the benchmarked transcriptomes, (b) the reduction of the search space and (c) running time improvement achieved by the AstaFunk method in the D. melanogaster, in the Caenorhabditis elegans and in three Homo sapiens transcriptome annotations (columns) WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 WormBase (ce6) FlyBase (dm3) RefSeq (hg19) UCSC (hg19) Gencode (hg19) (a) Transcriptome PCAS loci 1,601 1,771 6,397 9,872 12,744 PCAS ORFs (non-redundant) 4,172 5,259 19,934 38,059 54,716 PCAS transcripts 5,918 6,244 22,340 42,045 81,091 PCAS events 2,771 3,288 14,609 33,756 76,554 PCAS events per locus 1.7 1.9 2.3 3.4 6.0 (b) Space Amino acids (all ORFs) 2,783,189 4,289,104 11,827,322 21,470,870 24,205,084 Amino acids in query 8,967,542 14,339,424 40,941,621 78,169,360 89,084,428 Amino acids aligned 5,007,648 7,310,503 19,130,559 55,891,207 68,720,579 Search space reduction [%] 44.2 49.0 53.3 28.5 22.9 (c) Time Time Exhaustive search [h] 0.57 0.9 2.22 9.1 10.29 Time AstaFunk [h] 0.33 0.45 1.18 6.35 7.62 Speedup [%] 42.5 49.8 46.93 30.2 25.97 Note: All running times were estimated by the CPU time reported when monitoring the program execution on an Intel Xeon E5-4650 processor clocked at 2.7 GHz by the time command. However, our evaluation also demonstrates that the speedup of the AstaFunk method decreases when the AS diversity annotated in a transcriptome further increases, from 47% in the human RefSeq, to 30% in the human UCSC and 26% in the human Gencode annotation (Table 1c). This reduction in efficiency is apparently caused by the overhead of additional operations that are necessary for merging AS events in transcriptomes with high AS event densities (3.4 and 6 AS events per PC locus, Table 1a). Indeed, when considering the proportion of AS events that have to be merged because they overlap by Δπ regions, we observe that RefSeq comprises the comparatively highest proportion of loci with few (0–10%) merged events, whereas the UCSC and the Gencode annotation show less such loci. Vice versa, the proportion of loci with most of their AS events being merged follows the ranking Gencode > UCSC > RefSeq (Supplementary Fig. S5). 4.3 The impact of alternative splicing on protein domains We then employed our results from Section 4.2 in order to estimate the transcriptome-wide impact of AS on protein domains. Assuming the prediction scores to be a proxy for the molecular activity of domains, we analyzed the impact of AS in the Gencode annotation, the transcriptome with the highest AS activity in our benchmark (Table 1a). For our study, we clustered all AstaFunk predictions of the same domain that overlap by genomic coordinates, and under the hypothesis of purifying selection we assumed the highest scoring prediction to represent the wild-type of the domain. We then computed for each AS variant of the domain a corresponding ‘domain conservation’ coefficient by the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. Figure 3a shows that most scores assigned by AstaFunk to alternatively spliced domains preserve the wild-type score well, which implies that also changes to the corresponding domainso functionalities are likely limited. According to our analysis, domains that show AS variants with low conservation scores tend to be longer and overall less frequent in the proteome (Fig. 3b). To further investigate an alternatively spliced domain with poor prediction score, we selected the TNNT1 gene (Supplementary Fig. S1) that harbours in total four different predictions of the troponin domain (Supplementary Fig. S6). If we assume the best prediction with a score of 170.5 to represent the wild-type, AS creates two minor modifications of the domain, one by an alternative acceptor choice in exon 12 (conservation 98.9%) and another by a postponed start codon triggered by AS in the 5′-UTR (conservation 88.7%); however, a major modification can be caused by the skipping of exon 8 that leads to the loss of a considerable portion of amino acids in the C-terminal end of the encoded troponin domain (conservation 27.5%). The phenotype of individuals that exhibit the latter splicing anomaly has been characterized as ‘nemaline myopathy’ (van der Pol et al., 2014). Fig. 3. View largeDownload slide Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to AS changes to the coding sequence preserve the wild-type score well, i.e. most alternative scores fall into the [0.9; 1] interval of domain conservation. (b) Domains with AS variants that exhibit very low scores (domain conservation <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and less frequent in the transcriptome (grey boxes, right y-axis) Fig. 3. View largeDownload slide Evaluation of the AS impact on protein domains in the human Gencode transcriptome. (a) Most of the domain predictions that differ due to AS changes to the coding sequence preserve the wild-type score well, i.e. most alternative scores fall into the [0.9; 1] interval of domain conservation. (b) Domains with AS variants that exhibit very low scores (domain conservation <0.1 of the wild-type) are on average longer (white boxes, left y-axis) and less frequent in the transcriptome (grey boxes, right y-axis) 5 Discussion We developed AstaFunk, a computational pipeline that links AS events to protein domains and identifies alternatively spliced domains without requiring a cumbersome manual overlap between AS events (in nucleotide space) and protein domains (in amino acid space). To this end, our method combines previously published methods for the automated predicition of AS events (Foissac and Sammeth, 2007) with in silico predictions of protein domains (Eddy, 2011). To boost efficiency, our approach reduces the search space primarily by omitting redundant sequence scans of identical (i.e. constitutively spliced) parts that are intrinsic to alternatively spliced transcripts, considering each unique splicing pattern (i.e. variant) of an AS event exactly once. However, some additional operations are necessary to detect whether two neighbouring AS events overlap by a profile-specific Δπ region (Algorithm 1), and in such cases to merge the corresponding variants of neighbouring events (Algorithm 2). Our benchmark shows that the computational overhead for these additional operations is rather negligible in transcriptomes with modest AS activity, i.e. in the WormBase, FlyBase and the RefSeq collection, but the required calculations can dominate the running time when the density of AS events is as elevated as in the UCSC and the Gencode annotation of the human transcriptome (Table 1). Nevertheless, in all of the transcriptomes we investigated, AstaFunk reduces the CPU time needed for computing the genome-wide set of domains in AS events by > 1/4 of the time consumed by an exhaustive search approach. Since we employed for our benchmark only a heuristically determined subset of domain profiles, i.e. domains occurring in the longest ORF of each gene, the speedup that we measure here in hours extrapolates to a reduction of days or weeks when performing a full exhaustive search of all Pfam domains in each gene. As a proof of principle, our results show that superimposing RNAseq-based exon quantifications from the GTEx project to the AstaFunk domain predictions confirms well previously established knowledge on tissue-selective preferences of mutually exclusive exons in three in-depth studied genes. Since our AstaFunk protocol is not restricted to a specific database of gene models or AS events, it also provides the potential to automatically and exhaustively detect AS changes to protein domains implied by generic RNA-Seq results in combination with ab initio and/or de novo gene predictors. But also when employing AS events from reference gene annotations, the AstaFunk pipeline could provide a unified platform for producing comparable results on the impact of transcriptional modifications on protein domains. If we believe that a comparison of the scores computed based on the Pfam models allows us to make conclusions about the molecular activity of a domain, then the predictions produced by AstaFunk enable us to suggest that most AS events impact the domains they affect only marginally. These observations agree well with complementary proteomics studies (Tress et al., 2017), that conclude that the impact of AS in the proteome is likely less severe than the plethora of splicing variants found in transcriptome interrogations would suggest (Djebali et al., 2012). However, some of the score changes predicted by AstaFunk on AS configurations of a domain indicate that the consequences on the protein functionality could be non-negligible. One such example is the skipping of exonic sequences in an AS variant of the TNNT1 gene that leads to the predicted loss of the C-terminal part of its essential troponin domain and was linked to a pathogen phenotype described as nemaline myopathy. Although certainly only a minority of individuals suffer this AS-triggered domain malfunction, the example demonstrates the exploratory power provided by our AstaFunk pipeline to detect potential consequences of AS on protein domains, which later on can be further explored by complementary genetic or proteomic studies. Funding This work was supported by the National Counsel of Technological and Scientific Development [132455/2013-7 and 140941/2015-0 to V.C. 310132/2015-0 to M.S.]; and the Research Support Foundation of the State of Rio de Janeiro [E_06/2015 to M.S.]. Conflict of Interest: none declared. References Black D. ( 2000 ) Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology . Cell , 103 , 367 – 370 . Google Scholar Crossref Search ADS PubMed Buljan M. et al. ( 2012 ) Tissue-specific splicing of disordered segments that embed binding motifs rewires protein interaction networks . Mol. Cell , 46 , 871 – 883 . Google Scholar Crossref Search ADS PubMed Castle J.C. et al. ( 2008 ) Expression of 24, 426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines . Nat. Genet ., 40 , 1416 – 1425 . Google Scholar Crossref Search ADS PubMed Chothia C. et al. ( 2003 ) Evolution of the protein repertoire . Science , 300 , 1701 – 1703 . Google Scholar Crossref Search ADS PubMed Crosby M.A. et al. ( 2007 ) Flybase: genomes by the dozen . Nucleic Acids Res ., 35 , D486 – D491 . Google Scholar Crossref Search ADS PubMed Djebali S. et al. ( 2012 ) Landscape of transcription in human cells . Nature , 489 , 101 – 108 . Google Scholar Crossref Search ADS PubMed Durbin R. et al. ( 1998 ). Biological Sequence Analysis: Probabilistic Models of Proteins Nucleic Acids . 1 st edn. Cambridge University Press . Eddy S.R. ( 2011 ) Accelerated profile hmm searches . PLoS Comput. Biol ., 7 , e1002195 – e1002116 . Google Scholar Crossref Search ADS PubMed Finn R. et al. ( 2014 ) Pfam: the protein families database . Nucleic Acids Res ., 42 , D222 – D230 . Google Scholar Crossref Search ADS PubMed Foissac S. , Sammeth M. ( 2007 ) Astalavista: dynamic and flexible analysis of alternative splicing events in custom gene datasets . Nucleic Acids Res ., 35 , w297 – w299 . Google Scholar Crossref Search ADS PubMed Ardlie K.G. et al. ( 2015 ) The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans . Science , 348 , 648 – 660 . Google Scholar Crossref Search ADS PubMed Harris T.W. et al. ( 2003 ) Wormbase: a cross-species database for comparative genomics . Nucleic Acids Res ., 31 , 133 – 137 . Google Scholar Crossref Search ADS PubMed Harrow J. et al. ( 2012 ) Gencode: the reference human genome annotation for the encode project . Genome Res ., 22 , 1760 – 1774 . Google Scholar Crossref Search ADS PubMed Hegyi H. et al. ( 2011 ) Verification of alternative splicing variants based on domain integrity, truncation length and intrinsic protein disorder . Nucleic Acids Res ., 39 , 1208 – 1219 . Google Scholar Crossref Search ADS PubMed Kaessmann H. et al. ( 2002 ) Signatures of domain shuffling in the human genome . Genome Res ., 12 , 1642 – 1650 . Google Scholar Crossref Search ADS PubMed Kent W.J. et al. ( 2002 ) The human genome browser at UCSC . Genome Res ., 12 , 996 – 1006 . Google Scholar Crossref Search ADS PubMed Keren H. et al. ( 2010 ) Alternative splicing and evolution: diversification, exon definition and function . Nat. Rev . 11 , 345 – 355 . Google Scholar Crossref Search ADS Lappalainen T. et al. ( 2013 ) Transcriptome and genome sequencing uncovers functional variation in humans . Nature , 501 , 506 – 511 . Google Scholar Crossref Search ADS PubMed Letunic I. et al. ( 2015 ) Smart: recent updates, new developments and status in 2015 . Nucleic Acids Res ., 43 , D257 – D260 . Google Scholar Crossref Search ADS PubMed Light S. , Elofsson A. ( 2013 ) The impact of splicing on protein domain architecture . Curr. Opin. Struct. Biol ., 23 , 451 – 458 . Google Scholar Crossref Search ADS PubMed Liu S. , Altman R.B. ( 2003 ) Large scale study of protein domain distribution in the context of alternative splicing . Nucleic Acids Res ., 31 , 4828 – 4835 . Google Scholar Crossref Search ADS PubMed Moore A.D. et al. ( 2008 ) Arrangements in the modular evolution of proteins . Trends Biochem. Sci ., 33 , 444 – 451 . Google Scholar Crossref Search ADS PubMed Niimi T. et al. ( 2001 ) claudin-18, a novel downstream target gene for the t/ebp/nkx2.1 homeodomain transcription factor, encodes lung- and stomach-specific isoforms through alternative splicing . Mol. Cell. Biol ., 21 , 7380 – 7390 . Google Scholar Crossref Search ADS PubMed Oltean S. , Bates D.O. ( 2014 ) Hallmarks of alternative splicing in cancer . Oncogene , 33 , 5311 – 5318 . Google Scholar Crossref Search ADS PubMed van der Pol W.L. et al. ( 2014 ) Nemaline myopathy caused by tnnt1 mutations in a Dutch pedigree . Mol. Genet. Genomic Med ., 2 , 134 – 137 . Google Scholar Crossref Search ADS PubMed Pruitt K.D. et al. ( 2014 ) Refseq: an update on mammalian reference sequences . Nucleic Acids Res ., 42 , D756 – D763 . Google Scholar Crossref Search ADS PubMed Resch A. et al. ( 2004 ) Assessing the impact of alternative splicing on domain interactions in the human proteome . J. Proteome Res ., 3 , 76 – 83 . Google Scholar Crossref Search ADS PubMed Rosenbloom K.R. et al. ( 2015 ) The ucsc genome browser database: 2015 update . Nucleic Acids Res ., 43 , D670 – D681 . Google Scholar Crossref Search ADS PubMed Sammeth M. ( 2009 ) Complete alternative splicing events are bubbles in splicing graphs . J. Comput. Biol ., 16 , 1117 – 1140 . Google Scholar Crossref Search ADS PubMed Sammeth M. et al. ( 2008 ) A general definition and nomenclature for alternative splicing events . PLoS Comput. Biol ., 4 , e1000147. Google Scholar Crossref Search ADS PubMed Sigrist C.J.A. et al. ( 2013 ) New and continuing developments at prosite . Nucleic Acids Res ., 41 , D344 – D347 . Google Scholar Crossref Search ADS PubMed Smith C.D. et al. ( 2007 ) The release 5.1 annotation of Drosophila melanogaster heterochromatin . Science , 316 , 1586 – 1591 . Google Scholar Crossref Search ADS PubMed Tazi J. et al. ( 2009 ) Alternative splicing and disease . Biochim. Biophys. Acta , 1792 , 14 – 26 . Google Scholar Crossref Search ADS PubMed Tress M.L. et al. ( 2007 ) The implications of alternative splicing in the encode protein complement . PNAS , 104 , 5495 – 5500 . Google Scholar Crossref Search ADS PubMed Tress M.L. et al. ( 2017 ) Alternative splicing may not be the key to proteome complexity . Trends Biochem. Sci ., 42 , 98 – 110 . Google Scholar Crossref Search ADS PubMed Tureci O. et al. ( 2011 ) Claudin-18 gene structure, regulation, and expression is evolutionary conserved in mammals . Gene , 481 , 83 – 92 . Google Scholar Crossref Search ADS PubMed Vibranovski M.D. et al. ( 2005 ) Signs of ancient and modern exon-shuffling are correlated to the distribution of ancient and modern domains along proteins . J. Mol. Evol ., 61 , 341 – 350 . Google Scholar Crossref Search ADS PubMed Wang E.T. et al. ( 2008 ) Alternative isoform regulation in human tissue transcriptomes . Nature , 456 , 470 – 476 . Google Scholar Crossref Search ADS PubMed Wetlaufer D.B. ( 1973 ) Nucleation, rapid folding, and globular intrachain regions in proteins . Proc. Natl Acad. Sci. USA , 70 , 697 – 701 . Google Scholar Crossref Search ADS © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 15, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off