Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The Transcriptome Analysis and Comparison Explorer—T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms

The Transcriptome Analysis and Comparison Explorer—T-ACE: a platform-independent, graphical tool... Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER Vol. 28 no. 6 2012, pages 777–783 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts056 Sequence analysis Advance Access publication January 27, 2012 The Transcriptome Analysis and Comparison Explorer—T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms 1,∗,† 1,∗,† 2 1 1 E. E. R. Philipp , L. Kraemer , D. Mountfort , M. Schilhabel , S. Schreiber 1,∗ and P. Rosenstiel Department of Cell biology, Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Schittenhelmstrasse 12, 24105 Kiel, Germany and Cawthron Institute, 98 Halifax Street East Nelson 7010, Private Bag 2, Nelson 7042, New Zealand Associate Editor: Ivo Hofacker ABSTRACT 1 INTRODUCTION Motivation: Next generation sequencing (NGS) technologies allow Recent advances in sequencing technology have led to an increasing a rapid and cost-effective compilation of large RNA sequence accumulation of transcriptomic (RNAseq) data. Although most datasets in model and non-model organisms. However, the storage of the data has been generated in model organisms for which and analysis of transcriptome information from different NGS a high number of annotated sequences and complete genomes platforms is still a significant bottleneck, leading to a delay in data are available, increasing transcriptomic sequence data for non- dissemination and subsequent biological understanding. Especially model organisms are generated, for which only little or no database interfaces with transcriptome analysis modules going a priori sequence information exists. The analysis of the obtained beyond mere read counts are missing. Here, we present the sequences and/or contigs, therefore, relies on comparative analysis Transcriptome Analysis and Comparison Explorer (T-ACE), a tool with annotated genes or protein domains of other organisms. A designed for the organization and analysis of large sequence multimodal comparison with a high number of sequences from datasets, and especially suited for transcriptome projects of non- different organisms and databases, e.g. NCBI, UniProtKB, Gene model organisms with little or no a priori sequence information. Ontology (GO), KEGG (Ashburner et al., 2000; Kanehisa and T-ACE offers a TCL-based interface, which accesses a PostgreSQL Goto, 2000), should be pursued to gain a wider picture of the database via a php-script. Within T-ACE, information belonging to putative biological role of a specific sequence. This approach, single sequences or contigs, such as annotation or read coverage, however, increases the quantity of information per sequence and is linked to the respective sequence and immediately accessible. adds to the already large amount of data. Not only bioinformatic Sequences and assigned information can be searched via keyword- processing of sequences (i.e. cleaning, assembly and annotation) or BLAST-search. Additionally, T-ACE provides within and between is still a bottleneck in RNASeq, but so is the subsequent analysis transcriptome analysis modules on the level of expression, GO of annotated sequences. The analyzing scientist is faced with terms, KEGG pathways and protein domains. Results are visualized the problem of organizing the manifold BLAST hits, protein and can be easily exported for external analysis. We developed domains, GO terms and KEGG pathway information assigned to ten T-ACE for laboratory environments, which have only a limited amount thousands of individual sequences, together with the mere sequence of bioinformatics support, and for collaborative projects in which information such as nucleotide and translated protein sequence, different partners work on the same dataset from different locations protein domain organization or, in case of assembled contigs, read or platforms (Windows/Linux/MacOS). For laboratories with some coverage. experience in bioinformatics and programming, the low complexity of To date a number of software solutions exist for the assembly the database structure and open-source code provides a framework and annotation of transcriptomes. Major efforts have been put that can be customized according to the different needs of the user into the development of reference-guided and de novo assembly and transcriptome project. tools. Several assemblers such as MIRA (Chevreux et al., 2004), Contact: e.philipp@ikmb.uni-kiel.de; l.kraemer@ikmb.uni_kiel.de; Newbler (Roche/454 Life Sciences), CAP3 (Huang and Madan, p.rosenstiel@mucosa.de 1999), Velvet (Zerbino and Birney, 2008) and others were developed Supplementary information: Supplementary data are available at by researchers together with software and sequencing companies, of Bioinformatics online. which some are especially designed for different NGS applications e.g. 454, Illumina or SOLID. It must be noted, that de novo RNA Received on June 27, 2011; revised on January 20, 2012; accepted sequence assembly is a particularly difficult bioinformatic problem, on January 24, 2012 as e.g. the existence of transcript isoforms usually results in many different contigs that cannot be merged in a simple fashion (for review see Kumar and Blaxter, 2010; Martin and Wang, 2011). To whom correspondence should be addressed. Beyond the mere sequence assembly, a suite of software solutions The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. for clustering (Partigen), protein prediction (prot4EST) and GO, © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 777 [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 777 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. KEGG and EC annotation (annot8r, Blast2Go, AutoFACT) of non- Local Web-interface model organism EST data has been developed by different working Interpro results raw-file groups (Conesa et al., 2005; Koski et al., 2005; Parkinson et al., Postgres database 2004; Schmid and Blaxter, 2008; Wasmuth and Blaxter, 2004). To InterProScan work with the results tables of the various software tools, non- ACE file bioinformatically skilled biologists, however, rely on user-friendly InterPro front ends which preferably work on various operating systems GO (Windows/Linux/MacOS). Within the Generic Model Organism T-ACE Client Fasta file KEGG Database project (GMOD, gmod.org), the web front end TRIPAL and also Gbrowse was developed for the Chado database structure. Also annot8r recommends visualization by Gbrowse, whereas NCBI Blast Blast2Go has an integrated user interface to view and manage the annotated sequence data. These currently available interfaces are, Blast results Local however, predominantly focused on the visualization of genomes Txt-file and their associated annotation-, expression- or publication data. Concerning transcriptome studies, however, databases and interfaces Fig. 1. Schematic overview of the generation and annotation of a T-ACE especially designed for transcriptome studies with analysis modules database. After creating a new database there are the following steps to going beyond a mere display of read count numbers are still missing. perform: (i) adding sequence information: sequence information (protein Ideally, a software tool for transcriptome analysis will link all or nucleotides) is added by uploading an ACE- or FASTA-file into the available information for a single transcript, and also enable a database. Uploading an ACE-file will not only add contig entries, but will also provide information about the positioning of the reads in each contig. transcriptome wide overview and analysis on transcript and read Nucleotide/contig sequences can be directly translated into protein database count (expression) level. Information and results files should be entries. (ii) Blasting: nucleotide/contig or protein entries can be blasted at exportable in a common format and the tool should allow an NCBI with the ‘NCBI-BLAST’-module or BLAST results imported into a implementation of additional analysis modules for customization database from a text file. The BLAST annotation can be used to deduce GO, to specific needs. In order to serve these needs, we developed the InterProScan or KEGG annotations or Blast2GO results in .annot format can software tool T-ACE with a Windows/Linux/MacOS interface to be imported into a database. (iii) InterProScan: contig or protein entries can organize and analyze large amounts of transcriptome data, especially be annotated by InterProScan, for this a custom installation of InterProScan of non-model organisms with limited sequence information. with web interface is needed. Alternatively, InterProScan result files, in .raw format, can be imported into a database. InterProScan hits also lead to GO annotations, which can be used to deduce KEGG annotations. 2 METHODS 2.2 Implementation General comment: T-ACE was developed for the analysis and organization 2.1 Installation of transcriptome projects but is also helpful for the organization of small All components of T-ACE are written as TCL scripts using the TCL/TK sequence datasets e.g. extracts of a large transcriptome databases. The current 8.5 software. Most of the scripts depend on additional TCL packages, such version of T-ACE does not provide an assembly function, thus data gained as: bwidget v1.9.2, tablelist, tclthread v2.6.5 (and libpgtcl v1.7, in case of by NGS projects (e.g. 454, Illumina, SoliD) have to be assembled prior to the T-ACEpg version or the T-ACE_DB_Manager). For the full function the upload (e.g. using Newbler, Celera or TGICL). T-ACE currently accepts of T-ACE, the additional software tools such as InterProScan v4.6 (Hunter ACE or SAM files from different assembly and alignment programs. It must et al., 2009), NCBI-BLAST-2.2.25+, PHOBOS v3.3.2 (Mayer, 2007) and be emphasized that the choice of the assembler and the assembly strategy Primer3 v1.1.4 (Rozen and Skaletsky, 2000) are required. T-ACE is based may result in slightly different models that represent a given transcriptome on a PostgreSQL 8.4 database system. For non-local use, the PostgreSQL- (Kumar and Blaxter, 2010; Martin and Wang, 2011). Depending on the server can be accessed via a PHP-enabled Apache 2.0 Web server. T-ACE type of study, the assembly strategy has to be carefully evaluated and and its necessary PostgreSQL-server runs on any standard computer, but the highlighted assembly results have to be validated for each novel dataset. performance depends on the size of the examined dataset. The biggest dataset T-ACE does offer the option of an automatic BLAST against NCBI databases tested so far contains ∼120 000 contigs, 400 000 protein open reading frames and InterProScan, but sequence annotation can be also undertaken outside (ORFs), over 3.1 million reads and according annotations. This database T-ACE with sometimes more sophisticated annotation tools (e.g. Blast2Go) currently runs without difficulty on a dual-core unix system with 8 GB and the results files are loaded into T-ACE for further analysis (Fig. 1). RAM. The T-ACE client can be executed on the same machine without For transcriptome comparisons, T-ACE is currently designed to work with a memory issues or high processor load (4 GB RAM should be sufficient). dataset composed of several transcriptomes of different treatments or tissues, Two different versions of T-ACE are currently available. The ‘T-ACE’ which are assembled into a consensus transcriptome (Fig. 2). Subsequently, version accesses the PostgreSQL database through a php script, which has reads of the different transcriptomes are again mapped against the consensus to run on the database server. With this version Pgtcl is not needed for transcriptome for information of transcriptome expression pattern i.e. number running the T-ACE client. The ‘T-ACEpg’ version accesses the PostgreSQL of reads per specific contig. Throughout the text, we will use the term database directly. For this version, the Pgtcl package is needed. In both ‘transcriptome’ for sequences (i.e. reads) gained from different samples and versions, the T-ACE_DB_Manager accesses the database directly, therefore ‘consensus transcriptome’ for the contigs gained from the assembly of all needs the Pgtcl package. A scheme of the T-ACE database is given in ‘transcriptome’ sequences. Together with the annotations this builds the Supplementary Figure S1. After the required software is installed, T-ACE.tcl ‘database’. and T-ACE_DB_Manager.tcl should be executable. Detailed instructions, such as information about the additional software and its installation or how Example datasets: in the following, we will give examples for different to set up the parent database, are described in the T-ACE manual and webpage features of T-ACE using two independent datasets generated by 454 (http://www.ikmb.uni-kiel.de/tace/). pyrosequencing (Roche/454 Life Sciences). One dataset was generated [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 778 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE 3 RESULTS 3.1 Working with T-ACE Single sequence reads or assembled contigs, with their corresponding reads (e.g. ACE and SAM files), can be uploaded into the database together with information of independently performed BLAST- and protein domain-annotations. The current version of T-ACE, for example, supports the Blast2Go annot. and Fig. 2. Illustration of the assembly approach undertaken for the InterProScan files. Alternatively, BLAST annotation and protein M.galloprovincialis dataset generated by Craft et al. (2010). Tissue- domain identification (InterProScan) can be performed within specific sequence reads were combined and assembled into a consensus T-ACE. When performed in T-ACE, GO and KEGG information is transcriptome using the ‘GS De novo Assembler 2.3’ from Roche (Newbler, deduced from the BLAST and InterProScan entries. The nucleotide Roche/454 Life Sciences) and the TGICL (Cap3) assembler. Subsequently, and translated protein sequence of individual sequences is then tissue-specific reads were mapped against the consensus transcriptome linked with the respective annotations. Different characters of using AMOScmp (http://sourceforge.net/apps/mediawiki/amos/index.php? a single sequence such as nucleotide and amino acid sequence, title=AMOScmp) for the information of tissue-specific read number per protein domain composition, ORF detection or contig coverage contig i.e. tissue-specific transcript expression. are visualized and can be curated manually within T-ACE. Such possibilities allow a detailed inspection and refinement of the results by Craft et al. (2010) for the marine mussel Mytilus galloprovincialis and goes beyond what is provided by other database interfaces, and was downloaded from the MG-RAST portal (download 12/10; http:// such as TRIPAL, Gbrowse or tbrowse (http://code.google.com/p/ metagenomics.nmpdr.org/; Meyer et al., 2008). The second dataset contains sequences of the marine invader species Sabella spallanzanii and was tbrowse/). At the whole database level, a complete overview of sequenced at the Institute of Clinical Molecular Biology (ICMB Kiel, the sequence, GO term and KEGG pathway composition can be Germany) in cooperation with the Cawthron Institute (Dr D. Mountfort, calculated and visualized within the interface. Data can be easily New Zealand). Both datasets can be used to test T-ACE in the cloud. extracted for further external analysis and graphical processing. Detailed instructions how to load the T-ACE client and enter the databases To identify genes of interest within the database, the tool offers are given on the T-ACE webpage ( http://www.ikmb.uni-kiel.de/tace/) and in the option for key word- or BLAST-searches. T-ACE is especially video tutorials on the webpage. The M.galloprovincialis database contains designed for the comparison of transcriptomes of non-model sequences of foot, gill, digestive gland and mantle tissue. Sequences organisms without genome or large sequence information. Software were quality controlled and cleaned for primer and adapter sequences solutions for such transcriptome comparisons are currently still (Smart primer sequences and 454 adapters) as well as polyA tails by missing, but are urgently needed due to the decreasing costs ‘seqclean’ and ‘cln2qual’ (TGI—The Gene Index Project) before the assembly. Reads <40 bp after the quality control and cleaning were excluded and increasing sequence numbers in NGS. T-ACE offers a first from further sequence assembly and annotation. The trimmed reads were solution by using the information of the number of transcriptome assembled with the ‘GS De novo Assembler 2.3’ from Roche (Newbler, specific single reads assigned to a defined contig, and changes in Roche/454 Life Sciences), as a first step. The standard parameters of contig expression can be analyzed and visualized (further details the ‘GS De novo Assembler 2.3’ were used for this initial assembly below). In T-ACE, database organization and analysis is combined. (‘minimum overlap length’ = 40; ‘minimum overlap identity’ = 90). To organize sequences and results, the tool enables individual Afterwards the resulting contigs and singletons were further assembled databases to be set up, to sort sequences into projects or export data in multiple rounds using the TGICL (Cap3) assembler. The ‘minimum on the level of FASTA files or annotation tables for further analysis overlap length’ varied between 40 and 300 bp and the ‘minimum overlap in external software solutions. T-ACE consists of a PostgreSQL identity’ between 85 and 100%. In total, 104 123 MG-RAST sequences database and a TCL Client interface, which allows multiple users (average length 211 bp) of M.galloprovincialis were assembled into 12 827 contigs (average length 279 bp) and 40 972 singletons (average length to access the Postgres server from different computers and/or 207 bp). Using AMOS (http://sourceforge.net), reads originating from the various operating systems (Windows/Linux/MacOS). Results different tissues were subsequently mapped against the generated contigs saved within the database or in individual projects can thus be and the Mytilus edulis mitochondrion genome (GI:55977238), which interchanged between different partners working on Windows, resulted in the final contigs and assigned reads from the different tissues Linux or Mac platforms. On the one hand, T-ACE is attractive (Fig. 2). The S.spallanzanii dataset contains 86 490 read sequences deduced for laboratory environments, which have only a limited amount from fan tissue which were assembled into 4714 contigs and 21 086 of bioinformatic support and for cooperating partners working singletons. More detailed information of both datasets is given on the on the same dataset from different locations. On the other, for T-ACE webpage. For both datasets, putative gene names and protein laboratories with expertise in bioinformatics and programming, domains were assigned to all contigs by using the BLASTx algorithm T-ACE provides a framework that can be extended by adding against the UniProtKB/Swiss-Prot and UniRef100 protein databases of UniProt Knowledgebase (UniProtKB, http://www.expasy.org/sprot/) with a further modules, customized according to the different needs of −3 −3 −10 cut off e 10 , as well as tBLASTx (e 10 ) and BLASTn (e 10 ) the user and transcriptome project. The low complexity of the against the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov). The database allows an easy understanding of its structure and therefore S.spallanzanii dataset was further analyzed for conserved domains by facilitates the extension and integration of new tables and functions. running the assembled contigs through InterProScan (1). GO terms were deduced from BLAST and InterProScan results and KEGG information from GO and BLAST results. Datasets as well as required databases (e.g. 3.2 Functions of T-ACE reference list, GO term list, etc.) for performing the below described analyses 3.2.1 Whole database overview and analysis For a first overview can be downloaded from the T-ACE webpage ( http://www.ikmb.uni-kiel. de/tace/#Package/). of a database after the upload of sequences, basic information [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 779 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. about the database content, such as the number and average Database KEGG maps and composition: to analyze the length of nucleotide, protein and read sequences are displayed gene composition and regulation of specific pathways within in the ‘Database info’ window (Supplementary Fig. S2). Number transcriptome data, the analysis of contigs on the basis of reference of annotation entries from externally performed analysis, or after pathways from the Kyoto Encyclopedia of Genes and Genomes BLAST and InterProScan annotation within T-ACE, are displayed (KEGG) was implemented in T-ACE. KEGG pathway annotations according to the origin (e.g. GO, KEGG, Pfam). All databases can for specific genes originate from the respective BLAST and/or GO be exported directly from T-ACE, which facilitates the interchange annotations. Within the ‘KEGG map’ menu, an overview of all or further analysis of data outside T-ACE. The ‘Database info’ KEGG pathways identified in the consensus transcriptome is given window also gives information of the transcriptomes (e.g. different together with the number of contigs covering this pathway. Further, treatments, tissues, populations) included in the database. These the user gets an overview on which number and percentage (ko, are displayed in the ‘run’-table and can be selected for whole ko%) of all pathway members within a specific reference pathway database analysis and comparison between different transcriptomes is covered by the contigs of the consensus transcriptome. Contigs (Supplementary Fig. S2). In the following, we will describe T-ACE are listed when clicking on the pathway to allow a more detailed functions by using a nucleotide database. It is also possible, however, investigation of the transcripts. Both tables have web links to either to create a pure protein database or view a nucleotide database in a pathway or a specific KEGG Orthology (KO)-ID (http://www a protein mode. By switching a nucleotide database into ‘protein’- .genome.jp/kegg/). mode, the nucleotide sequence list of the ‘Database browser’ will be replaced by a list of all protein sequences contained in the database. 3.2.2 Organization and overview of sequence information The In this way, annotations for distinct open reading frames of a contig organization and structured overview of sequence data within a can, for example, be reviewed. database is an important component of T-ACE. In the ‘Database Browser’ window, all sequence entries of the database are listed together with the associated information such as sequence length Database sequence statistics: to get a first insight in how different and, in case of contigs, number of reads, as well as the number of RNAseq datasets/libraries compose the database, the nucleotide different annotations (e.g. BLAST, GO, InterPro). coverage and percentage of partially covered contigs of the consensus transcriptome can be calculated for transcriptomes Working with single sequences and contigs: selection of a sequence selected within the ‘run’ table (Supplementary Fig. S3), and is entry will open detailed associated information like BLAST, GO executed via the ‘Coverage’ button. A more detailed statistical and KEGG hits, InterProScan results, domain structure, read analysis about the database content is performed in the ‘Database coverage and user-specific comments in different tabs in the lower statistics’ window. This menu allows an overview of sequence windows of T-ACE (Fig. 3). Pop-up windows visualizing sequence frequency on the level of reads or contigs, as well as contig and read-coverage and ORF information can be opened by right click whole transcriptome coverage on the level of reads and nucleotides. on single sequences (Fig. 4). Further options for processing single Results are visualized as graphs within the tool or can be displayed sequences can be selected e.g. by adding the sequence to the as a list (Supplementary Fig. S4) and exported from T-ACE as BLAST window or creating primers with Primer 3. This enables txt/tables for analysis in external software solutions (e.g. Excel or an immediate access and processing of different associated data of GraphPad Prism). a sequence entry. Single and multiple sequences can be transferred and saved in a project file, which will help the user to organize groups of sequences. Project files can then be, for example, Database GO statistics: in the GO statistics, the number of contigs exchanged between research partners working on the same database of a specific GO term is listed and visualized for all levels of the but different platforms (Windows/Linux/MacOS). GO tree (Supplementary Fig. S5). GO terms are deduced from the BLAST and InterProScan results of the consensus transcriptome Search for target genes or protein domains: T-ACE offers database and sorted into the different subcategories for molecular function, searches for target genes or protein domains either via keyword cellular component and biological process. Contigs detected within search or BLAST analysis. Keyword searches can be performed lower levels of the GO tree are also listed in the parent directory at with user-specific filters for e-values or type of annotation in which higher levels of the GO tree but not counted twice. Alternatively, should be searched (e.g. BLAST, KEGG). To perform BLAST GO analysis can be conducted in Blast2GO and .annot files loaded searches with T-ACE, either a BLAST server or a local installation into T-ACE. The GO table is graphically visualized or the list of NCBI- BLAST+ can be used and the required T-ACE databases can be directly copied and imported in external software tools for have to be available as BLAST databases. For a local BLAST graphical processing (e.g. for the generation of GO pie charts). installation, this can easily be done by using the ‘Create BLAST Contigs belonging to a specific GO term can be directly exported database’-module. The standard BLAST parameters are set through by right mouse click into a new project file for detailed inspection. the ‘BLAST parameters’-option in the ‘Config’-menu. If a BLAST To investigate whether a group of specific contigs e.g. which show server is used for BLAST-analysis (selected in the ‘Config’-menu extremely high expression levels, represent distinct GO terms, ‘BLAST configuration’), the ‘Database’-combo box will contain entries of a project tab can be added as red bars to the GO a list of every database available to the user. When using the tree diagram of the consensus transcriptome with the ‘Compare’- ‘Local’-option, only databases situated in the BLAST_dbs folder button. A more detailed comparison of GO term patterns between in the T-ACE directory are listed in the ‘Database’-combo box. and within RNAseq datasets can, however, be undertaken in the BLAST results are displayed in a separate window in which ‘Run compare’ tool, which is described below in the section of the alignment of the BLAST matches can be given. By running transcriptome comparisons. the ‘Mapping’ option, the position of the different matches on the [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 780 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE Fig. 3. Information assigned to single sequences or contigs are listed and visualized in different windows of T-ACE. Fig. 4. Visualization of the open reading frames, read coverage and domain organization of sequence entries of the Sabella database. query can be visualized (Supplementary Fig. S6). Within BLAST, is, however, restricted to GO-terms. In T-ACE, the information of the selection of ‘overlaps only’ enables a result filter that, when sequence reads from different transcriptomes assigned to specific activated, displays only BLAST hits which reach the start or end contigs of the consensus transcriptome within the database is needed of the query sequence. This option is useful when searching for as a prerequisite for transcriptome analysis. In case of a de novo sequences elongating a given query. assembly, as done for M.galloprovincialis and S.spallanzanii, all reads from the different transcriptomes are assembled into contigs and the information of transcriptome-specific reads per database 3.2.3 Transcriptome comparison An important aspect when contig are extracted (Fig. 2). In T-ACE, different options for working with RNAseq datasets from different samples (e.g. transcriptome comparisons are implemented that can be executed different tissues or treatments) is the easy accessibility of in the ‘Database statistic’ window or the ‘Tool’ drop-down menu. information about transcriptomal changes in pathways described by GO/KEGG terms, changes in domain abundance and gene Expression analysis: the ‘Expression analysis’ tool investigates the expression on the contig level. Besides T-ACE, only the Blast2Go origin, i.e. transcriptomal dataset affiliation of reads contained in tool offers a first statistical analysis between transcriptomes. This a contig. For this, transcriptomes of the run table in the ‘Database [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 781 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. one pathway may not be detected as a major result when only looking at the single genes, but may be striking when investigated on the whole pathway level. The ‘Run compare’ tool compares multiple subsets of contigs (called cohorts) against the combined contigs of all subsets. This gives a statistical estimation on which GO terms, KEGG pathways or domains are enriched or depleted within the respective transcriptome or treatment group. Calculations can be conducted for the whole set of different GO terms and KEGG pathways, or for specific GO levels and KEGG pathways. P-values are calculated by performing the Fisher’s exact test. In Figure 6A, a typical run compare table is shown with significant enriched- or depleted-GO terms marked in green (enriched) or red (depleted). Data obtained by the run compare tool can then be exported for further analysis and graphical display as, for example, undertaken for the GO subcategory ‘Molecular function’ Fig. 5. Visualization of expression analysis when comparing different transcriptomes. As an example, the gill tissue transcriptome of the for different tissues of M.galloprovincialis (Fig. 6B). A similar M.galloprovincialis database was set as group A and digestive gland tissue as approach is undertaken in the Blast2Go software, which uses group B and the fold change calculated for contigs >100 bp and containing GOSSIP (Bluthgen et al., 2005) to compare enriched GO terms minimum 20 reads. Each dot represents a sequence entry. Dots can be between two datasets. BioMyn (http://www.biomyn.de/explore/) selected by mouse click and the information associated to the respective and skypainter (http://www.reactome.org/cgi-bin/skypainter2) are sequence is displayed in the different windows of T-ACE. other online tools in which two lists of genes with identifiers can be uploaded for GO and pathway analysis. The analysis is, however, info’-window are marked as defined groups (A or B). Results can be restricted to a limited number of species and excludes GO, KEGG displayed as the number of ‘A and ‘B reads per contig on the linear and domain analysis of sequences without a proper gene BLAST hit or logarithmic scale as well as fold change in a list or graph (Fig. 5). i.e. with only a domain annotation. The advantage in T-ACE is that If the compared transcriptomes are composed of a different number the two datasets to be compared do not need to be annotated and of assembled sequences, T-ACE gives the option to calculate the loaded separately but are within one T-ACE project. The calculation results on normalized values. The results can be filtered by e.g. is not based on the gene identifier of a distinct species but on setting minimum contig length, read number or the fold change the previous performed GO, KEGG and domain annotation and threshold. Statistical analysis is undertaken when the A and B further, not only overrepresentation but also underrepresentation of group each consists >2 transcriptomes. Fold change values are the respective terms or pathways is calculated. calculated for mean or median values and a P-value for significantly up- or downregulated contigs between groups is calculated by the Mann–Whitney U test. P-values are corrected for multiple testing using the Benjamini–Hochberg Step-Up false discovery 3.3 Conclusion and outlook rate-controlling procedure. Due to the still high sequencing costs, The T-ACE was especially developed for NGS transcriptome however, generally a low number of transcriptomes per treatment projects of non-model organisms where significant a priori group (biological replicates) are generated, which is in most cases sequence information is missing on DNA and RNA level. We not suitable for statistical calculation and leading to non- significant wanted to design a software tool, which goes beyond a webpage P-values (P > 0.05) for up- or downregulated genes. To yet reduce application for gene mining by keyword or BLAST searches. We the number of genes to be chosen for a subsequent more detailed also explicitly did not aim to compete with larger central sequence investigation of expression changes e.g. by real time PCR, T-ACE databases (e.g. short read archives of EBI and NCBI) that allow holds the option to filter for ‘distinct’ differences. The number of sharing of the unannotated data with the wider public. T-ACE was reads per contig within groups A and B are ranked and the two built in order to provide a graphical interface that can be used groups are 100% distinctly different when the transcriptome with locally among different scientists in a single lab and also between the lowest number of reads for a specific contig within one group collaborating laboratories in order to work with large amounts (e.g. A) still has a higher number of reads compared with the of transcriptome sequence data. T-ACE exhibits modules for transcriptome with the highest number of reads per contig in the manual curation and visualization of single contigs and underlying other group (e.g. B). The percentage (%) of distinction can be set reads. Furthermore, statistical tools have been implemented for the by the user. The obtained sequence entries for regulated genes can analysis of differential expression or occurrence of GO, KEGG be transferred to separate project tabs by the ‘Show UP’- and/or terms and protein domains comparison of transcriptomes. The ‘Show DOWN’-button for further analysis or export. relatively simple bioinformatic structure sets a framework for Analysis of GO term, KEGG pathway member or protein domain further integration of analysis modules and customization of the distribution within and between transcriptomes: in some cases, the tool. Future development of T-ACE will particularly focus on such composition of GO terms, KEGG pathway members or protein additional modules for transcriptome comparisons in the light of domains in different transcriptomal datasets or in the group of up- the increasing use of RNAseq data as ‘virtual microarrays’ i.e. a and downregulated genes of treatments can be more informative consensus transcriptome will be used as a ‘virtual microarray’ than investigating expression changes on the level of single genes. against which sequence data of short-read sequencing technologies A high number of small expression changes of several genes within (e.g. SoliD, Illumina) are mapped for gene expression analysis. [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 782 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE Fig. 6. (A) Example for a run compare table showing significant enriched (marked green) or depleted (marked red) GO terms, calculated for the subcategory ‘Molecular function’ in digestive gland, gill, mantle and foot tissue and compared with all GO terms found within the reference database (all tissues combined). GO terms were deduced form BLAST hits using the BLASTX algorithm against the UniProtKB/Swiss-Prot and UniRef100 protein databases of UniProt −3 −3 −10 Knowledgebase (UniProtKB, http://www.expasy.org/sprot/) with a cut off e 10 , as well as tBLASTx (e 10 ) and BLASTn (e 10 ) against the NCBI non-redundant protein (nr) database ( http://www.ncbi.nlm.nih.gov). (B) Relative abundance of GO terms (%) for the subcategory Molecular function calculated from data obtained by the ‘run compare’ tool. Arrows indicate one significantly enriched and depleted GO term within the respective tissue. Taken together, T-ACE has been designed for scientists in Conesa,A. et al. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. different laboratories working cooperatively on a central database Craft,J.A. et al. (2010) Pyrosequencing of Mytilus galloprovincialis cDNAs: tissue- and allows its users access from different terminals and platforms specific expression patterns. PLoS One, 5, e8875. (Linux/Windows/MacOS). Huang,X. and Madan,A. (1999) CAP3: A DNA sequence assembly program. Genome Res., 9, 868–877. Hunter,S. et al. (2009) InterPro: the integrative protein signature database. Nucleic Acids ACKNOWLEDGEMENTS Res., 37, D211–D215. Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. The authors would like to thank the cell biology group and Nucleic Acids Res., 28, 27–30. sequencing platform of the IKMB for technical support, especially Koski,L. et al. (2005) AutoFACT: An Automatic Functional Annotation and Tanja Kaacksteen, Melanie Friskovec and Anita Dietsch. We thank Classification Tool. BMC Bioinformatics, 6,151. Dr Georg Hemmrich-Stanisak for helpful comments while writing Kumar,S. and Blaxter,M. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics, 11, 571. the manuscript as well as three anonymous reviewers. Martin,J.A. and Wang,Z. (2011) Next-generation transcriptome assembly. Nat. Rev. Genet., 12, 671–682. Funding: DFG clusters of Excellence ‘The Future Ocean’ and Mayer,C. (2007) PHOBOS – a tandem repeat search tool for complete genomes. ‘Inflammation at Interfaces’ and the DFG priority programme 1399 http://www.rub.de/spezzoo/cm. ‘Host-parasite covolution’, Genomics Analysis platform. Meyer,F. et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Conflict of Interest: none declared. Bioinformatics, 9, 386. Parkinson,J. et al. (2004) PartiGene—constructing partial genomes. Bioinformatics, 20, 1398–1404. REFERENCES Rozen,S. and Skaletsky,H. (2000) Primer3 on the WWW for general users and for Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. biologist programmers. Methods Mol. Biol., 132, 365–386. Genet., 25, 25–29. Schmid,R. and Blaxter,M.L. (2008) annot8r: rapid assignment of GO, EC and KEGG Bluthgen,N. et al. (2005) Biological profiling of gene groups utilizing Gene Ontology. annotations. BMC Bioinformatics, 2008, 9, 180. Genome Inform., 16, 106–115. Wasmuth,J. and Blaxter,M. (2004) prot4EST: Translating Expressed Sequence Tags Chevreux,B. et al. (2004) Using the miraEST assembler for reliable and automated from neglected genomes. BMC Bioinformatics, 5, 187. mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res., Zerbino,D. and Birney,E. (2008) Velvet: algorithms for de novo short read assembly 14, 1147–1159. using de Bruijn graphs. Genome Res., 18, 821–829. [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 783 777–783 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

The Transcriptome Analysis and Comparison Explorer—T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms

Loading next page...
 
/lp/oxford-university-press/the-transcriptome-analysis-and-comparison-explorer-t-ace-a-platform-iGe0CtJYG7

References (31)

Publisher
Oxford University Press
Copyright
© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bts056
pmid
22285826
Publisher site
See Article on Publisher Site

Abstract

Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER Vol. 28 no. 6 2012, pages 777–783 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts056 Sequence analysis Advance Access publication January 27, 2012 The Transcriptome Analysis and Comparison Explorer—T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms 1,∗,† 1,∗,† 2 1 1 E. E. R. Philipp , L. Kraemer , D. Mountfort , M. Schilhabel , S. Schreiber 1,∗ and P. Rosenstiel Department of Cell biology, Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Schittenhelmstrasse 12, 24105 Kiel, Germany and Cawthron Institute, 98 Halifax Street East Nelson 7010, Private Bag 2, Nelson 7042, New Zealand Associate Editor: Ivo Hofacker ABSTRACT 1 INTRODUCTION Motivation: Next generation sequencing (NGS) technologies allow Recent advances in sequencing technology have led to an increasing a rapid and cost-effective compilation of large RNA sequence accumulation of transcriptomic (RNAseq) data. Although most datasets in model and non-model organisms. However, the storage of the data has been generated in model organisms for which and analysis of transcriptome information from different NGS a high number of annotated sequences and complete genomes platforms is still a significant bottleneck, leading to a delay in data are available, increasing transcriptomic sequence data for non- dissemination and subsequent biological understanding. Especially model organisms are generated, for which only little or no database interfaces with transcriptome analysis modules going a priori sequence information exists. The analysis of the obtained beyond mere read counts are missing. Here, we present the sequences and/or contigs, therefore, relies on comparative analysis Transcriptome Analysis and Comparison Explorer (T-ACE), a tool with annotated genes or protein domains of other organisms. A designed for the organization and analysis of large sequence multimodal comparison with a high number of sequences from datasets, and especially suited for transcriptome projects of non- different organisms and databases, e.g. NCBI, UniProtKB, Gene model organisms with little or no a priori sequence information. Ontology (GO), KEGG (Ashburner et al., 2000; Kanehisa and T-ACE offers a TCL-based interface, which accesses a PostgreSQL Goto, 2000), should be pursued to gain a wider picture of the database via a php-script. Within T-ACE, information belonging to putative biological role of a specific sequence. This approach, single sequences or contigs, such as annotation or read coverage, however, increases the quantity of information per sequence and is linked to the respective sequence and immediately accessible. adds to the already large amount of data. Not only bioinformatic Sequences and assigned information can be searched via keyword- processing of sequences (i.e. cleaning, assembly and annotation) or BLAST-search. Additionally, T-ACE provides within and between is still a bottleneck in RNASeq, but so is the subsequent analysis transcriptome analysis modules on the level of expression, GO of annotated sequences. The analyzing scientist is faced with terms, KEGG pathways and protein domains. Results are visualized the problem of organizing the manifold BLAST hits, protein and can be easily exported for external analysis. We developed domains, GO terms and KEGG pathway information assigned to ten T-ACE for laboratory environments, which have only a limited amount thousands of individual sequences, together with the mere sequence of bioinformatics support, and for collaborative projects in which information such as nucleotide and translated protein sequence, different partners work on the same dataset from different locations protein domain organization or, in case of assembled contigs, read or platforms (Windows/Linux/MacOS). For laboratories with some coverage. experience in bioinformatics and programming, the low complexity of To date a number of software solutions exist for the assembly the database structure and open-source code provides a framework and annotation of transcriptomes. Major efforts have been put that can be customized according to the different needs of the user into the development of reference-guided and de novo assembly and transcriptome project. tools. Several assemblers such as MIRA (Chevreux et al., 2004), Contact: e.philipp@ikmb.uni-kiel.de; l.kraemer@ikmb.uni_kiel.de; Newbler (Roche/454 Life Sciences), CAP3 (Huang and Madan, p.rosenstiel@mucosa.de 1999), Velvet (Zerbino and Birney, 2008) and others were developed Supplementary information: Supplementary data are available at by researchers together with software and sequencing companies, of Bioinformatics online. which some are especially designed for different NGS applications e.g. 454, Illumina or SOLID. It must be noted, that de novo RNA Received on June 27, 2011; revised on January 20, 2012; accepted sequence assembly is a particularly difficult bioinformatic problem, on January 24, 2012 as e.g. the existence of transcript isoforms usually results in many different contigs that cannot be merged in a simple fashion (for review see Kumar and Blaxter, 2010; Martin and Wang, 2011). To whom correspondence should be addressed. Beyond the mere sequence assembly, a suite of software solutions The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. for clustering (Partigen), protein prediction (prot4EST) and GO, © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 777 [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 777 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. KEGG and EC annotation (annot8r, Blast2Go, AutoFACT) of non- Local Web-interface model organism EST data has been developed by different working Interpro results raw-file groups (Conesa et al., 2005; Koski et al., 2005; Parkinson et al., Postgres database 2004; Schmid and Blaxter, 2008; Wasmuth and Blaxter, 2004). To InterProScan work with the results tables of the various software tools, non- ACE file bioinformatically skilled biologists, however, rely on user-friendly InterPro front ends which preferably work on various operating systems GO (Windows/Linux/MacOS). Within the Generic Model Organism T-ACE Client Fasta file KEGG Database project (GMOD, gmod.org), the web front end TRIPAL and also Gbrowse was developed for the Chado database structure. Also annot8r recommends visualization by Gbrowse, whereas NCBI Blast Blast2Go has an integrated user interface to view and manage the annotated sequence data. These currently available interfaces are, Blast results Local however, predominantly focused on the visualization of genomes Txt-file and their associated annotation-, expression- or publication data. Concerning transcriptome studies, however, databases and interfaces Fig. 1. Schematic overview of the generation and annotation of a T-ACE especially designed for transcriptome studies with analysis modules database. After creating a new database there are the following steps to going beyond a mere display of read count numbers are still missing. perform: (i) adding sequence information: sequence information (protein Ideally, a software tool for transcriptome analysis will link all or nucleotides) is added by uploading an ACE- or FASTA-file into the available information for a single transcript, and also enable a database. Uploading an ACE-file will not only add contig entries, but will also provide information about the positioning of the reads in each contig. transcriptome wide overview and analysis on transcript and read Nucleotide/contig sequences can be directly translated into protein database count (expression) level. Information and results files should be entries. (ii) Blasting: nucleotide/contig or protein entries can be blasted at exportable in a common format and the tool should allow an NCBI with the ‘NCBI-BLAST’-module or BLAST results imported into a implementation of additional analysis modules for customization database from a text file. The BLAST annotation can be used to deduce GO, to specific needs. In order to serve these needs, we developed the InterProScan or KEGG annotations or Blast2GO results in .annot format can software tool T-ACE with a Windows/Linux/MacOS interface to be imported into a database. (iii) InterProScan: contig or protein entries can organize and analyze large amounts of transcriptome data, especially be annotated by InterProScan, for this a custom installation of InterProScan of non-model organisms with limited sequence information. with web interface is needed. Alternatively, InterProScan result files, in .raw format, can be imported into a database. InterProScan hits also lead to GO annotations, which can be used to deduce KEGG annotations. 2 METHODS 2.2 Implementation General comment: T-ACE was developed for the analysis and organization 2.1 Installation of transcriptome projects but is also helpful for the organization of small All components of T-ACE are written as TCL scripts using the TCL/TK sequence datasets e.g. extracts of a large transcriptome databases. The current 8.5 software. Most of the scripts depend on additional TCL packages, such version of T-ACE does not provide an assembly function, thus data gained as: bwidget v1.9.2, tablelist, tclthread v2.6.5 (and libpgtcl v1.7, in case of by NGS projects (e.g. 454, Illumina, SoliD) have to be assembled prior to the T-ACEpg version or the T-ACE_DB_Manager). For the full function the upload (e.g. using Newbler, Celera or TGICL). T-ACE currently accepts of T-ACE, the additional software tools such as InterProScan v4.6 (Hunter ACE or SAM files from different assembly and alignment programs. It must et al., 2009), NCBI-BLAST-2.2.25+, PHOBOS v3.3.2 (Mayer, 2007) and be emphasized that the choice of the assembler and the assembly strategy Primer3 v1.1.4 (Rozen and Skaletsky, 2000) are required. T-ACE is based may result in slightly different models that represent a given transcriptome on a PostgreSQL 8.4 database system. For non-local use, the PostgreSQL- (Kumar and Blaxter, 2010; Martin and Wang, 2011). Depending on the server can be accessed via a PHP-enabled Apache 2.0 Web server. T-ACE type of study, the assembly strategy has to be carefully evaluated and and its necessary PostgreSQL-server runs on any standard computer, but the highlighted assembly results have to be validated for each novel dataset. performance depends on the size of the examined dataset. The biggest dataset T-ACE does offer the option of an automatic BLAST against NCBI databases tested so far contains ∼120 000 contigs, 400 000 protein open reading frames and InterProScan, but sequence annotation can be also undertaken outside (ORFs), over 3.1 million reads and according annotations. This database T-ACE with sometimes more sophisticated annotation tools (e.g. Blast2Go) currently runs without difficulty on a dual-core unix system with 8 GB and the results files are loaded into T-ACE for further analysis (Fig. 1). RAM. The T-ACE client can be executed on the same machine without For transcriptome comparisons, T-ACE is currently designed to work with a memory issues or high processor load (4 GB RAM should be sufficient). dataset composed of several transcriptomes of different treatments or tissues, Two different versions of T-ACE are currently available. The ‘T-ACE’ which are assembled into a consensus transcriptome (Fig. 2). Subsequently, version accesses the PostgreSQL database through a php script, which has reads of the different transcriptomes are again mapped against the consensus to run on the database server. With this version Pgtcl is not needed for transcriptome for information of transcriptome expression pattern i.e. number running the T-ACE client. The ‘T-ACEpg’ version accesses the PostgreSQL of reads per specific contig. Throughout the text, we will use the term database directly. For this version, the Pgtcl package is needed. In both ‘transcriptome’ for sequences (i.e. reads) gained from different samples and versions, the T-ACE_DB_Manager accesses the database directly, therefore ‘consensus transcriptome’ for the contigs gained from the assembly of all needs the Pgtcl package. A scheme of the T-ACE database is given in ‘transcriptome’ sequences. Together with the annotations this builds the Supplementary Figure S1. After the required software is installed, T-ACE.tcl ‘database’. and T-ACE_DB_Manager.tcl should be executable. Detailed instructions, such as information about the additional software and its installation or how Example datasets: in the following, we will give examples for different to set up the parent database, are described in the T-ACE manual and webpage features of T-ACE using two independent datasets generated by 454 (http://www.ikmb.uni-kiel.de/tace/). pyrosequencing (Roche/454 Life Sciences). One dataset was generated [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 778 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE 3 RESULTS 3.1 Working with T-ACE Single sequence reads or assembled contigs, with their corresponding reads (e.g. ACE and SAM files), can be uploaded into the database together with information of independently performed BLAST- and protein domain-annotations. The current version of T-ACE, for example, supports the Blast2Go annot. and Fig. 2. Illustration of the assembly approach undertaken for the InterProScan files. Alternatively, BLAST annotation and protein M.galloprovincialis dataset generated by Craft et al. (2010). Tissue- domain identification (InterProScan) can be performed within specific sequence reads were combined and assembled into a consensus T-ACE. When performed in T-ACE, GO and KEGG information is transcriptome using the ‘GS De novo Assembler 2.3’ from Roche (Newbler, deduced from the BLAST and InterProScan entries. The nucleotide Roche/454 Life Sciences) and the TGICL (Cap3) assembler. Subsequently, and translated protein sequence of individual sequences is then tissue-specific reads were mapped against the consensus transcriptome linked with the respective annotations. Different characters of using AMOScmp (http://sourceforge.net/apps/mediawiki/amos/index.php? a single sequence such as nucleotide and amino acid sequence, title=AMOScmp) for the information of tissue-specific read number per protein domain composition, ORF detection or contig coverage contig i.e. tissue-specific transcript expression. are visualized and can be curated manually within T-ACE. Such possibilities allow a detailed inspection and refinement of the results by Craft et al. (2010) for the marine mussel Mytilus galloprovincialis and goes beyond what is provided by other database interfaces, and was downloaded from the MG-RAST portal (download 12/10; http:// such as TRIPAL, Gbrowse or tbrowse (http://code.google.com/p/ metagenomics.nmpdr.org/; Meyer et al., 2008). The second dataset contains sequences of the marine invader species Sabella spallanzanii and was tbrowse/). At the whole database level, a complete overview of sequenced at the Institute of Clinical Molecular Biology (ICMB Kiel, the sequence, GO term and KEGG pathway composition can be Germany) in cooperation with the Cawthron Institute (Dr D. Mountfort, calculated and visualized within the interface. Data can be easily New Zealand). Both datasets can be used to test T-ACE in the cloud. extracted for further external analysis and graphical processing. Detailed instructions how to load the T-ACE client and enter the databases To identify genes of interest within the database, the tool offers are given on the T-ACE webpage ( http://www.ikmb.uni-kiel.de/tace/) and in the option for key word- or BLAST-searches. T-ACE is especially video tutorials on the webpage. The M.galloprovincialis database contains designed for the comparison of transcriptomes of non-model sequences of foot, gill, digestive gland and mantle tissue. Sequences organisms without genome or large sequence information. Software were quality controlled and cleaned for primer and adapter sequences solutions for such transcriptome comparisons are currently still (Smart primer sequences and 454 adapters) as well as polyA tails by missing, but are urgently needed due to the decreasing costs ‘seqclean’ and ‘cln2qual’ (TGI—The Gene Index Project) before the assembly. Reads <40 bp after the quality control and cleaning were excluded and increasing sequence numbers in NGS. T-ACE offers a first from further sequence assembly and annotation. The trimmed reads were solution by using the information of the number of transcriptome assembled with the ‘GS De novo Assembler 2.3’ from Roche (Newbler, specific single reads assigned to a defined contig, and changes in Roche/454 Life Sciences), as a first step. The standard parameters of contig expression can be analyzed and visualized (further details the ‘GS De novo Assembler 2.3’ were used for this initial assembly below). In T-ACE, database organization and analysis is combined. (‘minimum overlap length’ = 40; ‘minimum overlap identity’ = 90). To organize sequences and results, the tool enables individual Afterwards the resulting contigs and singletons were further assembled databases to be set up, to sort sequences into projects or export data in multiple rounds using the TGICL (Cap3) assembler. The ‘minimum on the level of FASTA files or annotation tables for further analysis overlap length’ varied between 40 and 300 bp and the ‘minimum overlap in external software solutions. T-ACE consists of a PostgreSQL identity’ between 85 and 100%. In total, 104 123 MG-RAST sequences database and a TCL Client interface, which allows multiple users (average length 211 bp) of M.galloprovincialis were assembled into 12 827 contigs (average length 279 bp) and 40 972 singletons (average length to access the Postgres server from different computers and/or 207 bp). Using AMOS (http://sourceforge.net), reads originating from the various operating systems (Windows/Linux/MacOS). Results different tissues were subsequently mapped against the generated contigs saved within the database or in individual projects can thus be and the Mytilus edulis mitochondrion genome (GI:55977238), which interchanged between different partners working on Windows, resulted in the final contigs and assigned reads from the different tissues Linux or Mac platforms. On the one hand, T-ACE is attractive (Fig. 2). The S.spallanzanii dataset contains 86 490 read sequences deduced for laboratory environments, which have only a limited amount from fan tissue which were assembled into 4714 contigs and 21 086 of bioinformatic support and for cooperating partners working singletons. More detailed information of both datasets is given on the on the same dataset from different locations. On the other, for T-ACE webpage. For both datasets, putative gene names and protein laboratories with expertise in bioinformatics and programming, domains were assigned to all contigs by using the BLASTx algorithm T-ACE provides a framework that can be extended by adding against the UniProtKB/Swiss-Prot and UniRef100 protein databases of UniProt Knowledgebase (UniProtKB, http://www.expasy.org/sprot/) with a further modules, customized according to the different needs of −3 −3 −10 cut off e 10 , as well as tBLASTx (e 10 ) and BLASTn (e 10 ) the user and transcriptome project. The low complexity of the against the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov). The database allows an easy understanding of its structure and therefore S.spallanzanii dataset was further analyzed for conserved domains by facilitates the extension and integration of new tables and functions. running the assembled contigs through InterProScan (1). GO terms were deduced from BLAST and InterProScan results and KEGG information from GO and BLAST results. Datasets as well as required databases (e.g. 3.2 Functions of T-ACE reference list, GO term list, etc.) for performing the below described analyses 3.2.1 Whole database overview and analysis For a first overview can be downloaded from the T-ACE webpage ( http://www.ikmb.uni-kiel. de/tace/#Package/). of a database after the upload of sequences, basic information [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 779 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. about the database content, such as the number and average Database KEGG maps and composition: to analyze the length of nucleotide, protein and read sequences are displayed gene composition and regulation of specific pathways within in the ‘Database info’ window (Supplementary Fig. S2). Number transcriptome data, the analysis of contigs on the basis of reference of annotation entries from externally performed analysis, or after pathways from the Kyoto Encyclopedia of Genes and Genomes BLAST and InterProScan annotation within T-ACE, are displayed (KEGG) was implemented in T-ACE. KEGG pathway annotations according to the origin (e.g. GO, KEGG, Pfam). All databases can for specific genes originate from the respective BLAST and/or GO be exported directly from T-ACE, which facilitates the interchange annotations. Within the ‘KEGG map’ menu, an overview of all or further analysis of data outside T-ACE. The ‘Database info’ KEGG pathways identified in the consensus transcriptome is given window also gives information of the transcriptomes (e.g. different together with the number of contigs covering this pathway. Further, treatments, tissues, populations) included in the database. These the user gets an overview on which number and percentage (ko, are displayed in the ‘run’-table and can be selected for whole ko%) of all pathway members within a specific reference pathway database analysis and comparison between different transcriptomes is covered by the contigs of the consensus transcriptome. Contigs (Supplementary Fig. S2). In the following, we will describe T-ACE are listed when clicking on the pathway to allow a more detailed functions by using a nucleotide database. It is also possible, however, investigation of the transcripts. Both tables have web links to either to create a pure protein database or view a nucleotide database in a pathway or a specific KEGG Orthology (KO)-ID (http://www a protein mode. By switching a nucleotide database into ‘protein’- .genome.jp/kegg/). mode, the nucleotide sequence list of the ‘Database browser’ will be replaced by a list of all protein sequences contained in the database. 3.2.2 Organization and overview of sequence information The In this way, annotations for distinct open reading frames of a contig organization and structured overview of sequence data within a can, for example, be reviewed. database is an important component of T-ACE. In the ‘Database Browser’ window, all sequence entries of the database are listed together with the associated information such as sequence length Database sequence statistics: to get a first insight in how different and, in case of contigs, number of reads, as well as the number of RNAseq datasets/libraries compose the database, the nucleotide different annotations (e.g. BLAST, GO, InterPro). coverage and percentage of partially covered contigs of the consensus transcriptome can be calculated for transcriptomes Working with single sequences and contigs: selection of a sequence selected within the ‘run’ table (Supplementary Fig. S3), and is entry will open detailed associated information like BLAST, GO executed via the ‘Coverage’ button. A more detailed statistical and KEGG hits, InterProScan results, domain structure, read analysis about the database content is performed in the ‘Database coverage and user-specific comments in different tabs in the lower statistics’ window. This menu allows an overview of sequence windows of T-ACE (Fig. 3). Pop-up windows visualizing sequence frequency on the level of reads or contigs, as well as contig and read-coverage and ORF information can be opened by right click whole transcriptome coverage on the level of reads and nucleotides. on single sequences (Fig. 4). Further options for processing single Results are visualized as graphs within the tool or can be displayed sequences can be selected e.g. by adding the sequence to the as a list (Supplementary Fig. S4) and exported from T-ACE as BLAST window or creating primers with Primer 3. This enables txt/tables for analysis in external software solutions (e.g. Excel or an immediate access and processing of different associated data of GraphPad Prism). a sequence entry. Single and multiple sequences can be transferred and saved in a project file, which will help the user to organize groups of sequences. Project files can then be, for example, Database GO statistics: in the GO statistics, the number of contigs exchanged between research partners working on the same database of a specific GO term is listed and visualized for all levels of the but different platforms (Windows/Linux/MacOS). GO tree (Supplementary Fig. S5). GO terms are deduced from the BLAST and InterProScan results of the consensus transcriptome Search for target genes or protein domains: T-ACE offers database and sorted into the different subcategories for molecular function, searches for target genes or protein domains either via keyword cellular component and biological process. Contigs detected within search or BLAST analysis. Keyword searches can be performed lower levels of the GO tree are also listed in the parent directory at with user-specific filters for e-values or type of annotation in which higher levels of the GO tree but not counted twice. Alternatively, should be searched (e.g. BLAST, KEGG). To perform BLAST GO analysis can be conducted in Blast2GO and .annot files loaded searches with T-ACE, either a BLAST server or a local installation into T-ACE. The GO table is graphically visualized or the list of NCBI- BLAST+ can be used and the required T-ACE databases can be directly copied and imported in external software tools for have to be available as BLAST databases. For a local BLAST graphical processing (e.g. for the generation of GO pie charts). installation, this can easily be done by using the ‘Create BLAST Contigs belonging to a specific GO term can be directly exported database’-module. The standard BLAST parameters are set through by right mouse click into a new project file for detailed inspection. the ‘BLAST parameters’-option in the ‘Config’-menu. If a BLAST To investigate whether a group of specific contigs e.g. which show server is used for BLAST-analysis (selected in the ‘Config’-menu extremely high expression levels, represent distinct GO terms, ‘BLAST configuration’), the ‘Database’-combo box will contain entries of a project tab can be added as red bars to the GO a list of every database available to the user. When using the tree diagram of the consensus transcriptome with the ‘Compare’- ‘Local’-option, only databases situated in the BLAST_dbs folder button. A more detailed comparison of GO term patterns between in the T-ACE directory are listed in the ‘Database’-combo box. and within RNAseq datasets can, however, be undertaken in the BLAST results are displayed in a separate window in which ‘Run compare’ tool, which is described below in the section of the alignment of the BLAST matches can be given. By running transcriptome comparisons. the ‘Mapping’ option, the position of the different matches on the [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 780 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE Fig. 3. Information assigned to single sequences or contigs are listed and visualized in different windows of T-ACE. Fig. 4. Visualization of the open reading frames, read coverage and domain organization of sequence entries of the Sabella database. query can be visualized (Supplementary Fig. S6). Within BLAST, is, however, restricted to GO-terms. In T-ACE, the information of the selection of ‘overlaps only’ enables a result filter that, when sequence reads from different transcriptomes assigned to specific activated, displays only BLAST hits which reach the start or end contigs of the consensus transcriptome within the database is needed of the query sequence. This option is useful when searching for as a prerequisite for transcriptome analysis. In case of a de novo sequences elongating a given query. assembly, as done for M.galloprovincialis and S.spallanzanii, all reads from the different transcriptomes are assembled into contigs and the information of transcriptome-specific reads per database 3.2.3 Transcriptome comparison An important aspect when contig are extracted (Fig. 2). In T-ACE, different options for working with RNAseq datasets from different samples (e.g. transcriptome comparisons are implemented that can be executed different tissues or treatments) is the easy accessibility of in the ‘Database statistic’ window or the ‘Tool’ drop-down menu. information about transcriptomal changes in pathways described by GO/KEGG terms, changes in domain abundance and gene Expression analysis: the ‘Expression analysis’ tool investigates the expression on the contig level. Besides T-ACE, only the Blast2Go origin, i.e. transcriptomal dataset affiliation of reads contained in tool offers a first statistical analysis between transcriptomes. This a contig. For this, transcriptomes of the run table in the ‘Database [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 781 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER E.E.R.Philipp et al. one pathway may not be detected as a major result when only looking at the single genes, but may be striking when investigated on the whole pathway level. The ‘Run compare’ tool compares multiple subsets of contigs (called cohorts) against the combined contigs of all subsets. This gives a statistical estimation on which GO terms, KEGG pathways or domains are enriched or depleted within the respective transcriptome or treatment group. Calculations can be conducted for the whole set of different GO terms and KEGG pathways, or for specific GO levels and KEGG pathways. P-values are calculated by performing the Fisher’s exact test. In Figure 6A, a typical run compare table is shown with significant enriched- or depleted-GO terms marked in green (enriched) or red (depleted). Data obtained by the run compare tool can then be exported for further analysis and graphical display as, for example, undertaken for the GO subcategory ‘Molecular function’ Fig. 5. Visualization of expression analysis when comparing different transcriptomes. As an example, the gill tissue transcriptome of the for different tissues of M.galloprovincialis (Fig. 6B). A similar M.galloprovincialis database was set as group A and digestive gland tissue as approach is undertaken in the Blast2Go software, which uses group B and the fold change calculated for contigs >100 bp and containing GOSSIP (Bluthgen et al., 2005) to compare enriched GO terms minimum 20 reads. Each dot represents a sequence entry. Dots can be between two datasets. BioMyn (http://www.biomyn.de/explore/) selected by mouse click and the information associated to the respective and skypainter (http://www.reactome.org/cgi-bin/skypainter2) are sequence is displayed in the different windows of T-ACE. other online tools in which two lists of genes with identifiers can be uploaded for GO and pathway analysis. The analysis is, however, info’-window are marked as defined groups (A or B). Results can be restricted to a limited number of species and excludes GO, KEGG displayed as the number of ‘A and ‘B reads per contig on the linear and domain analysis of sequences without a proper gene BLAST hit or logarithmic scale as well as fold change in a list or graph (Fig. 5). i.e. with only a domain annotation. The advantage in T-ACE is that If the compared transcriptomes are composed of a different number the two datasets to be compared do not need to be annotated and of assembled sequences, T-ACE gives the option to calculate the loaded separately but are within one T-ACE project. The calculation results on normalized values. The results can be filtered by e.g. is not based on the gene identifier of a distinct species but on setting minimum contig length, read number or the fold change the previous performed GO, KEGG and domain annotation and threshold. Statistical analysis is undertaken when the A and B further, not only overrepresentation but also underrepresentation of group each consists >2 transcriptomes. Fold change values are the respective terms or pathways is calculated. calculated for mean or median values and a P-value for significantly up- or downregulated contigs between groups is calculated by the Mann–Whitney U test. P-values are corrected for multiple testing using the Benjamini–Hochberg Step-Up false discovery 3.3 Conclusion and outlook rate-controlling procedure. Due to the still high sequencing costs, The T-ACE was especially developed for NGS transcriptome however, generally a low number of transcriptomes per treatment projects of non-model organisms where significant a priori group (biological replicates) are generated, which is in most cases sequence information is missing on DNA and RNA level. We not suitable for statistical calculation and leading to non- significant wanted to design a software tool, which goes beyond a webpage P-values (P > 0.05) for up- or downregulated genes. To yet reduce application for gene mining by keyword or BLAST searches. We the number of genes to be chosen for a subsequent more detailed also explicitly did not aim to compete with larger central sequence investigation of expression changes e.g. by real time PCR, T-ACE databases (e.g. short read archives of EBI and NCBI) that allow holds the option to filter for ‘distinct’ differences. The number of sharing of the unannotated data with the wider public. T-ACE was reads per contig within groups A and B are ranked and the two built in order to provide a graphical interface that can be used groups are 100% distinctly different when the transcriptome with locally among different scientists in a single lab and also between the lowest number of reads for a specific contig within one group collaborating laboratories in order to work with large amounts (e.g. A) still has a higher number of reads compared with the of transcriptome sequence data. T-ACE exhibits modules for transcriptome with the highest number of reads per contig in the manual curation and visualization of single contigs and underlying other group (e.g. B). The percentage (%) of distinction can be set reads. Furthermore, statistical tools have been implemented for the by the user. The obtained sequence entries for regulated genes can analysis of differential expression or occurrence of GO, KEGG be transferred to separate project tabs by the ‘Show UP’- and/or terms and protein domains comparison of transcriptomes. The ‘Show DOWN’-button for further analysis or export. relatively simple bioinformatic structure sets a framework for Analysis of GO term, KEGG pathway member or protein domain further integration of analysis modules and customization of the distribution within and between transcriptomes: in some cases, the tool. Future development of T-ACE will particularly focus on such composition of GO terms, KEGG pathway members or protein additional modules for transcriptome comparisons in the light of domains in different transcriptomal datasets or in the group of up- the increasing use of RNAseq data as ‘virtual microarrays’ i.e. a and downregulated genes of treatments can be more informative consensus transcriptome will be used as a ‘virtual microarray’ than investigating expression changes on the level of single genes. against which sequence data of short-read sequencing technologies A high number of small expression changes of several genes within (e.g. SoliD, Illumina) are mapped for gene expression analysis. [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 782 777–783 Copyedited by: ES MANUSCRIPT CATEGORY: ORIGINAL PAPER T-ACE Fig. 6. (A) Example for a run compare table showing significant enriched (marked green) or depleted (marked red) GO terms, calculated for the subcategory ‘Molecular function’ in digestive gland, gill, mantle and foot tissue and compared with all GO terms found within the reference database (all tissues combined). GO terms were deduced form BLAST hits using the BLASTX algorithm against the UniProtKB/Swiss-Prot and UniRef100 protein databases of UniProt −3 −3 −10 Knowledgebase (UniProtKB, http://www.expasy.org/sprot/) with a cut off e 10 , as well as tBLASTx (e 10 ) and BLASTn (e 10 ) against the NCBI non-redundant protein (nr) database ( http://www.ncbi.nlm.nih.gov). (B) Relative abundance of GO terms (%) for the subcategory Molecular function calculated from data obtained by the ‘run compare’ tool. Arrows indicate one significantly enriched and depleted GO term within the respective tissue. Taken together, T-ACE has been designed for scientists in Conesa,A. et al. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. different laboratories working cooperatively on a central database Craft,J.A. et al. (2010) Pyrosequencing of Mytilus galloprovincialis cDNAs: tissue- and allows its users access from different terminals and platforms specific expression patterns. PLoS One, 5, e8875. (Linux/Windows/MacOS). Huang,X. and Madan,A. (1999) CAP3: A DNA sequence assembly program. Genome Res., 9, 868–877. Hunter,S. et al. (2009) InterPro: the integrative protein signature database. Nucleic Acids ACKNOWLEDGEMENTS Res., 37, D211–D215. Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. The authors would like to thank the cell biology group and Nucleic Acids Res., 28, 27–30. sequencing platform of the IKMB for technical support, especially Koski,L. et al. (2005) AutoFACT: An Automatic Functional Annotation and Tanja Kaacksteen, Melanie Friskovec and Anita Dietsch. We thank Classification Tool. BMC Bioinformatics, 6,151. Dr Georg Hemmrich-Stanisak for helpful comments while writing Kumar,S. and Blaxter,M. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics, 11, 571. the manuscript as well as three anonymous reviewers. Martin,J.A. and Wang,Z. (2011) Next-generation transcriptome assembly. Nat. Rev. Genet., 12, 671–682. Funding: DFG clusters of Excellence ‘The Future Ocean’ and Mayer,C. (2007) PHOBOS – a tandem repeat search tool for complete genomes. ‘Inflammation at Interfaces’ and the DFG priority programme 1399 http://www.rub.de/spezzoo/cm. ‘Host-parasite covolution’, Genomics Analysis platform. Meyer,F. et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Conflict of Interest: none declared. Bioinformatics, 9, 386. Parkinson,J. et al. (2004) PartiGene—constructing partial genomes. Bioinformatics, 20, 1398–1404. REFERENCES Rozen,S. and Skaletsky,H. (2000) Primer3 on the WWW for general users and for Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. biologist programmers. Methods Mol. Biol., 132, 365–386. Genet., 25, 25–29. Schmid,R. and Blaxter,M.L. (2008) annot8r: rapid assignment of GO, EC and KEGG Bluthgen,N. et al. (2005) Biological profiling of gene groups utilizing Gene Ontology. annotations. BMC Bioinformatics, 2008, 9, 180. Genome Inform., 16, 106–115. Wasmuth,J. and Blaxter,M. (2004) prot4EST: Translating Expressed Sequence Tags Chevreux,B. et al. (2004) Using the miraEST assembler for reliable and automated from neglected genomes. BMC Bioinformatics, 5, 187. mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res., Zerbino,D. and Birney,E. (2008) Velvet: algorithms for de novo short read assembly 14, 1147–1159. using de Bruijn graphs. Genome Res., 18, 821–829. [14:26 29/2/2012 Bioinformatics-bts056.tex] Page: 783 777–783

Journal

BioinformaticsOxford University Press

Published: Jan 27, 2012

There are no references for this article.