Novel molecules lncRNAs, tRFs and circRNAs deciphered from next-generation sequencing/RNA sequencing: computational databases and tools

Novel molecules lncRNAs, tRFs and circRNAs deciphered from next-generation sequencing/RNA... Abstract Powerful next-generation sequencing (NGS) technologies, more specifically RNA sequencing (RNA-seq), have been pivotal toward the detection and analysis and hypotheses generation of novel biomolecules, long noncoding RNAs (lncRNAs), tRNA-derived fragments (tRFs) and circular RNAs (circRNAs). Experimental validation of the occurrence of these biomolecules inside the cell has been reported. Their differential expression and functionally important role in several cancers types as well as other diseases such as Alzheimer’s and cardiovascular diseases have garnered interest toward further studies in this research arena. In this review, starting from a brief relevant introduction to NGS and RNA-seq and the expression and role of lncRNAs, tRFs and circRNAs in cancer, we have comprehensively analyzed the current landscape of databases developed and computational software used for analyses and visualization for this emerging and highly interesting field of these novel biomolecules. Our review will help the end users and research investigators gain information on the existing databases and tools as well as an understanding of the specific features which these offer. This will be useful for the researchers in their proper usage thereby guiding them toward novel hypotheses generation and saving time and costs involved in extensive experimental processes in these three different novel functional RNAs. next-generation sequencing technologies, RNA sequencing, RNA-seq, long noncoding RNAs, circular RNAs, tRNA-derived fragments, cancer, databases, tools Introduction Today’s science is essentially technology driven, and newer the technology, the faster is the progress and the more advanced is science. The limitless capabilities thus imparted translate into major scientific breakthroughs in virtually all of the major fields, from Genomics to Proteomics to Metabolomics and fast-forwarding to Systems and Synthetic biology. Groundbreaking real-world applications range from human diseases to evolution to agriculture and medicine to metagenomics (sequencing genomes of microbiota present in environment and in humans), focusing on all the major elements of our life. The impact of these novel, groundbreaking technologies in biomedical science is tremendous. Genomics and genome projects Ever since the first protein and DNA sequences were discovered using Edman degradation method and Sanger sequencing, respectively, the need to produce more and more information in an automated, robust, high-throughput, scalable and speedier manner, and from any biological sample at low cost and the need to sequence entire genome in one step, has led to a kind of ‘sequencing technology revolution’. In fact, the sequencing technologies of today are ‘ultra high-throughput’ technologies, called, next-generation sequencing (NGS) technologies, powerful and robust yet simple in their principles. Found on the principles of its forefather, the Capillary Electrophoresis-based sequencing, NGS technology has pushed the frontiers of science at altogether new levels. Using NGS, thousands of sequences can be deciphered in a ‘concurrent’ manner and computational analyses done at a ‘genome-wide’ scale. Besides its use in DNA sequencing and RNA sequencing (RNA-seq), it is also a basis of chromatin immunoprecipitation sequencing (ChIP-sequencing) to identify binding sites of proteins that associate with DNA molecules, such as transcription factors and DNA polymerases. Personalized genomics with personal whole genome sequence information, and genome-wide profiling of several types of RNA molecules such as messenger RNAs (mRNAs), microRNAs (miRNAs), long noncoding RNAs (lncRNAs), small noncoding RNAs (sncRNAs), circular RNAs (circRNAs) and tRNA-derived fragments (tRFs) are now possible with this technology. In the case of gene expression studies, owing to the precise quantitation of RNA transcripts occurring as a result of alternative gene splicing events and discovery of several novel transcripts, RNA-seq provides a stronger platform than microarrays. NGS: Sequencing machines, genome centers and major projects NGS technology started in 2005, shortly after the Human Genome Project was declared complete in the euchromatic region in the year 2003. The first of the sequencing machines was Roche 454 sequencing, based on pyrosequencing methodology, which is currently being discontinued. Higher end machines such as Illumina HiSeq 2500 and MiSeq sequencers and Ion Proton by Ion Torrent Systems Inc. are slated to rule the Genomics research field. Among newly developing technologies are Single Molecule Real Time sequencing methods and Nanopore DNA Sequencing. Data analysis of NGS platforms requires a strong Bioinformatics component, which needs to keep pace with huge volume of data generated. The main goal is to reduce costs per genome to <$1000. Several Genome Centers worldwide are using these high-end NGS machines to support their projects as well as outside researchers. Notable among these genome centers are the following: JP Sulzberger Columbia Genome Center at Columbia University Department of Systems Biology, New York City; The US Department of Energy Joint Genome Institute, California, USA; Wellcome Trust Sanger Institute, Hinxton, UK; Max Planck Institute for Molecular Genetics, Berlin, Germany; National Institute of Biomedical Genomics, Kolkata, India; and Center for Cellular and Molecular Platforms, National Centre for Biological Sciences-Tata Institute of Fundamental Research (NCBS-TIFR), Bangalore, India. Implementing NGS technology, several high-throughput genome projects have been undertaken or are slated to start. Interesting projects among these are the Functional Annotation of the Mammalian Genome 5 project to functionally annotate transcriptomes specific to mammalian cells; 1000 Genomes Project, intended to catalog every possible human genetic variation from about 25 populations all over the world; 1000 Fungal Genomes Project to sequence every family of fungi, The Cancer Genome Atlas (TCGA) to sequence genomes of different types of cancer; 100K Foodborne Pathogen Genome Project; and the International Cancer Genome Consortium. TCGA is a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute. It aimed to provide a ‘comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer’. To date, approx. 2.5 petabytes of data describing genomic alterations in tumor tissue and matched normal tissues has been generated and is made publicly available for further research, analysis and dissemination of crucial information. Data are made available through Genomic Data Commons portal (also host to Therapeutically Applicable Research to Generate Effective Therapies) platform and is in the form of methylation-array, genotyping array, clinical and metadata information, histopathological images, DNA/mRNA/miRNA/total RNA-seq data, DNA methylation, copy number variation, microsatellite instability and protein expression, among others. In 2017, NCI Center for Cancer Genomics will be taking the place of TCGA and intends to continue the work in the same direction. Another consortium also aiming at pan-cancer studies, the International Cancer Genome Consortium, aims to put together data from several countries across the world including Australia, Brazil, Canada, China, EU, France, Germany and India, to name a few. The primary goals are to study and generate catalogs of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types and/or subtypes and make the data available to researchers in our rush to combat cancers of clinical and societal relevance. To date, 89 committed projects are available involving a variety of experimental protocols such as array genotyping, DNA deep resequencing, cytogenetic analyses and computational pipelines implementing statistical analyses, correlations between molecular profiles and clinical features among others. BOX 1 The Technology: NGS To illustrate a generalized NGS methodology, the following steps are started with a genomic DNA or complementary DNA (cDNA) extracted from samples, e.g. blood samples: Template: The DNA is fragmented and adapters (synthesized oligonucleotides of a few base pairs length and of a known sequence) are added to the ends of the fragments. These adapters are used to attach the DNA fragments to the flow cell of the sequencing machine and also contain primer sequences (primers are used to extend a chain by a DNA polymerase molecule). These libraries of several fragments are then amplified, e.g. by PCR, to generate large numbers of clones. Some sequencing machines operate on the method of ‘sequencing by synthesis’. The fragments in the library act as a template, on which a new fragment is synthesized by providing known nucleotides sequentially. The DNA strands are extended by one nucleotide and the reads (a sequence) recorded by a software. Millions of such sequencing reactions are done in parallel. These reads are aligned to a reference genome (re-assembling and re-sequencing) to generate larger contigs or in situations where a reference genome is not available, then the reads are assembled de novo (de novo sequencing). Reference genomes are those genomes that are representative set of a species genome, characterized by sequence quality and well-annotated with key genomic features such as total number of protein-coding genes, RNA genes, exon and intron locations among others. A recent high-quality assembly of human reference genome, GRCh38, has been released in December 2013 by Genome Reference Consortium. These whole genome sequences can be analyzed by Bioinformatics tools for genetic variant detection, presence of single nucleotide polymorphisms (SNPs) and insertion/deletion (Indel) mutations associated with diseases, presence of regulatory and protein-binding sites, finding novel genes and assessing gene expression levels through transcript abundance. RNA-seq During RNA-seq, the mRNA isolated through poly-A pulldown assay is converted into cDNA through reverse transcription reaction, and further steps for sequencing are essentially the same as mentioned above in the ‘NGS: Sequencing machines, genome centers and major projects’ section. The gene expression level is determined through the relative abundance of mRNA transcripts quantified after aligning the reads to the reference genome. The contemporary research is being geared toward using RNA-seq technology to quantify gene expression changes, identification of novel transcripts and their possible role in diseased state. Identification of differentially expressed genes, isoforms, gene fusions and predicting novel genes/protein as putative drug targets are some of the other futuristic research arenas. Toward these ends, several studies have been and are currently being undertaken in the context of various diseases. As some examples, in cancer studies, large-scale transcriptome analysis using RNA-seq data from 4043 cancers of diverse types and 548 normal controls identified seven co-regulated gene sets, also called cross-cancer signatures, altered across a diverse panel of primary human cancer samples. A 14-gene signature extracted from these cross-cancer signatures was capable of clear distinction between cancer and normal samples [1]. Integrated DNA sequencing and transcriptional profile analysis detected 10 034 ovarian cancer structural variants (SVs) at base-pair level resolution, and these SVs along with gene fusions were implicated as significant factors in the onset and progression of ovarian cancer [2]. Using RNA-seq data, an underexplored area on Ran regulation of mitotic spindle formation was identified for further studies on prostate cancer pathogenesis [3]. In another interesting study [4], RNA-seq data were used to identify genes related to increasing body mass index (BMI) among four integrated clusters POLE, MSI, CNL, CNH of endometrial cancer (EC). Differences in gene expression profiles of obese and nonobese women with EC and the association of BMI within these clusters were measured. It was observed that 181 genes were significantly up- or down-regulated with increasing BMI mainly involved in cell cycle and DNA-metabolism processes. This set of genes included LPL, IRS-1, IGFBP4, IGFBP7 and the progesterone receptor genes. RNA-seq data to study differential gene expression are typically analyzed with the same set of tools that are used to analyze microarrays, viz., t-test, SAM, limma, ANOVA, Bonferroni correction for adjusting false discovery rate. In addition, RNA-seq-specific software tools such as DESeq, edgeR, SAMseq are also being increasingly used. A detailed protocol for the usage and understanding of these software tools is provided in [5]. Types based on regions sequenced Based on the genomic regions being sequenced, NGS technologies involve three basic types of sequencing: Whole Genome Sequencing: Whole genomes are sequenced. Exome Sequencing: Whole of the exon (coding) regions in a genome are sequenced. Targeted Resequencing: Only a portion of a genome or a gene that is thought to function in disease states is sequenced. These different types of sequencing studies yield specific information amenable to further investigation and novel hypotheses generation. The latter two are considered cost-effective, reduce turnaround time and are more focused and less labor-intensive than the whole genome sequencing technology. NGS and lncRNA lncRNA is a transcript longer than 200 nt that does not code for protein. Nearly 30 000 (and still counting) lncRNA transcripts are present in humans as identified via Encyclopedia of DNA Elements (ENCODE) and Functional Annotation of Mammals (FANTOM) consortia. GENCODE consortium analysis indicates that similar to protein coding genes, lncRNA transcription is regulated by histone modification and is processed by similar splicing mechanism. Their expression is strikingly cell type and tissue specific. In addition to their ubiquitous role in transcription regulation of protein-coding genes, several lncRNAs have been found to be involved in tumorigenesis and metastasis. lncRNA expression can reflect cancer phenotype as shown by 32 upregulated lncRNAs identified as playing a role in urothelial cancer progression. Of these, upregulation of ABO74278 lncRNA led to anti-apoptotic role and maintained proliferative state in cancer, through potential interaction with tumor suppressor EMP1 [6] and is considered to be a strong prognostic biomarker candidate. Other lncRNAs such as lncRNA-n336928 in bladder cancer [7], XLOC_010235 or RP11-789C1.1 in gastric cancer [8] and NBAT1 in neuroblastoma [9] have been implicated to be playing a role in cancer development and progression. Large-scale analyses showed that lncRNA expression and dysregulation are highly tumor- and lineage-specific and often associated with somatic copy number alterations, promoter methylation and SNPs. siRNA screening strategy and co-expression analysis approach identified cancer driver lncRNAs and predicted their functions [10]. LncRNAs are important epigenetic regulators as well. HOX transcript antisense RNA (HOTAIR) lncRNA overexpression in glioblastoma leads to cellular proliferation. Chromatin immunoprecipitation studies revealed binding of Bromodomain Containing 4 (BRD4) protein, an epigenetic modulator, to the HOTAIR promoter, suggesting that BET proteins can directly regulate lncRNA expression. This is one of the novel mechanisms discovered through which BET proteins can control tumor cell proliferation [11]. NGS and t-RNA-derived small ncRNA t-RNA-derived small ncRNAs were discovered through sequencing of RNAs with size 19–40 nt [12] and through deep sequencing and bioinformatics studies. These are most possibly produced via specific processing of tRNA molecules [13]. These are also known by the name of tRFs. Their role in cancer development and progression is being investigated and is currently in its infancy. tRFs induced under hypoxic conditions may play a role in preventing metastasis [14]. A novel type of tRF, termed Sex Hormone-dependent tRNA-derived RNAs (SHOT-RNAs), was found to be specifically and abundantly expressed in androgen receptor-positive prostate cancer and estrogen receptor-positive breast cancer cell lines. These SHOT-RNAs were found to be produced from amino-acylated mature tRNAs by angiogenin-mediated anticodon cleavage promoted by sex hormones and their receptors [15]. NGS and CircRNA The first evidence of the existence of circRNAs was shown in plant viroids, where these were present as single-stranded, circularly closed RNA molecules [16]. In humans and other organisms, circRNAs are also being discovered through analyses of deep sequencing data. CircRNAs, as their name implies, do not have 5ʹ and 3ʹ ends, instead these are circular molecules, forming a covalently closed loop. As these are not polyadenylated, poly (A)-RNA-seq data cannot be used for their discovery. These were identified by research groups on their way toward understanding exon scrambling events, when exons are spliced in noncanonical order. Deep sequencing reads that show a junction between two ‘scrambled’ exons are typically used to identify them. In human context, it is estimated that circRNAs may account for 1% as many molecules as polyadenylated RNAs [17]. Up- and down-regulation of circRNAs has also been found associated with cancer. A review paper on bioinformatics and experimental methodologies on circRNAs detection has been published [18]. CircRNAs can act as novel class of biomarkers and have been postulated to be miRNA sponges binding to miRNAs and repressing their function [19]. In one study, RNA polymerase II-associated circRNA in human cells produced a new class of circRNA molecules. In these transcripts, interestingly, introns were found to be ‘retained’ between circularized exons. Named exon–intron circRNAs or EIciRNAs, these were found to be involved in regulating transcription of their parental genes through RNA–RNA interaction between them and U1 snRNA [20]. Identification of lncRNA, sncRNA, tRF and circRNAs using deep sequencing and bioinformatics: databases, tools and techniques Of major interest in today’s scientific world is the compelling power that deep sequencing and Bioinformatics possess in discovering novel molecules inside the cells. Novel unannotated coding and noncoding transcripts can be detected, analyzed for their probable roles and mechanisms of action and several research areas have been focusing on these in recent times. These studies require powerful computational tools to detect and differentiate lncRNAs from other coding and ncRNAs. Methods for identification of lncRNAs from RNA-seq libraries are developed by integrating RNA-seq data with known annotation databases. The primary step involves transcriptome reconstruction from RNA-seq data of each sample. After that, reads are aligned to reference genome and annotated using RefSeq, UCSC, GENCODE protein coding transcripts databases. This is done to eliminate annotated non-lncRNA transcripts (protein coding genes, tRNAs, miRNA and pseudogenes). For distinguishing low-level-expressed, single-exon unreliable fragment assemblies from low-level-expressed lncRNA, a read coverage threshold of value >200 bp has to be selected. To eliminate un-annotated or novel coding genes, two methods, namely, PhyloCSF and Pfam can be used. PhyloCSF (phylogenetic codon substitution frequency) is used to assess putative open reading frames (ORFs; these are evolutionarily conserved ORFs with synonymous amino acid content), whereas Pfam, on the other hand, is used to exclude transcript, which codes any of the 31 912 protein domains. Finally, the lncRNA transcripts are enlisted (Figure 1). Figure 1. View largeDownload slide Flowchart depicting identification of lncRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 1. View largeDownload slide Flowchart depicting identification of lncRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) In contrast, generally, circRNAs are identified from poly-A-depleted or rRNA-depleted RNA-seq data (Figure 2). In one such method, called rRNA-depleted RNA-seq, also known as RibominusSeq, sequencing is done after the depletion of rRNA. The sequencing reads are then aligned to reference genome for discarding mapped reads. From unmapped reads, 20-nucleotide anchor sequences from either side are aligned further to find unique anchor positions. A series of filtering criteria to identify circRNAs are used after this step. Back-splicing is suggested by an anchor pair aligning in reverse direction with canonical GU/AG splicing signals flanking the splice sites, to indicate the presence of circRNAs. Figure 2. View largeDownload slide Flowchart depicting identification of circRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 2. View largeDownload slide Flowchart depicting identification of circRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) After small RNA-seq, identification of tRFs from other RNAs is done by integration of known annotated databases. In general, initially, small RNA-seq data are mapped to the human genome for discarding unmapped reads (Figure 3). The mapped or aligned reads may then be annotated with known transcripts databases (GENCODE, miRbase, Rfam) to eliminate non-tRFs (mRNA, miRNA, rRNA, snRNA and snoRNA). The enriched tRFs may further be annotated to pre-tRNA and mature tRNA. Pooled tRFs are classified according to their biogenesis site. The 5ʹ tRFs and 3ʹ tRFs are derived from mature-tRNA, 3ʹ U-tRFs from 3ʹ end of pre-tRNA and the one recently characterized i-tRFs (internal t-RFs) has been shown to be derived from anticodon region of tRNA and either D-arm or TψC-arm. Figure 3. View largeDownload slide Flowchart depicting identification of tRFs from small RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 3. View largeDownload slide Flowchart depicting identification of tRFs from small RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) LncRNA LncRNAs are synthesized from same RNApol-II machinery, which synthesizes mRNAs. With characteristic features such as a 5ʹ-cap, a 3ʹ-poly-A tail, an absence of long ORF, lncRNAs must pass the >200 nt long sequence threshold to qualify. A large majority of lncRNAs do not encode proteins, and these have a multi-exonic structure [21–23]. These are also distinguished bioinformatically from short ncRNAs <200 nt in length with known functions [24]. As protein-coding regions have >100 nt long ORFs, the absence of long ORFs in lncRNA indicates that its function is yet to be defined accurately. Ribosome profiling followed by sequencing is a novel technique to estimate some translation activity [25], which along with mass spectrometry can be used to verify translation. LncRNAs have a wide range of functional modes as well as many mechanisms through which they function in regulating cell cycle progression, apoptosis and differentiation [26]. Databases for lncRNA NONCODE2016. An interactive database, NONCODE2016 provides collection of ncRNAs (except tRNA and rRNA) from 16 species. A total number of 527 336 lncRNAs are submitted in NONCODE, mostly curated from literature and public databases. In the case of human and mouse, the deposited lncRNA are 167 150 and 130 558, respectively, in number as accessed in October 2016 [27]. Besides basic information such as location, strand, exon number, length and sequence, three important features are present, namely, conservation annotation, the relationships between lncRNAs and diseases and an interface to choose high-quality data sets through predicted scores, literature and long reads. This database is accessible from this link http://www.noncode.org. lncRNAdb. This database provides users with a comprehensive, manually curated reference database of 287 eukaryotic lncRNAs that have been published in the scientific literature. It is built on an improved user interface enabling sequence information access, as well as access to expression profiles from Illumina Body Atlas. BLAST search tool against lncRNAdb is also available, which is useful in search for novel transcripts [28]. Following types of information can be extracted using this database: nucleotide sequences; genomic context; gene expression data derived from the Illumina Body Atlas; structural information; subcellular localization; conservation; function with referenced literature. LNCipedia. Being a human lncRNA-specific database, its version 4 accounts for 118 777 annotated lncRNA transcripts, obtained from various sources. Large-scale reprocessing of publicly available proteomics data is undertaken to assess the protein-coding potential, and those transcripts with low protein-coding potential are deposited. In addition, a tool to assess lncRNA gene conservation between human, mouse and zebrafish has also been implemented. It can be accessed from the link http://www.lncipedia.org. Data deposited in LNCipedia can also be visualized using Integrative Genomics Viewer, a visualization tool for high-throughput genomics data [29]. The Atlas of Noncoding RNAs in Cancer (TANRIC). This database characterizes the expression profiles of lncRNAs using large-scale RNA-seq data sets from TCGA and independent data sets in 20 different cancer types. Available at http://bioinformatics.mdanderson.org/main/TANRIC [30], this widely useful database enables searching for lncRNAs with strong correlation with established biomarkers as well as drug sensitivity studies. Besides these databases (Table 1), there exist some more lncRNA databases covered in Wikipedia pages and the users are encouraged to have a look at these as well. Table 1. Databases and their specific features and active URLs with reference to lncRNAs Database  Description  URL link  NONCODE  Provides a collection of ncRNAs (except tRNA and rRNA) from 16 species. A total of 527 336 lncRNAs are available in NONCODE.  http://www.noncode.org/  lncRNAdb  Provides 287 eukaryotic lncRNAs and BLAST tool for identification of lncRNA.  http://lncrnadb.org  LNCipedia  An integrated database containing 118 777 human lncRNAs.  http://www.lncipedia.org  TANRIC  Interactive exploration of expression and clinical relevance of lncRNA in 20 different cancer types.  http://bioinformatics.mdanderson.org/main/TANRIC  Database  Description  URL link  NONCODE  Provides a collection of ncRNAs (except tRNA and rRNA) from 16 species. A total of 527 336 lncRNAs are available in NONCODE.  http://www.noncode.org/  lncRNAdb  Provides 287 eukaryotic lncRNAs and BLAST tool for identification of lncRNA.  http://lncrnadb.org  LNCipedia  An integrated database containing 118 777 human lncRNAs.  http://www.lncipedia.org  TANRIC  Interactive exploration of expression and clinical relevance of lncRNA in 20 different cancer types.  http://bioinformatics.mdanderson.org/main/TANRIC  Computational tools developed for lncRNAs Noncoding RNAs identification with Hybrid Random Forest. This is a classification tool based on a hybrid random forest (RF) algorithm with a logistic regression model to discriminate between short ncRNA and long and complex ncRNA sequences. The logistic regression function, which generates a new feature called SCORE, comprises five features of significance—structure, sequence, modularity, structural robustness and coding potential. This is done as a way to better describe the functional elements in the lncRNAs. The classifier is available at this link http://ncrna-pred.com/HLRF.htm [31]. PhyloCSF. A novel comparative genomics method, it is used to determine whether a protein-coding region is present in multi-species nucleotide sequence alignment. It is based on a formal statistical comparison of phylogenetic codon model. Rather than homology in sequence alignments, it examines conserved coding region for evolutionary signatures. Examples include codon substitution frequencies with high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions. This helps to distinguish protein-coding and ncRNAs of novel transcripts obtained from high-throughput transcriptome sequencing. This software is freely available from this link http://compbio.mit.edu/PhyloCSF [32] DeepLNC. Deep Neural Network (DNN) is postulated by the authors [33] as faster and an accurate computational method for screening of lncRNAs from mRNAs as compared with other classifiers. Manually annotated training data sets from LNCipedia and RefSeq database have been used in the classifier and information content stored in k-mer pattern is used as a sole feature for the DNN. This information content is generated on the basis of Shannon entropy function to improve classifier accuracy. It has been implemented as a web prediction tool, which is available at ‘http://bioserver.iiita.ac.in/deeplnc’. AnnoLnc. It is an online portal for systematically annotating newly identified human lncRNAs. AnnoLnc offers a full spectrum of annotations covering genomic location, RNA secondary structure, transcriptional regulation, expression; protein interaction, miRNA interaction, genetic association and evolution, as well as an abstraction-based text summary. The data are generated from high-throughput sequencing technique RNA-seq, ChIP-Seq, AGO CLIP-Seq, RBP CLIP-Seq, Gene model file and highly conserved miRNA families [34]. Coding Potential Calculator. It is developed based on support vector machines (SVMs) classifier, to assess the protein-coding potential of a transcript based on six biological features. These features are longer length, high-quality and integrity of ORF and parsing another three features from output of BLASTX based on E-value cutoff, HIT SCORE and FRAME SCORE. The true positive values from these features are protein coding transcripts and true negative values represent noncoding transcripts [35]. This tool can be used from the link http://cpc.cbi.pku.edu.cn/. Predictor of lncRNAs and messenger RNAs based on an improved k-mer scheme (PLEK). It is an alignment tool and uses computational pipeline based on an improved k-mer scheme and a SVM algorithm. PLEK takes calibrated k-mer frequencies of a transcript sequence as its computational features. With this feature, SVM is used to build a binary classification model to separate lncRNAs from mRNAs, in the absence of genomic sequences or annotations. This tool is reported by the developers as having performed well using a simulated data set and two real de novo assembled transcriptome data sets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. PLEK is especially suitable for PacBio or 454 sequenced data and large-scale transcriptome data analysis (Table 2) [36]. Table 2. Computational tools/web servers and their specific features and active URLs with reference to lncRNAs Tools  Description  Specificity [reference]  Sensitivity [reference]  Accuracy [reference]  Link [reference]  Noncoding RNAs identification with Hybrid Random Forest  To identify known ncRNAs by yielding a sensitivity level for prokaryotic and eukaryotic sequences.Based on a hybrid RF algorithm with a logistic regression model.  92.11 [31]  90.7 [31]  93.5 [31]  http://ncrna-pred.com/HLRF.htm [31]  PhyloCSF  To distinguish protein-coding and ncRNAs of novel transcripts.Based on a formal statistical comparison of phylogenetic codon model.  Not available  Not available  Not available  http://compbio.mit.edu/PhyloCSF  DeepLNC  Fast and accurate computational method for screening of lncRNA from mRNA.Based on DNN and Shannon entropy function.  97.19 [33]  98.98 [33]  98.07 [33]  http://bioserver.iiita.ac.in/deeplnc [33]  AnnoLnc  An online portal for systematically annotating newly identified human lncRNAs.  Not available  Not available  Not available  http://annolnc.cbi.pku.edu.cn  Coding potential calculator  To distinguish protein-coding RNAs from ncRNAs accurately.Based on SVM classifier.  Not available  Not Available  92.00 [35]  http://cpc.cbi.pku.edu.cn/ [35]  PLEK  It is a computational pipeline based on k-mer and SVM to identify lncRNAs from messenger RNAs (mRNAs).  95.80 (PacBio), 95.5 (454) [36]  94.70 (PacBio), 92.5 (454) [36]  94.70 (PacBio), 95.4 (454) [36]  https://sourceforge.net/projects/plek/files/ [36]  Tools  Description  Specificity [reference]  Sensitivity [reference]  Accuracy [reference]  Link [reference]  Noncoding RNAs identification with Hybrid Random Forest  To identify known ncRNAs by yielding a sensitivity level for prokaryotic and eukaryotic sequences.Based on a hybrid RF algorithm with a logistic regression model.  92.11 [31]  90.7 [31]  93.5 [31]  http://ncrna-pred.com/HLRF.htm [31]  PhyloCSF  To distinguish protein-coding and ncRNAs of novel transcripts.Based on a formal statistical comparison of phylogenetic codon model.  Not available  Not available  Not available  http://compbio.mit.edu/PhyloCSF  DeepLNC  Fast and accurate computational method for screening of lncRNA from mRNA.Based on DNN and Shannon entropy function.  97.19 [33]  98.98 [33]  98.07 [33]  http://bioserver.iiita.ac.in/deeplnc [33]  AnnoLnc  An online portal for systematically annotating newly identified human lncRNAs.  Not available  Not available  Not available  http://annolnc.cbi.pku.edu.cn  Coding potential calculator  To distinguish protein-coding RNAs from ncRNAs accurately.Based on SVM classifier.  Not available  Not Available  92.00 [35]  http://cpc.cbi.pku.edu.cn/ [35]  PLEK  It is a computational pipeline based on k-mer and SVM to identify lncRNAs from messenger RNAs (mRNAs).  95.80 (PacBio), 95.5 (454) [36]  94.70 (PacBio), 92.5 (454) [36]  94.70 (PacBio), 95.4 (454) [36]  https://sourceforge.net/projects/plek/files/ [36]  circRNA Databases for circRNA CircBase. This database can be used to browse or search through unified public data sets of circRNAs, and the evidence supporting their expression can be accessed and downloaded within the genomic context. Data can be queried using ‘circBase identifier (e.g. mmu_circ_0000010), refseq transcript ID (NM_027671), gene symbol (Pvt1), genomic coordinates (chrII:123456-7891011) or Gene Ontology term identifiers’. The database can also be queried using DNA or RNA sequence, even using BLAST-like Alignment Tool (BLAT) search. CircBase can also identify known and novel circRNAs in sequencing data. Currently supported genome assemblies in circBase are hg19 for Homo sapiens, mm9 for Mus musculus and ce6 for Caenorhabditis elegans, Latimeria chalumnae and Latimeria menadoensis (latCha1), Drosophila melanogaster (dm3) and Schmidtea mediterranea (Oct06). It is freely accessible through this link http://www.circbase.org/ [37]. circRNADb. This database contains 32 914 human exonic circRNAs selected from diverse sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, internal ribosome entry site (IRES) and references. The database authors have found that these circRNAs were found to be able to encode proteins, not been reported in any species as yet. In total, 7170 IRES elements were found from annotation of 16 328 circRNAs having ORFs longer than 100 amino acids. In all, 46 circRNAs from 37 genes are found to have their corresponding proteins expressed according to mass spectrometry. This database is accessible at http://reprod.njmu.edu.cn/circrnadb [38]. Circ2Traits. This database reports studies on potential circRNA association with human diseases in two different ways. On identification of the interactions of circRNAs with miRNAs associated with individual diseases, the likelihood of a circRNA being associated with a disease state is calculated. Network prediction between disease-associated miRNAs and PCGs (protein coding genes), long noncoding and circRNA genes is done and followed by Gene Ontology enrichment analyses for particular biological processes to specify the role of PCGs. Disease-associated SNPs are then mapped on circRNA loci and Argonaute interaction sites on circRNAs are identified, as the functional sites are ago-binding sites, which bind miRNA to act as sponge. circ2Traits has categorized 1951 human circRNAs potentially associated with 105 different diseases. Besides storing the complete putative miRNA–circRNA–mRNA–lncRNA interaction network for each of these diseases. The above information compiled into Circ2Traits database link http://gyanxet-beta.com/circdb/ [39]. CircNet. CircNet is distinctive in that this database incorporates a novel way of naming of circRNAs. Because multiple circular isoforms can originate from the same back splice junction site, the circRNAs are named based on these distinctive isoforms. This naming system provides information on source genes, circRNA as antisense or intronic and the location of back-splice junction sites on well-annotated exons. Following resources can be found while accessing this database: novel circRNAs, integrated miRNA-target networks, expression profiles of circRNA isoforms, genomic annotations of circRNA isoforms and sequences of circRNA isoforms. Tissue-specific circRNA expression profiles and circRNA–miRNA–gene regulatory network are also provided for further studies. This unique database is accessible from http://circnet.mbc.nctu.edu.tw/ (Table 3) [40]. Table 3. Databases and their specific features and active URLs with reference to circRNAs Database  Description  URL link  CircBase  Unified public data sets of circRNAs.Also identifies novel circRNAs in sequencing data.  http://www.circbase.org/  circRNADb  Contains 32 914 human exonic circRNAs selected from diversified sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, IRES and references.  http://reprod.njmu.edu.cn/circrnadb  Circ2Traits  Provides circRNA association with human diseases by circRNA–miRNA–protein interaction network, and disease-associated SNPs, Argonaut interaction on circRNA.  http://gyanxet-beta.com/circdb/  CircNet  Provides tissue-specific circRNA expression profiles and circRNA–miRNA-gene regulatory networks.  http://circnet.mbc.nctu.edu.tw/  Database  Description  URL link  CircBase  Unified public data sets of circRNAs.Also identifies novel circRNAs in sequencing data.  http://www.circbase.org/  circRNADb  Contains 32 914 human exonic circRNAs selected from diversified sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, IRES and references.  http://reprod.njmu.edu.cn/circrnadb  Circ2Traits  Provides circRNA association with human diseases by circRNA–miRNA–protein interaction network, and disease-associated SNPs, Argonaut interaction on circRNA.  http://gyanxet-beta.com/circdb/  CircNet  Provides tissue-specific circRNA expression profiles and circRNA–miRNA-gene regulatory networks.  http://circnet.mbc.nctu.edu.tw/  Tools for circRNA studies UROBORUS. This is an efficient tool for the identification of circRNA with low expression levels from total-RNA-seq data without RNase-R treatment. It is combined with TopHat and Bowtie to detect junction reads. TopHat is capable of detecting the canonical splicing event. However, TopHat cannot map junction reads that support back-spliced exons to a reference genome. As circRNAs arise out of this back-splicing event, and need to be detected in those unmapped reads, UROBORUS takes input from TopHat unmapped.sam results. It first extracts 20 bp from the two ends of reads in an unmapped.sam file to form an artificial paired-end seed in fastq file format. Then, this short 20 bp paired-end seed is aligned to the human reference genome (hg19) with maximum of 2 bp mismatches using TopHat with default parameters. Results would be two cases: balanced mapped junction (BMJ) reads, and unbalanced mapped junction (UMJ) reads. BMJ reads are those reads that align to the joining region of two back-spliced exons with a minimum 20 bp of overhang at any end of the reads. Those reads aligned to the joining region of two back-spliced exons with <20 bp of overhang at one end of the reads are termed as UMJ reads. Among these candidate back-spliced junction reads, those supporting junction reads above two reads are annotated to be candidate circRNA. This tool is available at http://uroborus.openbioinformatics.org/ [41]. CIRI. An acronym for CircRNA Identifier, it is a novel chiastic clipping signal-based algorithm, for unbiased and accurate de novo detection of circRNAs from transcriptome data by using multiple filtration approaches. When CIRI was applied to ENCODE RNA-seq data, the tool authors were able to identify and even experimentally validate the presence of intronic/intergenic circRNAs. It scans SAM files twice, and during the first scanning of SAM alignment, it detects junction reads with PCC signals that reflect a circRNA candidate. Thereafter, preliminary filtering is implemented using paired-end mapping and mapping of GT-AG splicing signals for the junctions. These junction reads are clustered, and each circRNA candidate is detected and recorded. CIRI then scans the SAM alignment again to detect additional junction reads. It even performs further filtering to eliminate false positive candidates resulting from incorrectly mapped reads of homologous genes or repetitive sequences. Identified circRNAs are annotated for their proper identification. This tool is available at https://sourceforge.net/projects/ciri/ [42]. DCC and CircTest. DCC uses an output from the STAR read mapper to systematically identify back-splice junctions and applies a series of filters for identification of candidate circRNAs. The tool authors specify that their software achieves a much higher precision than state-of-the-art competitors at similar sensitivity levels on using mouse brain data as well as publicly available ones. They further specify that DCC estimates circRNA versus host gene expression from counting junction and nonjunction reads to test for host gene-independence of circRNA expression across different experimental conditions by their R package CircTest. The software is made available for public use at https://github.com/dieterich-lab [43]. KNIFE. Acronym for Known and Novel Isoforms Explorer, it is a statistics-based splicing detection tool for circular and linear isoforms from RNA-Seq data. It quantifies circular and linear RNA splicing events at both annotated and un-annotated exon boundaries thus increasing detection sensitivity. Both single-end and paired-end reads can be analyzed. The algorithm should be used with good-quality reads, those reads that have been processed using cutAdapt to trim ends of poor quality. Annotated junction indices are available for reference genomes of human (hg19), mouse (mm10), rat (rn5) and drosophila (dm3). To select for high-quality circular/linear junctions, the identified junctions are reported along with a posterior probability (GLM reports) or P-value (naive reports) to eliminate false positives. The tool can be accessed from https://github.com/lindaszabo/KNIFE [44]. CircInteractome. Interactions of circRNAs with proteins and miRNAs is a highly useful area of research and study. In this regard, a web-based tool for exploring circRNAs and their interacting proteins and miRNAs has been developed named as CircInteractome. CircRNAs have been shown to act as sponge for miRNAs and potentially RNA-binding proteins (RBPs) thereby serving important function as posttranscriptional regulators of gene expression. CircRNA, miRNA and RBP public databases are integrated within CircInteractome to provide binding sites of miRNA and RBP sites on the junction and junction-flanking sequences of circRNA through bioinformatic analyses. The tool developers specify that their tool ‘allows identification of potential circRNAs which can act as RBP sponges, design junction-spanning primers for specific detection of circRNAs of interest, design siRNAs for circRNA silencing, and identification of potential internal ribosomal entry sites (IRES). It is accessible at http://circinteractome.nia.nih.gov (Table 4) [45]. Table 4. Computational tools/web servers and their specific features and active URLs with reference to circRNAs Tools  Description  URL link  UROBORUS  To identify circRNA with low expression levels from total-RNA-seq data without RNase-R treatment, and formed owing to back-splicing event.  http://uroborus.openbioinformatics.org/  CIRI  A novel chiastic clipping signal-based algorithm, for the detection of circRNAs from transcriptome data by using multiple filtration approaches.  https://sourceforge.net/projects/ciri/  DDC & CircTest  A python package, it detects abundant circRNAs and quantifies relative expression changes of circRNA, based on read count data.  https://github.com/dieterich-lab  KNIFE  Identifies circRNA isoforms using statistically based splicing detection for circular and linear isoforms from RNA-seq data.  https://github.com/lindaszabo/KNIFE  CircInteractome  For exploring circRNAs and their interacting proteins and miRNAs in a network.  http://circinteractome.nia.nih.gov  Tools  Description  URL link  UROBORUS  To identify circRNA with low expression levels from total-RNA-seq data without RNase-R treatment, and formed owing to back-splicing event.  http://uroborus.openbioinformatics.org/  CIRI  A novel chiastic clipping signal-based algorithm, for the detection of circRNAs from transcriptome data by using multiple filtration approaches.  https://sourceforge.net/projects/ciri/  DDC & CircTest  A python package, it detects abundant circRNAs and quantifies relative expression changes of circRNA, based on read count data.  https://github.com/dieterich-lab  KNIFE  Identifies circRNA isoforms using statistically based splicing detection for circular and linear isoforms from RNA-seq data.  https://github.com/lindaszabo/KNIFE  CircInteractome  For exploring circRNAs and their interacting proteins and miRNAs in a network.  http://circinteractome.nia.nih.gov  tRFs Databases for tRFs As of December 2016, only two databases for tRFs can be identified from web search and these are detailed as follows: tRFdb. tRFdb is the first database of transfer RNA fragments (tRFs). Its source data set is taken from NCBI GEO and SRA databases. tRFdb currently contains the sequences and read counts of the three classes of tRFs of eight species: Rhodobacter sphaeroides, Schizosaccharomyces pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse and human. A total of 12 877 tRFs are deposited. As there are ‘five’ types of tRFs originating from the mature tRNA, 5′-halves, 3′-halves, 5′-tRFs, 3′-tRFs and internal tRFs, this database can be searched via different tRF types, viz., tRF-5, -3 or -1 or tRF-ID such as 5001, 5001a, 5001b and so on. The output consists of tRF-ID, organism name, tRF type, tRNA gene co-ordinate, tRNA gene name and hyperlinks to the tRF sequence itself and the experimental details. It is available at http://genome.bioch.virginia.edu/trfdb/ [46]. MINTbase. tRFs can arise from mitochondrial as well as nuclear tRNAs. A web-based database for the repository of these novel tRFs and a tool for the interactive exploration of nuclear and mitochondrial tRNA fragments has been made available. MINTbase integrates four kinds of information about these molecules. These are sequence information, expression information, parental tRNA information and genomic information. MINTbase is freely accessible at http://cm.jefferson.edu/MINTbase/ (Table 5) [47]. Table 5. Databases and their specific features and active URLs with reference to tRFs Databases  Description  URL Link  tRFdb  Contains sequence and read count of tRFs in R. sphaeroides, S. pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse, and human.  http://genome.bioch.virginia.edu/trfdb/  MINTbase  A repository of nuclear and mitochondrial tRFs and tool for interactive exploration.  http://cm.jefferson.edu/MINTbase/  Databases  Description  URL Link  tRFdb  Contains sequence and read count of tRFs in R. sphaeroides, S. pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse, and human.  http://genome.bioch.virginia.edu/trfdb/  MINTbase  A repository of nuclear and mitochondrial tRFs and tool for interactive exploration.  http://cm.jefferson.edu/MINTbase/  Tools for tRF studies tRF2Cancer. This integrated web-based computing system is useful for accurate identification of tRFs from small RNA deep sequencing data and their expression levels in multiple cancers. A statistical tool, binomial test, is introduced to determine whether reads from a small RNA-seq data set represent tRFs, i.e. it estimates the significance of sequenced sRNAs abundance distributed on each tRNA. Thereafter, a classification method is used to annotate the tRF type. Another tool implemented therein, ‘tRFinCancer’ tool, is used to inspect the expression of tRFs across different cancer types. ‘tRF-Browser’ is used to determine the sites of origin and the distribution of chemical modification sites in tRFs including m5C, 2′-O-Me, Ψ and m6A, on corresponding tRNA source. The source link for this tool is http://rna.sysu.edu.cn/tRFfinder/ [48]. High-throughput Annotation of Modified Ribonucleotides (HAMR). Modified nucleotides are usually found in RNA owing to its posttranscriptional modifications. At single-nucleotide resolution, HAMR finds potential signatures of nucleotide modification. Both SNPs and candidate RNA as well as tRNA modifications can be mapped and used for likely identification of tRF from a pool of small RNA-seq data. The web version of HAMR is available at http://wanglab.pcbi.upenn.edu/hamr [49]. tDRmapper. This tool was developed on the premise that tRFs are difficult to accurately map owing to as yet incomplete information of the exact number of copies and annotation of human tRNA genes. Further, tRNAs are subject to chemical modifications extensively. Hence the detection of tRFs and its comparison with tRNAs is difficult. Besides, there is no standard nomenclature. Hence, tDRmapper was developed as an alignment tool for mapping, naming, quantifying and for graphical visualization of novel tRFs. Small RNA-seq reads are filtered based on quality of each base and length of reads. Filtered reads are mapped to mature t-RNA sequences, then to pre-tRNA sequences and finally in a hierarchical manner allowing for errors such as one mismatch, one deletion, two mismatches, two deletions and a three base-pair deletion. tDRs are then annotated based on size and location and quantified based on the fraction of reads aligning to the parent tRNA and the maximum coverage across all positions of the tRNA. tDRmapper is available from this link: https://github.com/sararselitsky/tDRmapp (Table 6) [50]. Table 6. Computational tools/web servers and their specific features and active URLs with reference to tRFs Tools  Description  Accuracy [reference]  URL link  tRF2Cancer  Tool for identification and determination of tRF expression levels in multiple cancers.  Not available  http://rna.sysu.edu.cn/tRFfinder/  HAMR  Scans RNA-seq data for sites showing potential signatures of nucleotide modification in tRFs.  98% (two classes of adenosine modification), 79% (two classes of guanosine modification) [49]  http://wanglab.pcbi.upenn.edu/hamr  tDRmapper  An alignment tool for mapping, naming, quantifying and graphical visualization of novel tRFs from small RNA-seq libraries.  Not available  https://github.com/sararselitsky/tDRmapp  Tools  Description  Accuracy [reference]  URL link  tRF2Cancer  Tool for identification and determination of tRF expression levels in multiple cancers.  Not available  http://rna.sysu.edu.cn/tRFfinder/  HAMR  Scans RNA-seq data for sites showing potential signatures of nucleotide modification in tRFs.  98% (two classes of adenosine modification), 79% (two classes of guanosine modification) [49]  http://wanglab.pcbi.upenn.edu/hamr  tDRmapper  An alignment tool for mapping, naming, quantifying and graphical visualization of novel tRFs from small RNA-seq libraries.  Not available  https://github.com/sararselitsky/tDRmapp  Conclusions Novel molecules such as lncRNAs, circRNAs and tRFs are making their mark in the new research arena. NGS and RNA-seq are key technologies that are playing the powerful facilitative role. In this article, we have strived to put together several databases and tools of immediate use to researchers that will be helpful in getting at sound scientific decisions and exploration of novel molecules inside the cell, most of which appear to be regulatory in nature. The characteristic features detailed for each of these enable specific types of information to be retrieved, analyzed and studied. It is also stressed that rather than using a single database, using multiple databases and tools to gain consensus enhances quality and accuracy of results. Biomedical research such as that on cancer is increasingly being big data-driven, leading to discovery of these and other novel molecules. Given the need for further investigation, our article strives to provide the right impetus in this direction. Key Points Biomedical research such as that on cancer is increasingly being big data-driven using novel technologies such as NGS and RNA-seq and Chip-seq. Novel molecules such as lncRNAs, sncRNAs, circRNAs and tRFs are increasingly being discovered and deciphered to play a possible regulatory role crucial in driving diseases. In this article, we provide a catalog of existing databases and tools on these novel molecules with their descriptions. This will allow researchers as end users to arrive at sound scientific decisions, thereby providing an attractive opportunity for further studies on these interesting novel molecules. A. Saleembhasha, is a PhD scholar working on cancer systems biology and the role of novel molecules in cancer. His interest lies in exploring several Bioinformatics databases and tools. Seema Mishra, PhD, is Head of the Bioinformatics and Systems Biology Laboratory at Department of Biochemistry, University of Hyderabad, India. Her expertise is in exploring all the major facets of Computational and Systems Biology, ranging from sequence and molecular modeling analyses and drug designing, to uncover laws governing the development of Cancer and Infectious diseases. References 1 Peng L, Bian XW, Li DK, et al.   Large-scale RNA-seq transcriptome analysis of 4043 cancers and 548 normal tissue controls across 12 TCGA cancer types. Sci Rep  2015; 5: 13413. Google Scholar CrossRef Search ADS PubMed  2 Mittal VK, McDonald JF. Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation. BMC Med Genomics  2015; 8: 40. Google Scholar CrossRef Search ADS PubMed  3 Myers JS, von Lersner AK, Robbins CJ, Sang QX. Differentially expressed genes and signature pathways of human prostate cancer. PLos One  2015; 10: e0145322. Google Scholar CrossRef Search ADS PubMed  4 Roque DR, Makowski L, Chen TH, et al.   Association between differential gene expression and body mass index among endometrial cancers from The Cancer Genome Atlas Project. Gynecol Oncol  2016 Aug; 142: 317– 22. doi: 10.1016/j.ygyno.2016.06.006. Epub 2016 Jun 14. Google Scholar CrossRef Search ADS PubMed  5 Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform  2015; 16: 59– 70. Google Scholar CrossRef Search ADS PubMed  6 Peter S, Borkowska E, Drayton RM, et al.   Identification of differentially expressed long noncoding RNAs in bladder cancer. Clin Cancer Res  2014; 20: 5311– 21. Google Scholar CrossRef Search ADS PubMed  7 Chen T, Xie W, Xie L, et al.   Expression of long noncoding RNA lncRNA-n336928 is correlated with tumor stage and grade and overall survival in bladder cancer. Biochem Biophys Res Commun  2015; 468: 666– 70. Google Scholar CrossRef Search ADS PubMed  8 Song W, Liu YY, Peng JJ, et al.   Identification of differentially expressed signatures of long non-coding RNAs associated with different metastatic potentials in gastric cancer. J Gastroenterol  2016; 51: 119– 29. Google Scholar CrossRef Search ADS PubMed  9 Pandey GK, Mitra S, Subhash S, et al.   The risk-associated long noncoding RNA NBAT-controls neuroblastoma progression by regulating cell proliferation and neuronal differentiation. Cancer Cell  2014; 26: 722– 37. Google Scholar CrossRef Search ADS PubMed  10 Yan X, Hu Z, Feng Y, et al.   Comprehensive genomic characterization of long non-coding RNAs across human cancers. Cancer Cell  2015; 28: 529– 40. Google Scholar CrossRef Search ADS PubMed  11 Pastori C, Kapranov P, Penas C, et al.   The Bromodomain protein BRD4 controls HOTAIR, a long noncoding RNA essential for glioblastoma proliferation. Proc Natl Acad Sci USA  2015; 112: 8326– 31. Google Scholar CrossRef Search ADS PubMed  12 Kawaji H, Nakamura M, Takahashi Y, et al.   Hidden layers of human small RNAs. BMC Genomics  2008; 9: 157. Google Scholar CrossRef Search ADS PubMed  13 Cole C, Sobala A, Lu C, et al.   Filtering of deep sequencing data reveals the existence of abundant dicer-dependent small RNAs derived from tRNAs. RNA  2009; 15: 2147– 60. Google Scholar CrossRef Search ADS PubMed  14 Green D, Fraser WD, Dalmay T. Transfer RNA-derived small RNAs in the cancer transcriptome. Pflugers Arch  2016; 468: 1041– 7. Google Scholar CrossRef Search ADS PubMed  15 Honda S, Loher P, Shigematsu M, et al.   Sex hormone-dependent tRNA halves enhance cell proliferation in breast and prostate cancers. Proc Natl Acad Sci USA  2015; 112: E3816– 25. Google Scholar CrossRef Search ADS PubMed  16 Sanger HL, Klotz G, Riesner D, et al.   Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc Natl Acad Sci USA  1976; 73: 3852– 6. Google Scholar CrossRef Search ADS PubMed  17 Salzman J, Chen RE, Olsen MN, et al.   Cell-type specific features of circular RNA expression. PLos Genetic  2013; 9: e1003777. Google Scholar CrossRef Search ADS   18 Szabo L, Salzman J. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet  2016; 17: 679– 92. Google Scholar CrossRef Search ADS PubMed  19 Kulcheski FR, Christoff AP, Margis R. Circular RNAs are miRNA sponges and can be used as a new class of biomarker. J Biotechnol  2016; 238: 42– 51. Google Scholar CrossRef Search ADS PubMed  20 Li Z, Huang C, Bao C, et al.   Exon-intron circular RNAs regulate transcription in the nucleus. Nat Struct Mol Biol  2015; 22: 256– 64. Google Scholar CrossRef Search ADS PubMed  21 Guttman M, Russell P, Ingolia NT, et al.   Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell  2013; 154: 240– 51. Google Scholar CrossRef Search ADS PubMed  22 Guttman M, Garber M, Levin JZ, et al.   Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol  2010; 28: 503– 10. Google Scholar CrossRef Search ADS PubMed  23 Cabili MN, Trapnell C, Goff L, et al.   Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev  2011; 25: 1915– 27. Google Scholar CrossRef Search ADS PubMed  24 Dinger ME, Amaral PP, Mercer TR, et al.   Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res  2008; 18: 1433– 45. Google Scholar CrossRef Search ADS PubMed  25 Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science  2009; 324: 218– 23. Google Scholar CrossRef Search ADS PubMed  26 Rossi MN, Antonangeli F. LncRNAs: new players in apoptosis control. Int J Cell Biol  2014; 2014: 473857. Google Scholar CrossRef Search ADS PubMed  27 Zhao Y, Li H, Fang S, et al.   NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res  2016; 44: D203– 8. Google Scholar CrossRef Search ADS PubMed  28 Quek XC, Thomson DW, Maag JL, et al.   lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res  2015; 43: D168– 73. Google Scholar CrossRef Search ADS PubMed  29 Volders PJ, Verheggen K, Menschaert G, et al.   An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic Acids Res  2015; 43: D174– 80. Google Scholar CrossRef Search ADS PubMed  30 Li J, Han L, Roebuck P, et al.   TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res  2015; 75: 3728– 37. Google Scholar CrossRef Search ADS PubMed  31 Lertampaiporn S, Thammarongtham C, Nukoolkit C, et al.   Identification of non-coding RNAs with a new composite feature in the hybrid random forest ensemble algorithm. Nucleic Acids Res  2014; 42: e93. Google Scholar CrossRef Search ADS PubMed  32 Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics  2011; 27: i275– 82. Google Scholar CrossRef Search ADS PubMed  33 Tripathi R, Patel S, Kumari V, et al.   DeepLNC, a long non-coding RNA prediction tool using deep neural network. Netw Model Anal Health Inform Bioinform  2016; 5: 21. Google Scholar CrossRef Search ADS   34 Hou M, Tang X, Tian F, et al.   AnnoLnc: a web server for systematically annotating novel human lncRNAs. BMC Genomics  2016; 17: 931. Google Scholar CrossRef Search ADS PubMed  35 Kong L, Zhang Y, Ye ZQ, et al.   CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res  2007; 36: W345– 9. Google Scholar CrossRef Search ADS   36 Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics  2014; 15: 311. Google Scholar CrossRef Search ADS PubMed  37 Glažar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. RNA  2014; 20: 1666– 70. Google Scholar CrossRef Search ADS PubMed  38 Chen X, Han P, Zhou T, et al.   circRNADb: a comprehensive database for human circular RNAs with protein-coding annotations. Sci Rep  2016; 6: 34985. Google Scholar CrossRef Search ADS PubMed  39 Ghosal S, Das S, Sen R, et al.   Circ2Traits: a comprehensive database for circular RNA potentially associated with disease and traits. Front Genet  2013; 4: 283. Google Scholar CrossRef Search ADS PubMed  40 Liu YC, Li JR, Sun CH, et al.   CircNet: a database of circular RNAs derived from transcriptome sequencing data. Nucleic Acids Res  2016; 44: D209– 15. Google Scholar CrossRef Search ADS PubMed  41 Song X, Zhang N, Han P, et al.   Circular RNA profile in gliomas revealed by identification tool UROBORUS. Nucleic Acids Res  2016; 44: e87. Google Scholar CrossRef Search ADS PubMed  42 Gao Y, Wang J, Zhao F. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biol  2015; 16: 4. Google Scholar CrossRef Search ADS PubMed  43 Cheng J, Metge F, Dieterich C, et al.   Specific identification and quantification of circular RNAs from sequencing data. Bioinformatics  2016; 32: 1094– 6. Google Scholar CrossRef Search ADS PubMed  44 Szabo L, Morey R, Palpant NJ, et al.   Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biol  2015; 16: 126. Google Scholar CrossRef Search ADS PubMed  45 Dudekula DB, Panda AC, Grammatikakis I, et al.   CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol  2016; 13: 34– 42. Google Scholar CrossRef Search ADS PubMed  46 Kumar P, Mudunuri SB, Anaya J, Dutta A. tRFdb: a database for transfer RNA fragments. Nucleic Acids Res  2015; 43: D141– 5. Google Scholar CrossRef Search ADS PubMed  47 Pliatsika V, Loher P, Telonis AG, Rigoutsos I. MINTbase: a framework for the interactive exploration of mitochondrial and nuclear tRNA fragments. Bioinformatics  2016; 32: 2481– 9. Google Scholar CrossRef Search ADS PubMed  48 Zheng LL, Xu WL, Liu S, et al.   tRF2Cancer: a web server to detect tRNA-derived small RNA fragments (tRFs) and their expression in multiple cancers. Nucleic Acids Res  2016; 44: W185– 93. Google Scholar CrossRef Search ADS PubMed  49 Ryvkin P, Leung YY, Silverman IM, et al.   HAMR: high-throughput annotation of modified ribonucleotides. RNA  2013; 19: 1684– 92. Google Scholar CrossRef Search ADS PubMed  50 Selitsky SR, Sethupathy P. tDRmapper: challenges and solutions to mapping, naming, and quantifying tRNA-derived RNAs from human small RNA-sequencing data. BMC Bioinformatics  2015; 16: 354. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Functional Genomics Oxford University Press

Novel molecules lncRNAs, tRFs and circRNAs deciphered from next-generation sequencing/RNA sequencing: computational databases and tools

Loading next page...
 
/lp/ou_press/novel-molecules-lncrnas-trfs-and-circrnas-deciphered-from-next-xW1cvLGRQc
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
2041-2649
eISSN
2041-2647
D.O.I.
10.1093/bfgp/elx013
Publisher site
See Article on Publisher Site

Abstract

Abstract Powerful next-generation sequencing (NGS) technologies, more specifically RNA sequencing (RNA-seq), have been pivotal toward the detection and analysis and hypotheses generation of novel biomolecules, long noncoding RNAs (lncRNAs), tRNA-derived fragments (tRFs) and circular RNAs (circRNAs). Experimental validation of the occurrence of these biomolecules inside the cell has been reported. Their differential expression and functionally important role in several cancers types as well as other diseases such as Alzheimer’s and cardiovascular diseases have garnered interest toward further studies in this research arena. In this review, starting from a brief relevant introduction to NGS and RNA-seq and the expression and role of lncRNAs, tRFs and circRNAs in cancer, we have comprehensively analyzed the current landscape of databases developed and computational software used for analyses and visualization for this emerging and highly interesting field of these novel biomolecules. Our review will help the end users and research investigators gain information on the existing databases and tools as well as an understanding of the specific features which these offer. This will be useful for the researchers in their proper usage thereby guiding them toward novel hypotheses generation and saving time and costs involved in extensive experimental processes in these three different novel functional RNAs. next-generation sequencing technologies, RNA sequencing, RNA-seq, long noncoding RNAs, circular RNAs, tRNA-derived fragments, cancer, databases, tools Introduction Today’s science is essentially technology driven, and newer the technology, the faster is the progress and the more advanced is science. The limitless capabilities thus imparted translate into major scientific breakthroughs in virtually all of the major fields, from Genomics to Proteomics to Metabolomics and fast-forwarding to Systems and Synthetic biology. Groundbreaking real-world applications range from human diseases to evolution to agriculture and medicine to metagenomics (sequencing genomes of microbiota present in environment and in humans), focusing on all the major elements of our life. The impact of these novel, groundbreaking technologies in biomedical science is tremendous. Genomics and genome projects Ever since the first protein and DNA sequences were discovered using Edman degradation method and Sanger sequencing, respectively, the need to produce more and more information in an automated, robust, high-throughput, scalable and speedier manner, and from any biological sample at low cost and the need to sequence entire genome in one step, has led to a kind of ‘sequencing technology revolution’. In fact, the sequencing technologies of today are ‘ultra high-throughput’ technologies, called, next-generation sequencing (NGS) technologies, powerful and robust yet simple in their principles. Found on the principles of its forefather, the Capillary Electrophoresis-based sequencing, NGS technology has pushed the frontiers of science at altogether new levels. Using NGS, thousands of sequences can be deciphered in a ‘concurrent’ manner and computational analyses done at a ‘genome-wide’ scale. Besides its use in DNA sequencing and RNA sequencing (RNA-seq), it is also a basis of chromatin immunoprecipitation sequencing (ChIP-sequencing) to identify binding sites of proteins that associate with DNA molecules, such as transcription factors and DNA polymerases. Personalized genomics with personal whole genome sequence information, and genome-wide profiling of several types of RNA molecules such as messenger RNAs (mRNAs), microRNAs (miRNAs), long noncoding RNAs (lncRNAs), small noncoding RNAs (sncRNAs), circular RNAs (circRNAs) and tRNA-derived fragments (tRFs) are now possible with this technology. In the case of gene expression studies, owing to the precise quantitation of RNA transcripts occurring as a result of alternative gene splicing events and discovery of several novel transcripts, RNA-seq provides a stronger platform than microarrays. NGS: Sequencing machines, genome centers and major projects NGS technology started in 2005, shortly after the Human Genome Project was declared complete in the euchromatic region in the year 2003. The first of the sequencing machines was Roche 454 sequencing, based on pyrosequencing methodology, which is currently being discontinued. Higher end machines such as Illumina HiSeq 2500 and MiSeq sequencers and Ion Proton by Ion Torrent Systems Inc. are slated to rule the Genomics research field. Among newly developing technologies are Single Molecule Real Time sequencing methods and Nanopore DNA Sequencing. Data analysis of NGS platforms requires a strong Bioinformatics component, which needs to keep pace with huge volume of data generated. The main goal is to reduce costs per genome to <$1000. Several Genome Centers worldwide are using these high-end NGS machines to support their projects as well as outside researchers. Notable among these genome centers are the following: JP Sulzberger Columbia Genome Center at Columbia University Department of Systems Biology, New York City; The US Department of Energy Joint Genome Institute, California, USA; Wellcome Trust Sanger Institute, Hinxton, UK; Max Planck Institute for Molecular Genetics, Berlin, Germany; National Institute of Biomedical Genomics, Kolkata, India; and Center for Cellular and Molecular Platforms, National Centre for Biological Sciences-Tata Institute of Fundamental Research (NCBS-TIFR), Bangalore, India. Implementing NGS technology, several high-throughput genome projects have been undertaken or are slated to start. Interesting projects among these are the Functional Annotation of the Mammalian Genome 5 project to functionally annotate transcriptomes specific to mammalian cells; 1000 Genomes Project, intended to catalog every possible human genetic variation from about 25 populations all over the world; 1000 Fungal Genomes Project to sequence every family of fungi, The Cancer Genome Atlas (TCGA) to sequence genomes of different types of cancer; 100K Foodborne Pathogen Genome Project; and the International Cancer Genome Consortium. TCGA is a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute. It aimed to provide a ‘comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer’. To date, approx. 2.5 petabytes of data describing genomic alterations in tumor tissue and matched normal tissues has been generated and is made publicly available for further research, analysis and dissemination of crucial information. Data are made available through Genomic Data Commons portal (also host to Therapeutically Applicable Research to Generate Effective Therapies) platform and is in the form of methylation-array, genotyping array, clinical and metadata information, histopathological images, DNA/mRNA/miRNA/total RNA-seq data, DNA methylation, copy number variation, microsatellite instability and protein expression, among others. In 2017, NCI Center for Cancer Genomics will be taking the place of TCGA and intends to continue the work in the same direction. Another consortium also aiming at pan-cancer studies, the International Cancer Genome Consortium, aims to put together data from several countries across the world including Australia, Brazil, Canada, China, EU, France, Germany and India, to name a few. The primary goals are to study and generate catalogs of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types and/or subtypes and make the data available to researchers in our rush to combat cancers of clinical and societal relevance. To date, 89 committed projects are available involving a variety of experimental protocols such as array genotyping, DNA deep resequencing, cytogenetic analyses and computational pipelines implementing statistical analyses, correlations between molecular profiles and clinical features among others. BOX 1 The Technology: NGS To illustrate a generalized NGS methodology, the following steps are started with a genomic DNA or complementary DNA (cDNA) extracted from samples, e.g. blood samples: Template: The DNA is fragmented and adapters (synthesized oligonucleotides of a few base pairs length and of a known sequence) are added to the ends of the fragments. These adapters are used to attach the DNA fragments to the flow cell of the sequencing machine and also contain primer sequences (primers are used to extend a chain by a DNA polymerase molecule). These libraries of several fragments are then amplified, e.g. by PCR, to generate large numbers of clones. Some sequencing machines operate on the method of ‘sequencing by synthesis’. The fragments in the library act as a template, on which a new fragment is synthesized by providing known nucleotides sequentially. The DNA strands are extended by one nucleotide and the reads (a sequence) recorded by a software. Millions of such sequencing reactions are done in parallel. These reads are aligned to a reference genome (re-assembling and re-sequencing) to generate larger contigs or in situations where a reference genome is not available, then the reads are assembled de novo (de novo sequencing). Reference genomes are those genomes that are representative set of a species genome, characterized by sequence quality and well-annotated with key genomic features such as total number of protein-coding genes, RNA genes, exon and intron locations among others. A recent high-quality assembly of human reference genome, GRCh38, has been released in December 2013 by Genome Reference Consortium. These whole genome sequences can be analyzed by Bioinformatics tools for genetic variant detection, presence of single nucleotide polymorphisms (SNPs) and insertion/deletion (Indel) mutations associated with diseases, presence of regulatory and protein-binding sites, finding novel genes and assessing gene expression levels through transcript abundance. RNA-seq During RNA-seq, the mRNA isolated through poly-A pulldown assay is converted into cDNA through reverse transcription reaction, and further steps for sequencing are essentially the same as mentioned above in the ‘NGS: Sequencing machines, genome centers and major projects’ section. The gene expression level is determined through the relative abundance of mRNA transcripts quantified after aligning the reads to the reference genome. The contemporary research is being geared toward using RNA-seq technology to quantify gene expression changes, identification of novel transcripts and their possible role in diseased state. Identification of differentially expressed genes, isoforms, gene fusions and predicting novel genes/protein as putative drug targets are some of the other futuristic research arenas. Toward these ends, several studies have been and are currently being undertaken in the context of various diseases. As some examples, in cancer studies, large-scale transcriptome analysis using RNA-seq data from 4043 cancers of diverse types and 548 normal controls identified seven co-regulated gene sets, also called cross-cancer signatures, altered across a diverse panel of primary human cancer samples. A 14-gene signature extracted from these cross-cancer signatures was capable of clear distinction between cancer and normal samples [1]. Integrated DNA sequencing and transcriptional profile analysis detected 10 034 ovarian cancer structural variants (SVs) at base-pair level resolution, and these SVs along with gene fusions were implicated as significant factors in the onset and progression of ovarian cancer [2]. Using RNA-seq data, an underexplored area on Ran regulation of mitotic spindle formation was identified for further studies on prostate cancer pathogenesis [3]. In another interesting study [4], RNA-seq data were used to identify genes related to increasing body mass index (BMI) among four integrated clusters POLE, MSI, CNL, CNH of endometrial cancer (EC). Differences in gene expression profiles of obese and nonobese women with EC and the association of BMI within these clusters were measured. It was observed that 181 genes were significantly up- or down-regulated with increasing BMI mainly involved in cell cycle and DNA-metabolism processes. This set of genes included LPL, IRS-1, IGFBP4, IGFBP7 and the progesterone receptor genes. RNA-seq data to study differential gene expression are typically analyzed with the same set of tools that are used to analyze microarrays, viz., t-test, SAM, limma, ANOVA, Bonferroni correction for adjusting false discovery rate. In addition, RNA-seq-specific software tools such as DESeq, edgeR, SAMseq are also being increasingly used. A detailed protocol for the usage and understanding of these software tools is provided in [5]. Types based on regions sequenced Based on the genomic regions being sequenced, NGS technologies involve three basic types of sequencing: Whole Genome Sequencing: Whole genomes are sequenced. Exome Sequencing: Whole of the exon (coding) regions in a genome are sequenced. Targeted Resequencing: Only a portion of a genome or a gene that is thought to function in disease states is sequenced. These different types of sequencing studies yield specific information amenable to further investigation and novel hypotheses generation. The latter two are considered cost-effective, reduce turnaround time and are more focused and less labor-intensive than the whole genome sequencing technology. NGS and lncRNA lncRNA is a transcript longer than 200 nt that does not code for protein. Nearly 30 000 (and still counting) lncRNA transcripts are present in humans as identified via Encyclopedia of DNA Elements (ENCODE) and Functional Annotation of Mammals (FANTOM) consortia. GENCODE consortium analysis indicates that similar to protein coding genes, lncRNA transcription is regulated by histone modification and is processed by similar splicing mechanism. Their expression is strikingly cell type and tissue specific. In addition to their ubiquitous role in transcription regulation of protein-coding genes, several lncRNAs have been found to be involved in tumorigenesis and metastasis. lncRNA expression can reflect cancer phenotype as shown by 32 upregulated lncRNAs identified as playing a role in urothelial cancer progression. Of these, upregulation of ABO74278 lncRNA led to anti-apoptotic role and maintained proliferative state in cancer, through potential interaction with tumor suppressor EMP1 [6] and is considered to be a strong prognostic biomarker candidate. Other lncRNAs such as lncRNA-n336928 in bladder cancer [7], XLOC_010235 or RP11-789C1.1 in gastric cancer [8] and NBAT1 in neuroblastoma [9] have been implicated to be playing a role in cancer development and progression. Large-scale analyses showed that lncRNA expression and dysregulation are highly tumor- and lineage-specific and often associated with somatic copy number alterations, promoter methylation and SNPs. siRNA screening strategy and co-expression analysis approach identified cancer driver lncRNAs and predicted their functions [10]. LncRNAs are important epigenetic regulators as well. HOX transcript antisense RNA (HOTAIR) lncRNA overexpression in glioblastoma leads to cellular proliferation. Chromatin immunoprecipitation studies revealed binding of Bromodomain Containing 4 (BRD4) protein, an epigenetic modulator, to the HOTAIR promoter, suggesting that BET proteins can directly regulate lncRNA expression. This is one of the novel mechanisms discovered through which BET proteins can control tumor cell proliferation [11]. NGS and t-RNA-derived small ncRNA t-RNA-derived small ncRNAs were discovered through sequencing of RNAs with size 19–40 nt [12] and through deep sequencing and bioinformatics studies. These are most possibly produced via specific processing of tRNA molecules [13]. These are also known by the name of tRFs. Their role in cancer development and progression is being investigated and is currently in its infancy. tRFs induced under hypoxic conditions may play a role in preventing metastasis [14]. A novel type of tRF, termed Sex Hormone-dependent tRNA-derived RNAs (SHOT-RNAs), was found to be specifically and abundantly expressed in androgen receptor-positive prostate cancer and estrogen receptor-positive breast cancer cell lines. These SHOT-RNAs were found to be produced from amino-acylated mature tRNAs by angiogenin-mediated anticodon cleavage promoted by sex hormones and their receptors [15]. NGS and CircRNA The first evidence of the existence of circRNAs was shown in plant viroids, where these were present as single-stranded, circularly closed RNA molecules [16]. In humans and other organisms, circRNAs are also being discovered through analyses of deep sequencing data. CircRNAs, as their name implies, do not have 5ʹ and 3ʹ ends, instead these are circular molecules, forming a covalently closed loop. As these are not polyadenylated, poly (A)-RNA-seq data cannot be used for their discovery. These were identified by research groups on their way toward understanding exon scrambling events, when exons are spliced in noncanonical order. Deep sequencing reads that show a junction between two ‘scrambled’ exons are typically used to identify them. In human context, it is estimated that circRNAs may account for 1% as many molecules as polyadenylated RNAs [17]. Up- and down-regulation of circRNAs has also been found associated with cancer. A review paper on bioinformatics and experimental methodologies on circRNAs detection has been published [18]. CircRNAs can act as novel class of biomarkers and have been postulated to be miRNA sponges binding to miRNAs and repressing their function [19]. In one study, RNA polymerase II-associated circRNA in human cells produced a new class of circRNA molecules. In these transcripts, interestingly, introns were found to be ‘retained’ between circularized exons. Named exon–intron circRNAs or EIciRNAs, these were found to be involved in regulating transcription of their parental genes through RNA–RNA interaction between them and U1 snRNA [20]. Identification of lncRNA, sncRNA, tRF and circRNAs using deep sequencing and bioinformatics: databases, tools and techniques Of major interest in today’s scientific world is the compelling power that deep sequencing and Bioinformatics possess in discovering novel molecules inside the cells. Novel unannotated coding and noncoding transcripts can be detected, analyzed for their probable roles and mechanisms of action and several research areas have been focusing on these in recent times. These studies require powerful computational tools to detect and differentiate lncRNAs from other coding and ncRNAs. Methods for identification of lncRNAs from RNA-seq libraries are developed by integrating RNA-seq data with known annotation databases. The primary step involves transcriptome reconstruction from RNA-seq data of each sample. After that, reads are aligned to reference genome and annotated using RefSeq, UCSC, GENCODE protein coding transcripts databases. This is done to eliminate annotated non-lncRNA transcripts (protein coding genes, tRNAs, miRNA and pseudogenes). For distinguishing low-level-expressed, single-exon unreliable fragment assemblies from low-level-expressed lncRNA, a read coverage threshold of value >200 bp has to be selected. To eliminate un-annotated or novel coding genes, two methods, namely, PhyloCSF and Pfam can be used. PhyloCSF (phylogenetic codon substitution frequency) is used to assess putative open reading frames (ORFs; these are evolutionarily conserved ORFs with synonymous amino acid content), whereas Pfam, on the other hand, is used to exclude transcript, which codes any of the 31 912 protein domains. Finally, the lncRNA transcripts are enlisted (Figure 1). Figure 1. View largeDownload slide Flowchart depicting identification of lncRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 1. View largeDownload slide Flowchart depicting identification of lncRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) In contrast, generally, circRNAs are identified from poly-A-depleted or rRNA-depleted RNA-seq data (Figure 2). In one such method, called rRNA-depleted RNA-seq, also known as RibominusSeq, sequencing is done after the depletion of rRNA. The sequencing reads are then aligned to reference genome for discarding mapped reads. From unmapped reads, 20-nucleotide anchor sequences from either side are aligned further to find unique anchor positions. A series of filtering criteria to identify circRNAs are used after this step. Back-splicing is suggested by an anchor pair aligning in reverse direction with canonical GU/AG splicing signals flanking the splice sites, to indicate the presence of circRNAs. Figure 2. View largeDownload slide Flowchart depicting identification of circRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 2. View largeDownload slide Flowchart depicting identification of circRNA from RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) After small RNA-seq, identification of tRFs from other RNAs is done by integration of known annotated databases. In general, initially, small RNA-seq data are mapped to the human genome for discarding unmapped reads (Figure 3). The mapped or aligned reads may then be annotated with known transcripts databases (GENCODE, miRbase, Rfam) to eliminate non-tRFs (mRNA, miRNA, rRNA, snRNA and snoRNA). The enriched tRFs may further be annotated to pre-tRNA and mature tRNA. Pooled tRFs are classified according to their biogenesis site. The 5ʹ tRFs and 3ʹ tRFs are derived from mature-tRNA, 3ʹ U-tRFs from 3ʹ end of pre-tRNA and the one recently characterized i-tRFs (internal t-RFs) has been shown to be derived from anticodon region of tRNA and either D-arm or TψC-arm. Figure 3. View largeDownload slide Flowchart depicting identification of tRFs from small RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) Figure 3. View largeDownload slide Flowchart depicting identification of tRFs from small RNA-seq data. (A colour version of this figure is available online at: https://academic.oup.com/bfg) LncRNA LncRNAs are synthesized from same RNApol-II machinery, which synthesizes mRNAs. With characteristic features such as a 5ʹ-cap, a 3ʹ-poly-A tail, an absence of long ORF, lncRNAs must pass the >200 nt long sequence threshold to qualify. A large majority of lncRNAs do not encode proteins, and these have a multi-exonic structure [21–23]. These are also distinguished bioinformatically from short ncRNAs <200 nt in length with known functions [24]. As protein-coding regions have >100 nt long ORFs, the absence of long ORFs in lncRNA indicates that its function is yet to be defined accurately. Ribosome profiling followed by sequencing is a novel technique to estimate some translation activity [25], which along with mass spectrometry can be used to verify translation. LncRNAs have a wide range of functional modes as well as many mechanisms through which they function in regulating cell cycle progression, apoptosis and differentiation [26]. Databases for lncRNA NONCODE2016. An interactive database, NONCODE2016 provides collection of ncRNAs (except tRNA and rRNA) from 16 species. A total number of 527 336 lncRNAs are submitted in NONCODE, mostly curated from literature and public databases. In the case of human and mouse, the deposited lncRNA are 167 150 and 130 558, respectively, in number as accessed in October 2016 [27]. Besides basic information such as location, strand, exon number, length and sequence, three important features are present, namely, conservation annotation, the relationships between lncRNAs and diseases and an interface to choose high-quality data sets through predicted scores, literature and long reads. This database is accessible from this link http://www.noncode.org. lncRNAdb. This database provides users with a comprehensive, manually curated reference database of 287 eukaryotic lncRNAs that have been published in the scientific literature. It is built on an improved user interface enabling sequence information access, as well as access to expression profiles from Illumina Body Atlas. BLAST search tool against lncRNAdb is also available, which is useful in search for novel transcripts [28]. Following types of information can be extracted using this database: nucleotide sequences; genomic context; gene expression data derived from the Illumina Body Atlas; structural information; subcellular localization; conservation; function with referenced literature. LNCipedia. Being a human lncRNA-specific database, its version 4 accounts for 118 777 annotated lncRNA transcripts, obtained from various sources. Large-scale reprocessing of publicly available proteomics data is undertaken to assess the protein-coding potential, and those transcripts with low protein-coding potential are deposited. In addition, a tool to assess lncRNA gene conservation between human, mouse and zebrafish has also been implemented. It can be accessed from the link http://www.lncipedia.org. Data deposited in LNCipedia can also be visualized using Integrative Genomics Viewer, a visualization tool for high-throughput genomics data [29]. The Atlas of Noncoding RNAs in Cancer (TANRIC). This database characterizes the expression profiles of lncRNAs using large-scale RNA-seq data sets from TCGA and independent data sets in 20 different cancer types. Available at http://bioinformatics.mdanderson.org/main/TANRIC [30], this widely useful database enables searching for lncRNAs with strong correlation with established biomarkers as well as drug sensitivity studies. Besides these databases (Table 1), there exist some more lncRNA databases covered in Wikipedia pages and the users are encouraged to have a look at these as well. Table 1. Databases and their specific features and active URLs with reference to lncRNAs Database  Description  URL link  NONCODE  Provides a collection of ncRNAs (except tRNA and rRNA) from 16 species. A total of 527 336 lncRNAs are available in NONCODE.  http://www.noncode.org/  lncRNAdb  Provides 287 eukaryotic lncRNAs and BLAST tool for identification of lncRNA.  http://lncrnadb.org  LNCipedia  An integrated database containing 118 777 human lncRNAs.  http://www.lncipedia.org  TANRIC  Interactive exploration of expression and clinical relevance of lncRNA in 20 different cancer types.  http://bioinformatics.mdanderson.org/main/TANRIC  Database  Description  URL link  NONCODE  Provides a collection of ncRNAs (except tRNA and rRNA) from 16 species. A total of 527 336 lncRNAs are available in NONCODE.  http://www.noncode.org/  lncRNAdb  Provides 287 eukaryotic lncRNAs and BLAST tool for identification of lncRNA.  http://lncrnadb.org  LNCipedia  An integrated database containing 118 777 human lncRNAs.  http://www.lncipedia.org  TANRIC  Interactive exploration of expression and clinical relevance of lncRNA in 20 different cancer types.  http://bioinformatics.mdanderson.org/main/TANRIC  Computational tools developed for lncRNAs Noncoding RNAs identification with Hybrid Random Forest. This is a classification tool based on a hybrid random forest (RF) algorithm with a logistic regression model to discriminate between short ncRNA and long and complex ncRNA sequences. The logistic regression function, which generates a new feature called SCORE, comprises five features of significance—structure, sequence, modularity, structural robustness and coding potential. This is done as a way to better describe the functional elements in the lncRNAs. The classifier is available at this link http://ncrna-pred.com/HLRF.htm [31]. PhyloCSF. A novel comparative genomics method, it is used to determine whether a protein-coding region is present in multi-species nucleotide sequence alignment. It is based on a formal statistical comparison of phylogenetic codon model. Rather than homology in sequence alignments, it examines conserved coding region for evolutionary signatures. Examples include codon substitution frequencies with high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions. This helps to distinguish protein-coding and ncRNAs of novel transcripts obtained from high-throughput transcriptome sequencing. This software is freely available from this link http://compbio.mit.edu/PhyloCSF [32] DeepLNC. Deep Neural Network (DNN) is postulated by the authors [33] as faster and an accurate computational method for screening of lncRNAs from mRNAs as compared with other classifiers. Manually annotated training data sets from LNCipedia and RefSeq database have been used in the classifier and information content stored in k-mer pattern is used as a sole feature for the DNN. This information content is generated on the basis of Shannon entropy function to improve classifier accuracy. It has been implemented as a web prediction tool, which is available at ‘http://bioserver.iiita.ac.in/deeplnc’. AnnoLnc. It is an online portal for systematically annotating newly identified human lncRNAs. AnnoLnc offers a full spectrum of annotations covering genomic location, RNA secondary structure, transcriptional regulation, expression; protein interaction, miRNA interaction, genetic association and evolution, as well as an abstraction-based text summary. The data are generated from high-throughput sequencing technique RNA-seq, ChIP-Seq, AGO CLIP-Seq, RBP CLIP-Seq, Gene model file and highly conserved miRNA families [34]. Coding Potential Calculator. It is developed based on support vector machines (SVMs) classifier, to assess the protein-coding potential of a transcript based on six biological features. These features are longer length, high-quality and integrity of ORF and parsing another three features from output of BLASTX based on E-value cutoff, HIT SCORE and FRAME SCORE. The true positive values from these features are protein coding transcripts and true negative values represent noncoding transcripts [35]. This tool can be used from the link http://cpc.cbi.pku.edu.cn/. Predictor of lncRNAs and messenger RNAs based on an improved k-mer scheme (PLEK). It is an alignment tool and uses computational pipeline based on an improved k-mer scheme and a SVM algorithm. PLEK takes calibrated k-mer frequencies of a transcript sequence as its computational features. With this feature, SVM is used to build a binary classification model to separate lncRNAs from mRNAs, in the absence of genomic sequences or annotations. This tool is reported by the developers as having performed well using a simulated data set and two real de novo assembled transcriptome data sets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. PLEK is especially suitable for PacBio or 454 sequenced data and large-scale transcriptome data analysis (Table 2) [36]. Table 2. Computational tools/web servers and their specific features and active URLs with reference to lncRNAs Tools  Description  Specificity [reference]  Sensitivity [reference]  Accuracy [reference]  Link [reference]  Noncoding RNAs identification with Hybrid Random Forest  To identify known ncRNAs by yielding a sensitivity level for prokaryotic and eukaryotic sequences.Based on a hybrid RF algorithm with a logistic regression model.  92.11 [31]  90.7 [31]  93.5 [31]  http://ncrna-pred.com/HLRF.htm [31]  PhyloCSF  To distinguish protein-coding and ncRNAs of novel transcripts.Based on a formal statistical comparison of phylogenetic codon model.  Not available  Not available  Not available  http://compbio.mit.edu/PhyloCSF  DeepLNC  Fast and accurate computational method for screening of lncRNA from mRNA.Based on DNN and Shannon entropy function.  97.19 [33]  98.98 [33]  98.07 [33]  http://bioserver.iiita.ac.in/deeplnc [33]  AnnoLnc  An online portal for systematically annotating newly identified human lncRNAs.  Not available  Not available  Not available  http://annolnc.cbi.pku.edu.cn  Coding potential calculator  To distinguish protein-coding RNAs from ncRNAs accurately.Based on SVM classifier.  Not available  Not Available  92.00 [35]  http://cpc.cbi.pku.edu.cn/ [35]  PLEK  It is a computational pipeline based on k-mer and SVM to identify lncRNAs from messenger RNAs (mRNAs).  95.80 (PacBio), 95.5 (454) [36]  94.70 (PacBio), 92.5 (454) [36]  94.70 (PacBio), 95.4 (454) [36]  https://sourceforge.net/projects/plek/files/ [36]  Tools  Description  Specificity [reference]  Sensitivity [reference]  Accuracy [reference]  Link [reference]  Noncoding RNAs identification with Hybrid Random Forest  To identify known ncRNAs by yielding a sensitivity level for prokaryotic and eukaryotic sequences.Based on a hybrid RF algorithm with a logistic regression model.  92.11 [31]  90.7 [31]  93.5 [31]  http://ncrna-pred.com/HLRF.htm [31]  PhyloCSF  To distinguish protein-coding and ncRNAs of novel transcripts.Based on a formal statistical comparison of phylogenetic codon model.  Not available  Not available  Not available  http://compbio.mit.edu/PhyloCSF  DeepLNC  Fast and accurate computational method for screening of lncRNA from mRNA.Based on DNN and Shannon entropy function.  97.19 [33]  98.98 [33]  98.07 [33]  http://bioserver.iiita.ac.in/deeplnc [33]  AnnoLnc  An online portal for systematically annotating newly identified human lncRNAs.  Not available  Not available  Not available  http://annolnc.cbi.pku.edu.cn  Coding potential calculator  To distinguish protein-coding RNAs from ncRNAs accurately.Based on SVM classifier.  Not available  Not Available  92.00 [35]  http://cpc.cbi.pku.edu.cn/ [35]  PLEK  It is a computational pipeline based on k-mer and SVM to identify lncRNAs from messenger RNAs (mRNAs).  95.80 (PacBio), 95.5 (454) [36]  94.70 (PacBio), 92.5 (454) [36]  94.70 (PacBio), 95.4 (454) [36]  https://sourceforge.net/projects/plek/files/ [36]  circRNA Databases for circRNA CircBase. This database can be used to browse or search through unified public data sets of circRNAs, and the evidence supporting their expression can be accessed and downloaded within the genomic context. Data can be queried using ‘circBase identifier (e.g. mmu_circ_0000010), refseq transcript ID (NM_027671), gene symbol (Pvt1), genomic coordinates (chrII:123456-7891011) or Gene Ontology term identifiers’. The database can also be queried using DNA or RNA sequence, even using BLAST-like Alignment Tool (BLAT) search. CircBase can also identify known and novel circRNAs in sequencing data. Currently supported genome assemblies in circBase are hg19 for Homo sapiens, mm9 for Mus musculus and ce6 for Caenorhabditis elegans, Latimeria chalumnae and Latimeria menadoensis (latCha1), Drosophila melanogaster (dm3) and Schmidtea mediterranea (Oct06). It is freely accessible through this link http://www.circbase.org/ [37]. circRNADb. This database contains 32 914 human exonic circRNAs selected from diverse sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, internal ribosome entry site (IRES) and references. The database authors have found that these circRNAs were found to be able to encode proteins, not been reported in any species as yet. In total, 7170 IRES elements were found from annotation of 16 328 circRNAs having ORFs longer than 100 amino acids. In all, 46 circRNAs from 37 genes are found to have their corresponding proteins expressed according to mass spectrometry. This database is accessible at http://reprod.njmu.edu.cn/circrnadb [38]. Circ2Traits. This database reports studies on potential circRNA association with human diseases in two different ways. On identification of the interactions of circRNAs with miRNAs associated with individual diseases, the likelihood of a circRNA being associated with a disease state is calculated. Network prediction between disease-associated miRNAs and PCGs (protein coding genes), long noncoding and circRNA genes is done and followed by Gene Ontology enrichment analyses for particular biological processes to specify the role of PCGs. Disease-associated SNPs are then mapped on circRNA loci and Argonaute interaction sites on circRNAs are identified, as the functional sites are ago-binding sites, which bind miRNA to act as sponge. circ2Traits has categorized 1951 human circRNAs potentially associated with 105 different diseases. Besides storing the complete putative miRNA–circRNA–mRNA–lncRNA interaction network for each of these diseases. The above information compiled into Circ2Traits database link http://gyanxet-beta.com/circdb/ [39]. CircNet. CircNet is distinctive in that this database incorporates a novel way of naming of circRNAs. Because multiple circular isoforms can originate from the same back splice junction site, the circRNAs are named based on these distinctive isoforms. This naming system provides information on source genes, circRNA as antisense or intronic and the location of back-splice junction sites on well-annotated exons. Following resources can be found while accessing this database: novel circRNAs, integrated miRNA-target networks, expression profiles of circRNA isoforms, genomic annotations of circRNA isoforms and sequences of circRNA isoforms. Tissue-specific circRNA expression profiles and circRNA–miRNA–gene regulatory network are also provided for further studies. This unique database is accessible from http://circnet.mbc.nctu.edu.tw/ (Table 3) [40]. Table 3. Databases and their specific features and active URLs with reference to circRNAs Database  Description  URL link  CircBase  Unified public data sets of circRNAs.Also identifies novel circRNAs in sequencing data.  http://www.circbase.org/  circRNADb  Contains 32 914 human exonic circRNAs selected from diversified sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, IRES and references.  http://reprod.njmu.edu.cn/circrnadb  Circ2Traits  Provides circRNA association with human diseases by circRNA–miRNA–protein interaction network, and disease-associated SNPs, Argonaut interaction on circRNA.  http://gyanxet-beta.com/circdb/  CircNet  Provides tissue-specific circRNA expression profiles and circRNA–miRNA-gene regulatory networks.  http://circnet.mbc.nctu.edu.tw/  Database  Description  URL link  CircBase  Unified public data sets of circRNAs.Also identifies novel circRNAs in sequencing data.  http://www.circbase.org/  circRNADb  Contains 32 914 human exonic circRNAs selected from diversified sources. It provides information about the circRNA, including genomic information, genome sequence, exon splicing, ORF, IRES and references.  http://reprod.njmu.edu.cn/circrnadb  Circ2Traits  Provides circRNA association with human diseases by circRNA–miRNA–protein interaction network, and disease-associated SNPs, Argonaut interaction on circRNA.  http://gyanxet-beta.com/circdb/  CircNet  Provides tissue-specific circRNA expression profiles and circRNA–miRNA-gene regulatory networks.  http://circnet.mbc.nctu.edu.tw/  Tools for circRNA studies UROBORUS. This is an efficient tool for the identification of circRNA with low expression levels from total-RNA-seq data without RNase-R treatment. It is combined with TopHat and Bowtie to detect junction reads. TopHat is capable of detecting the canonical splicing event. However, TopHat cannot map junction reads that support back-spliced exons to a reference genome. As circRNAs arise out of this back-splicing event, and need to be detected in those unmapped reads, UROBORUS takes input from TopHat unmapped.sam results. It first extracts 20 bp from the two ends of reads in an unmapped.sam file to form an artificial paired-end seed in fastq file format. Then, this short 20 bp paired-end seed is aligned to the human reference genome (hg19) with maximum of 2 bp mismatches using TopHat with default parameters. Results would be two cases: balanced mapped junction (BMJ) reads, and unbalanced mapped junction (UMJ) reads. BMJ reads are those reads that align to the joining region of two back-spliced exons with a minimum 20 bp of overhang at any end of the reads. Those reads aligned to the joining region of two back-spliced exons with <20 bp of overhang at one end of the reads are termed as UMJ reads. Among these candidate back-spliced junction reads, those supporting junction reads above two reads are annotated to be candidate circRNA. This tool is available at http://uroborus.openbioinformatics.org/ [41]. CIRI. An acronym for CircRNA Identifier, it is a novel chiastic clipping signal-based algorithm, for unbiased and accurate de novo detection of circRNAs from transcriptome data by using multiple filtration approaches. When CIRI was applied to ENCODE RNA-seq data, the tool authors were able to identify and even experimentally validate the presence of intronic/intergenic circRNAs. It scans SAM files twice, and during the first scanning of SAM alignment, it detects junction reads with PCC signals that reflect a circRNA candidate. Thereafter, preliminary filtering is implemented using paired-end mapping and mapping of GT-AG splicing signals for the junctions. These junction reads are clustered, and each circRNA candidate is detected and recorded. CIRI then scans the SAM alignment again to detect additional junction reads. It even performs further filtering to eliminate false positive candidates resulting from incorrectly mapped reads of homologous genes or repetitive sequences. Identified circRNAs are annotated for their proper identification. This tool is available at https://sourceforge.net/projects/ciri/ [42]. DCC and CircTest. DCC uses an output from the STAR read mapper to systematically identify back-splice junctions and applies a series of filters for identification of candidate circRNAs. The tool authors specify that their software achieves a much higher precision than state-of-the-art competitors at similar sensitivity levels on using mouse brain data as well as publicly available ones. They further specify that DCC estimates circRNA versus host gene expression from counting junction and nonjunction reads to test for host gene-independence of circRNA expression across different experimental conditions by their R package CircTest. The software is made available for public use at https://github.com/dieterich-lab [43]. KNIFE. Acronym for Known and Novel Isoforms Explorer, it is a statistics-based splicing detection tool for circular and linear isoforms from RNA-Seq data. It quantifies circular and linear RNA splicing events at both annotated and un-annotated exon boundaries thus increasing detection sensitivity. Both single-end and paired-end reads can be analyzed. The algorithm should be used with good-quality reads, those reads that have been processed using cutAdapt to trim ends of poor quality. Annotated junction indices are available for reference genomes of human (hg19), mouse (mm10), rat (rn5) and drosophila (dm3). To select for high-quality circular/linear junctions, the identified junctions are reported along with a posterior probability (GLM reports) or P-value (naive reports) to eliminate false positives. The tool can be accessed from https://github.com/lindaszabo/KNIFE [44]. CircInteractome. Interactions of circRNAs with proteins and miRNAs is a highly useful area of research and study. In this regard, a web-based tool for exploring circRNAs and their interacting proteins and miRNAs has been developed named as CircInteractome. CircRNAs have been shown to act as sponge for miRNAs and potentially RNA-binding proteins (RBPs) thereby serving important function as posttranscriptional regulators of gene expression. CircRNA, miRNA and RBP public databases are integrated within CircInteractome to provide binding sites of miRNA and RBP sites on the junction and junction-flanking sequences of circRNA through bioinformatic analyses. The tool developers specify that their tool ‘allows identification of potential circRNAs which can act as RBP sponges, design junction-spanning primers for specific detection of circRNAs of interest, design siRNAs for circRNA silencing, and identification of potential internal ribosomal entry sites (IRES). It is accessible at http://circinteractome.nia.nih.gov (Table 4) [45]. Table 4. Computational tools/web servers and their specific features and active URLs with reference to circRNAs Tools  Description  URL link  UROBORUS  To identify circRNA with low expression levels from total-RNA-seq data without RNase-R treatment, and formed owing to back-splicing event.  http://uroborus.openbioinformatics.org/  CIRI  A novel chiastic clipping signal-based algorithm, for the detection of circRNAs from transcriptome data by using multiple filtration approaches.  https://sourceforge.net/projects/ciri/  DDC & CircTest  A python package, it detects abundant circRNAs and quantifies relative expression changes of circRNA, based on read count data.  https://github.com/dieterich-lab  KNIFE  Identifies circRNA isoforms using statistically based splicing detection for circular and linear isoforms from RNA-seq data.  https://github.com/lindaszabo/KNIFE  CircInteractome  For exploring circRNAs and their interacting proteins and miRNAs in a network.  http://circinteractome.nia.nih.gov  Tools  Description  URL link  UROBORUS  To identify circRNA with low expression levels from total-RNA-seq data without RNase-R treatment, and formed owing to back-splicing event.  http://uroborus.openbioinformatics.org/  CIRI  A novel chiastic clipping signal-based algorithm, for the detection of circRNAs from transcriptome data by using multiple filtration approaches.  https://sourceforge.net/projects/ciri/  DDC & CircTest  A python package, it detects abundant circRNAs and quantifies relative expression changes of circRNA, based on read count data.  https://github.com/dieterich-lab  KNIFE  Identifies circRNA isoforms using statistically based splicing detection for circular and linear isoforms from RNA-seq data.  https://github.com/lindaszabo/KNIFE  CircInteractome  For exploring circRNAs and their interacting proteins and miRNAs in a network.  http://circinteractome.nia.nih.gov  tRFs Databases for tRFs As of December 2016, only two databases for tRFs can be identified from web search and these are detailed as follows: tRFdb. tRFdb is the first database of transfer RNA fragments (tRFs). Its source data set is taken from NCBI GEO and SRA databases. tRFdb currently contains the sequences and read counts of the three classes of tRFs of eight species: Rhodobacter sphaeroides, Schizosaccharomyces pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse and human. A total of 12 877 tRFs are deposited. As there are ‘five’ types of tRFs originating from the mature tRNA, 5′-halves, 3′-halves, 5′-tRFs, 3′-tRFs and internal tRFs, this database can be searched via different tRF types, viz., tRF-5, -3 or -1 or tRF-ID such as 5001, 5001a, 5001b and so on. The output consists of tRF-ID, organism name, tRF type, tRNA gene co-ordinate, tRNA gene name and hyperlinks to the tRF sequence itself and the experimental details. It is available at http://genome.bioch.virginia.edu/trfdb/ [46]. MINTbase. tRFs can arise from mitochondrial as well as nuclear tRNAs. A web-based database for the repository of these novel tRFs and a tool for the interactive exploration of nuclear and mitochondrial tRNA fragments has been made available. MINTbase integrates four kinds of information about these molecules. These are sequence information, expression information, parental tRNA information and genomic information. MINTbase is freely accessible at http://cm.jefferson.edu/MINTbase/ (Table 5) [47]. Table 5. Databases and their specific features and active URLs with reference to tRFs Databases  Description  URL Link  tRFdb  Contains sequence and read count of tRFs in R. sphaeroides, S. pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse, and human.  http://genome.bioch.virginia.edu/trfdb/  MINTbase  A repository of nuclear and mitochondrial tRFs and tool for interactive exploration.  http://cm.jefferson.edu/MINTbase/  Databases  Description  URL Link  tRFdb  Contains sequence and read count of tRFs in R. sphaeroides, S. pombe, D. melanogaster, C. elegans, Xenopus, zebra fish, mouse, and human.  http://genome.bioch.virginia.edu/trfdb/  MINTbase  A repository of nuclear and mitochondrial tRFs and tool for interactive exploration.  http://cm.jefferson.edu/MINTbase/  Tools for tRF studies tRF2Cancer. This integrated web-based computing system is useful for accurate identification of tRFs from small RNA deep sequencing data and their expression levels in multiple cancers. A statistical tool, binomial test, is introduced to determine whether reads from a small RNA-seq data set represent tRFs, i.e. it estimates the significance of sequenced sRNAs abundance distributed on each tRNA. Thereafter, a classification method is used to annotate the tRF type. Another tool implemented therein, ‘tRFinCancer’ tool, is used to inspect the expression of tRFs across different cancer types. ‘tRF-Browser’ is used to determine the sites of origin and the distribution of chemical modification sites in tRFs including m5C, 2′-O-Me, Ψ and m6A, on corresponding tRNA source. The source link for this tool is http://rna.sysu.edu.cn/tRFfinder/ [48]. High-throughput Annotation of Modified Ribonucleotides (HAMR). Modified nucleotides are usually found in RNA owing to its posttranscriptional modifications. At single-nucleotide resolution, HAMR finds potential signatures of nucleotide modification. Both SNPs and candidate RNA as well as tRNA modifications can be mapped and used for likely identification of tRF from a pool of small RNA-seq data. The web version of HAMR is available at http://wanglab.pcbi.upenn.edu/hamr [49]. tDRmapper. This tool was developed on the premise that tRFs are difficult to accurately map owing to as yet incomplete information of the exact number of copies and annotation of human tRNA genes. Further, tRNAs are subject to chemical modifications extensively. Hence the detection of tRFs and its comparison with tRNAs is difficult. Besides, there is no standard nomenclature. Hence, tDRmapper was developed as an alignment tool for mapping, naming, quantifying and for graphical visualization of novel tRFs. Small RNA-seq reads are filtered based on quality of each base and length of reads. Filtered reads are mapped to mature t-RNA sequences, then to pre-tRNA sequences and finally in a hierarchical manner allowing for errors such as one mismatch, one deletion, two mismatches, two deletions and a three base-pair deletion. tDRs are then annotated based on size and location and quantified based on the fraction of reads aligning to the parent tRNA and the maximum coverage across all positions of the tRNA. tDRmapper is available from this link: https://github.com/sararselitsky/tDRmapp (Table 6) [50]. Table 6. Computational tools/web servers and their specific features and active URLs with reference to tRFs Tools  Description  Accuracy [reference]  URL link  tRF2Cancer  Tool for identification and determination of tRF expression levels in multiple cancers.  Not available  http://rna.sysu.edu.cn/tRFfinder/  HAMR  Scans RNA-seq data for sites showing potential signatures of nucleotide modification in tRFs.  98% (two classes of adenosine modification), 79% (two classes of guanosine modification) [49]  http://wanglab.pcbi.upenn.edu/hamr  tDRmapper  An alignment tool for mapping, naming, quantifying and graphical visualization of novel tRFs from small RNA-seq libraries.  Not available  https://github.com/sararselitsky/tDRmapp  Tools  Description  Accuracy [reference]  URL link  tRF2Cancer  Tool for identification and determination of tRF expression levels in multiple cancers.  Not available  http://rna.sysu.edu.cn/tRFfinder/  HAMR  Scans RNA-seq data for sites showing potential signatures of nucleotide modification in tRFs.  98% (two classes of adenosine modification), 79% (two classes of guanosine modification) [49]  http://wanglab.pcbi.upenn.edu/hamr  tDRmapper  An alignment tool for mapping, naming, quantifying and graphical visualization of novel tRFs from small RNA-seq libraries.  Not available  https://github.com/sararselitsky/tDRmapp  Conclusions Novel molecules such as lncRNAs, circRNAs and tRFs are making their mark in the new research arena. NGS and RNA-seq are key technologies that are playing the powerful facilitative role. In this article, we have strived to put together several databases and tools of immediate use to researchers that will be helpful in getting at sound scientific decisions and exploration of novel molecules inside the cell, most of which appear to be regulatory in nature. The characteristic features detailed for each of these enable specific types of information to be retrieved, analyzed and studied. It is also stressed that rather than using a single database, using multiple databases and tools to gain consensus enhances quality and accuracy of results. Biomedical research such as that on cancer is increasingly being big data-driven, leading to discovery of these and other novel molecules. Given the need for further investigation, our article strives to provide the right impetus in this direction. Key Points Biomedical research such as that on cancer is increasingly being big data-driven using novel technologies such as NGS and RNA-seq and Chip-seq. Novel molecules such as lncRNAs, sncRNAs, circRNAs and tRFs are increasingly being discovered and deciphered to play a possible regulatory role crucial in driving diseases. In this article, we provide a catalog of existing databases and tools on these novel molecules with their descriptions. This will allow researchers as end users to arrive at sound scientific decisions, thereby providing an attractive opportunity for further studies on these interesting novel molecules. A. Saleembhasha, is a PhD scholar working on cancer systems biology and the role of novel molecules in cancer. His interest lies in exploring several Bioinformatics databases and tools. Seema Mishra, PhD, is Head of the Bioinformatics and Systems Biology Laboratory at Department of Biochemistry, University of Hyderabad, India. Her expertise is in exploring all the major facets of Computational and Systems Biology, ranging from sequence and molecular modeling analyses and drug designing, to uncover laws governing the development of Cancer and Infectious diseases. References 1 Peng L, Bian XW, Li DK, et al.   Large-scale RNA-seq transcriptome analysis of 4043 cancers and 548 normal tissue controls across 12 TCGA cancer types. Sci Rep  2015; 5: 13413. Google Scholar CrossRef Search ADS PubMed  2 Mittal VK, McDonald JF. Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation. BMC Med Genomics  2015; 8: 40. Google Scholar CrossRef Search ADS PubMed  3 Myers JS, von Lersner AK, Robbins CJ, Sang QX. Differentially expressed genes and signature pathways of human prostate cancer. PLos One  2015; 10: e0145322. Google Scholar CrossRef Search ADS PubMed  4 Roque DR, Makowski L, Chen TH, et al.   Association between differential gene expression and body mass index among endometrial cancers from The Cancer Genome Atlas Project. Gynecol Oncol  2016 Aug; 142: 317– 22. doi: 10.1016/j.ygyno.2016.06.006. Epub 2016 Jun 14. Google Scholar CrossRef Search ADS PubMed  5 Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform  2015; 16: 59– 70. Google Scholar CrossRef Search ADS PubMed  6 Peter S, Borkowska E, Drayton RM, et al.   Identification of differentially expressed long noncoding RNAs in bladder cancer. Clin Cancer Res  2014; 20: 5311– 21. Google Scholar CrossRef Search ADS PubMed  7 Chen T, Xie W, Xie L, et al.   Expression of long noncoding RNA lncRNA-n336928 is correlated with tumor stage and grade and overall survival in bladder cancer. Biochem Biophys Res Commun  2015; 468: 666– 70. Google Scholar CrossRef Search ADS PubMed  8 Song W, Liu YY, Peng JJ, et al.   Identification of differentially expressed signatures of long non-coding RNAs associated with different metastatic potentials in gastric cancer. J Gastroenterol  2016; 51: 119– 29. Google Scholar CrossRef Search ADS PubMed  9 Pandey GK, Mitra S, Subhash S, et al.   The risk-associated long noncoding RNA NBAT-controls neuroblastoma progression by regulating cell proliferation and neuronal differentiation. Cancer Cell  2014; 26: 722– 37. Google Scholar CrossRef Search ADS PubMed  10 Yan X, Hu Z, Feng Y, et al.   Comprehensive genomic characterization of long non-coding RNAs across human cancers. Cancer Cell  2015; 28: 529– 40. Google Scholar CrossRef Search ADS PubMed  11 Pastori C, Kapranov P, Penas C, et al.   The Bromodomain protein BRD4 controls HOTAIR, a long noncoding RNA essential for glioblastoma proliferation. Proc Natl Acad Sci USA  2015; 112: 8326– 31. Google Scholar CrossRef Search ADS PubMed  12 Kawaji H, Nakamura M, Takahashi Y, et al.   Hidden layers of human small RNAs. BMC Genomics  2008; 9: 157. Google Scholar CrossRef Search ADS PubMed  13 Cole C, Sobala A, Lu C, et al.   Filtering of deep sequencing data reveals the existence of abundant dicer-dependent small RNAs derived from tRNAs. RNA  2009; 15: 2147– 60. Google Scholar CrossRef Search ADS PubMed  14 Green D, Fraser WD, Dalmay T. Transfer RNA-derived small RNAs in the cancer transcriptome. Pflugers Arch  2016; 468: 1041– 7. Google Scholar CrossRef Search ADS PubMed  15 Honda S, Loher P, Shigematsu M, et al.   Sex hormone-dependent tRNA halves enhance cell proliferation in breast and prostate cancers. Proc Natl Acad Sci USA  2015; 112: E3816– 25. Google Scholar CrossRef Search ADS PubMed  16 Sanger HL, Klotz G, Riesner D, et al.   Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc Natl Acad Sci USA  1976; 73: 3852– 6. Google Scholar CrossRef Search ADS PubMed  17 Salzman J, Chen RE, Olsen MN, et al.   Cell-type specific features of circular RNA expression. PLos Genetic  2013; 9: e1003777. Google Scholar CrossRef Search ADS   18 Szabo L, Salzman J. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet  2016; 17: 679– 92. Google Scholar CrossRef Search ADS PubMed  19 Kulcheski FR, Christoff AP, Margis R. Circular RNAs are miRNA sponges and can be used as a new class of biomarker. J Biotechnol  2016; 238: 42– 51. Google Scholar CrossRef Search ADS PubMed  20 Li Z, Huang C, Bao C, et al.   Exon-intron circular RNAs regulate transcription in the nucleus. Nat Struct Mol Biol  2015; 22: 256– 64. Google Scholar CrossRef Search ADS PubMed  21 Guttman M, Russell P, Ingolia NT, et al.   Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell  2013; 154: 240– 51. Google Scholar CrossRef Search ADS PubMed  22 Guttman M, Garber M, Levin JZ, et al.   Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol  2010; 28: 503– 10. Google Scholar CrossRef Search ADS PubMed  23 Cabili MN, Trapnell C, Goff L, et al.   Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev  2011; 25: 1915– 27. Google Scholar CrossRef Search ADS PubMed  24 Dinger ME, Amaral PP, Mercer TR, et al.   Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res  2008; 18: 1433– 45. Google Scholar CrossRef Search ADS PubMed  25 Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science  2009; 324: 218– 23. Google Scholar CrossRef Search ADS PubMed  26 Rossi MN, Antonangeli F. LncRNAs: new players in apoptosis control. Int J Cell Biol  2014; 2014: 473857. Google Scholar CrossRef Search ADS PubMed  27 Zhao Y, Li H, Fang S, et al.   NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res  2016; 44: D203– 8. Google Scholar CrossRef Search ADS PubMed  28 Quek XC, Thomson DW, Maag JL, et al.   lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res  2015; 43: D168– 73. Google Scholar CrossRef Search ADS PubMed  29 Volders PJ, Verheggen K, Menschaert G, et al.   An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic Acids Res  2015; 43: D174– 80. Google Scholar CrossRef Search ADS PubMed  30 Li J, Han L, Roebuck P, et al.   TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res  2015; 75: 3728– 37. Google Scholar CrossRef Search ADS PubMed  31 Lertampaiporn S, Thammarongtham C, Nukoolkit C, et al.   Identification of non-coding RNAs with a new composite feature in the hybrid random forest ensemble algorithm. Nucleic Acids Res  2014; 42: e93. Google Scholar CrossRef Search ADS PubMed  32 Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics  2011; 27: i275– 82. Google Scholar CrossRef Search ADS PubMed  33 Tripathi R, Patel S, Kumari V, et al.   DeepLNC, a long non-coding RNA prediction tool using deep neural network. Netw Model Anal Health Inform Bioinform  2016; 5: 21. Google Scholar CrossRef Search ADS   34 Hou M, Tang X, Tian F, et al.   AnnoLnc: a web server for systematically annotating novel human lncRNAs. BMC Genomics  2016; 17: 931. Google Scholar CrossRef Search ADS PubMed  35 Kong L, Zhang Y, Ye ZQ, et al.   CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res  2007; 36: W345– 9. Google Scholar CrossRef Search ADS   36 Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics  2014; 15: 311. Google Scholar CrossRef Search ADS PubMed  37 Glažar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. RNA  2014; 20: 1666– 70. Google Scholar CrossRef Search ADS PubMed  38 Chen X, Han P, Zhou T, et al.   circRNADb: a comprehensive database for human circular RNAs with protein-coding annotations. Sci Rep  2016; 6: 34985. Google Scholar CrossRef Search ADS PubMed  39 Ghosal S, Das S, Sen R, et al.   Circ2Traits: a comprehensive database for circular RNA potentially associated with disease and traits. Front Genet  2013; 4: 283. Google Scholar CrossRef Search ADS PubMed  40 Liu YC, Li JR, Sun CH, et al.   CircNet: a database of circular RNAs derived from transcriptome sequencing data. Nucleic Acids Res  2016; 44: D209– 15. Google Scholar CrossRef Search ADS PubMed  41 Song X, Zhang N, Han P, et al.   Circular RNA profile in gliomas revealed by identification tool UROBORUS. Nucleic Acids Res  2016; 44: e87. Google Scholar CrossRef Search ADS PubMed  42 Gao Y, Wang J, Zhao F. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biol  2015; 16: 4. Google Scholar CrossRef Search ADS PubMed  43 Cheng J, Metge F, Dieterich C, et al.   Specific identification and quantification of circular RNAs from sequencing data. Bioinformatics  2016; 32: 1094– 6. Google Scholar CrossRef Search ADS PubMed  44 Szabo L, Morey R, Palpant NJ, et al.   Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biol  2015; 16: 126. Google Scholar CrossRef Search ADS PubMed  45 Dudekula DB, Panda AC, Grammatikakis I, et al.   CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol  2016; 13: 34– 42. Google Scholar CrossRef Search ADS PubMed  46 Kumar P, Mudunuri SB, Anaya J, Dutta A. tRFdb: a database for transfer RNA fragments. Nucleic Acids Res  2015; 43: D141– 5. Google Scholar CrossRef Search ADS PubMed  47 Pliatsika V, Loher P, Telonis AG, Rigoutsos I. MINTbase: a framework for the interactive exploration of mitochondrial and nuclear tRNA fragments. Bioinformatics  2016; 32: 2481– 9. Google Scholar CrossRef Search ADS PubMed  48 Zheng LL, Xu WL, Liu S, et al.   tRF2Cancer: a web server to detect tRNA-derived small RNA fragments (tRFs) and their expression in multiple cancers. Nucleic Acids Res  2016; 44: W185– 93. Google Scholar CrossRef Search ADS PubMed  49 Ryvkin P, Leung YY, Silverman IM, et al.   HAMR: high-throughput annotation of modified ribonucleotides. RNA  2013; 19: 1684– 92. Google Scholar CrossRef Search ADS PubMed  50 Selitsky SR, Sethupathy P. tDRmapper: challenges and solutions to mapping, naming, and quantifying tRNA-derived RNAs from human small RNA-sequencing data. BMC Bioinformatics  2015; 16: 354. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com

Journal

Briefings in Functional GenomicsOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off