TEcandidates: prediction of genomic origin of expressed transposable elements using RNA-seq data

TEcandidates: prediction of genomic origin of expressed transposable elements using RNA-seq data Abstract Motivation In recent years, Transposable Elements (TEs) have been related to gene regulation. However, estimating the origin of expression of TEs through RNA-seq is complicated by multi-mapping reads coming from their repetitive sequences. Current approaches that address multi-mapping reads are focused in expression quantification and not in finding the origin of expression. Addressing the genomic origin of expressed TEs could further aid in understanding the role that TEs might have in the cell. Results We have developed a new pipeline called TEcandidates, based on de novo transcriptome assembly to assess the instances of TEs being expressed, along with their location, to include in downstream DE analysis. TEcandidates takes as input the RNA-seq data, the genome sequence and the TE annotation file and returns a list of coordinates of candidate TEs being expressed, the TEs that have been removed and the genome sequence with removed TEs as masked. This masked genome is suited to include TEs in downstream expression analysis, as the ambiguity of reads coming from TEs is significantly reduced in the mapping step of the analysis. Availability and implementation The script which runs the pipeline can be downloaded at http://www.mobilomics.org/tecandidates/downloads or http://github.com/TEcandidates/TEcandidates. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Transposable Elements (TEs) have the machinery to self-replicate inside a genome, representing a proportion of its repetitive content. In recent years, TEs have been related to regulation of gene expression (Chuong et al., 2017). However, due to their repetitive nature, TEs complicate RNA-seq studies (Hoen et al., 2015). TE-derived reads can have ambiguous mapping because they can either map to several instances of a same TE, or other similar TEs. Current efforts addressing mapping ambiguity are focused in expression quantification rather than on pinpointing the origin of expression. By using default parameters, RNA-seq aligners, such as Bowtie 2 (Langmead and Salzberg, 2012) or STAR (Dobin and Gingeras, 2016) will report a randomly selected location amongst all possible mappings of a multi-mapping read. Two other strategies that can be adopted from the default mapping is to: (i) discard multi-mapped reads (Attig et al., 2017), potentially improving accuracy, but underestimating overall expression, or (ii) aggregate the counts per TE of unique and multi-mapped reads (Sharif et al., 2016). By using non-default parameters in RNA-seq aligners, allowing to report multiple positions for multi-mapped reads, two other strategies can be adopted: (iii) estimate the contribution of multi-mapped reads over all alignments (Jin et al., 2015), or (iv) probabilistically infer which site they belong to, using information of the local genomic context under the assumption that the correct mapping is the one that also have mappings in their vicinity (Kahles et al., 2016). In most cases, results using these strategies correlate well with experimentation. However, there would be a few drawbacks in using these mapping strategies to find the location of the TEs expressed. In the first strategy, novel TEs, that might have highly homologous copies, and thus, only multi-map reads coming from them, may not be considered at all. The second and third strategies assume the expression comes from all instances of a TE. Therefore, the results come at the cost of not being able to report the location of the TEs. In the fourth strategy, the results might not include TEs from regions with lower transcriptional activity. Alternatively, to estimate the origin of transcribing TEs, the minute differences among instances can be exploited. One way of capturing the unique local base differences among TE instances, is to de novo assemble the reads into contigs. This lengthens what is to be mapped, diminishing the expectancy of an ambiguous alignment. While this also generates false positive contigs, they can be further discarded after aligning them to the reference genome. Additional filters including minimum TE annotation length and alignment coverage can be applied in order to select the most likely TE instances to be a transcription origin. Implementing the general strategy above, we developed TEcandidates, a pipeline that reports candidate TEs to be considered in downstream expression analysis, along with a modified version of a reference genome suited to map the RNA-seq reads while avoiding ambiguity of the reads coming from TEs. 2 Materials and methods The default assumption of TEcandidates is that at most one instance per TE (only one of the copies of a TE of a specific type, family and superfamily) is being transcribed. So, each run reports by default one candidate per TE. The pipeline requires as inputs the RNA-seq reads files, a reference genome and a TE annotation. It is suggested to use the annotation generated by RepeatMasker, as it has resolution at the level of TE instance. In order to restrict the reads to repeat zones, TEcandidates first runs a pre-alignment of the reads on the genome and keeps the reads that map on RepeatMasker predictions. This reduces considerably the reads to be used in the next step, Afterwards, a de novo transcriptome assembly is performed by Trinity v2.4.0 with default parameters (Haas et al., 2013). The de novo contigs generated at this step are mapped to the reference genome, and their intersection with the annotated TEs is assessed with BEDtools v2.26 (Quinlan and Hall, 2010). The customizable filter parameters are minimum TE length, -l, and coverage of alignment of the de novo contig respect to the TE annotation, -c. Our benchmarks suggest the defaults 900 bp, and 0.1, respectively (Supplementary File S1). If more than one instance of a same TE is selected at this step, the longest one having the highest coverage is retained. Additional options, include, control of the number of processors, -t, dedicated RAM memory, -r, and number of candidates per TE instance to report, -N. The candidate TEs are written to a file in GFF3 format, and the positions of the annotated TEs not selected are hard masked in the genome sequence with an ‘X’. The pipeline outputs three files: the file containing the candidate TEs, the TEs not selected and the genome with the not-selected TEs masked. These files are suitable for downstream expression analysis of TEs. 3 Software TEcandidates script was developed in Linux bash. Configuration, as well as instructions for installation of third-party software such as Trinity and BEDTools is included in the README file. The computer requirements of this program depend on the size of the input RNA-seq, reference genome and TE annotation files to be processed. 4 Benchmark and validation As a proof of concept, we ran a benchmark with reads simulated using Flux-Simulator v1.2.1 (Griebel et al., 2012). Each test of the benchmark consisted in a construct of five instances of the same TE embedded in a larger random sequence. In different tests, the five instances differed in length or sequence identity, and one or two of them having simulated reads representing expression. In all 15 simulations TEcandidates was able to retrieve the TEs which originated the reads. (Supplementary File S2). Additionally, we validated indirectly part of our predictions with an expression dataset, from (Ohtani et al., 2013). The validation can only be indirect and partial, as experimental data on the expression of TEs, currently, do not include their location (Supplementary File S1). Funding This work was supported by Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT) program grant #11140869 from CONICYT, the Chilean NSF. Conflict of Interest: none declared. References Attig J. et al. ( 2017 ) Physiological and pathological transcriptional activation of endogenous retroelements assessed by RNA-sequencing of B lymphocytes . Front. Microbiol ., 8 , 2489. Google Scholar Crossref Search ADS PubMed Chuong E.B. et al. ( 2017 ) Regulatory activities of transposable elements: from conflicts to benefits . Nat. Rev. Genet ., 18 , 71 – 86 . Google Scholar Crossref Search ADS PubMed Dobin A. , Gingeras T.R. ( 2016 ) Optimizing RNA-Seq mapping with STAR . Methods Mol. Biol ., 1415 , 245 – 262 . Google Scholar Crossref Search ADS PubMed Griebel T. et al. ( 2012 ) Modelling and simulating generic RNA-Seq experiments with the flux simulator . Nucleic Acids Res ., 40 , 10073 – 10083 . Google Scholar Crossref Search ADS PubMed Haas B.J. et al. ( 2013 ) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis . Nat. Protoc ., 8 , 1494 – 1512 . Google Scholar Crossref Search ADS PubMed Hoen D.R. et al. ( 2015 ) A call for benchmarking transposable element annotation methods . Mob. DNA , 6 , 13. Google Scholar Crossref Search ADS PubMed Jin Y. et al. ( 2015 ) TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets . Bioinformatics , 31 , 3593 – 3599 . Google Scholar Crossref Search ADS PubMed Kahles A. et al. ( 2016 ) MMR: a tool for read multi-mapper resolution . Bioinformatics , 32 , 770 – 772 . Google Scholar Crossref Search ADS PubMed Langmead B. , Salzberg S.L. ( 2012 ) Fast gapped-read alignment with Bowtie 2 . Nat. Methods , 9 , 357 – 359 . Google Scholar Crossref Search ADS PubMed Ohtani H. et al. ( 2013 ) DmGTSF1 is necessary for Piwi-piRISC-mediated transcriptional transposon silencing in the Drosophila ovary . Genes Dev ., 27 , 1656 – 1661 . Google Scholar Crossref Search ADS PubMed Quinlan A.R. , Hall I.M. ( 2010 ) BEDTools: a flexible suite of utilities for comparing genomic features . Bioinformatics , 26 , 841 – 842 . Google Scholar Crossref Search ADS PubMed Sharif J. et al. ( 2016 ) Activation of endogenous retroviruses in Dnmt1(-/-) ESCs involves disruption of SETDB1-mediated repression by NP95 binding to hemimethylated DNA . Cell Stem Cell , 19 , 81 – 94 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

TEcandidates: prediction of genomic origin of expressed transposable elements using RNA-seq data

Loading next page...
 
/lp/ou_press/tecandidates-prediction-of-genomic-origin-of-expressed-transposable-OuZ3JUKcPk
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty423
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation In recent years, Transposable Elements (TEs) have been related to gene regulation. However, estimating the origin of expression of TEs through RNA-seq is complicated by multi-mapping reads coming from their repetitive sequences. Current approaches that address multi-mapping reads are focused in expression quantification and not in finding the origin of expression. Addressing the genomic origin of expressed TEs could further aid in understanding the role that TEs might have in the cell. Results We have developed a new pipeline called TEcandidates, based on de novo transcriptome assembly to assess the instances of TEs being expressed, along with their location, to include in downstream DE analysis. TEcandidates takes as input the RNA-seq data, the genome sequence and the TE annotation file and returns a list of coordinates of candidate TEs being expressed, the TEs that have been removed and the genome sequence with removed TEs as masked. This masked genome is suited to include TEs in downstream expression analysis, as the ambiguity of reads coming from TEs is significantly reduced in the mapping step of the analysis. Availability and implementation The script which runs the pipeline can be downloaded at http://www.mobilomics.org/tecandidates/downloads or http://github.com/TEcandidates/TEcandidates. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Transposable Elements (TEs) have the machinery to self-replicate inside a genome, representing a proportion of its repetitive content. In recent years, TEs have been related to regulation of gene expression (Chuong et al., 2017). However, due to their repetitive nature, TEs complicate RNA-seq studies (Hoen et al., 2015). TE-derived reads can have ambiguous mapping because they can either map to several instances of a same TE, or other similar TEs. Current efforts addressing mapping ambiguity are focused in expression quantification rather than on pinpointing the origin of expression. By using default parameters, RNA-seq aligners, such as Bowtie 2 (Langmead and Salzberg, 2012) or STAR (Dobin and Gingeras, 2016) will report a randomly selected location amongst all possible mappings of a multi-mapping read. Two other strategies that can be adopted from the default mapping is to: (i) discard multi-mapped reads (Attig et al., 2017), potentially improving accuracy, but underestimating overall expression, or (ii) aggregate the counts per TE of unique and multi-mapped reads (Sharif et al., 2016). By using non-default parameters in RNA-seq aligners, allowing to report multiple positions for multi-mapped reads, two other strategies can be adopted: (iii) estimate the contribution of multi-mapped reads over all alignments (Jin et al., 2015), or (iv) probabilistically infer which site they belong to, using information of the local genomic context under the assumption that the correct mapping is the one that also have mappings in their vicinity (Kahles et al., 2016). In most cases, results using these strategies correlate well with experimentation. However, there would be a few drawbacks in using these mapping strategies to find the location of the TEs expressed. In the first strategy, novel TEs, that might have highly homologous copies, and thus, only multi-map reads coming from them, may not be considered at all. The second and third strategies assume the expression comes from all instances of a TE. Therefore, the results come at the cost of not being able to report the location of the TEs. In the fourth strategy, the results might not include TEs from regions with lower transcriptional activity. Alternatively, to estimate the origin of transcribing TEs, the minute differences among instances can be exploited. One way of capturing the unique local base differences among TE instances, is to de novo assemble the reads into contigs. This lengthens what is to be mapped, diminishing the expectancy of an ambiguous alignment. While this also generates false positive contigs, they can be further discarded after aligning them to the reference genome. Additional filters including minimum TE annotation length and alignment coverage can be applied in order to select the most likely TE instances to be a transcription origin. Implementing the general strategy above, we developed TEcandidates, a pipeline that reports candidate TEs to be considered in downstream expression analysis, along with a modified version of a reference genome suited to map the RNA-seq reads while avoiding ambiguity of the reads coming from TEs. 2 Materials and methods The default assumption of TEcandidates is that at most one instance per TE (only one of the copies of a TE of a specific type, family and superfamily) is being transcribed. So, each run reports by default one candidate per TE. The pipeline requires as inputs the RNA-seq reads files, a reference genome and a TE annotation. It is suggested to use the annotation generated by RepeatMasker, as it has resolution at the level of TE instance. In order to restrict the reads to repeat zones, TEcandidates first runs a pre-alignment of the reads on the genome and keeps the reads that map on RepeatMasker predictions. This reduces considerably the reads to be used in the next step, Afterwards, a de novo transcriptome assembly is performed by Trinity v2.4.0 with default parameters (Haas et al., 2013). The de novo contigs generated at this step are mapped to the reference genome, and their intersection with the annotated TEs is assessed with BEDtools v2.26 (Quinlan and Hall, 2010). The customizable filter parameters are minimum TE length, -l, and coverage of alignment of the de novo contig respect to the TE annotation, -c. Our benchmarks suggest the defaults 900 bp, and 0.1, respectively (Supplementary File S1). If more than one instance of a same TE is selected at this step, the longest one having the highest coverage is retained. Additional options, include, control of the number of processors, -t, dedicated RAM memory, -r, and number of candidates per TE instance to report, -N. The candidate TEs are written to a file in GFF3 format, and the positions of the annotated TEs not selected are hard masked in the genome sequence with an ‘X’. The pipeline outputs three files: the file containing the candidate TEs, the TEs not selected and the genome with the not-selected TEs masked. These files are suitable for downstream expression analysis of TEs. 3 Software TEcandidates script was developed in Linux bash. Configuration, as well as instructions for installation of third-party software such as Trinity and BEDTools is included in the README file. The computer requirements of this program depend on the size of the input RNA-seq, reference genome and TE annotation files to be processed. 4 Benchmark and validation As a proof of concept, we ran a benchmark with reads simulated using Flux-Simulator v1.2.1 (Griebel et al., 2012). Each test of the benchmark consisted in a construct of five instances of the same TE embedded in a larger random sequence. In different tests, the five instances differed in length or sequence identity, and one or two of them having simulated reads representing expression. In all 15 simulations TEcandidates was able to retrieve the TEs which originated the reads. (Supplementary File S2). Additionally, we validated indirectly part of our predictions with an expression dataset, from (Ohtani et al., 2013). The validation can only be indirect and partial, as experimental data on the expression of TEs, currently, do not include their location (Supplementary File S1). Funding This work was supported by Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT) program grant #11140869 from CONICYT, the Chilean NSF. Conflict of Interest: none declared. References Attig J. et al. ( 2017 ) Physiological and pathological transcriptional activation of endogenous retroelements assessed by RNA-sequencing of B lymphocytes . Front. Microbiol ., 8 , 2489. Google Scholar Crossref Search ADS PubMed Chuong E.B. et al. ( 2017 ) Regulatory activities of transposable elements: from conflicts to benefits . Nat. Rev. Genet ., 18 , 71 – 86 . Google Scholar Crossref Search ADS PubMed Dobin A. , Gingeras T.R. ( 2016 ) Optimizing RNA-Seq mapping with STAR . Methods Mol. Biol ., 1415 , 245 – 262 . Google Scholar Crossref Search ADS PubMed Griebel T. et al. ( 2012 ) Modelling and simulating generic RNA-Seq experiments with the flux simulator . Nucleic Acids Res ., 40 , 10073 – 10083 . Google Scholar Crossref Search ADS PubMed Haas B.J. et al. ( 2013 ) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis . Nat. Protoc ., 8 , 1494 – 1512 . Google Scholar Crossref Search ADS PubMed Hoen D.R. et al. ( 2015 ) A call for benchmarking transposable element annotation methods . Mob. DNA , 6 , 13. Google Scholar Crossref Search ADS PubMed Jin Y. et al. ( 2015 ) TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets . Bioinformatics , 31 , 3593 – 3599 . Google Scholar Crossref Search ADS PubMed Kahles A. et al. ( 2016 ) MMR: a tool for read multi-mapper resolution . Bioinformatics , 32 , 770 – 772 . Google Scholar Crossref Search ADS PubMed Langmead B. , Salzberg S.L. ( 2012 ) Fast gapped-read alignment with Bowtie 2 . Nat. Methods , 9 , 357 – 359 . Google Scholar Crossref Search ADS PubMed Ohtani H. et al. ( 2013 ) DmGTSF1 is necessary for Piwi-piRISC-mediated transcriptional transposon silencing in the Drosophila ovary . Genes Dev ., 27 , 1656 – 1661 . Google Scholar Crossref Search ADS PubMed Quinlan A.R. , Hall I.M. ( 2010 ) BEDTools: a flexible suite of utilities for comparing genomic features . Bioinformatics , 26 , 841 – 842 . Google Scholar Crossref Search ADS PubMed Sharif J. et al. ( 2016 ) Activation of endogenous retroviruses in Dnmt1(-/-) ESCs involves disruption of SETDB1-mediated repression by NP95 binding to hemimethylated DNA . Cell Stem Cell , 19 , 81 – 94 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 15, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off