Genome assembly of the Pink Ipê (Handroanthus impetiginosus, Bignoniaceae), a highly valued, ecologically keystone Neotropical timber forest tree

Genome assembly of the Pink Ipê (Handroanthus impetiginosus, Bignoniaceae), a highly valued,... Background: Handroanthus impetiginosus (Mart. ex DC.) Mattos is a keystone Neotropical hardwood tree widely distributed in seasonally dry tropical forests of South and Mesoamerica. Regarded as the “new mahogany,” it is the second most expensive timber, the most logged species in Brazil, and currently under significant illegal trading pressure. The plant produces large amounts of quinoids, specialized metabolites with documented antitumorous and antibiotic effects. The development of genomic resources is needed to better understand and conserve the diversity of the species, to empower forensic identification of the origin of timber, and to identify genes for important metabolic compounds. Findings: The genome assembly covers 503.7 Mb (N50 = 81 316 bp), 90.4% of the 557-Mbp genome, with 13 206 scaffolds. A repeat database with 1508 sequences was developed, allowing masking of ∼31% of the assembly. Depth of coverage indicated that consensus determination adequately removed haplotypes assembled separately due to the extensive heterozygosity of the species. Automatic gene prediction provided 31 688 structures and 35 479 messenger RNA transcripts, while external evidence supported a well-curated set of 28 603 high-confidence models (90% of total). Finally, we used the genomic sequence and the comprehensive gene content annotation to identify genes related to the production of specialized metabolites. Conclusions: This genome assembly is the first well-curated resource for a Neotropical forest tree and the first one for a member of the Bignoniaceae family, opening exceptional opportunities to empower molecular, phytochemical, and breeding studies. This work should inspire the development of similar genomic resources for the largely neglected forest trees of the mega-diverse tropical biomes. Keywords: heterozygous genome; RNA-Seq; transposable elements; quinoids; Bignoniaceae Received: 28 June 2017; Revised: 27 September 2017; Accepted: 30 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Silva-Junior et al. Data Description Context The generation of plant genome assemblies is a key driver to develop powerful genomic resources that allow gaining de- tailed insights into the evolutionary history of species while enabling breeding and conservation efforts [1, 2]. Such ad- vances took place first in model plant species [ 3], followed by the mainstream [4] and minor crops [5] and some major forest trees [6–9]. Genome sequences have also driven impor- tant advances in the description and understanding of essen- tial plant metabolic processes that underlie survival across dis- tinct lineages. Research on the functional roles of specialized metabolites, many of them phylogenetically restricted [10], has recently addressed the gap in the species-specific knowledge of specialized plant metabolism by sequencing the genome of key medicinal plants [11, 12]. Innovation in this field has re- lied on a combination of high-throughput genomics, including massive parallel sequencing and arrays with animal and clin- ical studies to elucidate the mechanisms of target compounds such as adjuvant therapies, to demonstrate the necessary for- mulations for its biological effects and to determine which sub- stances are beneficial or toxic. Apart from recent reports of shal- low transcriptome characterization using 454 pyrosequencing [13] and a low-coverage (×11) fragmented genome assembly [14], essentially no well-curated genome assembly and gene content annotation exist for Neotropical forest trees, despite their recog- nized value by indigenous communities for the healing proper- ties of their special metabolites, increasingly exploited by large pharmaceutical corporations [15, 16]. An example of such a tree is the species Handroanthus impetiginosus (Mart.exDC.)Mattos (syn. Tabebuia impetiginosa, Bignoniaceae), popularly known as Pink Ipe, ˆ Lapacho, or Pau d’arco, a source of both high-value tim- ber and traditional medicine. Species of Handroanthus and Tabebuia have virtually no ge- nomic tools and resources, beyond a handful of 21 microsatel- lites [17] with their known caveats for more sophisticated genetic analyses in the areas of population genomics and evo- lution [18]. Whole-genome sequencing has now become ac- cessible to a point that efforts to develop improved genomic Figure 1: The Handroanthus impetiginosus (Mart. ex DC.) Mattos (syn. Tabebuia im- resources for such species are possible and warranted. We petiginosa, Bignoniaceae), tree UFG-1 whose genome was sequenced. built a preliminary assembly of the nuclear genome of a sin- gle individual of Handroanthus impetiginosus based on short reads and longer mate-pair DNA sequence data to provide DK). Flow cytometry was used to check the genome size of the necessary framework for the development of genomic re- tree UFG-1, indicating a genome size of (557 ±39) Mb/1C (Fig. sources to support multiple genomic and genetic analyses of S1) consistent with published estimates [22]. Total RNA from this keystone Neotropical hardwood tree regarded as the “new shoots of 5 seedlings and from the differentiating xylem of the mahogany.” It is the second most expensive timber and the adult tree (UFG-1) was extracted using Qiagen RNeasy Plant most logged species in Brazil [19], exported largely to North Mini kit (Qiagen, DK) and pooled for RNA sequencing. DNA America for residential decking and currently under signifi- and RNA sequencing was performed at the High-Throughput cant illegal trading pressure. Additionally, the tree produces Sequencing and Genotyping Center of the University of Illinois Urbana-Champaign. The following libraries were generated large amounts of natural products such as those of quinoid systems (1,4-anthraquinones, 1,4-naphthoquinones, and 1,2- for sequencing: (1) 2 shotgun genomic libraries of short frag- ments (300 bp and 600 bp) from tree UFG-1, (2) 1 shotgun furanonaphthoquinones), specialized metabolites with promis- ing antitumorous, anti-inflammatory, and antibiotic effects library from combined pools of 5 RNA samples tagged with [20, 21]. The high pressure of logging and illegal trading on this a single index sequence. Paired-end sequencing, 2 × 150 nt, species with a notable ecological keystone status urges conser- was performed in 2 lanes of an Illumina HiSeq 2500 instru- vation efforts of existing populations. ment (Illumina, CA, USA). Three additional mate-pair libraries (fragment lengths of 4 kb to 5.5 kb, 8 kb to 10 kb, and 15 kb to 20 kb) for UFG-1 were also sequenced in 2 lanes of an Methods Illumina HiSeq 2000 instrument (2 × 101 bp). This long-range sequence resource was used to generate the final genome Sample collection and sequencing assembly for annotation. A complete overview of the DNAofasingleadulttreeof H. impetiginosus (UFG-1) (Fig. 1) genome assembly and annotation pipeline is provided was extracted using Qiagen DNeasy Plant Mini kit (Qiagen, (Fig. S2). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 3 Genome assembly using short paired-end and [37] using both the RNA-Seq trimmed reads and sequences from the de novo transcript assembly. Loci were identified by mate-pair sequencing data the assembled transcript alignments using BLASTX [38]and Short reads and mate-pair reads were stripped of sequencing EXONERATE [39] alignments of peptide sequences to the repeat- adapters using Fastq-mcf [23]. Reads that mapped to a database soft-masked genome using RepeatMasker [40], based on a trans- containing mitochondrial and chloroplast genomes of plants poson database developed as part of this genome assembly an- with Bowtie1 (option–v3–a–m1)[24] were discarded. Mate- notation. Known peptide sequences included manually curated pair reads were inspected using a Perl script (TrimAdaptor.pl), datasets for plant species available from UniProtKB/Swiss-Prot and sequences that did not contain the circularization adaptor [41] and sequences available from Phytozome [1], version 11, were discarded. By using the filtered short reads, Jellyfish2 (Jelly- for Arabidopsis thaliana, Oryza sativa, Erythranthe guttata, Solanum fish, RRID:SCR 005491)[25] and GenomeScope [26] were applied lycopersicum, Solanum tuberosum, Populus trichocarpa,and Vitis to obtain estimates of the H. impetiginosus genome size, repeat vinifera. Gene structures were predicted by homology-based pre- fraction, and heterozygosity prior to the assembly. ALLPATHS- dictors, FGENESH++, FGENESH EST [42, 43], and GenomeScan LG (ALLPATHS-LG, RRID:SCR 010742)[27] was used for de novo as- [44]. Gene predictions were improved by Program to Assem- sembly of the sequence data from both paired-end and mate- ble Spliced Alignment (PASA, RRID:SCR 014656)[45], including pair data, with default options, in a stepwise strategy for error adding Untranslated Regions (UTRs), correcting splicing, and correction of reads, handling of repetitive sequences, and use of adding alternative transcripts. PASA-improved gene model pep- mate-pair libraries. tides were subjected to peptide homology analysis with the above-mentioned proteomes to obtain Cscore values and pep- tide coverage. Cscore is the ratio of the peptide Basic Local Align- Transposable elements and repetitive DNA ment Search Tool for Proteins (BLASTP) score to the mutual best Repetitive elements were detected and annotated on the hit BLASTP score, and peptide coverage is the highest percentage genome assembly with the RepeatModeler de novo repeat of peptide aligned to the best homolog. A transcript was selected family identification and modeling package (RepeatModeler, if its Cscore value was greater than or equal to 0.5 and its peptide RRID:SCR 015027)[28]. Using RECON, RepeatScout, and Tandem coverage was greater than or equal to 0.5 or if it had transcript Repeat Finder, repetitive sequences were detected in the scaf- coverage but the proportion of its coding sequence overlapping folds longer than 10 kb using a combination of similarity-based repeats was less than 20%. For gene models where greater than and de novo approaches. The TE sequences were evaluated us- 20% of the coding sequence overlapped with repeats, the Cscore ing modeling capabilities of the RepeatModeler program, with value was required to be at least 0.9 and homology coverage was default settings, to compare the TE library against the entire as- required to be at least 70% to be selected. Selected gene models sembled sequences and to refine and classify consensus mod- were then subjected to classification analysis using InterProScan els of putative interspersed repeats. A complementary analy- 5 (InterProScan, RRID:SCR 005829)[46] for PFAM domains, PAN- sis intended to augment the number of TE sequences classi- THER, Enzyme Comissioned Number (EC), and KEGG categories. fied according to current criteria [ 29] was performed using the Gene ontology annotation was obtained, where possible, from PASTEC program [30]. RepeatMasker Open-4.0 (RepeatMasker, Interpro2GO and EC2GO mappings. RRID:SCR 012954)[31] was used with the sequences from the de novo repetitive element library to annotate the interspersed repeats and to detect simple sequence repeats (SSRs) on the Data Validation and Quality Control genome assembly. Global properties of the H. impetiginosus tree genome from the unassembled reads Protein-coding gene annotation Sequencing of the H. impetiginosus tree genome generated c. 599 Protein-coding gene annotation was performed with a pipeline million reads, comprising 73 Gbp of sequence data. This rep- that combines RNA-Seq assembled transcript and protein resents nearly ×132 the expected sequence coverage. After re- alignments to the reference with de novo prediction methods moval of adaptors, followed by standard error correction and (Fig. S2). RNA-Seq reads were screened for the presence of trimming with ALLPATHS-LG, with default options, c. 46 Gbp adapters, which were removed using Fastq-mcf [23]. Trimmomatic of data was found useful for the assembly process, yielding (Trimmomatic, RRID:SCR 011848)[32] was used to (1) remove low- sequencing coverage of ×82 (×63 from the fragments libraries quality, no-base-called segments (Ns) from sequencing reads; and ×19 from the mate-pair libraries). The estimated physical (2) scan the read with a 4-base sliding window, cutting when coverage was ×400 based on the observed fragment size dis- the average quality per base dropped below 15; and (3) remove tributions (Table S1). ALLPATHS-LG k-mer spectrum frequency reads shorter than 32 bp after trimming. Trimmed reads mapped analysis (at K = 25) on useful reads, error-corrected reads, esti- to mitochondrial, chloroplast, and ribosomal sequences from mated a haploid genome size of 540 968 531 bp, a repeat frac- plants with Bowtie1 (options –v3–a–m1)[24] were also re- tion of 38.0%, and a single nucleotide polymorphism (SNP) rate moved. Transcript de novo assemblies were performed using of 1/88 bp (1.14%). An alternative analysis of the k-mer fre- SOAP-Transdenovo [33]and Trinity de novo [34] from the processed quencies using GenomeScope [26] produced a haploid genome reads. The assemblies were concatenated and used as input size estimate of 503 748 072 bp, repetitive content of 36.6%, to EvidentialGene [35], a comprehensive transcriptome pipeline and an SNP rate of 1/60 bp (1.65%). Both estimates (Fig. 2A) to identify likely complete coding regions and their proteins are consistent with the flow cytometry estimates and in line in the final, combined, transcriptome assembly. Gene modeling with the expectations regarding the heterozygous content of the was carried out using standard procedures and tools described, H. impetiginosus genome, a predominantly outcrossed tree [47]. for instance, in Schmutz et al. [36]. In summary, a genome- Sequencing errors caused an extreme peak at k = 1inthe k- guided transcriptome assembly of H. impetiginosus was per- mer frequency distribution. Both k-mer histograms display 2 formed with the JGI PERTRAN RNA-Seq Read Assembler pipeline distinct peaks comprising the largest area of each histogram Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Silva-Junior et al. (A) 0 20 40 60 80 100 120 140 k−mer coverage per 25mer (B) 0.08 0.06 0.04 0.02 0.00 0 20 40 60 80 100 120 140 Depth of coverage Figure 2: Depth of coverage analysis. (A) Histograms of k-mer frequencies in the filtered read data for k = 25 (red) and GenomeScope modeling equation on H. impetiginosus (blue). The x-axis shows the number of times a k-mer occurred (coverage). The vertical dashed dark blue lines correspond to the mean coverage values for unique heterozygous k-mers (left peak) and unique homozygous k-mers (right peak). (B) Density plot of read depth based on mapping all short fragment reads back to the assembled scaffolds (red). Left peak (at depth =×34) corresponds to regions where the assembler created 2 distinct scaffolds from divergent putative haplotypes. The right peak (at depth =×67) contains scaffolds from regions where the genome is less variable, allowing the assembler to construct a single contig combining homologue sequences. Histograms of Poisson modeling for read depth in the assembly (green, lambda = 34; blue, lambda = 67) are shown. at depths 27 and 55. The bimodal distributions characterize from the trimmed reads, but the process was aborted on the the expected behavior for k-mer frequencies of a heterozygous overlap-correction process in the Celera Assembler due to exces- diploid genome, as seen, for example, in the recently reported sive CPU usage. SOAPdenovo2 ran very fast (3 days) but produced oak genome [48]. In the right homozygous peak (at K = 55), k- an assembly with a total scaffold size of 860 Mbp. Analysis with mers are shared between the 2 homologous chromosomes. The SOAPdenovo2 was run with different k-mer sizes, from 31 to 71, left or heterozygous peak, with half the k-mer depth of the ho- step of 10, but none of them produced a reasonable assembly mozygous peak, contains k-mers that are unique to each hap- size in view of the expected size estimated by flow cytometry lotype due to heterozygosity. The difference in height between and the k-mer frequency. ALLPATHS-LG was therefore used to these peaks (heterozygous/homozygous ratio) is a measure of assemble the genome with default options. The short reads from the heterozygosity within the genome, which is 1.65% accord- fragmented libraries were error-corrected using default settings ing to the GenomeScope modeling equation. (K-mer size of 24, ploidy of 2), fragment-filled, and assembled into initial unipaths (k-mer size of 96, ploidy of 2). Jumping reads from the mate-pair libraries were then aligned to the unipaths Genome assembly and all alignments were processed in a seed-extension strategy with junction point recognition within the read aimed to remove State-of-the-art haploid genome assembler pipelines from invalid and duplicate fragments to perform error correction and short-read ALLPATHS-LG [27] and SOAPdenovo2 (SOAPdenovo2, initial scaffolding. This initial process produced an assembly RRID:SCR 014986)[49] were considered for an initial evalua- graph that was turned into scaffolds by analyzing branch points tion on the dataset of reads. Two relatively new algorithms in the graph topology. This late process converted single-base specifically developed for de novo assembly of heterozygous mismatches into ambiguous base codes at branch. It also flat- genomes, MaSuRCA (MaSuRCA, RRID:SCR 010691)[50] and PLA- tened some other structural features of the assembly including TANUS (PLATANUS, RRID:SCR 015531)[51], were also attempted short indels. The contig assembly comprised 109 064 sequences as alternatives to the other 2 assemblers designed for genomes of a length of 500 bp or longer with total length of 466 314 780 bp. of low heterozygosity. Reads were first preprocessed and error Genome assembly after scaffolding comprised 57 815 scaffolds corrected using the algorithms provided by each assembler. PLA- of length 1 kbp or longer with a total length of 610 091 865 bp and TANUS was set to run, but after 10 weeks it did not produce N50 of 57 Kbp. The fraction of bases captured in gaps was 23.9%, any result in an Intel(R) Xeon(R) server with 64 × 7560 2.27-GHz and the rate of ambiguous bases for all bases captured in the CPUs, 256 GB RAM, except for the k-mer count table on the in- assembly was 0.24%. This assembly was only slightly larger in put trimmed reads. After 9 week-long runtimes in an Intel(R) size (<10%) than the empirically determined genome size using Xeon(R) server with 64 × 7560 2.27-GHz CPUs, 512 GB RAM, Ma- flow cytometry [ 22]. SuRCA successfully completed the generation of the super-reads Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Density Number of k−mers at coverage (Million) Genome of the Pink Ipe, ˆ a neotropical forest tree 5 Alternative scaffold and gap-filling = 2379). The number of remaining gaps in the assembly was 21 417, totaling 30 066 113 bp (5.05%). Although the ALLPATHS-LG performance was good in recovering Paired-end reads from the short fragment libraries were the expected genome size in the assembled contigs, there was aligned back independently to this genome assembly using a high fraction of the bases captured in gaps in the scaffolds SMALT (map -r 0 -x -y 0.5; default alignment penalty scores). Per- (∼one-fourth of the total genome assembly). De novo assembly scaffold depth of coverage was computed, regardless of map- algorithms applied to moderate to high levels of heterozygosity ping quality, using Genome Analysis Toolkit (GATK) Depthof- cannot match the performance achieved in assemblies of ho- Coverage. The mean read depth across the scaffolds resulted in mozygous genomes, especially at the contig assembly level [52]. ×66.45. The mean read length of the mapped reads was 139.8 We thus used the assembled contigs to perform an alternative bp, and the corresponding k-mer coverage for the size of 25 scaffolding step with SSAKE-based Scaffolding of Pre-Assembled was ×55.04, which matches with the homozygous peak com- Contigs after Extension (SSPACE, RRID:SCR 005056)[53] using the puted from the k-mer frequency distribution from the unassem- error-corrected short fragment reads and the jumping reads. In bled reads. The read depth frequencies are shown in Fig. 2B. this approach, genome assembly comprised 16 090 scaffolds of The heterozygous/homozygous peak height (>1) in the distri- a length of 1 kbp or longer with a total length of 577 446 088 bp bution suggests that the assembly contains redundant copies and N50 of 95 Kbp, respectively. The fraction of bases captured of unmerged haplotypes due to the structural heterozygosity in gaps dropped from 23.9% to 18.9% in contrast to ALLPATHS- of the diploid genome of the species. To specifically deal with LG scaffolding, totaling 109 533 288 bp. The rate of ambiguous the heterozygosity, we introduced a step to, leniently, recognize bases for all bases captured in the assembly dropped from 0.24% and remove alternative heterozygous sequences. Sequences of to 0.13%. All preprocessed reads were reused in an attempt to scaffolds were aligned 1 vs all using the BLAST-like alignment close the intra-scaffold gaps using the GapCloser (GapCloser, tool (BLAT, RRID:SCR 011919)[57], and results were concate- RRID:SCR 015026)[54] algorithm. Genome assembly after gap- nated in a single file of alignments and sorted. Similar se- filling was 586 206 884 bp in 15 671 scaffolds of a length of 1 kbp quences were identified on the base of pairwise similarity us- or longer, and only 20 583 469 bp (3.51% of the genome assem- ing filterPSL utility from AUGUSTUS [ 58] with default param- bly) remained in 24 907 gaps. The N50 of scaffolds of a length of eters, and retaining all best matches to each single sequence 1 kbp or longer, with gaps, was 97 344 Kb (L50 = 1792). Sequences queried against all others that satisfy minimal percentage of longer than 20 kb were assembled in only 6791 scaffolds, total- identity (minId = 92%) and minimal percentage of coverage of ing 538 102 146 bp, ∼97% of the genome size estimated from flow the query read (minCover = 80%). We considered as heterozy- cytometry (557 Mb). gous redundant those scaffolds that showed pairwise similarity to exactly another sequence, and their depth of coverage fell in a Poisson distribution with parameters given by the heterozygous Evaluation of accuracy of the genome assembly peak of the read depth distribution over all scaffolds (lambda A subset of fragments and jumping read pairs (∼×15 sequenc- = 34) (Fig. 2B). The final step was to keep only 1 copy—the ing coverage each) was used to uncover inaccuracies in the largest—of the heterozygous scaffolds among pairs with high genome assembly. Scaffolds with identified errors were broken similarity. or flagged for inspection. Recognition of Errors in Assemblies using Paired Reads (REAPR) [55] was used to test each base of A preliminary assembly of the H. impetiginosus genome the genome assembly looking for small local errors (such as a single base substitution, and short insertions or deletions) and At the end of the accuracy evaluation processes, the genome structural errors (such as scaffolding errors), located by means assembly had a total size of 503 308 897 bp, with gaps, in 13 of changes to the expected distribution of inferred sequenc- 206 scaffolds. The N50 of scaffolds of 1 kbp or longer was 80 ing fragments from the mapped reads using SMALT v0.7.6 [56]. 946 bp (L50 = 1906), and the average size of the sequences was REAPR reported that only 343 588 027 (∼60%) bases in the assem- 38 118 bp. Using 20 kbp as an approximate value of longest plant bly should be free of errors, with 5476 reported (1658 within con- gene length [59, 60], the percentage of scaffolds that equaled or tigs, 3818 over gaps) in the remaining 242 618 857 bp. The most surpassed this value in relation to the empirically determined frequent (∼92%) type of inaccuracy reported was Perfect cov and genome size is 83%, which corresponds to over 92% of the as- Link. Perfect cov means low coverage of perfect uniquely mapping sembly total size. Contigs generated by cutting scaffolds at each reads while Link describes situations in which reads map else- gap (of at least 25 base pairs, i.e., 25 or more Ns) produced an where in the assembly. The recognition of this inaccuracy at the N50 of 40 064 bp (L50 = 3551) with an average sequence size of base pair level should thus reflect the repetitive nature of the 19 765 bp. The remaining gaps comprised 26 447 057 bp (5.25% genome, as inferred from the k-mer frequency spectra analy- of the genome assembly) in 11 094 segments, with a size of 2384 sis (∼36%–38% of repeats). Besides the base pair inaccurate calls ± 3167 bp. The total assembly size represents over 90% of the due to repeats, other structural problems in the assembly were flow cytometry genome estimate (557 Mb) and should provide identified based on sequence coverage differences from the ex- a good start to build a further improved reference genome as- pected fragment size distribution, and the program used this sembly of the species using long-range scaffolding techniques information to break these. Given the high heterozygosity and such as whole-genome maps using either imaging methods [61] divergence between haplotypes on this diploid genome se- or contact maps of chromosomes based on chromatin interac- quence, homologous sequences can assemble separately or tions [62]. Table 1 summarizes the main statistics of the Han- merge. Moreover, unresolved repeat structures in the assembly droanthus impetiginosus genome assembly with respect to the de- might also contribute heavily to this issue. Structural errors in cisions made in the assembly process. REAPR were likely called at the boundaries of these regions. The A reassessment of the assembly accuracy was carried out final genome assembly after REAPR breaks had 19 319 sequences using REAPR on the final genome assembly. A total of 121 of a length of 1 kbp or longer, with 576 829 188 bp. The N50 size errors within a contig were still recognized, a much smaller of scaffolds dropped from 97 344 Kb (L50 = 1792) to 71 491 bp (L50 number than previously annotated (1658 errors). Fig. 3Ashows Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Silva-Junior et al. Table 1: Handroanthus impetiginosus genome assembly statistics Allpaths-LG/ Allpaths-LG/ Scaffold sequences Allpaths-LG Sspace/GapClose Sspace/GapClose/Reapr Number 57 815 16 090 13 206 Total size, without gaps, bp 469 049 393 565 959 143 476 867 120 Total size, with gaps, bp 614 626 609 586 542 612 503 314 177 Number > 10 Kbp 10 029 8602 8348 Number > 20 Kbp 6920 6791 6647 Number > 100 Kbp 1100 1709 1304 Number > 1 Mbp 2 0 0 Longest sequence, bp 1 844 569 979 053 558 523 Average size, bp 10 631 36 454 38 112 N50 length, bp 57 726 97 266 80 946 L50 count 2595 1792 1906 GC % 33.63 33.57 33.62 The final assembly for each step contains scaffolds of length 1 kbp or longer. (A) 0.02 0.01 0.00 0 20 40 60 80 100 120 140 Depth of coverage 0.05 (B) 0.04 0.03 0.02 0.01 0.00 0 20 40 60 80 100 120 140 Average coverage for scaffolds Figure 3: Depth of coverage analysis for the haplotype-reduced assembly. (A) Density plot of read depth based on mapping all short fragment reads back to the haplotype-reduced assembled sequences after identification and removal of redundant sequences due the structural heterozygosity in the genome. (B ) Density plot for average sequencing coverage per scaffold on the final assembly. The observed number of scaffolds in the final haplotype-reduced assembly and the re spective read coverage (blue line) are shown in comparison with a Poisson process approximation (red line) with lambda =×63, the observed average sequencing coverage in the useful read data. the frequency distribution for the read depth computed from we can infer that most families/subfamilies of repeats might be the paired-end read alignment to the scaffold sequences. It underrepresented. indicates the expected effect on the distribution in compari- To complement the depth of read coverage analyses, we per- son with the previous more redundant assembly. The height formed additional analyses to identify the most probable causes of the heterozygous peak was successfully lowered by remov- of breaks in the assembly. We inspected contig termini defining ing unmerged copies of the same heterozygous loci. Fig. 3B the positions of the terminal nucleotides of each contig from the shows the relation between the observed number of scaffolds genome assembly created by cutting at each gap (of at least 1 in the final assembly and their read coverage in comparison base pair, i.e., 1 or more Ns). This analysis was developed using with a Poisson approximation with of lambda of 63, which was a protocol described elsewhere [63], and results are summarized the observed average sequencing coverage for reads set from in Fig. 4. Contig termini overlap most prominently (∼50%) with short fragment libraries. Loss of information due to repeat se- regions that do not encompass any annotated feature or regions quences is clearly a limitation of this H. impetiginosus assem- that have no depth of coverage (∼15%) based on mapped reads to bly. Given the high rate of nonclassified consensus sequences, the assembly. It suggests that contigs end in large repeats not yet Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Density Density Genome of the Pink Ipe, ˆ a neotropical forest tree 7 Figure 4: Contig termini analysis to investigate the possible genomic features associated with gaps in the genome assembly. Contigs were created from the genome assembly with the “cutN -n 1” command from the seqtk program, which cut at each gap (of at least 1 base pair, i.e., 1 or more Ns). The figure shows the percent age of contig termini (the position of the terminal nucleotides of each contig) intersecting with different annotations of the genome. resolved given the inherent limitations of short-read sequence sequences were annotated with predicted protein domains fre- data. Another possibility is that these regions can contain low- quently associated with protein coding genes. These 135 se- copy young euchromatic segmental duplication with higher se- quences were wiped out from the consensus library. Most of quence similarity to the consensus sequence. Annotated inter- the remaining 1473 sequences (71.1%) could not find classifica- spersed repeats (∼18%) and short tandem repeats (∼9%) were tion in the hierarchical well-known classes of transposable ele- the most prominently annotated features with overlap to contig ments (TEs) [64], but 16.6% could be classified Class I (retrotrans- ends. Less than 8% (2473 of 31 668) of annotated gene models posons), including 3 orders: long terminal repeats (LTR; 12.8%), were found to overlap contig ends, indicating that very few are long interspersed nuclear elements (LINE; 1.6%), and short in- likely to be interrupted in this unfinished assembly. It is a trend terspersed nuclear elements (SINE; 2.2%); 8.4% are Class II (DNA that was confirmed using BUSCO analysis, which reported only transposons). Other categories comprised nonautonomous TEs: 3% of fragmented genes. Based on variant identification analy- TRIM (0.4%) and miniature inverted–repeat transposable ele- sis with FreeBayes (FreeBayes, RRID:SCR 010761) using read data ments (MITE; 3.5%). Unknown nonclassified sequences in the mapped to the genome assembly, we found virtually no allelic consensus library cover a wide range of sequence sizes, from variants located at the contigs’ end, suggesting that interrup- 42 bp up to 5987 bp (average = 345 bp, median = 503 bp). The tion of continuity and contiguity in the assembly is not related 1473 sequences representing interspersed repeats in the con- to differences between haplotypes. sensus repeat library were used to mask the genome with Re- peatMasker. The masked fraction of the genome assembly com- prised 155 348 349 bp, i.e., 30.9% of the total assembled genome Repetitive DNA of 503 Mbp. Remarkably, if we add to these ∼155 Mbp the 54 Mbp of noncaptured base pairs in the assembly when consid- A total of 1608 consensus sequences (average length = 773 bp, ering the empirically determined genome size (557–503), the totaling 1 281 536 bp) representing interspersed repeats in the repetitive fraction of the genome approximates 37.5% (209 Mbp genome assembly were found. Search for domains in these se- out of 557 Mbp). This is within the expected range (36.6%–38.0%) quences with similarity to known large families of genes that for the repetitive fraction of the genome estimated from the read could confound the identification of true repeats indicated 85 set using k-mer profiling approaches. false positives in the consensus library of repeats. A further 50 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Silva-Junior et al. (A) Classification CLASS_I CLASS_II Tandem_repeat Unknown (B) LTR LTR/Caulimovirus LTR/Copia LTR/ERV LTR/Gypsy LTR/Ngaro LTR/Pao non−LTR/LINE non−LTR/SINE TRIM DNA DNA/CMC−EnSpm DNA/Dada DNA/hAT DNA/Kolobok−T2 DNA/MuLE−MuDR DNA/PIF−Harbinger MITE TIR Tandem_repeat Unknown 0 1000 2000 3000 4000 5000 6000 7000 Sequence length (bp) Figure 5: Repeat content of the H. impetiginosus genome assembly. (A) The density of interspersed and tandem repeat as percentage of the assembly. The size of the circles represents the number of copies in the assembly for each family of repeats. (B) Distribution of sizes of the consensus sequences for repeat families identified using de novo and homology methods for repeat characterization. More than 50% of the masked bases in the assembly, or 80 SSR motifs ranging from 1 to 6 bp showed that the di-nucleotide Mbp, came from nonclassified sequences in the consensus li- repeats were the most abundant repeats, followed by the brary. In the well-known repeats, retrotransposons are the most mono- (Fig. S3A). The frequency of SSR decreased with increase abundant class in the assembly, comprising 50 Mbp (∼one-third in motif length (Supplementary Fig. S3B), which is a trend usu- of the masked bases), with prominence of LTR/Gypsy (∼23 Mb) ally observed both in monocots and dicots [67]. and LTR/Copy (18 Mb) families of repeats. DNA transposons and nonautonomous orders of transposons masked 12 Mbp and 11 Transcriptome assembly and gene content annotation Mbp (∼one-sixth of the masked bases), respectively, highlighting and analysis the prominence of DNA/hAT families of class II and MITE (Fig. 5). A single run of Illumina HiSeq 2500 sequencing, from a pool of Simple sequence repeats (SSRs) detection using RepeatMasker RNA samples, generated nearly 148 million paired-end reads. identified a total of 182 115 microsatellites with a density of 2.76 After adapter removal, trimming, and coverage normalization, kb per SSR in the genome assembly. This density corroborates 55.2 million high-quality reads (38%) were used to assemble the general finding that the overall frequency of microsatellites the transcriptome using de novo (Trinity and SOAP-Trans-denovo is inversely related to genome size in plant genomes [65]. This SSRs density in H. impetiginosus (genome size of 557 Mbp/SSR transcripts combined with the EvidentialGene pipeline) and genome-guided methods (PERTRAN). The PASA pipeline was density of 362 per Mbp) is higher than in larger plant genomes such as those of maize (1115 Mbp/163 SSRs per Mbp), S. bicolor used to integrate transcript alignments to the genome assem- bly from these sets of sequences, generating 54 320 Expressed (738 Mbp/175 per Mbp), G. raimondii (761 Mbp/74.8 per Mbp) [66], but lower than densities in smaller genomes such as those of A. Sequence Tag (EST) assemblies representing putative protein- coding loci in the genome assembly. Loci were identified by thaliana (120 Mbp/418 per Mbp), Medicago truncatula (307 Mbp/495 per Mbp), and C. sativus (367 Mbp/552 per Mbp) [67]. Different the assembled transcript alignments using BLASTX [36]and Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Percent of assembly LTR LTR/Caulimovirus LTR/Copia LTR/ERV LTR/Gypsy LTR/Ngaro LTR/Pao non−LTR/LINE non−LTR/SINE TRIM DNA DNA/CMC−EnSpm DNA/Dada DNA/hAT DNA/Kolobok−T2 DNA/MuLE−MuDR DNA/PIF−Harbinger MITE TIR Tandem_repeat Unknown Genome of the Pink Ipe, ˆ a neotropical forest tree 9 Table 2: Handroanthus impetiginosus gene prediction statistics with respect to the number, length, and base composition of genes, transcripts, exons, and introns Genes Transcripts Exons Introns Number 31 688 35 479 154 209 122 521 Average number/gene – 1.12 4.87 3.87 Average length 3129 3342 285 445 N50 length 4421 4643 477 839 %GC 38.38 38.22 42.60 32.83 %N 0.43 0.43 0.00 0.29 Table 3: The distribution of the minimal introns (53–125 bp) and the minimal-intron-containing genes—as the number of genes with at least 1 minimal intron—from selected plant species in comparison with the H. impetiginosus genome assembly Genome Number of Mean intron Minimal Gene Species size, Mbp intron, bp length, bp intron, % , % A. thaliana (Rosids) 120 118 037 164 72.29 57.08 E. guttata (Asterids) 312 117 507 290 47.75 57.63 P. trichocarpa (Rosids) 423 166 809 380 36.96 53.41 E. grandis (Rosids) 691 137 329 425 33.49 48.38 S. indicum (Asterids) 354 101 313 439 38.14 49.76 H. impetiginosus (Asterids) 557 122 521 445 34.36 49.78 S. lycopersicum (Asterids) 900 125 750 543 36.09 47.78 EXONERATE [37] alignments of plant peptides to the repeat-soft- species in the Asterids and Rosids lineages. We have calculated masked genome using RepeatMasker. After gene model pre- the percentages of minimal introns out of the total introns and diction and refinements, a total of 36 262 gene models were the fraction of minimal-intron-containing genes with at least 1 found in the genome assembly, and 31 668 of them were re- minimal intron. Computed values were similar between H. im- tained after quality assessment based on Cscore, protein cov- petiginosus and those of selected species with higher numbers of erage, and overlap with repeats, as described in the “Methods.” large introns (smaller minimal intron peak) but were more dis- The number of predicted messenger RNA (mRNA) transcripts tinctive with those species such as A. thaliana and E. guttate, in was 35 479. which the number of large introns was lower (larger minimal in- Structural features of the gene content are shown in tron peak). This is thought to be a general trend and was also ob- Tables 2 and 3. The average number of exons per gene was ∼5, served in previous work [73]. These comparative analyses about and its average length was 285 bp. The average number of in- the structural properties of the predicted genes indicate that the trons per gene was ∼4, and its average length was 445 bp. The genome assembly of H. impetiginosus contains highly accurate GC content is significantly different between exons and introns gene structures. (t test P < 0.0001). Coding sequences have ∼43% of GC, while To further validate the gene content annotation, we used the introns have less, with ∼33% (Table 2). GC content tends to be transcript assemblies and selected plant proteomes to inspect higher in coding (exonic) than noncoding regions [68], which if these sequences could align in their entirety to the genomic may be related to gene architecture and alternative splicing sequence. Out of the 31 668 primary mRNA transcripts (consid- [69–71]. A comparison of the gene feature parameters, such as ering only the longest one when isoforms were predicted) in number and length (Fig. S4A), was carried out between H. im- the genome, 11 488 have 100% of their coding DNA sequence petiginosus and Erythranthe guttata, another plant in the order (CDS) covered by EST assemblies. The remaining 20 054 tran- Lamiales (Asterids), the model plant A. thaliana and the model scripts have either a minimum of 80% of their CDS covered by tree P. trichocarpa (Rosids). As depicted in the frequency his- EST assemblies or a Cscore ≥0.5. From these latter, the encoded tograms, the exon parameters are stable among these species putative peptides have excellent sequence similarity support (Fig. S4B). For the introns (Fig. S4C), frequency histograms have from BLASTP comparisons with dicot species Erythranthe guttata a sharp peak around 90 bp and a larger peak that is much lower (5224 genes), Sesamum indicum (4625 genes), potato or tomato in density. There is a small intron size variability from species to (2777 genes), soybean (1484 genes), and the poplar tree (1424 species in the distributions, especially for larger introns, which genes), reflecting the taxonomic relationship between H. impetig- rarely go beyond 10 000 bp. The intron length distributions in inosus and these other related dicots. Gene model support was these 4 species are similar to those observed in lineages that also found from more distantly related dicots (1826 genes) and are late in the evolutionary time scale, such as plants and verte- monocots (1042 genes). Altogether, 31 048 gene models (98%) brates [72]. The sharp peak in the distributions at their “minimal show well-supported similarity hits to other known plant pro- intron” size is supposed to affect function by enhancing the rate tein sequences. An additional 517 predicted protein sequences at which mRNA is exported from the cell nucleus [73, 74]. In the did not produce hits, and 103 sequences produced ambiguous model plant A. thaliana, a minimal intron group was previously hits from nontarget species or represent possible contaminants defined [ 73] as anything that lies within 3 standard deviations of in the assembly, such as endophytic fungi (ascomycetes, 42 se- the optimum peak at 89 ± 12 bp (53 – 125 bp). According to this quences; basidiomycetes, 17 sequences). Fig. 6A summarizes definition, Table 3 summarizes the distribution of the minimal the main finding regarding the similarity analyses with known intron among genes of H. impetiginosus and other selected plant proteins. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 Silva-Junior et al. (A) (B) Figure 6: Transcriptome quality assessment (A) similarity search of H. impetiginosus putative peptides against source database of plant protein sequences using BLASTP algorithm (e-value = 1e-6). Transcript count means the number of peptides of H. impetiginosus with the best hit against the source database using bit-score and grouping results by taxon name. Transcript score corresponds to the average bit-score overall hits for each group using the best hit. We ordered taxon groups by their average bit-score overall hits and used Welch’s t test to compare the distributions of bit-score hits between 2 adjacent groups with P-values <0.01 (ns = nonsignificant; significant). (B) Completeness of the expected gene space of the genome assembly, estimated with BUSCO. The estimates were compared with genome annot ations for other lamids, Erythranthe guttata and Olea europaea. BUSCO (BUSCO, RRID:SCR 015008)[75] single-copy gene plant to be completely duplicated. We benchmarked our results by profiles were used to estimate completeness of the expected searching the BUSCO profiles on the genomes of other lamids, gene space as well as the duplicate fraction of the genome as- Erythranthe guttata and Olea europaea. In E. guttate, the analysis sembly. Out of the 956 profiles searched on the assembly, 59 reported a completeness level of 88% (848 single-copy profiles (6.1%) were reported missing and 30 (3.1%) returned fragmented. with a complete match), while there were 52 fragmented genes From the profiles with a complete match to the assembly, 867 (5.4%). In O. europeae, the completeness level was 94% (905 com- (90.7%) were reported as single-copy and 247 (25.8%) were found plete single-copy profiles), and there were only 14 fragmented Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 11 genes (1.4%). A summary of the BUSCO analysis is presented in lamids, it appears that the frequency of small- and large-scale Fig. 6B. duplications, such as (paleo)polyploidy, can explain the differ- Databases for gene ontology (GO) annotation are rich re- ences in the number of annotated genes and levels of gene du- sources to describe the functional properties of experimentally plication (E. guttata <= H. impetiginosus  O. europaea). It suggests derived gene sets. To explore relationships between the GO that the H. impetiginosus genome has not undergone a recent terms in the H. impetiginosus and related, well-curated genomes, whole-genome duplication event, although a deeper analysis of we used WEGO [76] to perform a genome-wide comparative this question, beyond the scope of this study, remains open. analyses among broad functional GO terms with other lamids. Our genome assembly metrics were benchmarked against The P-value of the Pearson chi-square test was considered to comparable genome assemblies of other highly heterozygous indicate significant relationships between the proportions of forest tree genomes (File S2 and Fig. S7). The H. impetiginosus as- genes of each GO term in these 2 datasets and to suggest pat- sembly has 503 Mbp in 13 206 scaffolds ≥2 kbp, representing over terns of enrichment (Figs S5 and S6). These analyses revealed 90% of the flow cytometry estimated size (557 Mb). For Quercus several GO terms in which the proportion of genes in the 2 com- robur, the assembly had 17 910 scaffolds ≥2 kbp with scaffolds pared species were related. For the terms in which the com- N50 of 260 kbp, but corresponding to 1.34 Gbp, i.e., 81% larger parison did not indicate a significant relationship of gene pro- than the expected 740-Mbp genome, which is clearly an unde- portions between the 2 datasets, the compared GO terms sug- sirable result [83]. For Quercus lobate, with agenomesizeof730 gested enrichments in H. impetiginosus for GO terms involved in Mbp, 2 assemblies were provided: a haplotype-reduced assem- metabolic processes and catalytic activity in comparison with E. bly, with 40 158 contigs totaling 760 Mb, N50 of 95 kbp, and a guttata and O. europaea. more complete version for gene models, containing 94 394 scaf- The central role of enzymes as biological catalysts is a well- folds ≥2 kbp, totaling 1.15 Gbp, with an N50 of 278 kbp [48]. De- studied issue related to the chemistry of cells [77]. An impor- spite our lower NG50/N50 scaffold length <100 kbp, the H. im- tant feature of most enzymes is that their activities can be regu- petiginosus assembly has a large (60%) percentage of scaffolds lated to function properly to comply with physiological needs of ≥20 kbp. This value is higher than the reported values for Quer- the organism. We observed that the GO term for enzyme reg- cus lobata v0.5 (53%), Quercus lobata v1.0 (51%), and Quercus rubra ulatory activity encompasses a higher proportion of genes in (48%), even if those assemblies had higher NG50/N50 scaffold H. impetiginosus than in the 2 other lamids, albeit the differ- lengths. Finally, contig termini analysis has found virtually no ence did not reach significance in E. guttata. Research in Ara- allelic variants located at contig ends, suggesting that interrup- bidopsis, a herbaceous plant, has found little connectivity be- tion of continuity and contiguity in the assembly is not related to tween metabolites and enzyme activity [78]. In comparison with differences between haplotypes. This genome assembly for Han- Arabidopsis broader GO terms, H. impetiginosus showed, as dis- droanthus impetiginosus will thus be useful for variant calling, one cussed above, enrichment for the proportion of genes assigned of the main future objectives for generating this resource. to the metabolic process (49.1% > 47.4%; P = 0.002) and cat- alytic activity (46.2% > 42.9%; P = 0). The proportion of genes for Genome-guided exploration of specialized metabolism enzyme regulatory activity was also higher in H. impetiginosus genes of quinoid systems than A. thaliana, though not statistically significantly ( P = 0.083). Investigations into whether and how metabolic process and Aside from their highly valued wood, H. impetiginosus and enzyme activities relate and how it could influence the known other Ipeˆ species are also known for their medicinal ef- richness of metabolites for forest trees of the mega-diverse trop- fects. Extracts from their bark and wood have many ethnob- ical biomes, particularly in the genus Tabebuia and Handroanthus, otanical uses: against cancer, malaria, fevers, trypanosomi- shall be an interesting issue for future molecular and chemistry asis, fungal and bacterial infections, and stomach disorders studies. [84, 85]. The wood extracts have also been demonstrated to have anti-inflammatory effects [ 86][87]. The main bioactive components isolated from the Pink Ipeˆ are Lapachol and its Benchmarking the genome assembly products [88], which are naphthoquinones derived from the o- of H. impetiginosus succinylbenzoate (OSB) pathway [89]. Lapachol is also responsi- Based on current standards for plant genome sequence assem- ble for the well-known high resistance of the Ipeˆ wood against bly [60, 79, 80], we have provided a quality assembly of high rotting fungi and insects [90]. In addition, naphthoquinones are future utility. To support functional analyses, we classified the aromatic substances with ecological importance for the interac- gene models into high-confidence and low-confidence groups. tion of plants with other plants, insects, and microbes [89]. Given Out of the 31 688 protein-coding loci annotated in the genome their medicinal and biological relevance, we have searched the assembly, 28 603 (90%) produced high-confidence gene mod- H. impetiginosus annotated genes for the enzymes involved in els (Supplementary File S1). This subset contains approximately the biosynthesis of naphthoquinones. By searching for the KEGG the same number of genes reported in less fragmented genome identifiers of these enzymes (e.g., K01851) in the InterPro an- assemblies for other lamids. E. guttata (2n = 28) reports 28 notation results, we found all the important known enzymes 140 protein-coding genes [81]; O. europeae (2n = 46) has 56 349 that lead to the biosynthesis of lapachol (Fig. 7). Unfortunately, protein-coding genes [82], but its genome has likely undergone however, the last 2 steps of the lapachol biosynthesis pathway a whole-genome duplication event. Most of Tabebuia and Han- still constitute unidentified enzymes [ 89]. For comparative pur- droanthus species studied so far have 2n = 40 [22]. The fraction of poses, we downloaded the annotation file of 5 other species gene duplicates in the BUSCO analysis (see Fig. 5B) was intended from the Phytozome database. The number of H. impetiginosus to estimate the level of redundancy in the genome assembly. We genes encoding for the enzymes of each step in the pathway is benchmarked our results by searching the completed duplicated comparable to the numbers found in other species. However, BUSCO profiles in the genomes of E. guttata and O. europaea.In 3 exceptions were found. H. impetiginosus has 5 genes encod- the first, we found them to be 15% (150 out 956), while in the lat- ing the enzyme that converts chorismate to isochorismate, the ter the duplicated profiles were 38% (364 out of 956). In these 3 first step in the OSB pathway. Two other steps found to have Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 Silva-Junior et al. Figure 7: Genes of the biosynthetic pathway of specialized quinoids. O-succinylbenzoate (OSB) pathway depicting the number of H. impetiginosus (Himp) annotated genes for the known enzymes that lead to the biosynthesis of the naphthoquinones, including lapachol. For comparison, it also shows the numbers of genes for the closely related Mimulus guttatus (Mgut), Solanum lycopersicum (Slyc), for the model Arabidopsis thaliana (Ath), and for the tree species Eucalyptus grandis (Egr) and Populus trichocarpa (Potri). The pathway was modified from Widhalm and Rhodes [ 89]. relatively more genes in H. impetiginosus are the ones that lead to functional studies. Extensive documentation of quality through- the synthesis of 1,4-Dihydroxy-2-naphtoyl-CoA and of 2-Phytyl- out the assembly process was provided showing that accept- 1,4-naphthoquinone. The availability of sequences for these able continuity was reached and that the fragmentation of the genes may open new avenues for biotechnological products and final sequence mostly derives from loss of information on high- for a better understanding of their ecological roles. copy families of long interspersed repeats or the presence of low-copy segmental duplications likely recently evolved with higher sequence similarity to the consensus sequence. Cer- Re-use potential tainly, there are still inaccuracies at the base and assembly lev- els, but all efforts were made to deliver results to the end user We have reported a well-curated but still unfinished genome with the appropriate documentation, making this initial read assembly for Handroanthus impetiginosus, a highly valued, eco- set, sequence, and annotations a primary and reliable starting logically keystone tropical timber and a species rich in nat- grounds for further improvement. ural products. The fragmentation of this preliminary assem- We have documented in detail the main features of the re- bly might be still be limiting for deeper insights of whole- ported assembly. The total assembly size of scaffolds ≥2 kbp in genome comparative analyses or studies of genome evolu- length is 90% of the flow cytometry–determined genome size, a tion [91], although we think that such studies may be carried remarkable accomplishment, we believe, given the anticipated out using this assembly at least at the gene level or gene- difficulties in assembling such a repetitive and highly heterozy- family level. Nevertheless, the broad validation performed pro- gous diploid genome based exclusively on short-read sequenc- vides a useful genomic resource for genetic and functional anal- ing. The percentage of base pairs in scaffolds with ≥20 kbp ysis, including, but not limited to, downstream applications is 83% (461 Mbp of 557 Mbp) of the empirically determined such as variant calling, molecular markers development, and Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 13 genome size, which corresponds to 92% of the assembled total Supporting data and summary outputs for the main analyses in size (461 Mbp of 503 Mbp). Using 20 kbp as an approximate value this Data Note are available via the GigaScience repository, GigaDB of the longest plant gene length, this result shows that 60% of [92]. The Perl script that automated the read set from mate- the assembly is accessible for reliable gene annotation. Further- pair sequencing preprocessing (TrimAdaptor.pl) was uploaded more, the N50/NG50 (41 kbp/34 kbp) contig length is longer than to GigaDB under permission of the original authors at the High- 30 kbp, which has been suggested to be an adequate minimum Throughput Sequencing and Genotyping Center Unit of the Uni- threshold for high utility of a genome assembly [79]. The per- versity of Illinois Urbana-Champaign. centage of documented gaps in scaffolds is only 5.3%, and the few misassembled signatures present in the assembly were fully documented based on acceptable metrics such as fragment cov- erage distribution error (FCD error). Less than 8% (2473 of 31 Additional file 668) of annotated gene models were found to overlap contig ends, indicating that very few are likely to be interrupted in this Table S1: Summary of the sequence data generated for the unfinished assembly. No allelic variants were found at contig genome assembly of Handroanthus impetiginosus based on the ends, suggesting that interruption of continuity and contiguity ALLPATHS-LG algorithm. in the assembly is not related to differences between haplotypes, Figure S1: Flow cytometry results of the sequenced tree UFG- therefore providing a valuable resource for variant calling and 1of H. impetiginosus. Flow cytometry estimate of the nuclear DNA functional analysis. More than 86% (27 380 of 31 668) of the content was carried out using young leaf tissue on a BD Accuri gene models represented in the assembly have external eviden- C6 Plus personal flow cytometer. Pisum sativum (genome size 9.09 tial support measured by PASA-validated EST alignments from pg/2C or ∼4380 Mb/1C) was used as standard for comparison RNA-Seq or high-coverage alignments with known plant pro- (M2). The estimate of nuclear DNA content for H. impetiginosus teins (>90% coverage). Furthermore, 80% (25 369 of 31 668) of (M1) averaged over 10 readings was 1.155 pg/2C or 557.3 ± 39 transcripts have conceptual translation that contains protein Mb/1C. domain annotation, excluding those associated to TEs. Finally, a Figure S2: Overview of the analytical pipeline with the bioin- summary of BUSCO analysis indicates that the detected number formatics steps and tools employed for genome (black arrows) of plant single-copy orthologs represents 90% of the searched and transcriptome assembly (red arrows), and for gene predic- profiles (867 of 956), while only 6% are missing and 3% are tion and annotation (blue arrows). Bioinformatics programs are fragmented. indicated in italic, blue, and the main file formats in red. The This is the first well-curated genome for a Neotropical for- input sequences are highlighted in yellow boxes and the main est tree and the first one reported for a member of the Bignoni- products in green. aceae family. Besides expanding the opportunities for compara- Figure S3: Distribution and characterization of simple se- tive genomic studies by including an overlooked taxonomic fam- quence repeats in Handroanthus impetiginosus genome. (A) His- ily, the availability of this genome assembly will foster func- togram of different motifs ranging from 1 to 6 bp. (B) Distribution tional studies with new targets and allow the development of the simple sequence repeat length detected in the genome as- and application of robust sets of genome-wide SNP genotyp- sembly. ing tools to support multiple population genomics analyses Figure S4: Comparison of the gene feature parameters, in H. impetiginosus and related species of the Tabebuia Alliance. such as number and length, between H. impetiginosus and the This group includes several of the most ecologically and eco- other selected dicot plant across distinct lineages of Rosids (A. nomically important timber species of the American tropics. thaliana and P. trichocarpa) and Asterids (E. guttata and S. lycoper- Going beyond the species-specific significance of these results, sicum). Frequency histograms are shown according to the whole- this study paves the way for developing similar genomic re- genome gene content annotation for (A) the complete predicted sources for other Neotropical forest trees of equivalent rele- gene structure, (B) exons, and (C) introns. Dashed vertical lines vance. This, in turn, will open exceptional prospects to empower are the average lengths for the gene features. a higher-level understanding of the evolutionary history, species Figure S5: Histograms for Gene Ontology broader term anno- distribution, and population demography of the still largely ne- tations in the H. impetiginosus genome assembly. Terms for the glected forest trees of the mega-diverse tropical biomes. Further- Biological Process ontology were summarized with WEGO using more, this genome assembly provides a new resource for ad- the second tree level setting. The Pearson chi-square test was vances in the current integration between genomics, transcrip- applied to indicate significant relationships between H. impetigi- tomics, and metabolomics approaches for exploration of the nosus and the lamid Erythranthe guttata regarding the number of enormous structural diversity and biological activities of plant- genes (at α ≥ 5%). (A) Terms displaying a remarkable relationship derived compounds. between the 2 datasets; (B) terms with a significant difference between the 2 datasets. Figure S6: Same as Fig. S6 but showing comparison between numbers of genes assigned to GO broader terms for H. impetigi- Availability of supporting data nosus and the lamid Olea europaea. Sequences for the genome and assembly, along with gene Figure S7: Sequence length distribution from the assemblies content annotation and the raw sequencing reads, have of H. impetiginosus and the other 2 highly heterozygous trees of been deposited into GenBank, BioProject PRJNA324125. This the genus Quercus. Figure shows density plots for the size of scaf- Whole Genome Shotgun (WGS) project has been deposited at folds 2 kbp or longer in the 3 assemblies. Contigs metrics were DDBJ/ENA/GenBank under the accession NKXS00000000. The computed by cutting at each gap (of at least 25 base pairs, i.e., 25 version described in this paper is version NKXS01000000. or more Ns). Scaffolds and contigs length were plotted using the BioSample for WGS is SAMN05195323, and the corresponding common logarithm to respond to skewness toward large values. SRA run accessions are SRR3624821–SRR3624825. BioSample for File S1: Evidence adopted to support protein-coding loci iden- RNA-Seq is SAMN07346903, with SRA run accession SRR5820886. tification and assignment in the H. impetiginosus genome assem- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 14 Silva-Junior et al. bly. Two qualifiers—high confidence and low confidence—were genomic research. We also thank Dr. Gabriela Ferreira Nogueira added to the locus based on the reported evidence. and Andre´ Luis X. de Souza for their help with flow cytometry File S2: Genome assembly metrics from the assemblies analysis. of H. impetiginosus and the other 2 highly heterozygous trees of the genus Quercus. Comparison between met- rics was based on the assemblathon stats script part of References the assemblathon2-analysis package (https://github.com/ 1. Goodstein DM, Shu SQ, Howson R et al. Phytozome: a com- ucdavis-bioinformatics/assemblathon2-analysis). Metrics were parative platform for green plant genomics. Nucleic Acids computed for scaffolds 2 kbp or longer in length. Genomic Res 2012;40(D1):D1178–86. sequences in scaffolds for Quercus lobata was obtained from 2. Kang YJ, Lee T, Lee J et al. Translational genomics for https://valleyoak.ucla.edu/genomicresources/(accessed on 20 plant breeding with the genome sequence explosion. Plant September 2017). For Quercus rubra, genomic sequences in Biotechnol J 2016;14(4):1057–69. scaffolds were downloaded from the European Nucleotide 3. Bevan M, Walsh S. The Arabidopsis genome: a foundation for Archive repository, accessions LN776247–LN794156. plant research. Genome Res 2005;15(12):1632–42. 4. Morrell PL, Buckler ES, Ross-Ibarra J. Crop genomics: ad- Abbreviations vances and applications. Nat Rev Genet 2012;13(2):85–96. 5. Varshney RK, Glaszmann JC, Leung H et al. More ge- BLASTP: Basic Local Alignment Search Tool for Proteins; BLAT: nomic resources for less-studied crops. Trends Biotechnol BLAST-like alignment tool; CDS: coding DNA sequence; EC: 2010;28(9):452–60. Enzyme Comissioned Number; EST: Expressed Sequence Tag; 6. Myburg AA, Grattapaglia D, Tuskan GA et al. The genome of GATK: Genome Analysis Toolkit; GO: Gene Ontology; LINE: long Eucalyptus grandis. Nature 2014;510(7505):356–62. interspersed nuclear elements; LTR: long terminal repeats; MITE: 7. Tuskan GA, DiFazio S, Jansson S et al. The genome of miniature inverted–repeat transposable elements; mRNA: mes- black cottonwood, Populus trichocarpa (Torr. & Gray). Science senger RNA; PASA: Program to Assemble Spliced Alignment; 2006;313(5793):1596–604. REAPR: Recognition of Errors in Assemblies using Paired Reads; 8. Neale DB, Wegrzyn JL, Stevens KA et al. Decoding the mas- SINE: short interspersed nuclear elements; SNP: single nu- sive genome of loblolly pine using haploid DNA and novel cleotide polymorphism; SSPACE: SSAKE-based Scaffolding of assembly strategies. Genome Biol 2014;15:R59. Pre-Assembled Contigs after Extension; TE: transposable ele- 9. Nystedt B, Street NR, Wetterbom A et al. The Norway spruce ment. genome sequence and conifer genome evolution. Nature 2013;497(7451):579–84. Competing interests 10. Moghe G, Last R. Something old, something new: Conserved enzymes and the evolution of novelty in plant specialized The authors declare that they have no competing interests. metabolism. Plant Physiol 2015:169(3):1512–23. 11. Stone R. Lifting the veil on Traditional Chinese Medicine. Sci- ence 2008;319(5864):709–10. Funding 12. Chappell J, DellaPenna D, O’Connor S. Specific aims for This work was supported by competitive grants from CNPq to medicinal plant genomics resource. Med Plants Genomics R.G.C. (project no. 471366/2007–2, Rede Cerrado CNPq/PPBio Resour 2017, http://medicinalplantgenomics.msu.edu/. Ac- project no. 457406/2012–7, and Procad/Capes project # cessed 27 September, 2017. 88881.068425/2014-01), to E.N. (CNPq Proc. 476709/2012– 13. Brousseau L, Tinaut A, Duret C et al. High-throughput 1), and to D.G. (PRONEX FAP-DF Project Grant “NEXTREE” transcriptome sequencing and preliminary functional anal- 193.000.570/2009). R.G.C. and D.G. have been supported by pro- ysis in four Neotropical tree species. BMC Genomics ductivity grants from CNPq, which we gratefully acknowledge. 2014;15(1):238. O.B.S.Jr. has been supported by an EMBRAPA doctoral fellowship 14. Olsson S, Seoane-Zonjic P, Bautista Ro et al. Develop- and was an Affiliate Researcher at Lawrence Berkeley National ment of genomic tools in a widespread tropical tree, Laboratory (LBNL), Berkeley, California, at the time of this Symphonia globulifera L.f.: a new low-coverage draft research. genome, SNP and SSR markers. Mol Ecol Resour 2017;17(4): 614–30. 15. Cadena-Gonzalez ´ A, Sorensen M, Theilade I. Use and valu- Author contributions ation of native and introduced medicinal plant species in O.B.S.Jr. performed sequence data analysis and genome as- Campo Hermoso and Zetaquira, Boyaca, ´ Colombia. J Ethno- sembly and, together with E.N., carried out transcriptome and biol Ethnomed 2013;9(1):23. protein-coding gene annotation. R.C. and D.G. conceived the 16. Bodker G, Bhat KKS, Burley J et al. Medicinal plants for for- project, collected samples, extracted genomic DNA and RNA, est conservation and health care. Food Agricult Org UN 1997, carried out flow cytometry analysis, and supervised the project. http://www.fao.org/3/a-w7261e.pdf. Accessed 27 September, All authors were involved in discussions, writing, and editing. 2017. All authors read and approved the final manuscript. 17. Braga AC, Reis AMM, Leoi LT et al. Development and characterization of microsatellite markers for the tropical tree species Tabebuia aurea (Bignoniaceae). Mol Ecol Notes Acknowledgements 2007;7(1):53–56. 18. Liu B, Shi Y, Yuan J et al. Estimation of genomic charac- O.B.S.Jr. thanks D. M. Goodstein and the members of the Phy- teristics by analyzing k-mer frequency in de novo genome tozome team at the LBNL/Joint Genome Institute (JGI) for their projects. arXiv.org 2013. arXiv:13082012. valuable help and support in working with the JGI pipelines for Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 15 19. Schulze M, Grogan J, Uhl C et al. Evaluating ipe (Tabebuia, 39. Slater GS, Birney E. Automated generation of heuristics Bignoniaceae) logging in Amazonia: sustainable manage- for biological sequence comparison. BMC Bioinformatics ment or catalyst for forest degradation? Biol Conserv 2005;6:31. 2008;141(8):2071–85. 40. RepeatMasker Open-4.0. http://www.repeatmasker.org. Ac- 20. Inagaki R, Ninomiya M, Tanaka K et al. Synthesis and cessed 27 September, 2017. cytotoxicity on human leukemia cells of furonaphtho- 41. UniProt Consortium. UniProt: a hub for protein information. quinones isolated from tabebuia plants. Chem Pharmaceut Nucleic Acids Res 2014;43(D1):D204–12. Bull 2013;61(6):670–3. 42. Salamov AA, Solovyev VV. Ab initio gene finding in 21. Park BS, Kim JR, Lee SE et al. Selective growth-inhibiting ef- Drosophila genomic DNA. Genome Res 2000;10(4):516–22. fects of compounds identified in Tabebuia impetiginosa inner 43. Solovyev V, Kosarev P, Seledsov I et al. Automatic annotation bark on human intestinal bacteria. J Agricult Food Chem of eukaryotic genes, pseudogenes and promoters. Genome 2005;53(4):1152–7. Biol 2006;7(Suppl 1):S10.1–12. 22. Collevatti RG, Dornelas MC. Clues to the evolution of genome 44. Yeh RF, Lim LP, Burge CB. Computational inference of ho- size and chromosome number in Tabebuia alliance (Bignoni- mologous gene structures in the human genome. Genome aceae). Plant Syst Evol 2016;302(5):601–7. Res 2001;11(5):803–16. 23. Aronesty E. Comparison of sequencing utility programs. 45. Haas BJ, Delcher AL, Mount SM et al. Improving the Arabidop- Open Bioinformatics J 2013;7:1–8. sis genome annotation using maximal transcript alignment 24. Langmead B, Trapnell C, Pop M et al. Ultrafast and memory- assemblies. Nucleic Acids Res 2003;31(19):5654–66. efficient alignment of short DNA sequences to the human 46. Jones P, Binns D, Chang HY et al. InterProScan 5: genome- genome. Genome Biol 2009;10(3). scale protein function classification. Bioinformatics 25. Marcais G, Kingsford C. A fast, lock-free approach for effi- 2014;30(9):1236–40. cient parallel counting of occurrences of k-mers. Bioinfor- 47. Braga AC, Collevatti RG. Temporal variation in pollen disper- matics 2011;27(6):764–70. sal and breeding structure in a bee-pollinated Neotropical 26. Vurture GW, Sedlazeck FJ, Nattestad M et al. GenomeScope: tree. Heredity 2011;106(6):911–9. fast reference-free genome profiling from short reads. Bioin- 48. Sork VL, Fitz-Gibbon ST, Puiu D et al. First draft assembly formatics 2017;33(14):2202–20. and annotation of the genome of a California endemic oak 27. Gnerre S, MacCallum I, Przybylski D et al. High-quality draft Quercus lobata nee (Fagaceae). G3 (Bethesda) 2016;6(11):3485– assemblies of mammalian genomes from massively paral- 95. lel sequence data. Proc Natl Acad Sci U S A 2011;108(4): 49. Luo RB, Liu BH, Xie YL et al. SOAPdenovo2: an empirically 1513–8. improved memory-efficient short-read de novo assembler. 28. Flutre T, Duprat E, Feuillet C et al. Considering transposable Gigascience 2012;1(1):18. element diversification in de Nnvo annotation approaches. 50. Zimin AV, Marcais G, Puiu D et al. The MaSuRCA genome as- PLoS One 2011;6(1). sembler. Bioinformatics 2013;29(21):2669–77. 29. Wicker T, Sabot F, Hua-Van A et al. A unified classification 51. Kajitani R, Toshimoto K, Noguchi H et al. Efficient de novo system for eukaryotic transposable elements. Nat Rev Genet assembly of highly heterozygous genomes from whole- 2007;8(12):973–82. genome shotgun short reads. Genome Res 2014;24(8):1384– 30. Hoede C, Arnoux S, Moisset M et al. PASTEC: an auto- 95. matic transposable element classification tool. PLoS One 52. Malinsky M, Simpson JT, Durbin R. Trio-sga: facilitating de 2014;9(5):e91929. novo assembly of highly heterozygous genomes with parent- 31. Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0 child trios. bioRxiv 2016, doi: https://doi.org/10.1101/051516. (2013-2015). 2015, http://www.repeatmasker.org. Accessed 53. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding 27 September, 2017. pre-assembled contigs using SSPACE. Bioinformatics 32. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexi- 2011;27(4):578–9. ble trimmer for Illumina sequence data. Bioinformatics 54. Nadalin F, Vezzi F, Policriti A. GapFiller: a de novo assembly 2014;30(15):2114–20. approach to fill the gap within paired reads. BMC Bioinfor- 33. Xie Y, Wu G, Tang J et al. SOAPdenovo-Trans: de novo tran- matics 2012;13(Suppl 14):S8. scriptome assembly with short RNA-Seq reads. Bioinformat- 55. Hunt M, Kikuchi T, Sanders M et al. REAPR: a universal tool ics 2014;30(12):1660–6. for genome assembly evaluation. Genome Biol 2013;14(5): 34. Grabherr MG, Haas BJ, Yassour M et al. Full-length tran- R47. scriptome assembly from RNA-Seq data without a reference 56. Ponstingl H, Ning ZM. SMALT. 2010 - 2015 Genome Research genome. Nat Biotechnol 2011;29(7):644–U130. Ltd. 2016, http://www.sanger.ac.uk/science/tools/smalt-0. 35. Gilbert D. EvidentialGene: mRNA Transcript Assem- Accessed 27 September, 2017. bly Software. EvidentialGene: Evidence Directed Gene 57. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res Construction for Eukaryotes. 2013, http://arthropods. 2002;12(4):656–64. eugenes.org/EvidentialGene/. Accessed 27 September, 2017. 58. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- 36. Schmutz J, McClean P, Mamidi S et al. A reference genome for tio prediction of alternative transcripts. Nucleic Acids Res common bean and genome-wide analysis of dual domesti- 2006;34:W435–9. cations. Nat Genet 2014;46(7):707–13. 59. Ram´ırez-Sanc ´ hez O, Per ´ ez-Rodr´ıguez P, Delaye L et al. Plant 37. Shu S, Goodstein DM, Hayes D et al. JGI plant genomics proteins are smaller because they are encoded by fewer ex- gene annotation pipeline. SciTech Connect 2017, https:// ons than animal proteins. Genomics Proteomics Bioinfor- www.osti.gov/scitech/biblio/1241222-jgi-plant-genomics- matics 2016;14(6):357–70. gene-annotation-pipeline. Accessed 27 September, 2017. 60. Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2: 38. Gish W, States D. Identification of protein coding regions by evaluating de novo methods of genome assembly in three database similarity search. Nat Genet 1993;3(3):266–72. vertebrate species. Gigascience 2013;2:10–10. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 16 Silva-Junior et al. 61. Lam ET, Hastie A, Lin C et al. Genome mapping on 78. Sulpice R, Trenkamp S, Steinfath M et al. Network analysis of nanochannel arrays for structural variation analy- enzyme activities and metabolite levels and their relation- sis and sequence assembly. Nat Biotechnol 2012;30(8): ship to biomass in a large panel of Arabidopsis Accessions. 771–6. Plant Cell 2010;22(8):2872–93. 62. Ay F, Noble WS. Analysis methods for studying the 3D archi- 79. Hamilton JP, Robin Buell C. Advances in plant genome se- tecture of the genome. Genome Biol 2015;16:183. quencing. Plant J 2012;70(1):177–90. 63. Tørresen OK, Star B, Jentoft S et al. An improved genome 80. Barthelson R, McFarlin AJ, Rounsley SD et al. Plantagora: assembly uncovers prolific tandem repeats in Atlantic cod. modeling whole genome sequencing and assembly of plant BMC Genomics 2017;18(1):95. genomes. PLoS One 2011;6(12):e28436. 64. Wicker T, Sabot F, Hua-Van A et al. A unified classification 81. Hellsten U, Wright KM, Jenkins J et al. Fine-scale varia- system for eukaryotic transposable elements. Nat Rev Genet tion in meiotic recombination in Mimulus inferred from 2007;8(12):973–82. population shotgun sequencing. Proc Natl Acad Sci U S A 65. Morgante M, Hanafey M, Powell W. Microsatellites are 2013;110(48):19478–82. preferentially associated with nonrepetitive DNA in plant 82. Cruz F, Julca I, Gomez-Garrido ´ J et al. Genome sequence of genomes. Nat Genet 2002;30(2):194–200. the olive tree, Olea europaea. Gigascience 2016;5(1):29. 66. Wang Q, Fang L, Chen J et al. Genome-wide mining, char- 83. Plomion C, Aury JM, Amselem J et al. Decoding the acterization, and development of microsatellite markers in oak genome: public release of sequence data, assembly, gossypium species. Sci Rep 2015;5:10638. annotation and publication strategies. Mol Ecol Resour 67. Sonah H, Deshmukh R, Sharma A et al. Genome-wide distri- 2016;16(1):254–65. bution and organization of microsatellites in plants: an in- 84. Park B-S, Lee H-K, Lee S-E et al. Antibacterial activity of sight into marker development in brachypodium. PLoS One Tabebuia impetiginosa martius ex DC (Taheebo) against Heli- 2011;6(6):e21298. cobacter pylori. J Ethnopharmacol 2006;105(1–2):255–62. 68. Bernardi G. Isochores and the evolutionary genomics of ver- 85. Gomez Castellanos R, Prieto J, Heinrich M. Red Lapacho tebrates. Gene 2000;241(1):3–17. (Tabebuia impetiginosa)—a global ethnopharmacological com- 69. Mizuno M, Kanehisa M. Distribution profiles of GC content modity? J Ethnopharmacol 2009;121(1):1–13. around the translation initiation site in different species. 86. Byeon S, Chung J, Lee Y et al. In vitro and in vivo anti- FEBS Lett 1994;352(1):7–10. inflammatory effects of taheebo, a water extract from 70. Amit M, Donyo M, Hollander D et al. Differential GC content the inner bark of Tabebuia avellanedae. J Ethnopharmacol between exons and introns establishes distinct strategies of 2008;119(1):145–52. splice-site recognition. Cell Rep 2012;1(5):543–56. 87. Koyama J, Morita I, Tagahara K et al. Cyclopentene dialdehydes 71. Wendel JF, Greilhuber J, Dolezel J et al. Plant Genome Diver- from Tabebuia impetiginosa. Phytochemistry 2000;53(8):869– sity Volume 1 - Plant Genomes, Their Residents, and Their 72. Evolutionary Dynamics, vol. 1. Wien: Springer-Verlag; 2012. 88. Hussain H, Krohn K, Ahmad VU et al. Lapachol: an overview. 72. JiaYan W, JingFa X, LingPing W et al. Systematic analysis of Arkivoc 2007;2007(2):145. intron size and abundance parameters in diverse lineages. 89. Widhalm J, Rhodes D. Biosynthesis and molecular actions of Sci China Life Sci 2013;56(10):968–74. specialized 1,4-naphthoquinone natural products produced 73. Yu J, Yang Z, Kibukawa M et al. Minimal introns are not by horticultural plants. Horticult Res 2016;3:16046. “junk.” Genome Res 2002;12(8):1185–9. 90. Romagnoli M, Segoloni E, Luna M et al. Wood colour in Lapa- 74. Zhu J, He F, Wang D et al. A novel role for minimal introns: cho (Tabebuia serratifolia): chemical composition and indus- routing mRNAs to the cytosol. PLoS One 2010;5(4):e10144. trial implications. Wood Sci Technol 2013;47(4):701–16. 75. Simao FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- 91. Alkan C, Sajjadian S, Eichler EE. Limitations of next- ing genome assembly and annotation completeness with generation genome sequence assembly. Nat Methods single-copy orthologs. Bioinformatics 2015;31(19):3210–2. 2011;8(1):61–65. 76. Ye J, Fang L, Zheng H et al. WEGO. a web tool for plot- 92 Silva-Junior OB, Grattapaglia D, Novaes E et al. Supporting ting GO annotations. Nucleic Acids Res 2006;34(Web Server data for “Genome assembly of the pink Ipe( ˆ Handroanthus issue):W293–7. impetiginosus, Bignoniaceae), a highly valued, ecologically key- 77. Cooper GM. The Cell, 2nd edition. A Molecular Approach. stone Neotropical timber forest tree.” GigaScience Database Sunderland, MA: Sinauer Associates; 2000. 2017. http://dx.doi.org/10.5524/100379. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Genome assembly of the Pink Ipê (Handroanthus impetiginosus, Bignoniaceae), a highly valued, ecologically keystone Neotropical timber forest tree

Free
16 pages

Loading next page...
 
/lp/ou_press/genome-assembly-of-the-pink-ip-handroanthus-impetiginosus-bignoniaceae-curMKIQHE6
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix125
Publisher site
See Article on Publisher Site

Abstract

Background: Handroanthus impetiginosus (Mart. ex DC.) Mattos is a keystone Neotropical hardwood tree widely distributed in seasonally dry tropical forests of South and Mesoamerica. Regarded as the “new mahogany,” it is the second most expensive timber, the most logged species in Brazil, and currently under significant illegal trading pressure. The plant produces large amounts of quinoids, specialized metabolites with documented antitumorous and antibiotic effects. The development of genomic resources is needed to better understand and conserve the diversity of the species, to empower forensic identification of the origin of timber, and to identify genes for important metabolic compounds. Findings: The genome assembly covers 503.7 Mb (N50 = 81 316 bp), 90.4% of the 557-Mbp genome, with 13 206 scaffolds. A repeat database with 1508 sequences was developed, allowing masking of ∼31% of the assembly. Depth of coverage indicated that consensus determination adequately removed haplotypes assembled separately due to the extensive heterozygosity of the species. Automatic gene prediction provided 31 688 structures and 35 479 messenger RNA transcripts, while external evidence supported a well-curated set of 28 603 high-confidence models (90% of total). Finally, we used the genomic sequence and the comprehensive gene content annotation to identify genes related to the production of specialized metabolites. Conclusions: This genome assembly is the first well-curated resource for a Neotropical forest tree and the first one for a member of the Bignoniaceae family, opening exceptional opportunities to empower molecular, phytochemical, and breeding studies. This work should inspire the development of similar genomic resources for the largely neglected forest trees of the mega-diverse tropical biomes. Keywords: heterozygous genome; RNA-Seq; transposable elements; quinoids; Bignoniaceae Received: 28 June 2017; Revised: 27 September 2017; Accepted: 30 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Silva-Junior et al. Data Description Context The generation of plant genome assemblies is a key driver to develop powerful genomic resources that allow gaining de- tailed insights into the evolutionary history of species while enabling breeding and conservation efforts [1, 2]. Such ad- vances took place first in model plant species [ 3], followed by the mainstream [4] and minor crops [5] and some major forest trees [6–9]. Genome sequences have also driven impor- tant advances in the description and understanding of essen- tial plant metabolic processes that underlie survival across dis- tinct lineages. Research on the functional roles of specialized metabolites, many of them phylogenetically restricted [10], has recently addressed the gap in the species-specific knowledge of specialized plant metabolism by sequencing the genome of key medicinal plants [11, 12]. Innovation in this field has re- lied on a combination of high-throughput genomics, including massive parallel sequencing and arrays with animal and clin- ical studies to elucidate the mechanisms of target compounds such as adjuvant therapies, to demonstrate the necessary for- mulations for its biological effects and to determine which sub- stances are beneficial or toxic. Apart from recent reports of shal- low transcriptome characterization using 454 pyrosequencing [13] and a low-coverage (×11) fragmented genome assembly [14], essentially no well-curated genome assembly and gene content annotation exist for Neotropical forest trees, despite their recog- nized value by indigenous communities for the healing proper- ties of their special metabolites, increasingly exploited by large pharmaceutical corporations [15, 16]. An example of such a tree is the species Handroanthus impetiginosus (Mart.exDC.)Mattos (syn. Tabebuia impetiginosa, Bignoniaceae), popularly known as Pink Ipe, ˆ Lapacho, or Pau d’arco, a source of both high-value tim- ber and traditional medicine. Species of Handroanthus and Tabebuia have virtually no ge- nomic tools and resources, beyond a handful of 21 microsatel- lites [17] with their known caveats for more sophisticated genetic analyses in the areas of population genomics and evo- lution [18]. Whole-genome sequencing has now become ac- cessible to a point that efforts to develop improved genomic Figure 1: The Handroanthus impetiginosus (Mart. ex DC.) Mattos (syn. Tabebuia im- resources for such species are possible and warranted. We petiginosa, Bignoniaceae), tree UFG-1 whose genome was sequenced. built a preliminary assembly of the nuclear genome of a sin- gle individual of Handroanthus impetiginosus based on short reads and longer mate-pair DNA sequence data to provide DK). Flow cytometry was used to check the genome size of the necessary framework for the development of genomic re- tree UFG-1, indicating a genome size of (557 ±39) Mb/1C (Fig. sources to support multiple genomic and genetic analyses of S1) consistent with published estimates [22]. Total RNA from this keystone Neotropical hardwood tree regarded as the “new shoots of 5 seedlings and from the differentiating xylem of the mahogany.” It is the second most expensive timber and the adult tree (UFG-1) was extracted using Qiagen RNeasy Plant most logged species in Brazil [19], exported largely to North Mini kit (Qiagen, DK) and pooled for RNA sequencing. DNA America for residential decking and currently under signifi- and RNA sequencing was performed at the High-Throughput cant illegal trading pressure. Additionally, the tree produces Sequencing and Genotyping Center of the University of Illinois Urbana-Champaign. The following libraries were generated large amounts of natural products such as those of quinoid systems (1,4-anthraquinones, 1,4-naphthoquinones, and 1,2- for sequencing: (1) 2 shotgun genomic libraries of short frag- ments (300 bp and 600 bp) from tree UFG-1, (2) 1 shotgun furanonaphthoquinones), specialized metabolites with promis- ing antitumorous, anti-inflammatory, and antibiotic effects library from combined pools of 5 RNA samples tagged with [20, 21]. The high pressure of logging and illegal trading on this a single index sequence. Paired-end sequencing, 2 × 150 nt, species with a notable ecological keystone status urges conser- was performed in 2 lanes of an Illumina HiSeq 2500 instru- vation efforts of existing populations. ment (Illumina, CA, USA). Three additional mate-pair libraries (fragment lengths of 4 kb to 5.5 kb, 8 kb to 10 kb, and 15 kb to 20 kb) for UFG-1 were also sequenced in 2 lanes of an Methods Illumina HiSeq 2000 instrument (2 × 101 bp). This long-range sequence resource was used to generate the final genome Sample collection and sequencing assembly for annotation. A complete overview of the DNAofasingleadulttreeof H. impetiginosus (UFG-1) (Fig. 1) genome assembly and annotation pipeline is provided was extracted using Qiagen DNeasy Plant Mini kit (Qiagen, (Fig. S2). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 3 Genome assembly using short paired-end and [37] using both the RNA-Seq trimmed reads and sequences from the de novo transcript assembly. Loci were identified by mate-pair sequencing data the assembled transcript alignments using BLASTX [38]and Short reads and mate-pair reads were stripped of sequencing EXONERATE [39] alignments of peptide sequences to the repeat- adapters using Fastq-mcf [23]. Reads that mapped to a database soft-masked genome using RepeatMasker [40], based on a trans- containing mitochondrial and chloroplast genomes of plants poson database developed as part of this genome assembly an- with Bowtie1 (option–v3–a–m1)[24] were discarded. Mate- notation. Known peptide sequences included manually curated pair reads were inspected using a Perl script (TrimAdaptor.pl), datasets for plant species available from UniProtKB/Swiss-Prot and sequences that did not contain the circularization adaptor [41] and sequences available from Phytozome [1], version 11, were discarded. By using the filtered short reads, Jellyfish2 (Jelly- for Arabidopsis thaliana, Oryza sativa, Erythranthe guttata, Solanum fish, RRID:SCR 005491)[25] and GenomeScope [26] were applied lycopersicum, Solanum tuberosum, Populus trichocarpa,and Vitis to obtain estimates of the H. impetiginosus genome size, repeat vinifera. Gene structures were predicted by homology-based pre- fraction, and heterozygosity prior to the assembly. ALLPATHS- dictors, FGENESH++, FGENESH EST [42, 43], and GenomeScan LG (ALLPATHS-LG, RRID:SCR 010742)[27] was used for de novo as- [44]. Gene predictions were improved by Program to Assem- sembly of the sequence data from both paired-end and mate- ble Spliced Alignment (PASA, RRID:SCR 014656)[45], including pair data, with default options, in a stepwise strategy for error adding Untranslated Regions (UTRs), correcting splicing, and correction of reads, handling of repetitive sequences, and use of adding alternative transcripts. PASA-improved gene model pep- mate-pair libraries. tides were subjected to peptide homology analysis with the above-mentioned proteomes to obtain Cscore values and pep- tide coverage. Cscore is the ratio of the peptide Basic Local Align- Transposable elements and repetitive DNA ment Search Tool for Proteins (BLASTP) score to the mutual best Repetitive elements were detected and annotated on the hit BLASTP score, and peptide coverage is the highest percentage genome assembly with the RepeatModeler de novo repeat of peptide aligned to the best homolog. A transcript was selected family identification and modeling package (RepeatModeler, if its Cscore value was greater than or equal to 0.5 and its peptide RRID:SCR 015027)[28]. Using RECON, RepeatScout, and Tandem coverage was greater than or equal to 0.5 or if it had transcript Repeat Finder, repetitive sequences were detected in the scaf- coverage but the proportion of its coding sequence overlapping folds longer than 10 kb using a combination of similarity-based repeats was less than 20%. For gene models where greater than and de novo approaches. The TE sequences were evaluated us- 20% of the coding sequence overlapped with repeats, the Cscore ing modeling capabilities of the RepeatModeler program, with value was required to be at least 0.9 and homology coverage was default settings, to compare the TE library against the entire as- required to be at least 70% to be selected. Selected gene models sembled sequences and to refine and classify consensus mod- were then subjected to classification analysis using InterProScan els of putative interspersed repeats. A complementary analy- 5 (InterProScan, RRID:SCR 005829)[46] for PFAM domains, PAN- sis intended to augment the number of TE sequences classi- THER, Enzyme Comissioned Number (EC), and KEGG categories. fied according to current criteria [ 29] was performed using the Gene ontology annotation was obtained, where possible, from PASTEC program [30]. RepeatMasker Open-4.0 (RepeatMasker, Interpro2GO and EC2GO mappings. RRID:SCR 012954)[31] was used with the sequences from the de novo repetitive element library to annotate the interspersed repeats and to detect simple sequence repeats (SSRs) on the Data Validation and Quality Control genome assembly. Global properties of the H. impetiginosus tree genome from the unassembled reads Protein-coding gene annotation Sequencing of the H. impetiginosus tree genome generated c. 599 Protein-coding gene annotation was performed with a pipeline million reads, comprising 73 Gbp of sequence data. This rep- that combines RNA-Seq assembled transcript and protein resents nearly ×132 the expected sequence coverage. After re- alignments to the reference with de novo prediction methods moval of adaptors, followed by standard error correction and (Fig. S2). RNA-Seq reads were screened for the presence of trimming with ALLPATHS-LG, with default options, c. 46 Gbp adapters, which were removed using Fastq-mcf [23]. Trimmomatic of data was found useful for the assembly process, yielding (Trimmomatic, RRID:SCR 011848)[32] was used to (1) remove low- sequencing coverage of ×82 (×63 from the fragments libraries quality, no-base-called segments (Ns) from sequencing reads; and ×19 from the mate-pair libraries). The estimated physical (2) scan the read with a 4-base sliding window, cutting when coverage was ×400 based on the observed fragment size dis- the average quality per base dropped below 15; and (3) remove tributions (Table S1). ALLPATHS-LG k-mer spectrum frequency reads shorter than 32 bp after trimming. Trimmed reads mapped analysis (at K = 25) on useful reads, error-corrected reads, esti- to mitochondrial, chloroplast, and ribosomal sequences from mated a haploid genome size of 540 968 531 bp, a repeat frac- plants with Bowtie1 (options –v3–a–m1)[24] were also re- tion of 38.0%, and a single nucleotide polymorphism (SNP) rate moved. Transcript de novo assemblies were performed using of 1/88 bp (1.14%). An alternative analysis of the k-mer fre- SOAP-Transdenovo [33]and Trinity de novo [34] from the processed quencies using GenomeScope [26] produced a haploid genome reads. The assemblies were concatenated and used as input size estimate of 503 748 072 bp, repetitive content of 36.6%, to EvidentialGene [35], a comprehensive transcriptome pipeline and an SNP rate of 1/60 bp (1.65%). Both estimates (Fig. 2A) to identify likely complete coding regions and their proteins are consistent with the flow cytometry estimates and in line in the final, combined, transcriptome assembly. Gene modeling with the expectations regarding the heterozygous content of the was carried out using standard procedures and tools described, H. impetiginosus genome, a predominantly outcrossed tree [47]. for instance, in Schmutz et al. [36]. In summary, a genome- Sequencing errors caused an extreme peak at k = 1inthe k- guided transcriptome assembly of H. impetiginosus was per- mer frequency distribution. Both k-mer histograms display 2 formed with the JGI PERTRAN RNA-Seq Read Assembler pipeline distinct peaks comprising the largest area of each histogram Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Silva-Junior et al. (A) 0 20 40 60 80 100 120 140 k−mer coverage per 25mer (B) 0.08 0.06 0.04 0.02 0.00 0 20 40 60 80 100 120 140 Depth of coverage Figure 2: Depth of coverage analysis. (A) Histograms of k-mer frequencies in the filtered read data for k = 25 (red) and GenomeScope modeling equation on H. impetiginosus (blue). The x-axis shows the number of times a k-mer occurred (coverage). The vertical dashed dark blue lines correspond to the mean coverage values for unique heterozygous k-mers (left peak) and unique homozygous k-mers (right peak). (B) Density plot of read depth based on mapping all short fragment reads back to the assembled scaffolds (red). Left peak (at depth =×34) corresponds to regions where the assembler created 2 distinct scaffolds from divergent putative haplotypes. The right peak (at depth =×67) contains scaffolds from regions where the genome is less variable, allowing the assembler to construct a single contig combining homologue sequences. Histograms of Poisson modeling for read depth in the assembly (green, lambda = 34; blue, lambda = 67) are shown. at depths 27 and 55. The bimodal distributions characterize from the trimmed reads, but the process was aborted on the the expected behavior for k-mer frequencies of a heterozygous overlap-correction process in the Celera Assembler due to exces- diploid genome, as seen, for example, in the recently reported sive CPU usage. SOAPdenovo2 ran very fast (3 days) but produced oak genome [48]. In the right homozygous peak (at K = 55), k- an assembly with a total scaffold size of 860 Mbp. Analysis with mers are shared between the 2 homologous chromosomes. The SOAPdenovo2 was run with different k-mer sizes, from 31 to 71, left or heterozygous peak, with half the k-mer depth of the ho- step of 10, but none of them produced a reasonable assembly mozygous peak, contains k-mers that are unique to each hap- size in view of the expected size estimated by flow cytometry lotype due to heterozygosity. The difference in height between and the k-mer frequency. ALLPATHS-LG was therefore used to these peaks (heterozygous/homozygous ratio) is a measure of assemble the genome with default options. The short reads from the heterozygosity within the genome, which is 1.65% accord- fragmented libraries were error-corrected using default settings ing to the GenomeScope modeling equation. (K-mer size of 24, ploidy of 2), fragment-filled, and assembled into initial unipaths (k-mer size of 96, ploidy of 2). Jumping reads from the mate-pair libraries were then aligned to the unipaths Genome assembly and all alignments were processed in a seed-extension strategy with junction point recognition within the read aimed to remove State-of-the-art haploid genome assembler pipelines from invalid and duplicate fragments to perform error correction and short-read ALLPATHS-LG [27] and SOAPdenovo2 (SOAPdenovo2, initial scaffolding. This initial process produced an assembly RRID:SCR 014986)[49] were considered for an initial evalua- graph that was turned into scaffolds by analyzing branch points tion on the dataset of reads. Two relatively new algorithms in the graph topology. This late process converted single-base specifically developed for de novo assembly of heterozygous mismatches into ambiguous base codes at branch. It also flat- genomes, MaSuRCA (MaSuRCA, RRID:SCR 010691)[50] and PLA- tened some other structural features of the assembly including TANUS (PLATANUS, RRID:SCR 015531)[51], were also attempted short indels. The contig assembly comprised 109 064 sequences as alternatives to the other 2 assemblers designed for genomes of a length of 500 bp or longer with total length of 466 314 780 bp. of low heterozygosity. Reads were first preprocessed and error Genome assembly after scaffolding comprised 57 815 scaffolds corrected using the algorithms provided by each assembler. PLA- of length 1 kbp or longer with a total length of 610 091 865 bp and TANUS was set to run, but after 10 weeks it did not produce N50 of 57 Kbp. The fraction of bases captured in gaps was 23.9%, any result in an Intel(R) Xeon(R) server with 64 × 7560 2.27-GHz and the rate of ambiguous bases for all bases captured in the CPUs, 256 GB RAM, except for the k-mer count table on the in- assembly was 0.24%. This assembly was only slightly larger in put trimmed reads. After 9 week-long runtimes in an Intel(R) size (<10%) than the empirically determined genome size using Xeon(R) server with 64 × 7560 2.27-GHz CPUs, 512 GB RAM, Ma- flow cytometry [ 22]. SuRCA successfully completed the generation of the super-reads Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Density Number of k−mers at coverage (Million) Genome of the Pink Ipe, ˆ a neotropical forest tree 5 Alternative scaffold and gap-filling = 2379). The number of remaining gaps in the assembly was 21 417, totaling 30 066 113 bp (5.05%). Although the ALLPATHS-LG performance was good in recovering Paired-end reads from the short fragment libraries were the expected genome size in the assembled contigs, there was aligned back independently to this genome assembly using a high fraction of the bases captured in gaps in the scaffolds SMALT (map -r 0 -x -y 0.5; default alignment penalty scores). Per- (∼one-fourth of the total genome assembly). De novo assembly scaffold depth of coverage was computed, regardless of map- algorithms applied to moderate to high levels of heterozygosity ping quality, using Genome Analysis Toolkit (GATK) Depthof- cannot match the performance achieved in assemblies of ho- Coverage. The mean read depth across the scaffolds resulted in mozygous genomes, especially at the contig assembly level [52]. ×66.45. The mean read length of the mapped reads was 139.8 We thus used the assembled contigs to perform an alternative bp, and the corresponding k-mer coverage for the size of 25 scaffolding step with SSAKE-based Scaffolding of Pre-Assembled was ×55.04, which matches with the homozygous peak com- Contigs after Extension (SSPACE, RRID:SCR 005056)[53] using the puted from the k-mer frequency distribution from the unassem- error-corrected short fragment reads and the jumping reads. In bled reads. The read depth frequencies are shown in Fig. 2B. this approach, genome assembly comprised 16 090 scaffolds of The heterozygous/homozygous peak height (>1) in the distri- a length of 1 kbp or longer with a total length of 577 446 088 bp bution suggests that the assembly contains redundant copies and N50 of 95 Kbp, respectively. The fraction of bases captured of unmerged haplotypes due to the structural heterozygosity in gaps dropped from 23.9% to 18.9% in contrast to ALLPATHS- of the diploid genome of the species. To specifically deal with LG scaffolding, totaling 109 533 288 bp. The rate of ambiguous the heterozygosity, we introduced a step to, leniently, recognize bases for all bases captured in the assembly dropped from 0.24% and remove alternative heterozygous sequences. Sequences of to 0.13%. All preprocessed reads were reused in an attempt to scaffolds were aligned 1 vs all using the BLAST-like alignment close the intra-scaffold gaps using the GapCloser (GapCloser, tool (BLAT, RRID:SCR 011919)[57], and results were concate- RRID:SCR 015026)[54] algorithm. Genome assembly after gap- nated in a single file of alignments and sorted. Similar se- filling was 586 206 884 bp in 15 671 scaffolds of a length of 1 kbp quences were identified on the base of pairwise similarity us- or longer, and only 20 583 469 bp (3.51% of the genome assem- ing filterPSL utility from AUGUSTUS [ 58] with default param- bly) remained in 24 907 gaps. The N50 of scaffolds of a length of eters, and retaining all best matches to each single sequence 1 kbp or longer, with gaps, was 97 344 Kb (L50 = 1792). Sequences queried against all others that satisfy minimal percentage of longer than 20 kb were assembled in only 6791 scaffolds, total- identity (minId = 92%) and minimal percentage of coverage of ing 538 102 146 bp, ∼97% of the genome size estimated from flow the query read (minCover = 80%). We considered as heterozy- cytometry (557 Mb). gous redundant those scaffolds that showed pairwise similarity to exactly another sequence, and their depth of coverage fell in a Poisson distribution with parameters given by the heterozygous Evaluation of accuracy of the genome assembly peak of the read depth distribution over all scaffolds (lambda A subset of fragments and jumping read pairs (∼×15 sequenc- = 34) (Fig. 2B). The final step was to keep only 1 copy—the ing coverage each) was used to uncover inaccuracies in the largest—of the heterozygous scaffolds among pairs with high genome assembly. Scaffolds with identified errors were broken similarity. or flagged for inspection. Recognition of Errors in Assemblies using Paired Reads (REAPR) [55] was used to test each base of A preliminary assembly of the H. impetiginosus genome the genome assembly looking for small local errors (such as a single base substitution, and short insertions or deletions) and At the end of the accuracy evaluation processes, the genome structural errors (such as scaffolding errors), located by means assembly had a total size of 503 308 897 bp, with gaps, in 13 of changes to the expected distribution of inferred sequenc- 206 scaffolds. The N50 of scaffolds of 1 kbp or longer was 80 ing fragments from the mapped reads using SMALT v0.7.6 [56]. 946 bp (L50 = 1906), and the average size of the sequences was REAPR reported that only 343 588 027 (∼60%) bases in the assem- 38 118 bp. Using 20 kbp as an approximate value of longest plant bly should be free of errors, with 5476 reported (1658 within con- gene length [59, 60], the percentage of scaffolds that equaled or tigs, 3818 over gaps) in the remaining 242 618 857 bp. The most surpassed this value in relation to the empirically determined frequent (∼92%) type of inaccuracy reported was Perfect cov and genome size is 83%, which corresponds to over 92% of the as- Link. Perfect cov means low coverage of perfect uniquely mapping sembly total size. Contigs generated by cutting scaffolds at each reads while Link describes situations in which reads map else- gap (of at least 25 base pairs, i.e., 25 or more Ns) produced an where in the assembly. The recognition of this inaccuracy at the N50 of 40 064 bp (L50 = 3551) with an average sequence size of base pair level should thus reflect the repetitive nature of the 19 765 bp. The remaining gaps comprised 26 447 057 bp (5.25% genome, as inferred from the k-mer frequency spectra analy- of the genome assembly) in 11 094 segments, with a size of 2384 sis (∼36%–38% of repeats). Besides the base pair inaccurate calls ± 3167 bp. The total assembly size represents over 90% of the due to repeats, other structural problems in the assembly were flow cytometry genome estimate (557 Mb) and should provide identified based on sequence coverage differences from the ex- a good start to build a further improved reference genome as- pected fragment size distribution, and the program used this sembly of the species using long-range scaffolding techniques information to break these. Given the high heterozygosity and such as whole-genome maps using either imaging methods [61] divergence between haplotypes on this diploid genome se- or contact maps of chromosomes based on chromatin interac- quence, homologous sequences can assemble separately or tions [62]. Table 1 summarizes the main statistics of the Han- merge. Moreover, unresolved repeat structures in the assembly droanthus impetiginosus genome assembly with respect to the de- might also contribute heavily to this issue. Structural errors in cisions made in the assembly process. REAPR were likely called at the boundaries of these regions. The A reassessment of the assembly accuracy was carried out final genome assembly after REAPR breaks had 19 319 sequences using REAPR on the final genome assembly. A total of 121 of a length of 1 kbp or longer, with 576 829 188 bp. The N50 size errors within a contig were still recognized, a much smaller of scaffolds dropped from 97 344 Kb (L50 = 1792) to 71 491 bp (L50 number than previously annotated (1658 errors). Fig. 3Ashows Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Silva-Junior et al. Table 1: Handroanthus impetiginosus genome assembly statistics Allpaths-LG/ Allpaths-LG/ Scaffold sequences Allpaths-LG Sspace/GapClose Sspace/GapClose/Reapr Number 57 815 16 090 13 206 Total size, without gaps, bp 469 049 393 565 959 143 476 867 120 Total size, with gaps, bp 614 626 609 586 542 612 503 314 177 Number > 10 Kbp 10 029 8602 8348 Number > 20 Kbp 6920 6791 6647 Number > 100 Kbp 1100 1709 1304 Number > 1 Mbp 2 0 0 Longest sequence, bp 1 844 569 979 053 558 523 Average size, bp 10 631 36 454 38 112 N50 length, bp 57 726 97 266 80 946 L50 count 2595 1792 1906 GC % 33.63 33.57 33.62 The final assembly for each step contains scaffolds of length 1 kbp or longer. (A) 0.02 0.01 0.00 0 20 40 60 80 100 120 140 Depth of coverage 0.05 (B) 0.04 0.03 0.02 0.01 0.00 0 20 40 60 80 100 120 140 Average coverage for scaffolds Figure 3: Depth of coverage analysis for the haplotype-reduced assembly. (A) Density plot of read depth based on mapping all short fragment reads back to the haplotype-reduced assembled sequences after identification and removal of redundant sequences due the structural heterozygosity in the genome. (B ) Density plot for average sequencing coverage per scaffold on the final assembly. The observed number of scaffolds in the final haplotype-reduced assembly and the re spective read coverage (blue line) are shown in comparison with a Poisson process approximation (red line) with lambda =×63, the observed average sequencing coverage in the useful read data. the frequency distribution for the read depth computed from we can infer that most families/subfamilies of repeats might be the paired-end read alignment to the scaffold sequences. It underrepresented. indicates the expected effect on the distribution in compari- To complement the depth of read coverage analyses, we per- son with the previous more redundant assembly. The height formed additional analyses to identify the most probable causes of the heterozygous peak was successfully lowered by remov- of breaks in the assembly. We inspected contig termini defining ing unmerged copies of the same heterozygous loci. Fig. 3B the positions of the terminal nucleotides of each contig from the shows the relation between the observed number of scaffolds genome assembly created by cutting at each gap (of at least 1 in the final assembly and their read coverage in comparison base pair, i.e., 1 or more Ns). This analysis was developed using with a Poisson approximation with of lambda of 63, which was a protocol described elsewhere [63], and results are summarized the observed average sequencing coverage for reads set from in Fig. 4. Contig termini overlap most prominently (∼50%) with short fragment libraries. Loss of information due to repeat se- regions that do not encompass any annotated feature or regions quences is clearly a limitation of this H. impetiginosus assem- that have no depth of coverage (∼15%) based on mapped reads to bly. Given the high rate of nonclassified consensus sequences, the assembly. It suggests that contigs end in large repeats not yet Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Density Density Genome of the Pink Ipe, ˆ a neotropical forest tree 7 Figure 4: Contig termini analysis to investigate the possible genomic features associated with gaps in the genome assembly. Contigs were created from the genome assembly with the “cutN -n 1” command from the seqtk program, which cut at each gap (of at least 1 base pair, i.e., 1 or more Ns). The figure shows the percent age of contig termini (the position of the terminal nucleotides of each contig) intersecting with different annotations of the genome. resolved given the inherent limitations of short-read sequence sequences were annotated with predicted protein domains fre- data. Another possibility is that these regions can contain low- quently associated with protein coding genes. These 135 se- copy young euchromatic segmental duplication with higher se- quences were wiped out from the consensus library. Most of quence similarity to the consensus sequence. Annotated inter- the remaining 1473 sequences (71.1%) could not find classifica- spersed repeats (∼18%) and short tandem repeats (∼9%) were tion in the hierarchical well-known classes of transposable ele- the most prominently annotated features with overlap to contig ments (TEs) [64], but 16.6% could be classified Class I (retrotrans- ends. Less than 8% (2473 of 31 668) of annotated gene models posons), including 3 orders: long terminal repeats (LTR; 12.8%), were found to overlap contig ends, indicating that very few are long interspersed nuclear elements (LINE; 1.6%), and short in- likely to be interrupted in this unfinished assembly. It is a trend terspersed nuclear elements (SINE; 2.2%); 8.4% are Class II (DNA that was confirmed using BUSCO analysis, which reported only transposons). Other categories comprised nonautonomous TEs: 3% of fragmented genes. Based on variant identification analy- TRIM (0.4%) and miniature inverted–repeat transposable ele- sis with FreeBayes (FreeBayes, RRID:SCR 010761) using read data ments (MITE; 3.5%). Unknown nonclassified sequences in the mapped to the genome assembly, we found virtually no allelic consensus library cover a wide range of sequence sizes, from variants located at the contigs’ end, suggesting that interrup- 42 bp up to 5987 bp (average = 345 bp, median = 503 bp). The tion of continuity and contiguity in the assembly is not related 1473 sequences representing interspersed repeats in the con- to differences between haplotypes. sensus repeat library were used to mask the genome with Re- peatMasker. The masked fraction of the genome assembly com- prised 155 348 349 bp, i.e., 30.9% of the total assembled genome Repetitive DNA of 503 Mbp. Remarkably, if we add to these ∼155 Mbp the 54 Mbp of noncaptured base pairs in the assembly when consid- A total of 1608 consensus sequences (average length = 773 bp, ering the empirically determined genome size (557–503), the totaling 1 281 536 bp) representing interspersed repeats in the repetitive fraction of the genome approximates 37.5% (209 Mbp genome assembly were found. Search for domains in these se- out of 557 Mbp). This is within the expected range (36.6%–38.0%) quences with similarity to known large families of genes that for the repetitive fraction of the genome estimated from the read could confound the identification of true repeats indicated 85 set using k-mer profiling approaches. false positives in the consensus library of repeats. A further 50 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Silva-Junior et al. (A) Classification CLASS_I CLASS_II Tandem_repeat Unknown (B) LTR LTR/Caulimovirus LTR/Copia LTR/ERV LTR/Gypsy LTR/Ngaro LTR/Pao non−LTR/LINE non−LTR/SINE TRIM DNA DNA/CMC−EnSpm DNA/Dada DNA/hAT DNA/Kolobok−T2 DNA/MuLE−MuDR DNA/PIF−Harbinger MITE TIR Tandem_repeat Unknown 0 1000 2000 3000 4000 5000 6000 7000 Sequence length (bp) Figure 5: Repeat content of the H. impetiginosus genome assembly. (A) The density of interspersed and tandem repeat as percentage of the assembly. The size of the circles represents the number of copies in the assembly for each family of repeats. (B) Distribution of sizes of the consensus sequences for repeat families identified using de novo and homology methods for repeat characterization. More than 50% of the masked bases in the assembly, or 80 SSR motifs ranging from 1 to 6 bp showed that the di-nucleotide Mbp, came from nonclassified sequences in the consensus li- repeats were the most abundant repeats, followed by the brary. In the well-known repeats, retrotransposons are the most mono- (Fig. S3A). The frequency of SSR decreased with increase abundant class in the assembly, comprising 50 Mbp (∼one-third in motif length (Supplementary Fig. S3B), which is a trend usu- of the masked bases), with prominence of LTR/Gypsy (∼23 Mb) ally observed both in monocots and dicots [67]. and LTR/Copy (18 Mb) families of repeats. DNA transposons and nonautonomous orders of transposons masked 12 Mbp and 11 Transcriptome assembly and gene content annotation Mbp (∼one-sixth of the masked bases), respectively, highlighting and analysis the prominence of DNA/hAT families of class II and MITE (Fig. 5). A single run of Illumina HiSeq 2500 sequencing, from a pool of Simple sequence repeats (SSRs) detection using RepeatMasker RNA samples, generated nearly 148 million paired-end reads. identified a total of 182 115 microsatellites with a density of 2.76 After adapter removal, trimming, and coverage normalization, kb per SSR in the genome assembly. This density corroborates 55.2 million high-quality reads (38%) were used to assemble the general finding that the overall frequency of microsatellites the transcriptome using de novo (Trinity and SOAP-Trans-denovo is inversely related to genome size in plant genomes [65]. This SSRs density in H. impetiginosus (genome size of 557 Mbp/SSR transcripts combined with the EvidentialGene pipeline) and genome-guided methods (PERTRAN). The PASA pipeline was density of 362 per Mbp) is higher than in larger plant genomes such as those of maize (1115 Mbp/163 SSRs per Mbp), S. bicolor used to integrate transcript alignments to the genome assem- bly from these sets of sequences, generating 54 320 Expressed (738 Mbp/175 per Mbp), G. raimondii (761 Mbp/74.8 per Mbp) [66], but lower than densities in smaller genomes such as those of A. Sequence Tag (EST) assemblies representing putative protein- coding loci in the genome assembly. Loci were identified by thaliana (120 Mbp/418 per Mbp), Medicago truncatula (307 Mbp/495 per Mbp), and C. sativus (367 Mbp/552 per Mbp) [67]. Different the assembled transcript alignments using BLASTX [36]and Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Percent of assembly LTR LTR/Caulimovirus LTR/Copia LTR/ERV LTR/Gypsy LTR/Ngaro LTR/Pao non−LTR/LINE non−LTR/SINE TRIM DNA DNA/CMC−EnSpm DNA/Dada DNA/hAT DNA/Kolobok−T2 DNA/MuLE−MuDR DNA/PIF−Harbinger MITE TIR Tandem_repeat Unknown Genome of the Pink Ipe, ˆ a neotropical forest tree 9 Table 2: Handroanthus impetiginosus gene prediction statistics with respect to the number, length, and base composition of genes, transcripts, exons, and introns Genes Transcripts Exons Introns Number 31 688 35 479 154 209 122 521 Average number/gene – 1.12 4.87 3.87 Average length 3129 3342 285 445 N50 length 4421 4643 477 839 %GC 38.38 38.22 42.60 32.83 %N 0.43 0.43 0.00 0.29 Table 3: The distribution of the minimal introns (53–125 bp) and the minimal-intron-containing genes—as the number of genes with at least 1 minimal intron—from selected plant species in comparison with the H. impetiginosus genome assembly Genome Number of Mean intron Minimal Gene Species size, Mbp intron, bp length, bp intron, % , % A. thaliana (Rosids) 120 118 037 164 72.29 57.08 E. guttata (Asterids) 312 117 507 290 47.75 57.63 P. trichocarpa (Rosids) 423 166 809 380 36.96 53.41 E. grandis (Rosids) 691 137 329 425 33.49 48.38 S. indicum (Asterids) 354 101 313 439 38.14 49.76 H. impetiginosus (Asterids) 557 122 521 445 34.36 49.78 S. lycopersicum (Asterids) 900 125 750 543 36.09 47.78 EXONERATE [37] alignments of plant peptides to the repeat-soft- species in the Asterids and Rosids lineages. We have calculated masked genome using RepeatMasker. After gene model pre- the percentages of minimal introns out of the total introns and diction and refinements, a total of 36 262 gene models were the fraction of minimal-intron-containing genes with at least 1 found in the genome assembly, and 31 668 of them were re- minimal intron. Computed values were similar between H. im- tained after quality assessment based on Cscore, protein cov- petiginosus and those of selected species with higher numbers of erage, and overlap with repeats, as described in the “Methods.” large introns (smaller minimal intron peak) but were more dis- The number of predicted messenger RNA (mRNA) transcripts tinctive with those species such as A. thaliana and E. guttate, in was 35 479. which the number of large introns was lower (larger minimal in- Structural features of the gene content are shown in tron peak). This is thought to be a general trend and was also ob- Tables 2 and 3. The average number of exons per gene was ∼5, served in previous work [73]. These comparative analyses about and its average length was 285 bp. The average number of in- the structural properties of the predicted genes indicate that the trons per gene was ∼4, and its average length was 445 bp. The genome assembly of H. impetiginosus contains highly accurate GC content is significantly different between exons and introns gene structures. (t test P < 0.0001). Coding sequences have ∼43% of GC, while To further validate the gene content annotation, we used the introns have less, with ∼33% (Table 2). GC content tends to be transcript assemblies and selected plant proteomes to inspect higher in coding (exonic) than noncoding regions [68], which if these sequences could align in their entirety to the genomic may be related to gene architecture and alternative splicing sequence. Out of the 31 668 primary mRNA transcripts (consid- [69–71]. A comparison of the gene feature parameters, such as ering only the longest one when isoforms were predicted) in number and length (Fig. S4A), was carried out between H. im- the genome, 11 488 have 100% of their coding DNA sequence petiginosus and Erythranthe guttata, another plant in the order (CDS) covered by EST assemblies. The remaining 20 054 tran- Lamiales (Asterids), the model plant A. thaliana and the model scripts have either a minimum of 80% of their CDS covered by tree P. trichocarpa (Rosids). As depicted in the frequency his- EST assemblies or a Cscore ≥0.5. From these latter, the encoded tograms, the exon parameters are stable among these species putative peptides have excellent sequence similarity support (Fig. S4B). For the introns (Fig. S4C), frequency histograms have from BLASTP comparisons with dicot species Erythranthe guttata a sharp peak around 90 bp and a larger peak that is much lower (5224 genes), Sesamum indicum (4625 genes), potato or tomato in density. There is a small intron size variability from species to (2777 genes), soybean (1484 genes), and the poplar tree (1424 species in the distributions, especially for larger introns, which genes), reflecting the taxonomic relationship between H. impetig- rarely go beyond 10 000 bp. The intron length distributions in inosus and these other related dicots. Gene model support was these 4 species are similar to those observed in lineages that also found from more distantly related dicots (1826 genes) and are late in the evolutionary time scale, such as plants and verte- monocots (1042 genes). Altogether, 31 048 gene models (98%) brates [72]. The sharp peak in the distributions at their “minimal show well-supported similarity hits to other known plant pro- intron” size is supposed to affect function by enhancing the rate tein sequences. An additional 517 predicted protein sequences at which mRNA is exported from the cell nucleus [73, 74]. In the did not produce hits, and 103 sequences produced ambiguous model plant A. thaliana, a minimal intron group was previously hits from nontarget species or represent possible contaminants defined [ 73] as anything that lies within 3 standard deviations of in the assembly, such as endophytic fungi (ascomycetes, 42 se- the optimum peak at 89 ± 12 bp (53 – 125 bp). According to this quences; basidiomycetes, 17 sequences). Fig. 6A summarizes definition, Table 3 summarizes the distribution of the minimal the main finding regarding the similarity analyses with known intron among genes of H. impetiginosus and other selected plant proteins. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 Silva-Junior et al. (A) (B) Figure 6: Transcriptome quality assessment (A) similarity search of H. impetiginosus putative peptides against source database of plant protein sequences using BLASTP algorithm (e-value = 1e-6). Transcript count means the number of peptides of H. impetiginosus with the best hit against the source database using bit-score and grouping results by taxon name. Transcript score corresponds to the average bit-score overall hits for each group using the best hit. We ordered taxon groups by their average bit-score overall hits and used Welch’s t test to compare the distributions of bit-score hits between 2 adjacent groups with P-values <0.01 (ns = nonsignificant; significant). (B) Completeness of the expected gene space of the genome assembly, estimated with BUSCO. The estimates were compared with genome annot ations for other lamids, Erythranthe guttata and Olea europaea. BUSCO (BUSCO, RRID:SCR 015008)[75] single-copy gene plant to be completely duplicated. We benchmarked our results by profiles were used to estimate completeness of the expected searching the BUSCO profiles on the genomes of other lamids, gene space as well as the duplicate fraction of the genome as- Erythranthe guttata and Olea europaea. In E. guttate, the analysis sembly. Out of the 956 profiles searched on the assembly, 59 reported a completeness level of 88% (848 single-copy profiles (6.1%) were reported missing and 30 (3.1%) returned fragmented. with a complete match), while there were 52 fragmented genes From the profiles with a complete match to the assembly, 867 (5.4%). In O. europeae, the completeness level was 94% (905 com- (90.7%) were reported as single-copy and 247 (25.8%) were found plete single-copy profiles), and there were only 14 fragmented Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 11 genes (1.4%). A summary of the BUSCO analysis is presented in lamids, it appears that the frequency of small- and large-scale Fig. 6B. duplications, such as (paleo)polyploidy, can explain the differ- Databases for gene ontology (GO) annotation are rich re- ences in the number of annotated genes and levels of gene du- sources to describe the functional properties of experimentally plication (E. guttata <= H. impetiginosus  O. europaea). It suggests derived gene sets. To explore relationships between the GO that the H. impetiginosus genome has not undergone a recent terms in the H. impetiginosus and related, well-curated genomes, whole-genome duplication event, although a deeper analysis of we used WEGO [76] to perform a genome-wide comparative this question, beyond the scope of this study, remains open. analyses among broad functional GO terms with other lamids. Our genome assembly metrics were benchmarked against The P-value of the Pearson chi-square test was considered to comparable genome assemblies of other highly heterozygous indicate significant relationships between the proportions of forest tree genomes (File S2 and Fig. S7). The H. impetiginosus as- genes of each GO term in these 2 datasets and to suggest pat- sembly has 503 Mbp in 13 206 scaffolds ≥2 kbp, representing over terns of enrichment (Figs S5 and S6). These analyses revealed 90% of the flow cytometry estimated size (557 Mb). For Quercus several GO terms in which the proportion of genes in the 2 com- robur, the assembly had 17 910 scaffolds ≥2 kbp with scaffolds pared species were related. For the terms in which the com- N50 of 260 kbp, but corresponding to 1.34 Gbp, i.e., 81% larger parison did not indicate a significant relationship of gene pro- than the expected 740-Mbp genome, which is clearly an unde- portions between the 2 datasets, the compared GO terms sug- sirable result [83]. For Quercus lobate, with agenomesizeof730 gested enrichments in H. impetiginosus for GO terms involved in Mbp, 2 assemblies were provided: a haplotype-reduced assem- metabolic processes and catalytic activity in comparison with E. bly, with 40 158 contigs totaling 760 Mb, N50 of 95 kbp, and a guttata and O. europaea. more complete version for gene models, containing 94 394 scaf- The central role of enzymes as biological catalysts is a well- folds ≥2 kbp, totaling 1.15 Gbp, with an N50 of 278 kbp [48]. De- studied issue related to the chemistry of cells [77]. An impor- spite our lower NG50/N50 scaffold length <100 kbp, the H. im- tant feature of most enzymes is that their activities can be regu- petiginosus assembly has a large (60%) percentage of scaffolds lated to function properly to comply with physiological needs of ≥20 kbp. This value is higher than the reported values for Quer- the organism. We observed that the GO term for enzyme reg- cus lobata v0.5 (53%), Quercus lobata v1.0 (51%), and Quercus rubra ulatory activity encompasses a higher proportion of genes in (48%), even if those assemblies had higher NG50/N50 scaffold H. impetiginosus than in the 2 other lamids, albeit the differ- lengths. Finally, contig termini analysis has found virtually no ence did not reach significance in E. guttata. Research in Ara- allelic variants located at contig ends, suggesting that interrup- bidopsis, a herbaceous plant, has found little connectivity be- tion of continuity and contiguity in the assembly is not related to tween metabolites and enzyme activity [78]. In comparison with differences between haplotypes. This genome assembly for Han- Arabidopsis broader GO terms, H. impetiginosus showed, as dis- droanthus impetiginosus will thus be useful for variant calling, one cussed above, enrichment for the proportion of genes assigned of the main future objectives for generating this resource. to the metabolic process (49.1% > 47.4%; P = 0.002) and cat- alytic activity (46.2% > 42.9%; P = 0). The proportion of genes for Genome-guided exploration of specialized metabolism enzyme regulatory activity was also higher in H. impetiginosus genes of quinoid systems than A. thaliana, though not statistically significantly ( P = 0.083). Investigations into whether and how metabolic process and Aside from their highly valued wood, H. impetiginosus and enzyme activities relate and how it could influence the known other Ipeˆ species are also known for their medicinal ef- richness of metabolites for forest trees of the mega-diverse trop- fects. Extracts from their bark and wood have many ethnob- ical biomes, particularly in the genus Tabebuia and Handroanthus, otanical uses: against cancer, malaria, fevers, trypanosomi- shall be an interesting issue for future molecular and chemistry asis, fungal and bacterial infections, and stomach disorders studies. [84, 85]. The wood extracts have also been demonstrated to have anti-inflammatory effects [ 86][87]. The main bioactive components isolated from the Pink Ipeˆ are Lapachol and its Benchmarking the genome assembly products [88], which are naphthoquinones derived from the o- of H. impetiginosus succinylbenzoate (OSB) pathway [89]. Lapachol is also responsi- Based on current standards for plant genome sequence assem- ble for the well-known high resistance of the Ipeˆ wood against bly [60, 79, 80], we have provided a quality assembly of high rotting fungi and insects [90]. In addition, naphthoquinones are future utility. To support functional analyses, we classified the aromatic substances with ecological importance for the interac- gene models into high-confidence and low-confidence groups. tion of plants with other plants, insects, and microbes [89]. Given Out of the 31 688 protein-coding loci annotated in the genome their medicinal and biological relevance, we have searched the assembly, 28 603 (90%) produced high-confidence gene mod- H. impetiginosus annotated genes for the enzymes involved in els (Supplementary File S1). This subset contains approximately the biosynthesis of naphthoquinones. By searching for the KEGG the same number of genes reported in less fragmented genome identifiers of these enzymes (e.g., K01851) in the InterPro an- assemblies for other lamids. E. guttata (2n = 28) reports 28 notation results, we found all the important known enzymes 140 protein-coding genes [81]; O. europeae (2n = 46) has 56 349 that lead to the biosynthesis of lapachol (Fig. 7). Unfortunately, protein-coding genes [82], but its genome has likely undergone however, the last 2 steps of the lapachol biosynthesis pathway a whole-genome duplication event. Most of Tabebuia and Han- still constitute unidentified enzymes [ 89]. For comparative pur- droanthus species studied so far have 2n = 40 [22]. The fraction of poses, we downloaded the annotation file of 5 other species gene duplicates in the BUSCO analysis (see Fig. 5B) was intended from the Phytozome database. The number of H. impetiginosus to estimate the level of redundancy in the genome assembly. We genes encoding for the enzymes of each step in the pathway is benchmarked our results by searching the completed duplicated comparable to the numbers found in other species. However, BUSCO profiles in the genomes of E. guttata and O. europaea.In 3 exceptions were found. H. impetiginosus has 5 genes encod- the first, we found them to be 15% (150 out 956), while in the lat- ing the enzyme that converts chorismate to isochorismate, the ter the duplicated profiles were 38% (364 out of 956). In these 3 first step in the OSB pathway. Two other steps found to have Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 Silva-Junior et al. Figure 7: Genes of the biosynthetic pathway of specialized quinoids. O-succinylbenzoate (OSB) pathway depicting the number of H. impetiginosus (Himp) annotated genes for the known enzymes that lead to the biosynthesis of the naphthoquinones, including lapachol. For comparison, it also shows the numbers of genes for the closely related Mimulus guttatus (Mgut), Solanum lycopersicum (Slyc), for the model Arabidopsis thaliana (Ath), and for the tree species Eucalyptus grandis (Egr) and Populus trichocarpa (Potri). The pathway was modified from Widhalm and Rhodes [ 89]. relatively more genes in H. impetiginosus are the ones that lead to functional studies. Extensive documentation of quality through- the synthesis of 1,4-Dihydroxy-2-naphtoyl-CoA and of 2-Phytyl- out the assembly process was provided showing that accept- 1,4-naphthoquinone. The availability of sequences for these able continuity was reached and that the fragmentation of the genes may open new avenues for biotechnological products and final sequence mostly derives from loss of information on high- for a better understanding of their ecological roles. copy families of long interspersed repeats or the presence of low-copy segmental duplications likely recently evolved with higher sequence similarity to the consensus sequence. Cer- Re-use potential tainly, there are still inaccuracies at the base and assembly lev- els, but all efforts were made to deliver results to the end user We have reported a well-curated but still unfinished genome with the appropriate documentation, making this initial read assembly for Handroanthus impetiginosus, a highly valued, eco- set, sequence, and annotations a primary and reliable starting logically keystone tropical timber and a species rich in nat- grounds for further improvement. ural products. The fragmentation of this preliminary assem- We have documented in detail the main features of the re- bly might be still be limiting for deeper insights of whole- ported assembly. The total assembly size of scaffolds ≥2 kbp in genome comparative analyses or studies of genome evolu- length is 90% of the flow cytometry–determined genome size, a tion [91], although we think that such studies may be carried remarkable accomplishment, we believe, given the anticipated out using this assembly at least at the gene level or gene- difficulties in assembling such a repetitive and highly heterozy- family level. Nevertheless, the broad validation performed pro- gous diploid genome based exclusively on short-read sequenc- vides a useful genomic resource for genetic and functional anal- ing. The percentage of base pairs in scaffolds with ≥20 kbp ysis, including, but not limited to, downstream applications is 83% (461 Mbp of 557 Mbp) of the empirically determined such as variant calling, molecular markers development, and Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 13 genome size, which corresponds to 92% of the assembled total Supporting data and summary outputs for the main analyses in size (461 Mbp of 503 Mbp). Using 20 kbp as an approximate value this Data Note are available via the GigaScience repository, GigaDB of the longest plant gene length, this result shows that 60% of [92]. The Perl script that automated the read set from mate- the assembly is accessible for reliable gene annotation. Further- pair sequencing preprocessing (TrimAdaptor.pl) was uploaded more, the N50/NG50 (41 kbp/34 kbp) contig length is longer than to GigaDB under permission of the original authors at the High- 30 kbp, which has been suggested to be an adequate minimum Throughput Sequencing and Genotyping Center Unit of the Uni- threshold for high utility of a genome assembly [79]. The per- versity of Illinois Urbana-Champaign. centage of documented gaps in scaffolds is only 5.3%, and the few misassembled signatures present in the assembly were fully documented based on acceptable metrics such as fragment cov- erage distribution error (FCD error). Less than 8% (2473 of 31 Additional file 668) of annotated gene models were found to overlap contig ends, indicating that very few are likely to be interrupted in this Table S1: Summary of the sequence data generated for the unfinished assembly. No allelic variants were found at contig genome assembly of Handroanthus impetiginosus based on the ends, suggesting that interruption of continuity and contiguity ALLPATHS-LG algorithm. in the assembly is not related to differences between haplotypes, Figure S1: Flow cytometry results of the sequenced tree UFG- therefore providing a valuable resource for variant calling and 1of H. impetiginosus. Flow cytometry estimate of the nuclear DNA functional analysis. More than 86% (27 380 of 31 668) of the content was carried out using young leaf tissue on a BD Accuri gene models represented in the assembly have external eviden- C6 Plus personal flow cytometer. Pisum sativum (genome size 9.09 tial support measured by PASA-validated EST alignments from pg/2C or ∼4380 Mb/1C) was used as standard for comparison RNA-Seq or high-coverage alignments with known plant pro- (M2). The estimate of nuclear DNA content for H. impetiginosus teins (>90% coverage). Furthermore, 80% (25 369 of 31 668) of (M1) averaged over 10 readings was 1.155 pg/2C or 557.3 ± 39 transcripts have conceptual translation that contains protein Mb/1C. domain annotation, excluding those associated to TEs. Finally, a Figure S2: Overview of the analytical pipeline with the bioin- summary of BUSCO analysis indicates that the detected number formatics steps and tools employed for genome (black arrows) of plant single-copy orthologs represents 90% of the searched and transcriptome assembly (red arrows), and for gene predic- profiles (867 of 956), while only 6% are missing and 3% are tion and annotation (blue arrows). Bioinformatics programs are fragmented. indicated in italic, blue, and the main file formats in red. The This is the first well-curated genome for a Neotropical for- input sequences are highlighted in yellow boxes and the main est tree and the first one reported for a member of the Bignoni- products in green. aceae family. Besides expanding the opportunities for compara- Figure S3: Distribution and characterization of simple se- tive genomic studies by including an overlooked taxonomic fam- quence repeats in Handroanthus impetiginosus genome. (A) His- ily, the availability of this genome assembly will foster func- togram of different motifs ranging from 1 to 6 bp. (B) Distribution tional studies with new targets and allow the development of the simple sequence repeat length detected in the genome as- and application of robust sets of genome-wide SNP genotyp- sembly. ing tools to support multiple population genomics analyses Figure S4: Comparison of the gene feature parameters, in H. impetiginosus and related species of the Tabebuia Alliance. such as number and length, between H. impetiginosus and the This group includes several of the most ecologically and eco- other selected dicot plant across distinct lineages of Rosids (A. nomically important timber species of the American tropics. thaliana and P. trichocarpa) and Asterids (E. guttata and S. lycoper- Going beyond the species-specific significance of these results, sicum). Frequency histograms are shown according to the whole- this study paves the way for developing similar genomic re- genome gene content annotation for (A) the complete predicted sources for other Neotropical forest trees of equivalent rele- gene structure, (B) exons, and (C) introns. Dashed vertical lines vance. This, in turn, will open exceptional prospects to empower are the average lengths for the gene features. a higher-level understanding of the evolutionary history, species Figure S5: Histograms for Gene Ontology broader term anno- distribution, and population demography of the still largely ne- tations in the H. impetiginosus genome assembly. Terms for the glected forest trees of the mega-diverse tropical biomes. Further- Biological Process ontology were summarized with WEGO using more, this genome assembly provides a new resource for ad- the second tree level setting. The Pearson chi-square test was vances in the current integration between genomics, transcrip- applied to indicate significant relationships between H. impetigi- tomics, and metabolomics approaches for exploration of the nosus and the lamid Erythranthe guttata regarding the number of enormous structural diversity and biological activities of plant- genes (at α ≥ 5%). (A) Terms displaying a remarkable relationship derived compounds. between the 2 datasets; (B) terms with a significant difference between the 2 datasets. Figure S6: Same as Fig. S6 but showing comparison between numbers of genes assigned to GO broader terms for H. impetigi- Availability of supporting data nosus and the lamid Olea europaea. Sequences for the genome and assembly, along with gene Figure S7: Sequence length distribution from the assemblies content annotation and the raw sequencing reads, have of H. impetiginosus and the other 2 highly heterozygous trees of been deposited into GenBank, BioProject PRJNA324125. This the genus Quercus. Figure shows density plots for the size of scaf- Whole Genome Shotgun (WGS) project has been deposited at folds 2 kbp or longer in the 3 assemblies. Contigs metrics were DDBJ/ENA/GenBank under the accession NKXS00000000. The computed by cutting at each gap (of at least 25 base pairs, i.e., 25 version described in this paper is version NKXS01000000. or more Ns). Scaffolds and contigs length were plotted using the BioSample for WGS is SAMN05195323, and the corresponding common logarithm to respond to skewness toward large values. SRA run accessions are SRR3624821–SRR3624825. BioSample for File S1: Evidence adopted to support protein-coding loci iden- RNA-Seq is SAMN07346903, with SRA run accession SRR5820886. tification and assignment in the H. impetiginosus genome assem- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 14 Silva-Junior et al. bly. Two qualifiers—high confidence and low confidence—were genomic research. We also thank Dr. Gabriela Ferreira Nogueira added to the locus based on the reported evidence. and Andre´ Luis X. de Souza for their help with flow cytometry File S2: Genome assembly metrics from the assemblies analysis. of H. impetiginosus and the other 2 highly heterozygous trees of the genus Quercus. Comparison between met- rics was based on the assemblathon stats script part of References the assemblathon2-analysis package (https://github.com/ 1. Goodstein DM, Shu SQ, Howson R et al. Phytozome: a com- ucdavis-bioinformatics/assemblathon2-analysis). Metrics were parative platform for green plant genomics. Nucleic Acids computed for scaffolds 2 kbp or longer in length. Genomic Res 2012;40(D1):D1178–86. sequences in scaffolds for Quercus lobata was obtained from 2. Kang YJ, Lee T, Lee J et al. Translational genomics for https://valleyoak.ucla.edu/genomicresources/(accessed on 20 plant breeding with the genome sequence explosion. Plant September 2017). For Quercus rubra, genomic sequences in Biotechnol J 2016;14(4):1057–69. scaffolds were downloaded from the European Nucleotide 3. Bevan M, Walsh S. The Arabidopsis genome: a foundation for Archive repository, accessions LN776247–LN794156. plant research. Genome Res 2005;15(12):1632–42. 4. Morrell PL, Buckler ES, Ross-Ibarra J. Crop genomics: ad- Abbreviations vances and applications. Nat Rev Genet 2012;13(2):85–96. 5. Varshney RK, Glaszmann JC, Leung H et al. More ge- BLASTP: Basic Local Alignment Search Tool for Proteins; BLAT: nomic resources for less-studied crops. Trends Biotechnol BLAST-like alignment tool; CDS: coding DNA sequence; EC: 2010;28(9):452–60. Enzyme Comissioned Number; EST: Expressed Sequence Tag; 6. Myburg AA, Grattapaglia D, Tuskan GA et al. The genome of GATK: Genome Analysis Toolkit; GO: Gene Ontology; LINE: long Eucalyptus grandis. Nature 2014;510(7505):356–62. interspersed nuclear elements; LTR: long terminal repeats; MITE: 7. Tuskan GA, DiFazio S, Jansson S et al. The genome of miniature inverted–repeat transposable elements; mRNA: mes- black cottonwood, Populus trichocarpa (Torr. & Gray). Science senger RNA; PASA: Program to Assemble Spliced Alignment; 2006;313(5793):1596–604. REAPR: Recognition of Errors in Assemblies using Paired Reads; 8. Neale DB, Wegrzyn JL, Stevens KA et al. Decoding the mas- SINE: short interspersed nuclear elements; SNP: single nu- sive genome of loblolly pine using haploid DNA and novel cleotide polymorphism; SSPACE: SSAKE-based Scaffolding of assembly strategies. Genome Biol 2014;15:R59. Pre-Assembled Contigs after Extension; TE: transposable ele- 9. Nystedt B, Street NR, Wetterbom A et al. The Norway spruce ment. genome sequence and conifer genome evolution. Nature 2013;497(7451):579–84. Competing interests 10. Moghe G, Last R. Something old, something new: Conserved enzymes and the evolution of novelty in plant specialized The authors declare that they have no competing interests. metabolism. Plant Physiol 2015:169(3):1512–23. 11. Stone R. Lifting the veil on Traditional Chinese Medicine. Sci- ence 2008;319(5864):709–10. Funding 12. Chappell J, DellaPenna D, O’Connor S. Specific aims for This work was supported by competitive grants from CNPq to medicinal plant genomics resource. Med Plants Genomics R.G.C. (project no. 471366/2007–2, Rede Cerrado CNPq/PPBio Resour 2017, http://medicinalplantgenomics.msu.edu/. Ac- project no. 457406/2012–7, and Procad/Capes project # cessed 27 September, 2017. 88881.068425/2014-01), to E.N. (CNPq Proc. 476709/2012– 13. Brousseau L, Tinaut A, Duret C et al. High-throughput 1), and to D.G. (PRONEX FAP-DF Project Grant “NEXTREE” transcriptome sequencing and preliminary functional anal- 193.000.570/2009). R.G.C. and D.G. have been supported by pro- ysis in four Neotropical tree species. BMC Genomics ductivity grants from CNPq, which we gratefully acknowledge. 2014;15(1):238. O.B.S.Jr. has been supported by an EMBRAPA doctoral fellowship 14. Olsson S, Seoane-Zonjic P, Bautista Ro et al. Develop- and was an Affiliate Researcher at Lawrence Berkeley National ment of genomic tools in a widespread tropical tree, Laboratory (LBNL), Berkeley, California, at the time of this Symphonia globulifera L.f.: a new low-coverage draft research. genome, SNP and SSR markers. Mol Ecol Resour 2017;17(4): 614–30. 15. Cadena-Gonzalez ´ A, Sorensen M, Theilade I. Use and valu- Author contributions ation of native and introduced medicinal plant species in O.B.S.Jr. performed sequence data analysis and genome as- Campo Hermoso and Zetaquira, Boyaca, ´ Colombia. J Ethno- sembly and, together with E.N., carried out transcriptome and biol Ethnomed 2013;9(1):23. protein-coding gene annotation. R.C. and D.G. conceived the 16. Bodker G, Bhat KKS, Burley J et al. Medicinal plants for for- project, collected samples, extracted genomic DNA and RNA, est conservation and health care. Food Agricult Org UN 1997, carried out flow cytometry analysis, and supervised the project. http://www.fao.org/3/a-w7261e.pdf. Accessed 27 September, All authors were involved in discussions, writing, and editing. 2017. All authors read and approved the final manuscript. 17. Braga AC, Reis AMM, Leoi LT et al. Development and characterization of microsatellite markers for the tropical tree species Tabebuia aurea (Bignoniaceae). Mol Ecol Notes Acknowledgements 2007;7(1):53–56. 18. Liu B, Shi Y, Yuan J et al. Estimation of genomic charac- O.B.S.Jr. thanks D. M. Goodstein and the members of the Phy- teristics by analyzing k-mer frequency in de novo genome tozome team at the LBNL/Joint Genome Institute (JGI) for their projects. arXiv.org 2013. arXiv:13082012. valuable help and support in working with the JGI pipelines for Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of the Pink Ipe, ˆ a neotropical forest tree 15 19. Schulze M, Grogan J, Uhl C et al. Evaluating ipe (Tabebuia, 39. Slater GS, Birney E. Automated generation of heuristics Bignoniaceae) logging in Amazonia: sustainable manage- for biological sequence comparison. BMC Bioinformatics ment or catalyst for forest degradation? Biol Conserv 2005;6:31. 2008;141(8):2071–85. 40. RepeatMasker Open-4.0. http://www.repeatmasker.org. Ac- 20. Inagaki R, Ninomiya M, Tanaka K et al. Synthesis and cessed 27 September, 2017. cytotoxicity on human leukemia cells of furonaphtho- 41. UniProt Consortium. UniProt: a hub for protein information. quinones isolated from tabebuia plants. Chem Pharmaceut Nucleic Acids Res 2014;43(D1):D204–12. Bull 2013;61(6):670–3. 42. Salamov AA, Solovyev VV. Ab initio gene finding in 21. Park BS, Kim JR, Lee SE et al. Selective growth-inhibiting ef- Drosophila genomic DNA. Genome Res 2000;10(4):516–22. fects of compounds identified in Tabebuia impetiginosa inner 43. Solovyev V, Kosarev P, Seledsov I et al. Automatic annotation bark on human intestinal bacteria. J Agricult Food Chem of eukaryotic genes, pseudogenes and promoters. Genome 2005;53(4):1152–7. Biol 2006;7(Suppl 1):S10.1–12. 22. Collevatti RG, Dornelas MC. Clues to the evolution of genome 44. Yeh RF, Lim LP, Burge CB. Computational inference of ho- size and chromosome number in Tabebuia alliance (Bignoni- mologous gene structures in the human genome. Genome aceae). Plant Syst Evol 2016;302(5):601–7. Res 2001;11(5):803–16. 23. Aronesty E. Comparison of sequencing utility programs. 45. Haas BJ, Delcher AL, Mount SM et al. Improving the Arabidop- Open Bioinformatics J 2013;7:1–8. sis genome annotation using maximal transcript alignment 24. Langmead B, Trapnell C, Pop M et al. Ultrafast and memory- assemblies. Nucleic Acids Res 2003;31(19):5654–66. efficient alignment of short DNA sequences to the human 46. Jones P, Binns D, Chang HY et al. InterProScan 5: genome- genome. Genome Biol 2009;10(3). scale protein function classification. Bioinformatics 25. Marcais G, Kingsford C. A fast, lock-free approach for effi- 2014;30(9):1236–40. cient parallel counting of occurrences of k-mers. Bioinfor- 47. Braga AC, Collevatti RG. Temporal variation in pollen disper- matics 2011;27(6):764–70. sal and breeding structure in a bee-pollinated Neotropical 26. Vurture GW, Sedlazeck FJ, Nattestad M et al. GenomeScope: tree. Heredity 2011;106(6):911–9. fast reference-free genome profiling from short reads. Bioin- 48. Sork VL, Fitz-Gibbon ST, Puiu D et al. First draft assembly formatics 2017;33(14):2202–20. and annotation of the genome of a California endemic oak 27. Gnerre S, MacCallum I, Przybylski D et al. High-quality draft Quercus lobata nee (Fagaceae). G3 (Bethesda) 2016;6(11):3485– assemblies of mammalian genomes from massively paral- 95. lel sequence data. Proc Natl Acad Sci U S A 2011;108(4): 49. Luo RB, Liu BH, Xie YL et al. SOAPdenovo2: an empirically 1513–8. improved memory-efficient short-read de novo assembler. 28. Flutre T, Duprat E, Feuillet C et al. Considering transposable Gigascience 2012;1(1):18. element diversification in de Nnvo annotation approaches. 50. Zimin AV, Marcais G, Puiu D et al. The MaSuRCA genome as- PLoS One 2011;6(1). sembler. Bioinformatics 2013;29(21):2669–77. 29. Wicker T, Sabot F, Hua-Van A et al. A unified classification 51. Kajitani R, Toshimoto K, Noguchi H et al. Efficient de novo system for eukaryotic transposable elements. Nat Rev Genet assembly of highly heterozygous genomes from whole- 2007;8(12):973–82. genome shotgun short reads. Genome Res 2014;24(8):1384– 30. Hoede C, Arnoux S, Moisset M et al. PASTEC: an auto- 95. matic transposable element classification tool. PLoS One 52. Malinsky M, Simpson JT, Durbin R. Trio-sga: facilitating de 2014;9(5):e91929. novo assembly of highly heterozygous genomes with parent- 31. Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0 child trios. bioRxiv 2016, doi: https://doi.org/10.1101/051516. (2013-2015). 2015, http://www.repeatmasker.org. Accessed 53. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding 27 September, 2017. pre-assembled contigs using SSPACE. Bioinformatics 32. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexi- 2011;27(4):578–9. ble trimmer for Illumina sequence data. Bioinformatics 54. Nadalin F, Vezzi F, Policriti A. GapFiller: a de novo assembly 2014;30(15):2114–20. approach to fill the gap within paired reads. BMC Bioinfor- 33. Xie Y, Wu G, Tang J et al. SOAPdenovo-Trans: de novo tran- matics 2012;13(Suppl 14):S8. scriptome assembly with short RNA-Seq reads. Bioinformat- 55. Hunt M, Kikuchi T, Sanders M et al. REAPR: a universal tool ics 2014;30(12):1660–6. for genome assembly evaluation. Genome Biol 2013;14(5): 34. Grabherr MG, Haas BJ, Yassour M et al. Full-length tran- R47. scriptome assembly from RNA-Seq data without a reference 56. Ponstingl H, Ning ZM. SMALT. 2010 - 2015 Genome Research genome. Nat Biotechnol 2011;29(7):644–U130. Ltd. 2016, http://www.sanger.ac.uk/science/tools/smalt-0. 35. Gilbert D. EvidentialGene: mRNA Transcript Assem- Accessed 27 September, 2017. bly Software. EvidentialGene: Evidence Directed Gene 57. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res Construction for Eukaryotes. 2013, http://arthropods. 2002;12(4):656–64. eugenes.org/EvidentialGene/. Accessed 27 September, 2017. 58. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- 36. Schmutz J, McClean P, Mamidi S et al. A reference genome for tio prediction of alternative transcripts. Nucleic Acids Res common bean and genome-wide analysis of dual domesti- 2006;34:W435–9. cations. Nat Genet 2014;46(7):707–13. 59. Ram´ırez-Sanc ´ hez O, Per ´ ez-Rodr´ıguez P, Delaye L et al. Plant 37. Shu S, Goodstein DM, Hayes D et al. JGI plant genomics proteins are smaller because they are encoded by fewer ex- gene annotation pipeline. SciTech Connect 2017, https:// ons than animal proteins. Genomics Proteomics Bioinfor- www.osti.gov/scitech/biblio/1241222-jgi-plant-genomics- matics 2016;14(6):357–70. gene-annotation-pipeline. Accessed 27 September, 2017. 60. Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2: 38. Gish W, States D. Identification of protein coding regions by evaluating de novo methods of genome assembly in three database similarity search. Nat Genet 1993;3(3):266–72. vertebrate species. Gigascience 2013;2:10–10. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018 16 Silva-Junior et al. 61. Lam ET, Hastie A, Lin C et al. Genome mapping on 78. Sulpice R, Trenkamp S, Steinfath M et al. Network analysis of nanochannel arrays for structural variation analy- enzyme activities and metabolite levels and their relation- sis and sequence assembly. Nat Biotechnol 2012;30(8): ship to biomass in a large panel of Arabidopsis Accessions. 771–6. Plant Cell 2010;22(8):2872–93. 62. Ay F, Noble WS. Analysis methods for studying the 3D archi- 79. Hamilton JP, Robin Buell C. Advances in plant genome se- tecture of the genome. Genome Biol 2015;16:183. quencing. Plant J 2012;70(1):177–90. 63. Tørresen OK, Star B, Jentoft S et al. An improved genome 80. Barthelson R, McFarlin AJ, Rounsley SD et al. Plantagora: assembly uncovers prolific tandem repeats in Atlantic cod. modeling whole genome sequencing and assembly of plant BMC Genomics 2017;18(1):95. genomes. PLoS One 2011;6(12):e28436. 64. Wicker T, Sabot F, Hua-Van A et al. A unified classification 81. Hellsten U, Wright KM, Jenkins J et al. Fine-scale varia- system for eukaryotic transposable elements. Nat Rev Genet tion in meiotic recombination in Mimulus inferred from 2007;8(12):973–82. population shotgun sequencing. Proc Natl Acad Sci U S A 65. Morgante M, Hanafey M, Powell W. Microsatellites are 2013;110(48):19478–82. preferentially associated with nonrepetitive DNA in plant 82. Cruz F, Julca I, Gomez-Garrido ´ J et al. Genome sequence of genomes. Nat Genet 2002;30(2):194–200. the olive tree, Olea europaea. Gigascience 2016;5(1):29. 66. Wang Q, Fang L, Chen J et al. Genome-wide mining, char- 83. Plomion C, Aury JM, Amselem J et al. Decoding the acterization, and development of microsatellite markers in oak genome: public release of sequence data, assembly, gossypium species. Sci Rep 2015;5:10638. annotation and publication strategies. Mol Ecol Resour 67. Sonah H, Deshmukh R, Sharma A et al. Genome-wide distri- 2016;16(1):254–65. bution and organization of microsatellites in plants: an in- 84. Park B-S, Lee H-K, Lee S-E et al. Antibacterial activity of sight into marker development in brachypodium. PLoS One Tabebuia impetiginosa martius ex DC (Taheebo) against Heli- 2011;6(6):e21298. cobacter pylori. J Ethnopharmacol 2006;105(1–2):255–62. 68. Bernardi G. Isochores and the evolutionary genomics of ver- 85. Gomez Castellanos R, Prieto J, Heinrich M. Red Lapacho tebrates. Gene 2000;241(1):3–17. (Tabebuia impetiginosa)—a global ethnopharmacological com- 69. Mizuno M, Kanehisa M. Distribution profiles of GC content modity? J Ethnopharmacol 2009;121(1):1–13. around the translation initiation site in different species. 86. Byeon S, Chung J, Lee Y et al. In vitro and in vivo anti- FEBS Lett 1994;352(1):7–10. inflammatory effects of taheebo, a water extract from 70. Amit M, Donyo M, Hollander D et al. Differential GC content the inner bark of Tabebuia avellanedae. J Ethnopharmacol between exons and introns establishes distinct strategies of 2008;119(1):145–52. splice-site recognition. Cell Rep 2012;1(5):543–56. 87. Koyama J, Morita I, Tagahara K et al. Cyclopentene dialdehydes 71. Wendel JF, Greilhuber J, Dolezel J et al. Plant Genome Diver- from Tabebuia impetiginosa. Phytochemistry 2000;53(8):869– sity Volume 1 - Plant Genomes, Their Residents, and Their 72. Evolutionary Dynamics, vol. 1. Wien: Springer-Verlag; 2012. 88. Hussain H, Krohn K, Ahmad VU et al. Lapachol: an overview. 72. JiaYan W, JingFa X, LingPing W et al. Systematic analysis of Arkivoc 2007;2007(2):145. intron size and abundance parameters in diverse lineages. 89. Widhalm J, Rhodes D. Biosynthesis and molecular actions of Sci China Life Sci 2013;56(10):968–74. specialized 1,4-naphthoquinone natural products produced 73. Yu J, Yang Z, Kibukawa M et al. Minimal introns are not by horticultural plants. Horticult Res 2016;3:16046. “junk.” Genome Res 2002;12(8):1185–9. 90. Romagnoli M, Segoloni E, Luna M et al. Wood colour in Lapa- 74. Zhu J, He F, Wang D et al. A novel role for minimal introns: cho (Tabebuia serratifolia): chemical composition and indus- routing mRNAs to the cytosol. PLoS One 2010;5(4):e10144. trial implications. Wood Sci Technol 2013;47(4):701–16. 75. Simao FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- 91. Alkan C, Sajjadian S, Eichler EE. Limitations of next- ing genome assembly and annotation completeness with generation genome sequence assembly. Nat Methods single-copy orthologs. Bioinformatics 2015;31(19):3210–2. 2011;8(1):61–65. 76. Ye J, Fang L, Zheng H et al. WEGO. a web tool for plot- 92 Silva-Junior OB, Grattapaglia D, Novaes E et al. Supporting ting GO annotations. Nucleic Acids Res 2006;34(Web Server data for “Genome assembly of the pink Ipe( ˆ Handroanthus issue):W293–7. impetiginosus, Bignoniaceae), a highly valued, ecologically key- 77. Cooper GM. The Cell, 2nd edition. A Molecular Approach. stone Neotropical timber forest tree.” GigaScience Database Sunderland, MA: Sinauer Associates; 2000. 2017. http://dx.doi.org/10.5524/100379. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4739364 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off