Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the family Saturniidae

Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the... Background: Antheraea yamamai, also known as the Japanese oak silk moth, is a wild species of silk moth. Silk produced by A. yamamai, referred to as tensan silk, shows different characteristics such as thickness, compressive elasticity, and chemical resistance compared with common silk produced from the domesticated silkworm, Bombyx mori. Its unique characteristics have led to its use in many research fields including biotechnology and medical science, and the scientific as well as economic importance of the wild silk moth continues to gradually increase. However, no genomic information for the wild silk moth, including A. yamamai, is currently available. Findings: In order to construct the A. yamamai genome, a total of 147G base pairs using Illumina and Pacbio sequencing platforms were generated, providing 210-fold coverage based on the 700-Mb estimated genome size of A. yamamai. The assembled genome of A. yamamai was 656 Mb (>2 kb) with 3675 scaffolds, and the N50 length of assembly was 739 Kb with a 34.07% GC ratio. Identified repeat elements covered 37.33% of the total genome, and the completeness of the constructed genome assembly was estimated to be 96.7% by Benchmarking Received: 18 April 2017; Revised: 22 May 2017; Accepted: 16 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Kim et al. Universal Single-Copy Orthologs v2 analysis. A total of 15 481 genes were identified using Evidence Modeler based on the gene prediction results obtained from 3 different methods (ab initio, RNA-seq-based, known-gene-based) and manual curation. Conclusions: Here we present the genome sequence of A. yamamai, the first genome sequence of the wild silk moth. These results provide valuable genomic information, which will help enrich our understanding of the molecular mechanisms relating to not only specific phenotypes such as wild silk itself but also the genomic evolution of Saturniidae. Keywords: Antheraea yamamai; genome assembly; Japanese silk moth; Japanese oak silk moth; wild silkworm we added 2 long read sequencing platforms, Moleculo (Illumina Data Description synthetic long read) and RS II (Pacific Bioscience). Tables 1–3 Antheraea yamamai (NCBI Taxonomy ID: 7121), also known as show a summary of generated data for each library used in this the Japanese oak silk moth, is a wild silk moth species belong- study. RNA-seq libraries were also constructed for 10 different ing to the Saturniidae family (Fig. 1). Silk moths can be cate- tissues (hemocyte, malpighian tube, midgut, fat body, anterior- gorized into 2 families—Bombycidae and Saturniidae. Saterni- middle/silk gland, posterior/silk gland, head, integument, testis, idae has been estimated to contain approximately 1861 species, ovary) with 3 biological replicates following standard manufac- with 162 genera [1], and is known as the largest family in the turer protocol (Illumina, San Diego, CA, USA). For this, more than Lepidoptera. Among the many species in family Saturniidae, 100 individual A. yamamai samples in 5 instar stages from the only a few species, including A. yamamai, can be utilized for silk same breeding line were used for tissue anatomy, and 3 samples production. Previous phylogenetic studies have shown that the from each tissue were selected based on the quality of extracted family Saturniidae shares common ancestors with the family RNA. Details of transcriptome library construction are shown in Sphingidae, including the hawk moth (Macroglossum stellatarum), the Supplementary Data. Information of libraries and generated and the Bombycidae family, including the most representative data is provided in Table 4, and a total of 147 Gb of genomic data silkworm, Bombyx mori [2]. The estimated divergence time be- and 76 Gb of transcriptomic data was generated for this study. tween A. yamamai and B. mori is 84 million years ago (MYA), mak- ing it similar to the 88 MYA estimated divergence time between the human and mouse [3, 4]. Genome assembly and evaluation A. yamamai produces a specific silk called tensan silk [ 5], which shows distinctive characteristics compared with com- Before conducting genome assembly, we conducted k-mer dis- mon silk from B. mori, including characteristics such as thick- tribution analysis using a 350-bp paired-end library in order ness, bulkiness, compressive elasticity, and resistance to dyeing to estimate the size and characteristics of the A. yamamai chemicals [6–8]. These characteristics have received the atten- genome. The quality of our generated raw data was checked tion of researchers as a new biomaterial for use in various fields using FASTQC (FastQC, RRID:SCR 014583)[16]. Sequencing ar- [9–11]. Additionally, tensan silk also has been studied for its ap- tifacts such as adapter sequences and low-quality bases were plications to human health [12–15]. However, despite the poten- removed using Trimmomatic (Trimmomatic, RRID:SCR 011848) tial importance of the wild silk moth in research and economic [17]. Jellyfish [ 18] was used to count the k-mer frequency for es- fields, no whole genomic information is currently available for timation of the genome size of A. yamamai. Figure 2 shows the this or any other species from the family Saturniidae. 19-mer distribution of the A. yamamai genome using a 350-bp In this study, we present the annotated genome sequence of paired-end library. In the 19-mer distribution, the second peak, A. yamamai, the first published genome in the family Saturni- at approximately half the coverage value (x-axis) of the main idae, with transcriptome datasets collected from 10 different peak, indicates heterozygosity. Although the inbred line used in body organ tissues. These data will be a fundamental resource this study was the single pair sib-mating maintained for more for future studies and provide more insight into the genome evo- than 10 generations, high heterozygosity still remains. This phe- lution and molecular phylogeny of the family Saturniidae. nomenon has been observed in a previous genomic study of the Diamondback moth (Plutella xylostella), and sustained het- erozygosity as an important genomic characteristic was hypoth- Sample preparation and sequencing esized to be a result of environmental adaption [19]. The un- For whole-genome sequencing, we selected 1 male sample (Ay- derlying mechanism of the sustained heterozygosity is unclear, 7-male1) from a breeding line (Ay-7) of A. yamamai raised at the but associative overdominance can be one of the candidate ex- National Academy of Agricultural Science, Rural Development planations of this phenomenon [20, 21]. Based on the result of Administration, Korea. In lepidopterans, males are homoga- 19-mer distribution analysis, the genome size of A. yamamai metic (ZZ), and selecting a male sample can reduce the complex- was estimated to be 709 Mb. However, this size might be larger ity of assembly from excessive repeats on the W chromosome than the real genome size of A. yamamai because high het- in females. For genomic library construction, we removed the erozygosity could affect the estimation of genome size based guts of A. yamamai to prevent contamination of genomes from on the K-mer distribution. Next, we conducted error correc- other organisms such as gut microbes and oak, the main food tion on Illumina paired-end libraries using the error correction source of A. yamamai. Details of the sample preparation process module of Allpaths-LG [22] before the initial contig assembly used in this study are presented in the Supplementary Data. Ge- process (ALLPATHS-LG, RRID:SCR 010742). After error correction, nomic DNA was extracted using a DNeasy Animal Mini Kit (Qi- initial contig assembly with 350-bp and 700-bp libraries was con- agen, Hilden, Germany), and the quality of extracted DNA was ducted using SOAP denovo2 [23] with the parameter option set at checked using trenean, picogreen assay, and gel electrophore- K = 19; this approach showed the best assembly statistics com- sis (1% agarose gel/40 ng loading). After quality control process- pared with other assemblers and parameters (SOAPdenovo2, ing, we were left with a total of 61.5 ug of A. yamamai DNA RRID:SCR 014986). Quality control processing for mate-pair li- for genome sequencing. Using standard Illumina whole-genome braries and scaffolding was conducted using Nxtrim [24]and shotgun (WGS) sequencing protocol (paired-end and mate-pair), SSPACE (SSPACE, RRID:SCR 011848)[25], respectively. At each Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 3 Figure 1: Photograph of Antheraea Yamamai. From left: larva, cocoon, and adult A. yamamai, respectively. Green color is one of the representative characteristics of tensan silk. Table 1: Summary statistics of generated whole-genome shotgun sequencing data using Illumina Nextseq 500 Library Read Reads retained name Library type Insert size Platform length No. of reads Total base, bp after trimming 350 bp Paired-end 350 bp Nextseq500 151 293 176 268 44 269 616 468 291 070 362 700 bp Paired-end 700 bp Nextseq500 151 246 945 900 37 288 830 900 244 698 580 3 Kbp Mate-pair 3 Kbp Nextseq500 76 284 204 762 21 599 561 912 195 095 164 6 Kbp Mate-pair 6 Kbp Nextseq500 76 246 238 370 18 714 116 120 152 496 372 9 Kbp Mate-pair 9 Kbp Nextseq500 76 239 919 538 18 233 884 888 148 612 724 Total 1 310 484 838 140 106 010 288 1 031 973 202 Table 2: Summary statistics of generated Illumina synthetic long the N50 length of assembly was 739 Kb with a 34.07% GC ratio. read (Moleculo) library To evaluate the quality of the assembled genome, we conducted Benchmarking Universal Single-Copy Orthologs (BUSCO) analy- 500–1499 bp ≥1500 bp sis [29] using BUSCO v2.0 with insecta odb9 including 1658 BUS- COsfrom42species (BUSCO, RRID:SCR 015008). From BUSCO No. of assembled reads 302 132 342 738 analysis, 96.7% of BUSCOs were completely detected in the as- No. of bases in assembled read 268 853 717 1 205 349 082 sembled genome (1576: complete and single-copy; 27: complete N50 length of assembled read 960 4031 and duplicated) among 1658 tested BUSCOs. The numbers of fragmented and missing BUSCOs were 21 and 34, respectively. Table 3: Summary statistics of generated long reads data using Pacbio Based on the result of BUSCO analysis, the genome of A. yamamai RS II system presented here was considered properly constructed for down- stream analysis. No. of reads 1005,571 Total bases 5836 969 225 Repeat identification and comparative repeat analysis Length of longest (shortest) read 50 132 (50) Average read length 5804.63 To identify repeat elements of the A. yamamai genome, a cus- tom repeat library was constructed using RepeatModeler with RECON [30], RepeatScout [31], and TRF [32]. The resulting con- scaffolding step, SOAP Gapcloser [23] with -l 155 and -p 31 pa- structed custom repeat library for A. yamamai was further cu- rameters was repeatedly used to close the gaps within each rated using the CENSOR [33] search, and the curated library scaffold. In order to obtain a higher-quality genome assem- was employed in RepeatMasker [34] with Repbase [35]. Repeat- bly of A. yamamai, we employed several long read scaffolding Masker (RepeatMasker, RRID:SCR 012954) was conducted with strategies using SSPACE-LongRead [26]. First, we used an Illu- RMBlast and the “no is” option for skipping bacterial insertion mina synthetic long read sequencing platform called Moleculo, element check. Table 6 summarizes the proportion of identified which has been proven valuable for the study of highly het- mobile elements in the A. yamamai genome. The most preva- erozygous genomes in previous studies [27, 28]. After scaffolding lent repeat elements in the A. yamamai genome were LINE el- was performed using SSPACE-LongRead with Illumina synthetic ements (101 Mb, 15.31% of total genome), and total repeat el- long read data, the total number of assembled scaffolds was ef- ements accounted for 37.33% of the total genome. In order to fectively reduced from 398 446 to 24 558. The average scaffold compare the repeat elements of A. yamamai with those of other length was also extended from 1.7 Kb to 24.8 Kb. However, there genomes, we conducted the same process for 7 public genomes was no impressive improvement in N50 length (approximately that are close neighbors of A. yamamai—Aedes aegypti [36], Bom- 91 Kb to 112 Kb) of assembled scaffolds. Therefore, we employed byx mori [37], Danaus plexippus [38], Drosophila melanogaster [39], anothertypeoflongread data generatedfrom10cells of Pacbio Heliconius Melpomene [40], Melitaea cinxia [41], and Plutella xy- RS II system with P6-C4 chemistry. After final scaffolding pro- lostella [19]. Figure 3 displays the amount and proportion of iden- cessing using Pacbio long reads, the number of scaffolds was re- tified repeat elements from the 8 species. Despite the small duced to 3675, and N50 length was effectively extended from 112 genome size of B. mori, the total amount of identified SINE el- Kb to 739 Kb. Summary statistics of the assembled A. yamamai ements in the B. mori genome was 5.77 times larger than that genome are provided in Table 5. Final assembly of the A. yama- of A. yamamai. The top 5 expanded repeat elements in the A. mai genome was 656 Mb (>2 kb) long with 3675 scaffolds, and yamamai genome were DNA/RC, LINE/L2, LINE/RTE-BovB, DNA/ Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Kim et al. Table 4: Summary statistics of generated transcriptome data obtained from 6 organ tissues using Illumina platform Tissue Sample name Read length Read count Total base, bp Hemocyte Hemocyte 1 76 20 815 674 1 581 991 224 Hemocyte 2 76 26 704 666 2 029 554 616 Hemocyte 2 76 53 068 562 4 033 210 712 Malpighian tube Malpighi 1 76 22 635 428 1 720 292 528 Malpighi 2 76 24 893 788 1 891 927 888 Malpighi 3 76 45 213 164 3 436 200 464 Midgut Midgut 1 76 23 350 138 1 774 610 488 Midgut 2 76 24 597 972 1 869 445 872 Midgut 3 76 50 949 986 3 872 198 936 Head Head 1 76 26 526 276 2 015 996 976 Head 2 76 26 581 124 2 020 165 424 Head 3 76 40 900 456 3 108 434 656 Integument Skin 1 76 24 592 846 1 869 056 296 Skin 2 76 42 775 430 3 250 932 680 Skin 3 76 35 043 570 2 663 311 320 Fat body Fat Body 1 76 24 637 810 1 872 473 560 Fat Body 2 76 24 037 494 1 826 849 544 Fat Body 3 76 40 817 582 3 102 136 232 Anterior-middle/silk gland AM/Silk Gland 1 76 21 399 638 1 626 372 488 AM/Silk Gland 2 76 24 292 386 1 846 221 336 AM/Silk Gland 3 76 37 331 530 2 837 196 280 Posterior/silk gland P/Silk Gland 1 76 27 359 580 2 079 328 080 P/Silk Gland 2 76 23 300 962 1 770 873 112 P/Silk Gland 3 76 39 421 430 2 996 028 680 Testis Testis 1 76 40 890 404 3 107 670 704 Testis 2 76 45 733 846 3 475 772 296 Testis 3 76 44 985 224 3 418 877 024 Ovary Ovary 1 76 40 797 628 3 100 619 728 Ovary 2 76 40 409 752 3 071 141 152 Ovary 3 76 42 417 892 3 223 759 792 1e+10 1e+09 1e+08 1e+07 1e+06 100 000 0 20 40 60 80 100 120 multiplicity Figure 2: 19-mer distribution of A. yamamai genome using jellyfish with 350-bp paired-end whole genome sequencing data. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Number of distinct k-mers with given multiplicity Genome of Antheraea yamamai 5 Table 5: Summary statistics of the A. yamamai genome (>2 kb) erated transcriptome data from 10 organ tissues of A. yamamai were aligned to the assembled genome and gene information Assembled genome was predicted using Cufflinks (Cufflinks, RRID:SCR 014597)[48]. The longest CDS sequences were identified from Cufflinks re- Size, 1n 656 Mb sults using Transdecoder. For the homology-based approach, all GC level 34.07 known genes of the order Lepidoptera in the NCBI database were No. of scaffolds 3675 aligned using PASA [49]. Table 7 shows the gene prediction re- N50 of scaffolds, bp 739 388 sults from each method. Gene prediction results from differ- No. of bases in scaffolds, % 19 257 439 (2.93) ent prediction algorithms were combined using Evidence Mod- Longest (shortest) scaffolds, bp 3 156 949 (2003) eler [50], and a consensus gene set of the A. yamamai genome Average scaffold length, bp 178 657.53 was created. Manual curation was performed based on the 5 types of evidence (3 in silico, known protein, and RNA-seq) Table 6: Summary of identified repeat elements in the A. yamamai using IGV [51] and BLASTP (BLASTP, RRID:SCR 001010). Us- genome ing IGV with each gene evidence and comparing results with known genes via BLASTP, we mainly focused on removing false Repeat element No. of elements Length, % positively predicted genes that don’t have enough evidence. Merged and spliced gene structured were corrected by com- SINE 59 968 8 615 338 (1.30) paring the gene structure with known exon structure in the LINE 426 522 101 251 176 (15.31) NCBI NR database. In addition, fibroin and sericin genes, which LTR element 53 977 4 552 386 (0.69) couldn’t be properly predicted because of their high repeat mo- DNA element 512 760 69 071 227 (10.44) tif, were also manually identified with previously known se- Small RNA 43 645 6 691 619 (1.01) quences [42, 52] with RNA-seq data. The final gene set of the Simple repeat 135 989 6 256 839 (0.95) A. yamamai genome contains 15 481 genes. Summary statistics Low complexity 19 937 932 829 (0.14) for the consensus gene set are provided in Table 8.The aver- Unclassified 294 190 54 552 009 (8.25) age gene length was 11 016.34 bp with a 34.38% GC ratio, and the number of exons per gene was 5.64. In order to identify the function of predicted genes in A yamamai, 3 nonredun- TcMar-Mariner, and LINE/CR1. Among these, DNA/TcMar- dant sequence databases (Swiss-Prot [53], Uniref100 [53], and Mariner was the specifically expanded repeat element in A. NCBI NR [54]) as well as the gene information of 2 species (B. yamamai among the 8 species. In B. mori, SINE/tRNA-CR1, mori and D. melanogaster) were used for target databases us- LINE/Jockey, DNA/RC, LINE/CR1-Zenon, and LINE/RTE-BovB were ing BLASTP. Additionally, protein domain searches were con- the top 5 expanded repeat elements. When comparing the re- ducted on the consensus gene set using InterproScan5 (Inter- peat elements of A. yamamai with those of B. mori,which are ProScan, RRID:SCR 005829)[55]. Figure S1 shows the top 20 iden- both producers of the same type of silk, repeat elements showed tified terms from 7 different InterproScan5 analyses. Among the family and species-specific patterns in the 2 silk moth linages. various analyses conducted using InterproScan5, gene ontology Particularly, we found that the mariner repeat element, which analysis with the Pfam database showed that a large proportion was found specifically expanded in the A. yamamai genome, was of genes in the A. yamamai genome were related with the func- also included in the fibroin gene. A previous sequencing study tion of molecular binding, catalytic activity, internal component also showed that the mariner repeat element was inserted in the of membrane, metabolic process, oxidation-reduction process, 5’-end of the fibroin gene of A. yamamai [42]. Fibroin is the core and transmembrane transport. component of the silk protein found in the silk moth, and the physical characteristics of silk mainly depend on the types and unique repeat motifs of the fibroin [ 43]. This gene is known to have hundreds of tandem repeat motifs, and these kinds of tan- Comparative genome analysis dem repeats can be derived through transposable elements. This We used OrthoMCL [56] and Reciprocal Best Hit within BLASTP indicates that the mariner repeat element, specifically expanded for identification of gene family clusters and 1:1 orthologous in the A. yamamai genome, may play an important role in the gene sets. Gene information of 7 taxa (A. aegypti, B. mori, D. plex- development of the unique silk of A. yamamai, and the lineage- ippus, D. melanogaster, H. melpomene, M. cinxia, and P. xylostella), specific repeat elements may be one of the candidate evolution the same taxa used in repeat analysis, was employed for Or- forces related to host-specific phenotype during genome evolu- thoMCL with A. yamamai. A total of 17 406 gene family clusters tion. were constructed, and 3586 1:1 orthologous genes were identi- fied. Before conducting comparative genome analysis, we con- Gene prediction and annotation structed phylogenetic trees for the 8 species. In order to build Three different algorithms were used for gene prediction of the the phylogenetic tree, multiple sequence alignment for the 1:1 A. yamamai genome: ab initio, RNA-seq transcript-based, and orthologous genes of all 8 species was conducted using PRANK protein homology-based approaches. For ab initio gene predic- [57], and Gblocks [58] was used to obtain conserved blocks for the tion, Augustus (Augustus: Gene Prediction, RRID:SCR 008417) phylogenetic tree. Conserved block sequences were sequentially [44], Geneid [45], and GeneMarks-ET [46] were employed. Au- concatenated to obtain 1 consensus sequence for each species. gustus was trained using known genes of A. yamamai in the MEGA [59] was used for constructing Neighbor-Joining Trees NCBI database, and mapping information of RNA-seq data ob- (bootstrap 1000, maximum composite likelihood, transitions + tained from Tophat (TopHat, RRID:SCR 013035)[47] was also transversions, and gamma distributed option), and MrBayes (Mr- utilized for gene prediction. Geneid was used with predefined Bayes, RRID:SCR 012067)[60] was employed for the construc- parameters for Drosophila melanogaster. GeneMarks-ET was em- tion of Bayesian inference trees. To select the best evolution ployed using junction information of genes from transcriptome model for our data, Modeltest [61] was conducted and the GTR- data alignment. For RNA-seq transcript-based prediction, gen- based invariant model was chosen based on the AIC value of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Kim et al. A. aegyp A. yamamai B. mori D. plexippus D. melanogaster H. melpomene M. cinxia P. xylostella SINE LINE LTR element DNA element Small RNA Simple repeat Low complexity Unclassifed 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A. aegyp A. yamamai B. mori D. plexippus D. melanogaster H. melpomene M. cinxia P. xylostella Figure 3: Amount and proportion of identified repeat element from 8 species including A. yamamai. A) Absolute amount of repeat element classified into 8 different categories. B) Proportion of each repeat element in identified total repeat element. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 7 Table 7: Summary statistics of ab initio, RNA-seq-based, and homology-based gene prediction results Evidence type Programs Element Total count Exon/gene Total length, bp Mean length, bp Gene 14 576 142 415 318 9770.53 Augustus 4.85 Exon 70 733 14 736 668 208.34 Gene 10 946 46 119 402 4213.35 ab initio Geneid 2.25 Exon 24 686 3 925 563 159.01 Gene 27 754 273 745 951 9863.29 GeneMarks-ET 5.50 Exon 152 660 30 847 503 202.06 Gene 36 213 840 429 061 23 207.94 RNA-seq Cufflinks Transdecoder 7.03 Exon 254 770 201 721 675 791.77 Known gene (NCBI lepidoptera) PASA (gmap) 44 561 22 484 151 504.57 Table 8: Summary statistics for the consensus gene set of the A. yamamai genome Element No. of elements Exon/gene Avg. length Total length Genome coverage, % Gene 15 481 11 016.34 170 543 958 25.78 5.64 Exon 87 346 1346.23 20 840 925 3.31 +1560 / -4795 A.aegypti +3861 / -6534 1/100 +6142 / -1649 D.melanogaster +2498 / -5028 P.xylostella +567 / -715 Specific B.Mori +6 / -3395 1:Multi 1/100 +203 / -1671 1/100 Multi:Multi 1:1 +938 / -1987 A.Yamamai +6 / -590 1/100 +749 / -1729 D.plexippus +77 / -878 1/100 +700 / -1602 M.cinxia +47 / -975 1/100 +469 / -2173 H.melpomene 02k 4k 6k 8k 10k Figure 4: Constructed phylogenetic tree and comparative gene family analysis. Node values indicate Bayesian posterior probability, bootstrap and gene expansion, and contraction value. Orange and blue colors indicate expansion and contraction, respectively. Bar chart indicates the number of genes cauterized into 4 groups (Specific, 1:Multi, Multi:Multi, and 1:1) using OrthoMCL. Modeltest. Figure 4 shows the constructed phylogenetic tree of Based on the constructed phylogenetic tree, gene family the 8 species using 3586 orthologous genes. The bootstrap value expansion and contraction analysis were conducted using a and Bayesian poster probability value of all nodes were 100 and 2-parameter model in CAFE[62]and thegenetreewas con- 1, respectively. The closest neighbor of A. yamamai was B. mori, structed using protein sequence via MEGA [59]. Figure 4 shows which is included in Bombycidae family; this result is consistent the result of gene family expansion and contraction analysis of with that of previous studies. Three butterfly species ( D. plexip- 8 species; 938 and 1987 gene families of A. yamamai and 567 and pus, M. cinxia, and H. meplmene) included in Nymphalidae family 715 gene families of B. mori were estimated to be expanded and were also shown to share a common ancestor with the families contracted from the common ancestors, respectively. Among Saturniidae and Bombycidae. these, 15 gene families in A. yamamai were estimated to be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Kim et al. AB Figure 5: Expansion of chorion gene in the A. yamamai genome. A, B) The gene trees of chorion A and B in the rapid expanded gene family cluster, respectively. Color of terminal node indicates each taxon identified in the gene family cluster. under rapid expansion during the evolution process. Functions micropyle region [64]. Only a very few, small aeropyle crowns of genes in the rapidly expanded gene families of A. yama- remain, and it is entirely different from the ancestral form of mai were transposase, fatty acid synthase, zinc finger protein, eggshell surface mostly covered by aeropyle crowns. These re- chorion (eggshell protein), reverse transcriptase, prostaglandin gional differences were known to be adjusted by regional differ- dehydrogenase, RNA-directed DNA polymerase, gag like protein, ences of filler genes during choriogenesis [ 66], and the additional juvenile hormone acid methyltransferase, facilitated trehalose regulations of related genes for choriogenesis have to be consid- transporter, and glucose dehydrogenase. Figure 5 shows the ered. This indicates that the specifically expanded chorion gene gene tree of 2 chorion gene (chorion class A and B) family clus- families of A. yamamai may be one of the remaining evolution- ters rapidly expanded in the A. yamamai genome. Chorion, called ary tracks in the genome of the Antheraea genus. However, fur- eggshell protein, composes the surface of the egg and protects ther functional studies must be conducted to resolve the limited the embryo from environmental threats such as desiccation, understanding about the relationship between these expanded flooding, freezing, infection of microorganisms, and physical de- chorion gene families and the current eggshell surface forma- struction. It also provides channels, such as aeropyle, that en- tion of A. yamamai. able gas exchange and maintain proper conditions for diapause The constructed genome of A. yamamai presented here is of the egg [63]. These diverse functions of eggshell are imple- the first announced genome in the family Saturniidae, and the mented by the specific eggshell structure, and the surface struc- karyotyping analysis using gamete in the metaphase showed ture of eggshell varies between species for the adaptation in a that the genome of A. yamamai consists of 31 chromosomes different environment. The ancestor of Antherea has the unique (Fig. 6). This constructed genome information provides more in- aeropyle structure called “aerophyle crown” on the eggshell sur- sight into the genome evolution and phylogeny of the family Sat- face [64]. This unique structure is formed by the circular vertical urniidae, which contains the largest number of species in Lep- projection of lamellar chorion from the follicle cell, and it sur- idoptera. For example, although 2 silk moths, A. yamamai and rounds the aeropyles near the end of oogenesis [65]. Acquiring B. mori, appear similar, comparative genome analysis showed this kind of de novo complex structure requires numerous ge- the significant differences in the genome size and the specific netic changes, and a previous study about Antheraea polyphemus expansion of repeat elements and gene families between the has shown that more than 100 chorion-specific polypeptides families Saturniidae and Bombycidae. In case of molecular phy- were involved in this unique ultra-structure [65]. Therefore, the logeny, most previous phylogenetic studies were limited to few specific rapid expansion of the chorion class A and B gene family genes due to the lack of genomic information on the family in the A. yamamai genome might be one of the convincing molec- Saturniidae. We expect that our study and resulting constructed ular explanations for acquiring this unique ultrastructure in the genome will resolve some limitations of molecular phylogenetic eggshell surface of Antheraea genus. However, this unique ultra- and ecological research on the Saturniidae species. Addition- structure tends to be reduced during the current evolution pro- ally, constructed genome information will help researchers bet- cess of the Antheraea genus. Types of eggshell structures in the ter understand the molecular background of wild silk and its Antheraea genus can be categorized into multiple classes based production. Silk produced by A. yamamai, referred to as tensan on the morphology and regional distribution of the aeropyle [64]. silk, shows unique characteristics that have made it valuable in The shape of the aeropyle in the A. yamamai egg is known to be various fields. However, A. yamamai has not been completely converted to a mound shape from the crown shape, and these domesticated compared with B. mori, making mass produc- aeropyle mounds only exist in the narrow band surrounding the tion of tensan silk infeasible. Understanding of the molecular Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 9 Figure 6: Karyotype of A. yamamai using a gamete of testis in metaphase. mechanisms behind the tensan silk production process is essen- Author contributions tial for mass production using biotechnology, and this genome Sampling: Kee-Young Kim, Su-Bae Kim; sequencing: Kwang- sequence with manually curated gene information is a funda- Ho Choi, Seong-Wan Kim genome assembly: Seong-Ryul Kim, mental resource for related research and industrial improve- Woori Kwak, Jae-Sam Hwang, Seung-Won Park repeat el- ment. Additionally, the transcriptome data of 10 different organ ement analysis: Seong-Ryul Kim, Woori Kwak, Seung-Won tissues with the 3 biological replications presented here may be Park gene prediction: Seong-Ryul Kim, Woori Kwak, Hyaekang also useful resources for uncovering the molecular mechanisms Kim, Jae-Sam Hwang comparative genome analysis: Seong- related to specific phenotypes of A. yamamai and the family Ryul Kim, Woori Kwak, Min-Jae Kim, Kelsey Caetano-Anolles Saturniidae. funding and experimental design: Seong-Ryul Kim, Seung-Won Park. Availability of supporting data The generated genome sequence and gene information of A. Acknowledgements yamamai are available in GigaDB [67], and generated raw data This work was supported by a grant from the Rural Development are available under project accession PRJNA383008 and PR- Administration, Republic of Korea (grant No. PJ010442). JNA383025 of the NCBI database. References Additional file 1. Regier JC, Grant MC, Mitter C et al. Phylogenetic rela- Additional file 1: Figure S1. Top 20 terms in each 7 Interproscan5 tionships of wild silkmoths (Lepidoptera: Saturniidae) in- analysis (CDD, Pfam, PRINTS, ProSitePatterns, ProSiteProfiles, ferred from four protein-coding nuclear genes. Syst Entomol TIGRF, Gene Ontology). 2008;33(2):219–28. Additional file 1: Supplementrary information2.xlsx. 2. Regier JC, Mitter C, Zwick A et al. A large-scale, higher-level, molecular phylogenetic study of the insect order Lepidoptera (moths and butterflies). PLoS One 2013; 8(3):e58568. Competing interests 3. Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge- All authors report no competing interests. base of divergence times among organisms. Bioinformatics 2006;22(23):2971–2. 4. Kawahara AY, Barber JR. Tempo and mode of antibat ultra- Abbreviations sound production and sonar jamming in the diverse hawk- BUSCO: Benchmarking Universal Single-Copy Orthologs; MYA: moth radiation. Proc Natl Acad Sci U S A 2015;112(20):6407– million years ago. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 Kim et al. 5. Peigler RS. Wild Silks of the World. Am Entomol 1993;39(3):151– 28. McCoy RC, Taylor RW, Blauwkamp TA et al. Illumina TruSeq 62. synthetic long-reads empower de novo assembly and resolve 6. Matsumoto Y-I, Saito H. Load-extension characteristics of complex, highly-repetitive transposable elements. PLoS One composite raw silk of Antheraea yamamai and Bombyx mori. 2014;9(9):e106689. J Sericult Sci Japan 1997;66(6):497–501. 29. Simao ˜ FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- 7. Nakamura S, Saegusa Y, Yamaguchi Y et al. Physical prop- ing genome assembly and annotation completeness with erties and structure of silk. XI. Glass transition temper- single-copy orthologs. Bioinformatics 2015:btv351. ature of wild silk fibroins. J Appl Polym Sci 1986; 31(3): 30. Bao Z, Eddy SR. Automated de novo identification of re- 955–6. peat sequence families in sequenced genomes. Genome Res 8. Kweon H, Park Y. Structural characteristics and physical 2002;12(8):1269–76. properties of wild silk fibres; Antheraea pernyi and Antheraea 31. Price AL, Jones NC, Pevzner PA. De novo identifica- yamamai. Korean J Sericult Sci (Korea Rep) 1994. tion of repeat families in large genomes. Bioinformatics 9. Zheng Z, Wei Y, Yan S et al. Preparation of regenerated An- 2005;21(Suppl 1):i351–8. theraea yamamai silk fibroin film and controlled-molecular 32. Benson G Tandem repeats finder: a program to analyze DNA conformation changes by aqueous ethanol treatment. J Appl sequences. Nucleic Acids Res 1999;27(2):573–80. Polym Sci 2010;116(1):461–7. 33. Kohany O, Gentles AJ, Hankus L et al. Annotation, sub- 10. Omenetto F, Kaplan D, Amsden J et al. Silk based biophotonic mission and screening of repetitive elements in Rep- sensors. 2011. Google Patents. base: RepbaseSubmitter and Censor. BMC Bioinformatics 11. Takeda S. New field of insect science: research on the use of 2006;7(1):474. insect properties. Entomol Sci 2013;16(2):125–35. 34. Tarailo-Graovac M, Chen N. Using RepeatMasker to iden- 12. Omenetto F, Kaplan DL. Silk-based multifunctional biomed- tify repetitive elements in genomic sequences. Curr Protoc ical platform. 2012. Google Patents. Bioinformatics 2009: 4.10.1–4.10.14. 13. Serban MA. Silk medical devices. 2016. Google Patents. 35. Bao W, Kojima KK, Kohany O. Repbase Update, a database 14. Jiang G-L, Collette AL, Horan RL et al. Drug delivery plat- of repetitive elements in eukaryotic genomes. Mobile DNA forms comprising silk fibroin hydrogels and uses thereof. 2015;6(1):11. 2010. Google Patents. 36. Nene V, Wortman JR, Lawson D et al. Genome se- 15. Kamiya M, Oyauchi K, Sato Y et al. Structure-activity rela- quence of Aedes aegypti, a major arbovirus vector. Science tionship of a novel pentapeptide with cancer cell growth- 2007;316(5832):1718–23. inhibitory activity. J Pept Sci 2010;16(5):242–8. 37. Xia Q, Zhou Z, Lu C et al. A draft sequence for the 16. Bioinformatics B. FastQC A Quality Control Tool for High genome of the domesticated silkworm (Bombyx mori). Sci- Throughput Sequence Data. Cambridge, UK: Babraham Insti- ence 2004;306(5703):1937–40. tute, 2011. 38. Zhan S, Merlin C, Boore JL et al. The monarch butterfly 17. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trim- genome yields insights into long-distance migration. Cell mer for Illumina sequence data. Bioinformatics 2014:btu170. 2011;147(5):1171–85. 18. Marc¸ais G, Kingsford C. A fast, lock-free approach for effi- 39. Adams MD, Celniker SE, Holt RA et al. The genome sequence cient parallel counting of occurrences of k-mers. Bioinfor- of Drosophila melanogaster. Science 2000;287(5461):2185–95. matics 2011;27(6):764–70. 40. Consortium HG. Butterfly genome reveals promiscuous ex- 19. You M, Yue Z, He W et al. A heterozygous moth genome pro- change of mimicry adaptations among species. Nature vides insights into herbivory and detoxification. Nat Genet 2012;487(7405):94–98. 2013;45(2):220–5. 41. Ahola V, Lehtonen R, Somervuo P et al. The Glanville frit- 20. Maruyama T, Nei M. Genetic variability maintained by mu- illary genome retains an ancient karyotype and reveals se- tation and overdominant selection in finite populations. Ge- lective chromosomal fusions in Lepidoptera. Nat Commun netics 1981;98(2):441–59. 2014;5. 21. Pamilo P, Palsson ´ S. Associative overdominance, heterozy- 42. Hwang J-S, Lee J-S, Goo T-W et al. Cloning of the fi- gosity and fitness. Heredity 1998; 81(4):381–9. broin gene from the oak silkworm, Antheraea yamamai 22. Gnerre S, MacCallum I, Przybylski D et al. High-quality draft and its complete sequence. Biotechnol Lett 2001;23(16): assemblies of mammalian genomes from massively paral- 1321–6. lel sequence data. Proc Natl Acad Sci U S A 2011;108(4): 43. Malay AD, Sato R, Yazawa K et al. Relationships between 1513–8. physical properties and sequence in silkworm silks. Sci Rep 23. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically im- 2016;6(1):27573. proved memory-efficient short-read de novo assembler. Gi- 44. Stanke M, Diekhans M, Baertsch R et al. Using native and gascience 2012;1(1):18. syntenically mapped cDNA alignments to improve de novo 24. O’Connell J, Schulz-Trieglaff O, Carlson E et al. NxTrim: opti- gene finding. Bioinformatics 2008; 24(5):637–44. mized trimming of Illumina mate pair reads: Table 1. Bioin- 45. Blanco E, Parra G, Guigo´ R. Using Geneid to identify genes. formatics 2015;31(12):2035–7. Curr Protoc Bioinformatics 2007: 4.3.1–4.3.28. 25. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding 46. Lomsadze A, Burns PD, Borodovsky M. Integration of mapped pre-assembled contigs using SSPACE. Bioinformatics RNA-Seq reads into automatic training of eukaryotic gene 2011;27(4):578–9. finding algorithm. Nucleic Acids Res 2014:gku557. 26. Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bac- 47. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering terial draft genomes using long read sequence information. splice junctions with RNA-Seq. Bioinformatics 2009;25(9): BMC Bioinformatics 2014;15(1):211. 1105–11. 27. Voskoboynik A, Neff NF, Sahoo D et al. The genome se- 48. Trapnell C, Roberts A, Goff L et al. Differential gene and quence of the colonial chordate, Botryllus schlosseri. Elife transcript expression analysis of RNA-seq experiments with 2013;2:e00569. TopHat and Cufflinks. Nat Protoc 2012; 7(3):562–78. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 11 49. Campbell MA, Haas BJ, Hamilton JP et al. 58. Castresana J. Selection of conserved blocks from multiple Comprehensive analysis of alternative splicing in rice alignments for their use in phylogenetic analysis. Mol Biol and comparative analyses with Arabidopsis. BMC Genomics Evol 2000;17(4):540–52. 2006;7(1):327. 59. Kumar S, Stecher G, Tamura K. MEGA7: Molecular evolution- 50. Haas BJ, Salzberg SL, Zhu W et al. Automated eukary- ary genetics analysis version 7.0 for bigger datasets. Mol Biol otic gene structure annotation using EVidenceModeler and Evol 2016;33(7):1870–4. the program to assemble spliced alignments. Genome Biol 60. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phy- 2008;9(1):R7. logenetic inference under mixed models. Bioinformatics 51. Robinson JT, Thorvaldsdottir ´ H, Winckler W et al. In- 2003;19(12):1572–4. tegrative genomics viewer. Nat Biotechnol 2011;29(1): 61. Posada D. Using MODELTEST and PAUP to select a model 24–26. of nucleotide substitution. Curr Protoc Bioinformatics 2003: 52. Zurovec M, Yonemura N, Kludkiewicz B et al. Sericin Compo- 6.5.1–6.5.14. sition in the Silk of Antheraea yamamai. Biomacromolecules 62. De Bie T, Cristianini N, Demuth JP et al. CAFE: a computa- 2016;17(5):1776–87. tional tool for the study of gene family evolution. Bioinfor- 53. Consortium U. Reorganizing the protein space at the matics 2006;22(10):1269–71. Universal Protein Resource (UniProt). Nucleic Acids Res 63. Chapman RF. The Insects: Structure and Function. Cam- 2011:gkr981. bridge: Cambridge University Press; 1998. 54. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences 64. Regier JC, Paukstadt U, Paukstadt LH et al. Phylogenetics of (RefSeq): a curated non-redundant sequence database of eggshell morphogenesis in Antheraea (Lepidoptera: Saturni- genomes, transcripts and proteins. Nucleic Acids Res idae): unique origin and repeated reduction of the aeropyle 2007;35(Database):D61–5. crown. Syst Biol 2005;54(2):254–67. 55. Jones P, Binns D, Chang H-Y et al. InterProScan 5: 65. Regier JC, Mazur GD, Kafatos FC. The silkmoth chorion: mor- genome-scale protein function classification. Bioinformatics phological and biochemical characterization of four surface 2014;30(9):1236–40. regions. Devel Biol 1980;76(2):286–304. 56. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of 66. Hatzopoulos AK, Regier JC. Evolutionary changes in the de- ortholog groups for eukaryotic genomes. Genome Res velopmental expression of silkmoth chorion genes and their 2003;13(9):2178–89. morphological consequences. Proc Natl Acad Sci U S A 57. Loytynoja ¨ A, Goldman N. From The Cover: An algo- 1987;84(2):479–83. rithm for progressive multiple alignment of sequences 67. Kim S, Kwak W, Kim K et al. The Japanese silk moth, Anther- with insertions. Proc Natl Acad Sci U S A 2005;102(30): aea yamamai, draft genome sequence. GigaScience Database 10557–62. 2017. http://dx.doi.org/10.5524/100382. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the family Saturniidae

Free
11 pages

Loading next page...
 
/lp/ou_press/genome-sequence-of-the-japanese-oak-silk-moth-antheraea-yamamai-the-9iVNbNh26Z
Publisher
BGI
Copyright
© The Author(s) 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix113
Publisher site
See Article on Publisher Site

Abstract

Background: Antheraea yamamai, also known as the Japanese oak silk moth, is a wild species of silk moth. Silk produced by A. yamamai, referred to as tensan silk, shows different characteristics such as thickness, compressive elasticity, and chemical resistance compared with common silk produced from the domesticated silkworm, Bombyx mori. Its unique characteristics have led to its use in many research fields including biotechnology and medical science, and the scientific as well as economic importance of the wild silk moth continues to gradually increase. However, no genomic information for the wild silk moth, including A. yamamai, is currently available. Findings: In order to construct the A. yamamai genome, a total of 147G base pairs using Illumina and Pacbio sequencing platforms were generated, providing 210-fold coverage based on the 700-Mb estimated genome size of A. yamamai. The assembled genome of A. yamamai was 656 Mb (>2 kb) with 3675 scaffolds, and the N50 length of assembly was 739 Kb with a 34.07% GC ratio. Identified repeat elements covered 37.33% of the total genome, and the completeness of the constructed genome assembly was estimated to be 96.7% by Benchmarking Received: 18 April 2017; Revised: 22 May 2017; Accepted: 16 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Kim et al. Universal Single-Copy Orthologs v2 analysis. A total of 15 481 genes were identified using Evidence Modeler based on the gene prediction results obtained from 3 different methods (ab initio, RNA-seq-based, known-gene-based) and manual curation. Conclusions: Here we present the genome sequence of A. yamamai, the first genome sequence of the wild silk moth. These results provide valuable genomic information, which will help enrich our understanding of the molecular mechanisms relating to not only specific phenotypes such as wild silk itself but also the genomic evolution of Saturniidae. Keywords: Antheraea yamamai; genome assembly; Japanese silk moth; Japanese oak silk moth; wild silkworm we added 2 long read sequencing platforms, Moleculo (Illumina Data Description synthetic long read) and RS II (Pacific Bioscience). Tables 1–3 Antheraea yamamai (NCBI Taxonomy ID: 7121), also known as show a summary of generated data for each library used in this the Japanese oak silk moth, is a wild silk moth species belong- study. RNA-seq libraries were also constructed for 10 different ing to the Saturniidae family (Fig. 1). Silk moths can be cate- tissues (hemocyte, malpighian tube, midgut, fat body, anterior- gorized into 2 families—Bombycidae and Saturniidae. Saterni- middle/silk gland, posterior/silk gland, head, integument, testis, idae has been estimated to contain approximately 1861 species, ovary) with 3 biological replicates following standard manufac- with 162 genera [1], and is known as the largest family in the turer protocol (Illumina, San Diego, CA, USA). For this, more than Lepidoptera. Among the many species in family Saturniidae, 100 individual A. yamamai samples in 5 instar stages from the only a few species, including A. yamamai, can be utilized for silk same breeding line were used for tissue anatomy, and 3 samples production. Previous phylogenetic studies have shown that the from each tissue were selected based on the quality of extracted family Saturniidae shares common ancestors with the family RNA. Details of transcriptome library construction are shown in Sphingidae, including the hawk moth (Macroglossum stellatarum), the Supplementary Data. Information of libraries and generated and the Bombycidae family, including the most representative data is provided in Table 4, and a total of 147 Gb of genomic data silkworm, Bombyx mori [2]. The estimated divergence time be- and 76 Gb of transcriptomic data was generated for this study. tween A. yamamai and B. mori is 84 million years ago (MYA), mak- ing it similar to the 88 MYA estimated divergence time between the human and mouse [3, 4]. Genome assembly and evaluation A. yamamai produces a specific silk called tensan silk [ 5], which shows distinctive characteristics compared with com- Before conducting genome assembly, we conducted k-mer dis- mon silk from B. mori, including characteristics such as thick- tribution analysis using a 350-bp paired-end library in order ness, bulkiness, compressive elasticity, and resistance to dyeing to estimate the size and characteristics of the A. yamamai chemicals [6–8]. These characteristics have received the atten- genome. The quality of our generated raw data was checked tion of researchers as a new biomaterial for use in various fields using FASTQC (FastQC, RRID:SCR 014583)[16]. Sequencing ar- [9–11]. Additionally, tensan silk also has been studied for its ap- tifacts such as adapter sequences and low-quality bases were plications to human health [12–15]. However, despite the poten- removed using Trimmomatic (Trimmomatic, RRID:SCR 011848) tial importance of the wild silk moth in research and economic [17]. Jellyfish [ 18] was used to count the k-mer frequency for es- fields, no whole genomic information is currently available for timation of the genome size of A. yamamai. Figure 2 shows the this or any other species from the family Saturniidae. 19-mer distribution of the A. yamamai genome using a 350-bp In this study, we present the annotated genome sequence of paired-end library. In the 19-mer distribution, the second peak, A. yamamai, the first published genome in the family Saturni- at approximately half the coverage value (x-axis) of the main idae, with transcriptome datasets collected from 10 different peak, indicates heterozygosity. Although the inbred line used in body organ tissues. These data will be a fundamental resource this study was the single pair sib-mating maintained for more for future studies and provide more insight into the genome evo- than 10 generations, high heterozygosity still remains. This phe- lution and molecular phylogeny of the family Saturniidae. nomenon has been observed in a previous genomic study of the Diamondback moth (Plutella xylostella), and sustained het- erozygosity as an important genomic characteristic was hypoth- Sample preparation and sequencing esized to be a result of environmental adaption [19]. The un- For whole-genome sequencing, we selected 1 male sample (Ay- derlying mechanism of the sustained heterozygosity is unclear, 7-male1) from a breeding line (Ay-7) of A. yamamai raised at the but associative overdominance can be one of the candidate ex- National Academy of Agricultural Science, Rural Development planations of this phenomenon [20, 21]. Based on the result of Administration, Korea. In lepidopterans, males are homoga- 19-mer distribution analysis, the genome size of A. yamamai metic (ZZ), and selecting a male sample can reduce the complex- was estimated to be 709 Mb. However, this size might be larger ity of assembly from excessive repeats on the W chromosome than the real genome size of A. yamamai because high het- in females. For genomic library construction, we removed the erozygosity could affect the estimation of genome size based guts of A. yamamai to prevent contamination of genomes from on the K-mer distribution. Next, we conducted error correc- other organisms such as gut microbes and oak, the main food tion on Illumina paired-end libraries using the error correction source of A. yamamai. Details of the sample preparation process module of Allpaths-LG [22] before the initial contig assembly used in this study are presented in the Supplementary Data. Ge- process (ALLPATHS-LG, RRID:SCR 010742). After error correction, nomic DNA was extracted using a DNeasy Animal Mini Kit (Qi- initial contig assembly with 350-bp and 700-bp libraries was con- agen, Hilden, Germany), and the quality of extracted DNA was ducted using SOAP denovo2 [23] with the parameter option set at checked using trenean, picogreen assay, and gel electrophore- K = 19; this approach showed the best assembly statistics com- sis (1% agarose gel/40 ng loading). After quality control process- pared with other assemblers and parameters (SOAPdenovo2, ing, we were left with a total of 61.5 ug of A. yamamai DNA RRID:SCR 014986). Quality control processing for mate-pair li- for genome sequencing. Using standard Illumina whole-genome braries and scaffolding was conducted using Nxtrim [24]and shotgun (WGS) sequencing protocol (paired-end and mate-pair), SSPACE (SSPACE, RRID:SCR 011848)[25], respectively. At each Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 3 Figure 1: Photograph of Antheraea Yamamai. From left: larva, cocoon, and adult A. yamamai, respectively. Green color is one of the representative characteristics of tensan silk. Table 1: Summary statistics of generated whole-genome shotgun sequencing data using Illumina Nextseq 500 Library Read Reads retained name Library type Insert size Platform length No. of reads Total base, bp after trimming 350 bp Paired-end 350 bp Nextseq500 151 293 176 268 44 269 616 468 291 070 362 700 bp Paired-end 700 bp Nextseq500 151 246 945 900 37 288 830 900 244 698 580 3 Kbp Mate-pair 3 Kbp Nextseq500 76 284 204 762 21 599 561 912 195 095 164 6 Kbp Mate-pair 6 Kbp Nextseq500 76 246 238 370 18 714 116 120 152 496 372 9 Kbp Mate-pair 9 Kbp Nextseq500 76 239 919 538 18 233 884 888 148 612 724 Total 1 310 484 838 140 106 010 288 1 031 973 202 Table 2: Summary statistics of generated Illumina synthetic long the N50 length of assembly was 739 Kb with a 34.07% GC ratio. read (Moleculo) library To evaluate the quality of the assembled genome, we conducted Benchmarking Universal Single-Copy Orthologs (BUSCO) analy- 500–1499 bp ≥1500 bp sis [29] using BUSCO v2.0 with insecta odb9 including 1658 BUS- COsfrom42species (BUSCO, RRID:SCR 015008). From BUSCO No. of assembled reads 302 132 342 738 analysis, 96.7% of BUSCOs were completely detected in the as- No. of bases in assembled read 268 853 717 1 205 349 082 sembled genome (1576: complete and single-copy; 27: complete N50 length of assembled read 960 4031 and duplicated) among 1658 tested BUSCOs. The numbers of fragmented and missing BUSCOs were 21 and 34, respectively. Table 3: Summary statistics of generated long reads data using Pacbio Based on the result of BUSCO analysis, the genome of A. yamamai RS II system presented here was considered properly constructed for down- stream analysis. No. of reads 1005,571 Total bases 5836 969 225 Repeat identification and comparative repeat analysis Length of longest (shortest) read 50 132 (50) Average read length 5804.63 To identify repeat elements of the A. yamamai genome, a cus- tom repeat library was constructed using RepeatModeler with RECON [30], RepeatScout [31], and TRF [32]. The resulting con- scaffolding step, SOAP Gapcloser [23] with -l 155 and -p 31 pa- structed custom repeat library for A. yamamai was further cu- rameters was repeatedly used to close the gaps within each rated using the CENSOR [33] search, and the curated library scaffold. In order to obtain a higher-quality genome assem- was employed in RepeatMasker [34] with Repbase [35]. Repeat- bly of A. yamamai, we employed several long read scaffolding Masker (RepeatMasker, RRID:SCR 012954) was conducted with strategies using SSPACE-LongRead [26]. First, we used an Illu- RMBlast and the “no is” option for skipping bacterial insertion mina synthetic long read sequencing platform called Moleculo, element check. Table 6 summarizes the proportion of identified which has been proven valuable for the study of highly het- mobile elements in the A. yamamai genome. The most preva- erozygous genomes in previous studies [27, 28]. After scaffolding lent repeat elements in the A. yamamai genome were LINE el- was performed using SSPACE-LongRead with Illumina synthetic ements (101 Mb, 15.31% of total genome), and total repeat el- long read data, the total number of assembled scaffolds was ef- ements accounted for 37.33% of the total genome. In order to fectively reduced from 398 446 to 24 558. The average scaffold compare the repeat elements of A. yamamai with those of other length was also extended from 1.7 Kb to 24.8 Kb. However, there genomes, we conducted the same process for 7 public genomes was no impressive improvement in N50 length (approximately that are close neighbors of A. yamamai—Aedes aegypti [36], Bom- 91 Kb to 112 Kb) of assembled scaffolds. Therefore, we employed byx mori [37], Danaus plexippus [38], Drosophila melanogaster [39], anothertypeoflongread data generatedfrom10cells of Pacbio Heliconius Melpomene [40], Melitaea cinxia [41], and Plutella xy- RS II system with P6-C4 chemistry. After final scaffolding pro- lostella [19]. Figure 3 displays the amount and proportion of iden- cessing using Pacbio long reads, the number of scaffolds was re- tified repeat elements from the 8 species. Despite the small duced to 3675, and N50 length was effectively extended from 112 genome size of B. mori, the total amount of identified SINE el- Kb to 739 Kb. Summary statistics of the assembled A. yamamai ements in the B. mori genome was 5.77 times larger than that genome are provided in Table 5. Final assembly of the A. yama- of A. yamamai. The top 5 expanded repeat elements in the A. mai genome was 656 Mb (>2 kb) long with 3675 scaffolds, and yamamai genome were DNA/RC, LINE/L2, LINE/RTE-BovB, DNA/ Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Kim et al. Table 4: Summary statistics of generated transcriptome data obtained from 6 organ tissues using Illumina platform Tissue Sample name Read length Read count Total base, bp Hemocyte Hemocyte 1 76 20 815 674 1 581 991 224 Hemocyte 2 76 26 704 666 2 029 554 616 Hemocyte 2 76 53 068 562 4 033 210 712 Malpighian tube Malpighi 1 76 22 635 428 1 720 292 528 Malpighi 2 76 24 893 788 1 891 927 888 Malpighi 3 76 45 213 164 3 436 200 464 Midgut Midgut 1 76 23 350 138 1 774 610 488 Midgut 2 76 24 597 972 1 869 445 872 Midgut 3 76 50 949 986 3 872 198 936 Head Head 1 76 26 526 276 2 015 996 976 Head 2 76 26 581 124 2 020 165 424 Head 3 76 40 900 456 3 108 434 656 Integument Skin 1 76 24 592 846 1 869 056 296 Skin 2 76 42 775 430 3 250 932 680 Skin 3 76 35 043 570 2 663 311 320 Fat body Fat Body 1 76 24 637 810 1 872 473 560 Fat Body 2 76 24 037 494 1 826 849 544 Fat Body 3 76 40 817 582 3 102 136 232 Anterior-middle/silk gland AM/Silk Gland 1 76 21 399 638 1 626 372 488 AM/Silk Gland 2 76 24 292 386 1 846 221 336 AM/Silk Gland 3 76 37 331 530 2 837 196 280 Posterior/silk gland P/Silk Gland 1 76 27 359 580 2 079 328 080 P/Silk Gland 2 76 23 300 962 1 770 873 112 P/Silk Gland 3 76 39 421 430 2 996 028 680 Testis Testis 1 76 40 890 404 3 107 670 704 Testis 2 76 45 733 846 3 475 772 296 Testis 3 76 44 985 224 3 418 877 024 Ovary Ovary 1 76 40 797 628 3 100 619 728 Ovary 2 76 40 409 752 3 071 141 152 Ovary 3 76 42 417 892 3 223 759 792 1e+10 1e+09 1e+08 1e+07 1e+06 100 000 0 20 40 60 80 100 120 multiplicity Figure 2: 19-mer distribution of A. yamamai genome using jellyfish with 350-bp paired-end whole genome sequencing data. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Number of distinct k-mers with given multiplicity Genome of Antheraea yamamai 5 Table 5: Summary statistics of the A. yamamai genome (>2 kb) erated transcriptome data from 10 organ tissues of A. yamamai were aligned to the assembled genome and gene information Assembled genome was predicted using Cufflinks (Cufflinks, RRID:SCR 014597)[48]. The longest CDS sequences were identified from Cufflinks re- Size, 1n 656 Mb sults using Transdecoder. For the homology-based approach, all GC level 34.07 known genes of the order Lepidoptera in the NCBI database were No. of scaffolds 3675 aligned using PASA [49]. Table 7 shows the gene prediction re- N50 of scaffolds, bp 739 388 sults from each method. Gene prediction results from differ- No. of bases in scaffolds, % 19 257 439 (2.93) ent prediction algorithms were combined using Evidence Mod- Longest (shortest) scaffolds, bp 3 156 949 (2003) eler [50], and a consensus gene set of the A. yamamai genome Average scaffold length, bp 178 657.53 was created. Manual curation was performed based on the 5 types of evidence (3 in silico, known protein, and RNA-seq) Table 6: Summary of identified repeat elements in the A. yamamai using IGV [51] and BLASTP (BLASTP, RRID:SCR 001010). Us- genome ing IGV with each gene evidence and comparing results with known genes via BLASTP, we mainly focused on removing false Repeat element No. of elements Length, % positively predicted genes that don’t have enough evidence. Merged and spliced gene structured were corrected by com- SINE 59 968 8 615 338 (1.30) paring the gene structure with known exon structure in the LINE 426 522 101 251 176 (15.31) NCBI NR database. In addition, fibroin and sericin genes, which LTR element 53 977 4 552 386 (0.69) couldn’t be properly predicted because of their high repeat mo- DNA element 512 760 69 071 227 (10.44) tif, were also manually identified with previously known se- Small RNA 43 645 6 691 619 (1.01) quences [42, 52] with RNA-seq data. The final gene set of the Simple repeat 135 989 6 256 839 (0.95) A. yamamai genome contains 15 481 genes. Summary statistics Low complexity 19 937 932 829 (0.14) for the consensus gene set are provided in Table 8.The aver- Unclassified 294 190 54 552 009 (8.25) age gene length was 11 016.34 bp with a 34.38% GC ratio, and the number of exons per gene was 5.64. In order to identify the function of predicted genes in A yamamai, 3 nonredun- TcMar-Mariner, and LINE/CR1. Among these, DNA/TcMar- dant sequence databases (Swiss-Prot [53], Uniref100 [53], and Mariner was the specifically expanded repeat element in A. NCBI NR [54]) as well as the gene information of 2 species (B. yamamai among the 8 species. In B. mori, SINE/tRNA-CR1, mori and D. melanogaster) were used for target databases us- LINE/Jockey, DNA/RC, LINE/CR1-Zenon, and LINE/RTE-BovB were ing BLASTP. Additionally, protein domain searches were con- the top 5 expanded repeat elements. When comparing the re- ducted on the consensus gene set using InterproScan5 (Inter- peat elements of A. yamamai with those of B. mori,which are ProScan, RRID:SCR 005829)[55]. Figure S1 shows the top 20 iden- both producers of the same type of silk, repeat elements showed tified terms from 7 different InterproScan5 analyses. Among the family and species-specific patterns in the 2 silk moth linages. various analyses conducted using InterproScan5, gene ontology Particularly, we found that the mariner repeat element, which analysis with the Pfam database showed that a large proportion was found specifically expanded in the A. yamamai genome, was of genes in the A. yamamai genome were related with the func- also included in the fibroin gene. A previous sequencing study tion of molecular binding, catalytic activity, internal component also showed that the mariner repeat element was inserted in the of membrane, metabolic process, oxidation-reduction process, 5’-end of the fibroin gene of A. yamamai [42]. Fibroin is the core and transmembrane transport. component of the silk protein found in the silk moth, and the physical characteristics of silk mainly depend on the types and unique repeat motifs of the fibroin [ 43]. This gene is known to have hundreds of tandem repeat motifs, and these kinds of tan- Comparative genome analysis dem repeats can be derived through transposable elements. This We used OrthoMCL [56] and Reciprocal Best Hit within BLASTP indicates that the mariner repeat element, specifically expanded for identification of gene family clusters and 1:1 orthologous in the A. yamamai genome, may play an important role in the gene sets. Gene information of 7 taxa (A. aegypti, B. mori, D. plex- development of the unique silk of A. yamamai, and the lineage- ippus, D. melanogaster, H. melpomene, M. cinxia, and P. xylostella), specific repeat elements may be one of the candidate evolution the same taxa used in repeat analysis, was employed for Or- forces related to host-specific phenotype during genome evolu- thoMCL with A. yamamai. A total of 17 406 gene family clusters tion. were constructed, and 3586 1:1 orthologous genes were identi- fied. Before conducting comparative genome analysis, we con- Gene prediction and annotation structed phylogenetic trees for the 8 species. In order to build Three different algorithms were used for gene prediction of the the phylogenetic tree, multiple sequence alignment for the 1:1 A. yamamai genome: ab initio, RNA-seq transcript-based, and orthologous genes of all 8 species was conducted using PRANK protein homology-based approaches. For ab initio gene predic- [57], and Gblocks [58] was used to obtain conserved blocks for the tion, Augustus (Augustus: Gene Prediction, RRID:SCR 008417) phylogenetic tree. Conserved block sequences were sequentially [44], Geneid [45], and GeneMarks-ET [46] were employed. Au- concatenated to obtain 1 consensus sequence for each species. gustus was trained using known genes of A. yamamai in the MEGA [59] was used for constructing Neighbor-Joining Trees NCBI database, and mapping information of RNA-seq data ob- (bootstrap 1000, maximum composite likelihood, transitions + tained from Tophat (TopHat, RRID:SCR 013035)[47] was also transversions, and gamma distributed option), and MrBayes (Mr- utilized for gene prediction. Geneid was used with predefined Bayes, RRID:SCR 012067)[60] was employed for the construc- parameters for Drosophila melanogaster. GeneMarks-ET was em- tion of Bayesian inference trees. To select the best evolution ployed using junction information of genes from transcriptome model for our data, Modeltest [61] was conducted and the GTR- data alignment. For RNA-seq transcript-based prediction, gen- based invariant model was chosen based on the AIC value of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Kim et al. A. aegyp A. yamamai B. mori D. plexippus D. melanogaster H. melpomene M. cinxia P. xylostella SINE LINE LTR element DNA element Small RNA Simple repeat Low complexity Unclassifed 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A. aegyp A. yamamai B. mori D. plexippus D. melanogaster H. melpomene M. cinxia P. xylostella Figure 3: Amount and proportion of identified repeat element from 8 species including A. yamamai. A) Absolute amount of repeat element classified into 8 different categories. B) Proportion of each repeat element in identified total repeat element. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 7 Table 7: Summary statistics of ab initio, RNA-seq-based, and homology-based gene prediction results Evidence type Programs Element Total count Exon/gene Total length, bp Mean length, bp Gene 14 576 142 415 318 9770.53 Augustus 4.85 Exon 70 733 14 736 668 208.34 Gene 10 946 46 119 402 4213.35 ab initio Geneid 2.25 Exon 24 686 3 925 563 159.01 Gene 27 754 273 745 951 9863.29 GeneMarks-ET 5.50 Exon 152 660 30 847 503 202.06 Gene 36 213 840 429 061 23 207.94 RNA-seq Cufflinks Transdecoder 7.03 Exon 254 770 201 721 675 791.77 Known gene (NCBI lepidoptera) PASA (gmap) 44 561 22 484 151 504.57 Table 8: Summary statistics for the consensus gene set of the A. yamamai genome Element No. of elements Exon/gene Avg. length Total length Genome coverage, % Gene 15 481 11 016.34 170 543 958 25.78 5.64 Exon 87 346 1346.23 20 840 925 3.31 +1560 / -4795 A.aegypti +3861 / -6534 1/100 +6142 / -1649 D.melanogaster +2498 / -5028 P.xylostella +567 / -715 Specific B.Mori +6 / -3395 1:Multi 1/100 +203 / -1671 1/100 Multi:Multi 1:1 +938 / -1987 A.Yamamai +6 / -590 1/100 +749 / -1729 D.plexippus +77 / -878 1/100 +700 / -1602 M.cinxia +47 / -975 1/100 +469 / -2173 H.melpomene 02k 4k 6k 8k 10k Figure 4: Constructed phylogenetic tree and comparative gene family analysis. Node values indicate Bayesian posterior probability, bootstrap and gene expansion, and contraction value. Orange and blue colors indicate expansion and contraction, respectively. Bar chart indicates the number of genes cauterized into 4 groups (Specific, 1:Multi, Multi:Multi, and 1:1) using OrthoMCL. Modeltest. Figure 4 shows the constructed phylogenetic tree of Based on the constructed phylogenetic tree, gene family the 8 species using 3586 orthologous genes. The bootstrap value expansion and contraction analysis were conducted using a and Bayesian poster probability value of all nodes were 100 and 2-parameter model in CAFE[62]and thegenetreewas con- 1, respectively. The closest neighbor of A. yamamai was B. mori, structed using protein sequence via MEGA [59]. Figure 4 shows which is included in Bombycidae family; this result is consistent the result of gene family expansion and contraction analysis of with that of previous studies. Three butterfly species ( D. plexip- 8 species; 938 and 1987 gene families of A. yamamai and 567 and pus, M. cinxia, and H. meplmene) included in Nymphalidae family 715 gene families of B. mori were estimated to be expanded and were also shown to share a common ancestor with the families contracted from the common ancestors, respectively. Among Saturniidae and Bombycidae. these, 15 gene families in A. yamamai were estimated to be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Kim et al. AB Figure 5: Expansion of chorion gene in the A. yamamai genome. A, B) The gene trees of chorion A and B in the rapid expanded gene family cluster, respectively. Color of terminal node indicates each taxon identified in the gene family cluster. under rapid expansion during the evolution process. Functions micropyle region [64]. Only a very few, small aeropyle crowns of genes in the rapidly expanded gene families of A. yama- remain, and it is entirely different from the ancestral form of mai were transposase, fatty acid synthase, zinc finger protein, eggshell surface mostly covered by aeropyle crowns. These re- chorion (eggshell protein), reverse transcriptase, prostaglandin gional differences were known to be adjusted by regional differ- dehydrogenase, RNA-directed DNA polymerase, gag like protein, ences of filler genes during choriogenesis [ 66], and the additional juvenile hormone acid methyltransferase, facilitated trehalose regulations of related genes for choriogenesis have to be consid- transporter, and glucose dehydrogenase. Figure 5 shows the ered. This indicates that the specifically expanded chorion gene gene tree of 2 chorion gene (chorion class A and B) family clus- families of A. yamamai may be one of the remaining evolution- ters rapidly expanded in the A. yamamai genome. Chorion, called ary tracks in the genome of the Antheraea genus. However, fur- eggshell protein, composes the surface of the egg and protects ther functional studies must be conducted to resolve the limited the embryo from environmental threats such as desiccation, understanding about the relationship between these expanded flooding, freezing, infection of microorganisms, and physical de- chorion gene families and the current eggshell surface forma- struction. It also provides channels, such as aeropyle, that en- tion of A. yamamai. able gas exchange and maintain proper conditions for diapause The constructed genome of A. yamamai presented here is of the egg [63]. These diverse functions of eggshell are imple- the first announced genome in the family Saturniidae, and the mented by the specific eggshell structure, and the surface struc- karyotyping analysis using gamete in the metaphase showed ture of eggshell varies between species for the adaptation in a that the genome of A. yamamai consists of 31 chromosomes different environment. The ancestor of Antherea has the unique (Fig. 6). This constructed genome information provides more in- aeropyle structure called “aerophyle crown” on the eggshell sur- sight into the genome evolution and phylogeny of the family Sat- face [64]. This unique structure is formed by the circular vertical urniidae, which contains the largest number of species in Lep- projection of lamellar chorion from the follicle cell, and it sur- idoptera. For example, although 2 silk moths, A. yamamai and rounds the aeropyles near the end of oogenesis [65]. Acquiring B. mori, appear similar, comparative genome analysis showed this kind of de novo complex structure requires numerous ge- the significant differences in the genome size and the specific netic changes, and a previous study about Antheraea polyphemus expansion of repeat elements and gene families between the has shown that more than 100 chorion-specific polypeptides families Saturniidae and Bombycidae. In case of molecular phy- were involved in this unique ultra-structure [65]. Therefore, the logeny, most previous phylogenetic studies were limited to few specific rapid expansion of the chorion class A and B gene family genes due to the lack of genomic information on the family in the A. yamamai genome might be one of the convincing molec- Saturniidae. We expect that our study and resulting constructed ular explanations for acquiring this unique ultrastructure in the genome will resolve some limitations of molecular phylogenetic eggshell surface of Antheraea genus. However, this unique ultra- and ecological research on the Saturniidae species. Addition- structure tends to be reduced during the current evolution pro- ally, constructed genome information will help researchers bet- cess of the Antheraea genus. Types of eggshell structures in the ter understand the molecular background of wild silk and its Antheraea genus can be categorized into multiple classes based production. Silk produced by A. yamamai, referred to as tensan on the morphology and regional distribution of the aeropyle [64]. silk, shows unique characteristics that have made it valuable in The shape of the aeropyle in the A. yamamai egg is known to be various fields. However, A. yamamai has not been completely converted to a mound shape from the crown shape, and these domesticated compared with B. mori, making mass produc- aeropyle mounds only exist in the narrow band surrounding the tion of tensan silk infeasible. Understanding of the molecular Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 9 Figure 6: Karyotype of A. yamamai using a gamete of testis in metaphase. mechanisms behind the tensan silk production process is essen- Author contributions tial for mass production using biotechnology, and this genome Sampling: Kee-Young Kim, Su-Bae Kim; sequencing: Kwang- sequence with manually curated gene information is a funda- Ho Choi, Seong-Wan Kim genome assembly: Seong-Ryul Kim, mental resource for related research and industrial improve- Woori Kwak, Jae-Sam Hwang, Seung-Won Park repeat el- ment. Additionally, the transcriptome data of 10 different organ ement analysis: Seong-Ryul Kim, Woori Kwak, Seung-Won tissues with the 3 biological replications presented here may be Park gene prediction: Seong-Ryul Kim, Woori Kwak, Hyaekang also useful resources for uncovering the molecular mechanisms Kim, Jae-Sam Hwang comparative genome analysis: Seong- related to specific phenotypes of A. yamamai and the family Ryul Kim, Woori Kwak, Min-Jae Kim, Kelsey Caetano-Anolles Saturniidae. funding and experimental design: Seong-Ryul Kim, Seung-Won Park. Availability of supporting data The generated genome sequence and gene information of A. Acknowledgements yamamai are available in GigaDB [67], and generated raw data This work was supported by a grant from the Rural Development are available under project accession PRJNA383008 and PR- Administration, Republic of Korea (grant No. PJ010442). JNA383025 of the NCBI database. References Additional file 1. Regier JC, Grant MC, Mitter C et al. Phylogenetic rela- Additional file 1: Figure S1. Top 20 terms in each 7 Interproscan5 tionships of wild silkmoths (Lepidoptera: Saturniidae) in- analysis (CDD, Pfam, PRINTS, ProSitePatterns, ProSiteProfiles, ferred from four protein-coding nuclear genes. Syst Entomol TIGRF, Gene Ontology). 2008;33(2):219–28. Additional file 1: Supplementrary information2.xlsx. 2. Regier JC, Mitter C, Zwick A et al. A large-scale, higher-level, molecular phylogenetic study of the insect order Lepidoptera (moths and butterflies). PLoS One 2013; 8(3):e58568. Competing interests 3. Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge- All authors report no competing interests. base of divergence times among organisms. Bioinformatics 2006;22(23):2971–2. 4. Kawahara AY, Barber JR. Tempo and mode of antibat ultra- Abbreviations sound production and sonar jamming in the diverse hawk- BUSCO: Benchmarking Universal Single-Copy Orthologs; MYA: moth radiation. Proc Natl Acad Sci U S A 2015;112(20):6407– million years ago. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 Kim et al. 5. Peigler RS. Wild Silks of the World. Am Entomol 1993;39(3):151– 28. McCoy RC, Taylor RW, Blauwkamp TA et al. Illumina TruSeq 62. synthetic long-reads empower de novo assembly and resolve 6. Matsumoto Y-I, Saito H. Load-extension characteristics of complex, highly-repetitive transposable elements. PLoS One composite raw silk of Antheraea yamamai and Bombyx mori. 2014;9(9):e106689. J Sericult Sci Japan 1997;66(6):497–501. 29. Simao ˜ FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- 7. Nakamura S, Saegusa Y, Yamaguchi Y et al. Physical prop- ing genome assembly and annotation completeness with erties and structure of silk. XI. Glass transition temper- single-copy orthologs. Bioinformatics 2015:btv351. ature of wild silk fibroins. J Appl Polym Sci 1986; 31(3): 30. Bao Z, Eddy SR. Automated de novo identification of re- 955–6. peat sequence families in sequenced genomes. Genome Res 8. Kweon H, Park Y. Structural characteristics and physical 2002;12(8):1269–76. properties of wild silk fibres; Antheraea pernyi and Antheraea 31. Price AL, Jones NC, Pevzner PA. De novo identifica- yamamai. Korean J Sericult Sci (Korea Rep) 1994. tion of repeat families in large genomes. Bioinformatics 9. Zheng Z, Wei Y, Yan S et al. Preparation of regenerated An- 2005;21(Suppl 1):i351–8. theraea yamamai silk fibroin film and controlled-molecular 32. Benson G Tandem repeats finder: a program to analyze DNA conformation changes by aqueous ethanol treatment. J Appl sequences. Nucleic Acids Res 1999;27(2):573–80. Polym Sci 2010;116(1):461–7. 33. Kohany O, Gentles AJ, Hankus L et al. Annotation, sub- 10. Omenetto F, Kaplan D, Amsden J et al. Silk based biophotonic mission and screening of repetitive elements in Rep- sensors. 2011. Google Patents. base: RepbaseSubmitter and Censor. BMC Bioinformatics 11. Takeda S. New field of insect science: research on the use of 2006;7(1):474. insect properties. Entomol Sci 2013;16(2):125–35. 34. Tarailo-Graovac M, Chen N. Using RepeatMasker to iden- 12. Omenetto F, Kaplan DL. Silk-based multifunctional biomed- tify repetitive elements in genomic sequences. Curr Protoc ical platform. 2012. Google Patents. Bioinformatics 2009: 4.10.1–4.10.14. 13. Serban MA. Silk medical devices. 2016. Google Patents. 35. Bao W, Kojima KK, Kohany O. Repbase Update, a database 14. Jiang G-L, Collette AL, Horan RL et al. Drug delivery plat- of repetitive elements in eukaryotic genomes. Mobile DNA forms comprising silk fibroin hydrogels and uses thereof. 2015;6(1):11. 2010. Google Patents. 36. Nene V, Wortman JR, Lawson D et al. Genome se- 15. Kamiya M, Oyauchi K, Sato Y et al. Structure-activity rela- quence of Aedes aegypti, a major arbovirus vector. Science tionship of a novel pentapeptide with cancer cell growth- 2007;316(5832):1718–23. inhibitory activity. J Pept Sci 2010;16(5):242–8. 37. Xia Q, Zhou Z, Lu C et al. A draft sequence for the 16. Bioinformatics B. FastQC A Quality Control Tool for High genome of the domesticated silkworm (Bombyx mori). Sci- Throughput Sequence Data. Cambridge, UK: Babraham Insti- ence 2004;306(5703):1937–40. tute, 2011. 38. Zhan S, Merlin C, Boore JL et al. The monarch butterfly 17. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trim- genome yields insights into long-distance migration. Cell mer for Illumina sequence data. Bioinformatics 2014:btu170. 2011;147(5):1171–85. 18. Marc¸ais G, Kingsford C. A fast, lock-free approach for effi- 39. Adams MD, Celniker SE, Holt RA et al. The genome sequence cient parallel counting of occurrences of k-mers. Bioinfor- of Drosophila melanogaster. Science 2000;287(5461):2185–95. matics 2011;27(6):764–70. 40. Consortium HG. Butterfly genome reveals promiscuous ex- 19. You M, Yue Z, He W et al. A heterozygous moth genome pro- change of mimicry adaptations among species. Nature vides insights into herbivory and detoxification. Nat Genet 2012;487(7405):94–98. 2013;45(2):220–5. 41. Ahola V, Lehtonen R, Somervuo P et al. The Glanville frit- 20. Maruyama T, Nei M. Genetic variability maintained by mu- illary genome retains an ancient karyotype and reveals se- tation and overdominant selection in finite populations. Ge- lective chromosomal fusions in Lepidoptera. Nat Commun netics 1981;98(2):441–59. 2014;5. 21. Pamilo P, Palsson ´ S. Associative overdominance, heterozy- 42. Hwang J-S, Lee J-S, Goo T-W et al. Cloning of the fi- gosity and fitness. Heredity 1998; 81(4):381–9. broin gene from the oak silkworm, Antheraea yamamai 22. Gnerre S, MacCallum I, Przybylski D et al. High-quality draft and its complete sequence. Biotechnol Lett 2001;23(16): assemblies of mammalian genomes from massively paral- 1321–6. lel sequence data. Proc Natl Acad Sci U S A 2011;108(4): 43. Malay AD, Sato R, Yazawa K et al. Relationships between 1513–8. physical properties and sequence in silkworm silks. Sci Rep 23. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically im- 2016;6(1):27573. proved memory-efficient short-read de novo assembler. Gi- 44. Stanke M, Diekhans M, Baertsch R et al. Using native and gascience 2012;1(1):18. syntenically mapped cDNA alignments to improve de novo 24. O’Connell J, Schulz-Trieglaff O, Carlson E et al. NxTrim: opti- gene finding. Bioinformatics 2008; 24(5):637–44. mized trimming of Illumina mate pair reads: Table 1. Bioin- 45. Blanco E, Parra G, Guigo´ R. Using Geneid to identify genes. formatics 2015;31(12):2035–7. Curr Protoc Bioinformatics 2007: 4.3.1–4.3.28. 25. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding 46. Lomsadze A, Burns PD, Borodovsky M. Integration of mapped pre-assembled contigs using SSPACE. Bioinformatics RNA-Seq reads into automatic training of eukaryotic gene 2011;27(4):578–9. finding algorithm. Nucleic Acids Res 2014:gku557. 26. Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bac- 47. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering terial draft genomes using long read sequence information. splice junctions with RNA-Seq. Bioinformatics 2009;25(9): BMC Bioinformatics 2014;15(1):211. 1105–11. 27. Voskoboynik A, Neff NF, Sahoo D et al. The genome se- 48. Trapnell C, Roberts A, Goff L et al. Differential gene and quence of the colonial chordate, Botryllus schlosseri. Elife transcript expression analysis of RNA-seq experiments with 2013;2:e00569. TopHat and Cufflinks. Nat Protoc 2012; 7(3):562–78. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of Antheraea yamamai 11 49. Campbell MA, Haas BJ, Hamilton JP et al. 58. Castresana J. Selection of conserved blocks from multiple Comprehensive analysis of alternative splicing in rice alignments for their use in phylogenetic analysis. Mol Biol and comparative analyses with Arabidopsis. BMC Genomics Evol 2000;17(4):540–52. 2006;7(1):327. 59. Kumar S, Stecher G, Tamura K. MEGA7: Molecular evolution- 50. Haas BJ, Salzberg SL, Zhu W et al. Automated eukary- ary genetics analysis version 7.0 for bigger datasets. Mol Biol otic gene structure annotation using EVidenceModeler and Evol 2016;33(7):1870–4. the program to assemble spliced alignments. Genome Biol 60. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phy- 2008;9(1):R7. logenetic inference under mixed models. Bioinformatics 51. Robinson JT, Thorvaldsdottir ´ H, Winckler W et al. In- 2003;19(12):1572–4. tegrative genomics viewer. Nat Biotechnol 2011;29(1): 61. Posada D. Using MODELTEST and PAUP to select a model 24–26. of nucleotide substitution. Curr Protoc Bioinformatics 2003: 52. Zurovec M, Yonemura N, Kludkiewicz B et al. Sericin Compo- 6.5.1–6.5.14. sition in the Silk of Antheraea yamamai. Biomacromolecules 62. De Bie T, Cristianini N, Demuth JP et al. CAFE: a computa- 2016;17(5):1776–87. tional tool for the study of gene family evolution. Bioinfor- 53. Consortium U. Reorganizing the protein space at the matics 2006;22(10):1269–71. Universal Protein Resource (UniProt). Nucleic Acids Res 63. Chapman RF. The Insects: Structure and Function. Cam- 2011:gkr981. bridge: Cambridge University Press; 1998. 54. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences 64. Regier JC, Paukstadt U, Paukstadt LH et al. Phylogenetics of (RefSeq): a curated non-redundant sequence database of eggshell morphogenesis in Antheraea (Lepidoptera: Saturni- genomes, transcripts and proteins. Nucleic Acids Res idae): unique origin and repeated reduction of the aeropyle 2007;35(Database):D61–5. crown. Syst Biol 2005;54(2):254–67. 55. Jones P, Binns D, Chang H-Y et al. InterProScan 5: 65. Regier JC, Mazur GD, Kafatos FC. The silkmoth chorion: mor- genome-scale protein function classification. Bioinformatics phological and biochemical characterization of four surface 2014;30(9):1236–40. regions. Devel Biol 1980;76(2):286–304. 56. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of 66. Hatzopoulos AK, Regier JC. Evolutionary changes in the de- ortholog groups for eukaryotic genomes. Genome Res velopmental expression of silkmoth chorion genes and their 2003;13(9):2178–89. morphological consequences. Proc Natl Acad Sci U S A 57. Loytynoja ¨ A, Goldman N. From The Cover: An algo- 1987;84(2):479–83. rithm for progressive multiple alignment of sequences 67. Kim S, Kwak W, Kim K et al. The Japanese silk moth, Anther- with insertions. Proc Natl Acad Sci U S A 2005;102(30): aea yamamai, draft genome sequence. GigaScience Database 10557–62. 2017. http://dx.doi.org/10.5524/100382. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4662863 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

Monthly Plan

  • Read unlimited articles
  • Personalized recommendations
  • No expiration
  • Print 20 pages per month
  • 20% off on PDF purchases
  • Organize your research
  • Get updates on your journals and topic searches

$49/month

Start Free Trial

14-day Free Trial

Best Deal — 39% off

Annual Plan

  • All the features of the Professional Plan, but for 39% off!
  • Billed annually
  • No expiration
  • For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

$588

$360/year

billed annually
Start Free Trial

14-day Free Trial