Draft genome of the milu (Elaphurus davidianus)

Draft genome of the milu (Elaphurus davidianus) Background: Milu, also known as Per ` e David’s deer (Elaphurus davidianus), was widely distributed in East Asia but recently experienced a severe bottleneck. Only 18 survived by the end of the 19th century, and the current population of 4500 individuals was propagated from just 11 kept by the 11th British Duke of Bedford. This species is known for its distinguishable appearance, the driving force behind which is still a mystery. To aid efforts to explore these phenomena, we constructed a draft genome of the species. Findings: In total, we generated 321.86 gigabases (Gb) of raw DNA sequence from whole-genome sequencing of a male milu deer using an Illumina HiSeq 2000 platform. Assembly yielded a final genome with a scaffold N50 size of 3.03 megabases (Mb) and a total length of 2.52 Gb. Moreover, we identified 20 125 protein-coding genes and 988.1 Mb of repetitive sequences. In addition, homology-based searches detected 280 rRNA, 1335 miRNA, 1441 snRNA, and 893 tRNA sequences in the milu genome. The divergence time between E. davidianus and Bos taurus was estimated to be about 28.20 million years ago (Mya). We identified 167 species-specific genes and 293 expanded gene families in the milu lineage. Conclusions: We report the first reference genome of milu, which will provide a valuable resource for studying the species’ demographic history of severe bottleneck and the genetic mechanism(s) of special phenotypic evolution. Keywords: Elaphurus davidianus; Reference genome; Evolution Data Description of the 19th century, and only 18 individuals survived in several European zoos at that time. The 18 surviving individuals were Background collected by the 11th British Duke of Bedford and kept at Woburn Abbey (UK), and only 11 participated in subsequent reproduction Per ` e David’s deer (Elaphurus davidianus), named after its western [3]. After this severe bottleneck, the milu population started to finder (Father Armand David) and called “milu” in China, was an recover. In the 1980s, dozens were reintroduced into China, and endemic species that was once widely distributed in East Asia there were more than 1500 in China and more than 3000 globally [1, 2]. Milu also has a colloquial name in China, Sibuxiang,which by 2004 [4]. Milu is highly interesting partly because of this bot- could be translated as “the 4 unlikes,” because it has the hooves tleneck and partly because it has atypical features for a cervid, of a cow, head of a horse, antlers of a deer, and tail of a donkey, such as a relatively long tail and unique branched antlers. Due but is not any of these animals (Fig. 1). Due to intense human and to these traits, scientists once identified it as the root of the sub- natural pressures, such as excessive hunting by humans and habitat degradation, milu became extinct in China by the end family Cervinae, but subsequent molecular analysis indicated Received: 29 June 2017; Revised: 12 October 2017; Accepted: 15 December 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 provided the original work is properly cited. by Ed 'DeepDyve' Gillespie user on 16 March 2018 1 2 Zhang et al. Figure 1: Photo of 2 fighting P er ` e David’s deer in Dafeng Milu National Reserves, Jiangsu, China. A red wound was spotted on the body of the deer to the right, and winning such fights generally increases mating chances. that milu is closer to the genus Cervus [5–9]. However, little is the expected “k-mer depth” is 25 (Fig. S1). The estimated known about the genetic architecture underlying milu’s unique milu genome size with these parameters is about 3.04 Gb phenotypic features and the population dynamics during its re- (Table S2). For comparison, we also used GCE software and the covery from the severe bottleneck. Thus, we constructed a draft GenomeScope package to estimate the milu genome size and genome for the species to facilitate investigation of effects of obtained estimates of 3.00 Gb and 2.78 Gb, respectively [11–13] the severe recent bottleneck and the molecular mechanisms in- (Fig. S2). All these estimated genome sizes are within the range volved in its phenotypic evolution. of C values (2.22 to 3.44) reported for Cervidae in the Animal Genome Size Database [14], indicating that our estimations are credible (Table S3). Library construction and filtering Genomic DNA was extracted from a male milu bred at the San Genome assembly Diego Zoo Safari Park, Escondido, California, USA, utilizing heart tissue collected at necropsy (NCBI Taxonomy ID, 43 332). The SOAPdenovo software, version 2.04 (SOAPdenovo, RRID: extracted DNA was used to construct short-insert libraries (170, SCR 010752)[15], was applied (with parameter settings 500, and 800 base pairs [bp]), and subsequently long-insert li- pregraph-K 79; contig -M 1; scaff –L 200 -b 1.5 -p 40) to construct braries (2, 5, 10, and 20 kilo bases [kb]). A HiSeq 2000 platform (Il- the original contigs and initial scaffolds using corrected reads lumina, San Diego, CA, USA) was subsequently used to sequence for the milu genome assembly. Then we used GapCloser, version paired-end reads of each library based on a whole-genome shot- 1.12 (GapCloser, RRID:SCR 015026)[10], to fill the gaps of initial gun sequencing strategy, generating 100-bp and 49-bp reads scaffolds using short-insert size PE reads (170, 500, and 800 bp). from the short-insert and long-insert libraries, respectively. In The initial scaffolds were then divided into scaff-tigs by the total, a 321.86-Gb raw dataset was obtained (Table S1). unfilled gaps. The divided scaff-tigs were connected to final Raw reads were filtered out that had: (1) >5% uncalled (“N”) scaffolds using SSPACE, version 3.0 (SSPACE, RRID:SCR 005056) bases or polyA structure; (2) ≥60 bases with quality scores ≤7 [16], with the following parameters: -x 0, -z 200, -g 2, -k 2, -n for reads generated from the short-insert library sequences; (3) 10. These final scaffolds’ gaps were also closed by GapCloser. ≥30 such bases for reads generated from the long-insert library The total length of our final milu genome assembly is 2.52 Gb, sequences; (4) more than 10 bp aligned to the adapter sequence; accounting for 85.71% of the estimated genome size. The final (5) read1 and read2 (of a short-insert PE read) overlapping by ≥10 contig N50 and scaffold N50 (>2 kb) sizes are 32.71 kb and 3.03 bp, allowing 10% mismatch; (6) duplicated polymerase chain re- Mb, respectively (Table 1). action (PCR) sequences. The low-quality bases at heads or tails of reads were also trimmed. After that, the short-insert library Quality assessment reads were corrected using SOAPec [10], a k-mer-based error cor- rection package. This resulted in a 244.84-Gb qualified dataset, To evaluate the quality of the milu genome assembly, the fil- representing about 82-fold genome coverage (Table S1). tered reads (≥49 bp) were aligned to the assembled genome sequences using SOAPaligner, version 2.20 (SOAPaligner/soap2, RRID:SCR 005503)[15], allowing 3 mismatches. We also se- Estimation of milu genome size quenced the genome of another male milu deer. The clean reads The milu genome size (G) was estimated by k-mer frequency obtained from this sequencing were also aligned to the assem- distribution analysis of the short-insert library, with a 1-bp bled genome by SOAPaligner with the same parameters. Both slide and k set at 17, using the formula G = k-mer number/k- alignments showed high coverage of each genome base, con- mer depth [10]. Here, “k-mer number” is 1 592 668 741 and firming accuracy at the base level (Fig. S3 and Fig. S4). In addition, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of milu 3 Table 1: Statistics of the assembled sequence length Contig Scaffold Size, bp Number Size, bp Number N90 8530 77 768 520 987 978 N80 14 483 55 968 1 045 447 647 N70 20 193 41 646 1 614 103 455 N60 26 169 30 975 2 222 401 322 N50 32 707 22 564 3 039 716 223 Longest 292 964 – 17 945 643 – Total size 2 460 119 591 – 2 524 831 955 – Percentage of unknown bases – – 2.56% Total number (≥100 bp) – 189 067 – 46 381 Total number (≥2 kb) – 118 986 – 4772 analysis with Benchmarking Universal Single-Copy Orthologs, de novo predictions. For homology-based predictions, homolo- version 2.0 (BUSCO, RRID:SCR 015008)[17], showed that the gous proteins of Homo sapiens (Ensembl 89 release), Bos taurus, assembly included complete matches for 3820 of 4104 mam- and Sus scrofa (Ensembl 89 release) were aligned to the repeat- malian BUSCOs (indicating 93.00% completeness) (Table S4). masked milu genome using TblastN (Blastall 2.2.23) with an E- Feature-Response Curves (FRC; version 1.3.0) [18] was then used value cutoff of 1e-5. Then aligned sequences and correspond- to evaluate the trade-off between its contiguity and correctness. ing query proteins were filtered and passed to GeneWise, ver- FRC generated by the software showed that our milu genome sion 2.2.0 (GeneWise, RRID:SCR 015054)[29], for accurate spliced assembly has similar correctness to published genomes of an- alignments. Gene sequences shorter than 150 bp and frame- other 3 ruminants: domestic goat (Capra hircus, ARS1, GenBank shifted or prematurely terminated genes were removed. De novo ID: GCF 0 017 04415.1) [19], sheep (Ovis aries, Oar v3.1), and cat- predictions were obtained from analysis of the repeat-masked tle (Bos taurus UMD3.1) (Fig. S5) [20, 21]. Subsequently, synteny genome using Augustus, version 2.5.5 (Augustus: Gene Predic- analysis was applied to identify differences between the assem- tion, RRID:SCR 008417)[30], and GENSCAN, version 1.0 (GEN- bled genome and the domestic goat (Capra hircus) genome using SCAN, RRID:SCR 012902)[31], with parameters generated from MUMmer (version 3.23) [22], with a 50% identity cutoff for MUMs training with Homo sapiens genes. The filter processes applied in the NUCmer alignments used to determine synteny (Fig. S6); in the homology-based prediction procedure were also applied 99.35% of the 2 genome sequences could be 1:1 aligned. In ad- in the de novo predictions. Next, the obtained results were in- dition, we compared the milu and goat genomes using LAST, tegrated using GLEAN (version 1.0.1) [32], and then genes with version 3 (LAST, RRID:SCR 006119)[23], to find the breakpoints few exons (≤3), which could not be aligned well in SwissProt or (edges of structural variation). The overall density of different TrEMBL, were filtered to produce a final consensus gene set con- types of breakpoints was about 54.76 per Mb, comparable to den- taining 20 125 genes. The number of genes, gene length distribu- sities reported in another study (Table S5) [24], and the average tion, exon number per gene, and intron length distribution were nuclear distance (percentage of different base pairs in the syn- similar to those of other mammals (Fig. S8 and Table S7). We tenic regions) was 6.56% (Fig. S7). The results indicated that the also identified a total of 2803 pseudogenes from GeneWise align- milu genome assembly has good completeness and continuity. ment, of which 2801 had prematurely terminating mutations and 1358 had frame-shifted mutations (Tables S8 and S9) [29]. Then, the KEGG, SwissProt, and TrEMBL databases were Repeat annotation searched for best matches to the final gene set using To annotate repeats, we first searched the milu genome for tan- BLASTP (version 2.2.26) with an E-value of 1e-5. Subse- dem repeats using Tandem Repeats Finder (version 4.04) [25] quently, InterProScan software, version 5.18–57.0 (InterProScan, with the following settings: Match = 2, Mismatch = 7, Delta = RRID:SCR 005829), was applied to map putative encoded pro- 7, PM = 80, PI = 10, Minscore = 50. Then, RepeatMasker ver- tein sequences against entries in the Pfam, PRINTS, ProDom, and SMART databases to identify known motifs and domains. sion 3.3.0 (RepeatMasker, RRID:SCR 012954) and RepeatProtein- Mask (version 3.3.0, a package in RepeatMasker) [26]wereused In total, at least 1 function was allocated to 17 913 (89.31%) to find known transposable element (TE) repeats in the Repbase of the genes in this manner (Table S10). Next, reads from the TE library (version 16.01) [27]. In addition, RepeatModeler version short-insert library with about 27-fold genome coverage were 1.0.5 (RepeatModeler, RRID:SCR 015027)and LTR FINDER version mapped to the milu genome using BWA, version 0.7.15-r1140 1.0.5 (LTR Finder, RRID:SCR 015247)[28] were used to construct (BWA, RRID:SCR 010910)[33], and subsequently called variants a de novo repeat library, and RepeatMasker was employed to find by SAMtools (version 1.3.1) [34]. Finally, SnpEff (version 4.10) [35] homolog repeats in the genome and classify the detected re- was applied to identify the distribution of single nucleotide vari- peats. The results indicated that long interspersed elements ac- ants (SNVs) in the milu genome (Table S11). counted for 27.05% of the milu genome, and other identified re- In addition, putative short noncoding RNAs were identi- peat sequences accounted for a further 13.99% (Table S6). fied by BLASTN alignment of human rRNA sequences with milu homologs. We employed Infernal, version 0.81 (Infer- nal, RRID:SCR 011809), with the Rfam database (release 9.1) Gene annotation to annotate the miRNA and snRNA genes. The tRNAs were To annotate structures and functions of putative genes in our annotated using tRNAscan-SE, version 1.3.1 (tRNAscan-SE, milu genome assembly, we used both homology-based and RRID:SCR 010835), with default parameters. In total, 3949 short Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Zhang et al. Figure 2: Phylogenetic relationships and genomic comparisons. (A) A Venn diagram of the orthologues shared among Elaphurus davidianus, Equus caballus, Capra hircus, Bos taurus, and Homo sapiens. Each number represents a gene family number, and the sum of the numbers in the green, yellow, brown, red, and blue areas indicate total numbers of the gene families in milu, horse, human, goat, and cattle genomes, respectively. (B) Divergence time estimates for the 5 species generated using MCMCtree and the 4-fold degenerate sites; the dots correspond to calibration points, and the divergence times were obtained from http://www.timetree.org/; blue nodal bars indicate 95% confidence intervals. noncoding RNA sequences were identified in the milu deer regions. Moreover, this genome assembly should be elevated to genome (Table S12). the chromosomal level in the future with Hi-C, optical mapping, or genetic mapping technologies. Species-specific genes and phylogenetic relationships Supporting data The detected milu genes were clustered in families using Or- The raw reads of each sequencing library have been deposited thoMCL, version 2.0.9 (OrthoMCL DB: Ortholog Groups of Pro- at NCBI, Project ID: PRJNA391565. For the assembled individual tein Sequences, RRID:SCR 007839)[36], with an E-value cutoff of (Sample ID: SAMN07270940), please refer to the SRA accession 1e-5, and Markov Chain Clustering with a default inflation pa- numbers in Table S1. The SRA accession number for the other se- rameter in an all-to-all BLASTP analysis of entries for 5 species quenced individual (Sample ID: SAMN08014286) is SRR6287186. (Homo sapiens, Equus caballus, Capra hircus, Bos taurus,and Ela- The assembly and annotation of the milu genome, together phurus davidianus). The results indicated that 69 gene families with further supporting data, are available via the GigaScience and 167 genes were specific to milu (Fig. 2A, Table S13). We database, GigaDB [42]. Supplementary figures and tables are pro- also detected 293 gene families that had apparently expanded vided in Additional file 1. in the milu lineage using Computational Analysis of gene Fam- ily Evolution (CAFE; version 4.0.1) [37]. The milu species-specific gene families were enriched in 4 gene ontology (GO) categories Additional files related to hormone activity, pinocytosis, ribosomes, and struc- tural constituents of ribosomes (Table S14). The expanded gene Figure S1: K-mer (k = 25) distribution in the milu genome. families were enriched in 34 GO categories: motor activity, AT- Figure S2: GenomeScope K-mer profile plot of the milu Pase activity, calcium ion binding, and 31 others (for details, see genome. Table S15). Subsequently, 7906 1:1 orthologs were identified Figure S3: Sequence depth distribution of the assembly data. from these species and aligned using PRANK (version 3.8.31) Figure S4: Sequence depth distribution of the assembly data [38]. Next, we extracted 4D sites (4-fold degenerate sites) to for the genome of the other sequenced individual. construct a phylogenetic tree by RAxML, version 7.2.8 (RAxML, Figure S5: Feature-response curves of 4 ruminant genome as- RRID:SCR 006086)[39], with the GTR+G+I model. Finally, phy- semblies. logenetic analysis by PAML MCMCtree (version 4.5) [40], cali- Figure S6: Visualized synteny between the milu and goat brated with published timings for the divergence of the refer- genomes. ence species [41], showed that Elaphurus davidianus, Bos taurus, Figure S7: DNA sequence divergence between milu and goat. and Capra hircus diverged from a common ancestor approxi- Figure S8: Comparison of gene lengths, intron lengths, exon mately 28.20 million years ago (Fig. 2B). lengths, and exon numbers in milu, cattle, human, and sheep In summary, we report the first sequencing, assembly, and genomes. annotation of the milu genome. The assembled draft genome Table S1: Summary of sequenced reads. will provide a valuable resource for studying the species’ evolu- Table S2: 17-mer depth distribution. tionary history, as well as genetic changes and associated phe- Table S3: Summary of the C values of Cervidae and estimated nomena, such as genetic load and selection pressures that oc- milu genome sizes. curred during its severe bottleneck or other unknown historical Table S4: Summary of BUSCO analysis of matches to the 4104 events. It should be noted that this draft assembly was generated mammalian BUSCOs. by next-generation sequencing data and there may be some er- Table S5: Summary of breakpoints between milu and goat rors in highly guanosine and cytosine (GC)-biased or repeated genomes. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of milu 5 Table S6: TE contents in the assembled milu genome. References Table S7: General statistics of predicted protein-coding genes. 1. Harrison RJ, Hamilton WJ. The reproductive tract and the pla- Table S8: Summary of the predicted pseudogenes. centa and membranes of Pere David’s deer (Elaphurus david- Table S9: List of predicted pseudogenes. ianus Milne Edwards). J Anat 1952;86(2):203–25. Table S10: Summary statistics of gene function annotation. 2. Cao K. On the time of extinction of the wild Mi-deer in China Table S11: Distribution of single nucleotide variants in the [Chinese]. Acta Zoologica Sinica 1978;24(3):289–91. milu genome. 3. JONES F. A contribution to the history and anatomy of Table S12: Summary of short ncRNA annotation. Per ` e David’s Deer (Elaphurus davidianus). J Zool 1951;2(121): Table S13: List of milu species-specific genes. 319–70. Table S14: GO term enrichment in milu species-specific 4. Ding Y. Chinese Milu Research [Chinese]. Changchun, China: genes. Jilin Publishing House for the Science and Technology; 2004. Table S15: GO term enrichment in gene families that have 5. Tate ML, Mathias HC, Fennessy PF et al. A new gene mapping expanded in milu. resource: interspecies hybrids between Pere David’s deer (Elaphurus davidianus) and red deer (Cervus elaphus). Genet- Abbreviations ics 1995;139(3):1383–91. 6. Slate J, Van Stijn TC, Anderson RM et al. A deer (subfamily bp: base pair; BUSCO: Benchmarking Universal Single-Copy Or- Cervinae) genetic linkage map and the evolution of ruminant thologs; FRC: feature-response curves; Gb: giga base; GO: gene genomes. Genetics 2002;160(4):1587–97. ontology; kb: kilo base; Mb: mega base; SNV: single nucleotide 7. Pitra C, Fickel J, Meijaard E et al. Evolution and phylogeny of variant; TE: transposable element. old world deer. Mol Phylogenet Evol 2004;33(3):880–95. 8. Maqbool NJ, Tate ML, Dodds KG et al. A QTL study of growth and body shape in the inter-species hybrid of Pere David’s Ethics approval deer (Elaphurus davidianus) and red deer (Cervus elaphus). Animal collection and utility protocols were approved by the Anim Genet 2007;38(3):270–6. Northwestern Polytechnical University and BGI-Shenzhen Lab- 9. Emerson BC, Tate ML. Genetic analysis of evolutionary oratory Animal Care and Use Committee and were in accor- relationships among deer (subfamily Cervinae). J Hered dance with guidelines from the China Council on Animal Care. 1993;84(4):266–73. Samples provided by San Diego Zoo Global (SDZG) were col- 10. Li R, Fan W, Tian G et al. The sequence and de novo assembly lected in accordance with SDZG’s Institutional Animal Care and of the giant panda genome. Nature 2010;463(7279):311–7. Use Committee policies, which meet or exceed US regulatory 11. Liu B, Shi Y, Fan W. Estimation of genomic characteristics standards for the humane care and treatment of animals in by analyzing k-mer frequency in de novo estimation of ge- research. nomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint 2013. arXiv:1308.2012. 12. Vurture GW, Sedlazeck FJ, Nattestad M et al. GenomeScope: Competing interests fast reference-free genome profiling from short reads. Bioin- The authors declare that they have no competing interests. formatics 2017;33(14):2202–4. 13. Marcais G, Kingsford C. A fast, lock-free approach for effi- cient parallel counting of occurrences of k-mers. Bioinfor- Author contributions matics 2011;27(6):764–70. Q.Q., W.W., and G.Z. conceived the study. C.Z. and L.C. designed 14. Gregory TR. Animal Genome Size Database. 2017. http:// the scientific objectives. L.G.C. and O.A.R. evaluated and pro- www.genomesize.com. vided samples from San Diego Zoo Global. Y.Z. collected the 15. Li R, Zhu H, Ruan J et al. De novo assembly of human samples, extracted the genomic DNA, and constructed the DNA genomes with massively parallel short read sequencing. libraries. C.Z. and Y.Z. estimated the milu genome size and as- Genome Res 2010;20(2):265–72. sembled the genome. C.Z., L.C., and Y.Z. carried out the qual- 16. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding ity assessment, repeat annotation, and gene annotation. C.Z. pre-assembled contigs using SSPACE. Bioinformatics and K.W. were responsible for finding species-specific genes and 2011;27(4):578–9. phylogenetic relationship construction. L.C. and C.Z. uploaded 17. Simao FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- the raw read data, genome assembly, and annotation in the NCBI ing genome assembly and annotation completeness with and GigaScience (GigaDB) databases. C.Z., Q.Q., and W.W. wrote single-copy orthologs. Bioinformatics 2015;31(19):3210–2. the manuscript. Q.Q., W.W., and G.Z. supervised all aspects of 18. Vezzi F, Narzisi G, Mishra B. Reevaluating assembly evalua- the work to ensure the accuracy and integrity of the research tions with feature response curves: GAGE and assemblath- and data. O.A.R. contributed to editing the final manuscript. All ons. Plos One 2012;7(12):e52210. authors read and approved the final manuscript. 19. Bickhart DM, Rosen BD, Koren S et al. Single-molecule se- quencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Acknowledgements Genet 2017;49(4):643–50. This study was supported by grants from the Talents Team 20. Jiang Y, Xie M, Chen W et al. The sheep genome illumi- nates biology of the rumen and lipid metabolism. Science Construction Fund of Northwestern Polytechnical University (NWPU) to Q.Q. and W.W. We thank Nowbio Biotech Inc., Kun- 2014;344(6188):1168–73. 21. Elsik CG, Tellam RL, Worley KC et al. The genome sequence of ming, China, for excellent work on DNA library construction and assistance during the genome sequencing. This project was ini- taurine cattle: a window to ruminant biology and evolution. Science 2009;324(5926):522–8. tiated under the auspices of the Genome 10K Project. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Zhang et al. 22. Delcher AL, Salzberg SL, Phillippy AM. Using MUMmer to 33. Li H. Aligning sequence reads, clone sequences and identify similar regions in large sequence sets. Curr Protoc assembly contigs with BWA-MEM. arXiv preprint 2013. Bioinformatics 2003;Chapter 10:10–13. arXiv:13033997. 23. Kielbasa SM, Wan R, Sato K et al. Adaptive seeds tame 34. Li H, Handsaker B, Wysoker A et al. The Sequence Alignment/ genomic sequence comparison. Genome Res 2011;21(3): Map format and SAMtools. Bioinformatics 2009;25(16): 487–93. 2078–9. 24. Wang K, Wang L, Lenstra JA et al. The genome se- 35. Cingolani P, Platts A, Wang LL et al. A program for annotat- quence of the wisent (Bison bonasus). Gigascience 2017;6(4): ing and predicting the effects of single nucleotide polymor- 1–5. phisms, SnpEff. Fly 2012;6(2):80–92. 25. Benson G. Tandem repeats finder: a program to analyze DNA 36. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of sequences. Nucleic Acids Res 1999;27(2):573–80. ortholog groups for eukaryotic genomes. Genome Res 26. Tarailo-Graovac M, Chen N. Using RepeatMasker to iden- 2003;13(9):2178–89. tify repetitive elements in genomic sequences. Curr Protoc 37. De Bie T, Cristianini N, Demuth JP et al. CAFE: a computa- Bioinformatics 2009;Chapter 4:4–10. tional tool for the study of gene family evolution. Bioinfor- 27. Jurka J, Kapitonov VV, Pavlicek A et al. Repbase Update, matics 2006;22(10):1269–71. a database of eukaryotic repetitive elements. Cytogenet 38. Loytynoja A, Goldman N. From the cover: an algorithm for Genome Res 2005;110(1–4):462–7. progressive multiple alignment of sequences with inser- 28. Xu Z, Wang H. LTR˙FINDER: an efficient tool for the predic- tions. Proc Natl Acad Sci U S A 2005;102(30):10557–62. tion of full-length LTR retrotransposons. Nucleic Acids Res 39. Stamatakis A. RAxML version 8: a tool for phylogenetic anal- 2007;35(Web Server):W265–8. ysis and post-analysis of large phylogenies. Bioinformatics 29. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. 2014;30(9):1312–3. Genome Res 2004;14(5):988–95. 40. Yang Z. PAML 4: phylogenetic analysis by maximum likeli- 30. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- hood. Mol Biol Evol 2007;24(8):1586–91. tio prediction of alternative transcripts. Nucleic Acids Res 41. Kumar S, Stecher G, Suleski M, Hedges SB. TimeTree: A Re- 2006;34(Web Server):W435–9. source for Timelines, Timetrees, and Divergence Times. Mol 31. Burge C, Karlin S. Prediction of complete gene structures in Biol Evol 2017;34(7):1812–9. human genomic DNA. J Mol Biol 1997;268(1):78–94. 42. Zhang C, Chen L, Zhou Y et al. Supporting data for “Draft 32. Elsik CG, Mackey AJ, Reese JT et al. Creating a honey bee con- genome of the milu (Elaphurus davidianus).” GigaScience sensus gene set. Genome Biol 2007;8(1):R13. Database 2017. http://dx.doi.org/10.5524/100383. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Draft genome of the milu (Elaphurus davidianus)

Free
6 pages

Loading next page...
 
/lp/ou_press/draft-genome-of-the-milu-elaphurus-davidianus-lOzBGwGjpn
Publisher
BGI
Copyright
© The Author(s) 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix130
Publisher site
See Article on Publisher Site

Abstract

Background: Milu, also known as Per ` e David’s deer (Elaphurus davidianus), was widely distributed in East Asia but recently experienced a severe bottleneck. Only 18 survived by the end of the 19th century, and the current population of 4500 individuals was propagated from just 11 kept by the 11th British Duke of Bedford. This species is known for its distinguishable appearance, the driving force behind which is still a mystery. To aid efforts to explore these phenomena, we constructed a draft genome of the species. Findings: In total, we generated 321.86 gigabases (Gb) of raw DNA sequence from whole-genome sequencing of a male milu deer using an Illumina HiSeq 2000 platform. Assembly yielded a final genome with a scaffold N50 size of 3.03 megabases (Mb) and a total length of 2.52 Gb. Moreover, we identified 20 125 protein-coding genes and 988.1 Mb of repetitive sequences. In addition, homology-based searches detected 280 rRNA, 1335 miRNA, 1441 snRNA, and 893 tRNA sequences in the milu genome. The divergence time between E. davidianus and Bos taurus was estimated to be about 28.20 million years ago (Mya). We identified 167 species-specific genes and 293 expanded gene families in the milu lineage. Conclusions: We report the first reference genome of milu, which will provide a valuable resource for studying the species’ demographic history of severe bottleneck and the genetic mechanism(s) of special phenotypic evolution. Keywords: Elaphurus davidianus; Reference genome; Evolution Data Description of the 19th century, and only 18 individuals survived in several European zoos at that time. The 18 surviving individuals were Background collected by the 11th British Duke of Bedford and kept at Woburn Abbey (UK), and only 11 participated in subsequent reproduction Per ` e David’s deer (Elaphurus davidianus), named after its western [3]. After this severe bottleneck, the milu population started to finder (Father Armand David) and called “milu” in China, was an recover. In the 1980s, dozens were reintroduced into China, and endemic species that was once widely distributed in East Asia there were more than 1500 in China and more than 3000 globally [1, 2]. Milu also has a colloquial name in China, Sibuxiang,which by 2004 [4]. Milu is highly interesting partly because of this bot- could be translated as “the 4 unlikes,” because it has the hooves tleneck and partly because it has atypical features for a cervid, of a cow, head of a horse, antlers of a deer, and tail of a donkey, such as a relatively long tail and unique branched antlers. Due but is not any of these animals (Fig. 1). Due to intense human and to these traits, scientists once identified it as the root of the sub- natural pressures, such as excessive hunting by humans and habitat degradation, milu became extinct in China by the end family Cervinae, but subsequent molecular analysis indicated Received: 29 June 2017; Revised: 12 October 2017; Accepted: 15 December 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 provided the original work is properly cited. by Ed 'DeepDyve' Gillespie user on 16 March 2018 1 2 Zhang et al. Figure 1: Photo of 2 fighting P er ` e David’s deer in Dafeng Milu National Reserves, Jiangsu, China. A red wound was spotted on the body of the deer to the right, and winning such fights generally increases mating chances. that milu is closer to the genus Cervus [5–9]. However, little is the expected “k-mer depth” is 25 (Fig. S1). The estimated known about the genetic architecture underlying milu’s unique milu genome size with these parameters is about 3.04 Gb phenotypic features and the population dynamics during its re- (Table S2). For comparison, we also used GCE software and the covery from the severe bottleneck. Thus, we constructed a draft GenomeScope package to estimate the milu genome size and genome for the species to facilitate investigation of effects of obtained estimates of 3.00 Gb and 2.78 Gb, respectively [11–13] the severe recent bottleneck and the molecular mechanisms in- (Fig. S2). All these estimated genome sizes are within the range volved in its phenotypic evolution. of C values (2.22 to 3.44) reported for Cervidae in the Animal Genome Size Database [14], indicating that our estimations are credible (Table S3). Library construction and filtering Genomic DNA was extracted from a male milu bred at the San Genome assembly Diego Zoo Safari Park, Escondido, California, USA, utilizing heart tissue collected at necropsy (NCBI Taxonomy ID, 43 332). The SOAPdenovo software, version 2.04 (SOAPdenovo, RRID: extracted DNA was used to construct short-insert libraries (170, SCR 010752)[15], was applied (with parameter settings 500, and 800 base pairs [bp]), and subsequently long-insert li- pregraph-K 79; contig -M 1; scaff –L 200 -b 1.5 -p 40) to construct braries (2, 5, 10, and 20 kilo bases [kb]). A HiSeq 2000 platform (Il- the original contigs and initial scaffolds using corrected reads lumina, San Diego, CA, USA) was subsequently used to sequence for the milu genome assembly. Then we used GapCloser, version paired-end reads of each library based on a whole-genome shot- 1.12 (GapCloser, RRID:SCR 015026)[10], to fill the gaps of initial gun sequencing strategy, generating 100-bp and 49-bp reads scaffolds using short-insert size PE reads (170, 500, and 800 bp). from the short-insert and long-insert libraries, respectively. In The initial scaffolds were then divided into scaff-tigs by the total, a 321.86-Gb raw dataset was obtained (Table S1). unfilled gaps. The divided scaff-tigs were connected to final Raw reads were filtered out that had: (1) >5% uncalled (“N”) scaffolds using SSPACE, version 3.0 (SSPACE, RRID:SCR 005056) bases or polyA structure; (2) ≥60 bases with quality scores ≤7 [16], with the following parameters: -x 0, -z 200, -g 2, -k 2, -n for reads generated from the short-insert library sequences; (3) 10. These final scaffolds’ gaps were also closed by GapCloser. ≥30 such bases for reads generated from the long-insert library The total length of our final milu genome assembly is 2.52 Gb, sequences; (4) more than 10 bp aligned to the adapter sequence; accounting for 85.71% of the estimated genome size. The final (5) read1 and read2 (of a short-insert PE read) overlapping by ≥10 contig N50 and scaffold N50 (>2 kb) sizes are 32.71 kb and 3.03 bp, allowing 10% mismatch; (6) duplicated polymerase chain re- Mb, respectively (Table 1). action (PCR) sequences. The low-quality bases at heads or tails of reads were also trimmed. After that, the short-insert library Quality assessment reads were corrected using SOAPec [10], a k-mer-based error cor- rection package. This resulted in a 244.84-Gb qualified dataset, To evaluate the quality of the milu genome assembly, the fil- representing about 82-fold genome coverage (Table S1). tered reads (≥49 bp) were aligned to the assembled genome sequences using SOAPaligner, version 2.20 (SOAPaligner/soap2, RRID:SCR 005503)[15], allowing 3 mismatches. We also se- Estimation of milu genome size quenced the genome of another male milu deer. The clean reads The milu genome size (G) was estimated by k-mer frequency obtained from this sequencing were also aligned to the assem- distribution analysis of the short-insert library, with a 1-bp bled genome by SOAPaligner with the same parameters. Both slide and k set at 17, using the formula G = k-mer number/k- alignments showed high coverage of each genome base, con- mer depth [10]. Here, “k-mer number” is 1 592 668 741 and firming accuracy at the base level (Fig. S3 and Fig. S4). In addition, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of milu 3 Table 1: Statistics of the assembled sequence length Contig Scaffold Size, bp Number Size, bp Number N90 8530 77 768 520 987 978 N80 14 483 55 968 1 045 447 647 N70 20 193 41 646 1 614 103 455 N60 26 169 30 975 2 222 401 322 N50 32 707 22 564 3 039 716 223 Longest 292 964 – 17 945 643 – Total size 2 460 119 591 – 2 524 831 955 – Percentage of unknown bases – – 2.56% Total number (≥100 bp) – 189 067 – 46 381 Total number (≥2 kb) – 118 986 – 4772 analysis with Benchmarking Universal Single-Copy Orthologs, de novo predictions. For homology-based predictions, homolo- version 2.0 (BUSCO, RRID:SCR 015008)[17], showed that the gous proteins of Homo sapiens (Ensembl 89 release), Bos taurus, assembly included complete matches for 3820 of 4104 mam- and Sus scrofa (Ensembl 89 release) were aligned to the repeat- malian BUSCOs (indicating 93.00% completeness) (Table S4). masked milu genome using TblastN (Blastall 2.2.23) with an E- Feature-Response Curves (FRC; version 1.3.0) [18] was then used value cutoff of 1e-5. Then aligned sequences and correspond- to evaluate the trade-off between its contiguity and correctness. ing query proteins were filtered and passed to GeneWise, ver- FRC generated by the software showed that our milu genome sion 2.2.0 (GeneWise, RRID:SCR 015054)[29], for accurate spliced assembly has similar correctness to published genomes of an- alignments. Gene sequences shorter than 150 bp and frame- other 3 ruminants: domestic goat (Capra hircus, ARS1, GenBank shifted or prematurely terminated genes were removed. De novo ID: GCF 0 017 04415.1) [19], sheep (Ovis aries, Oar v3.1), and cat- predictions were obtained from analysis of the repeat-masked tle (Bos taurus UMD3.1) (Fig. S5) [20, 21]. Subsequently, synteny genome using Augustus, version 2.5.5 (Augustus: Gene Predic- analysis was applied to identify differences between the assem- tion, RRID:SCR 008417)[30], and GENSCAN, version 1.0 (GEN- bled genome and the domestic goat (Capra hircus) genome using SCAN, RRID:SCR 012902)[31], with parameters generated from MUMmer (version 3.23) [22], with a 50% identity cutoff for MUMs training with Homo sapiens genes. The filter processes applied in the NUCmer alignments used to determine synteny (Fig. S6); in the homology-based prediction procedure were also applied 99.35% of the 2 genome sequences could be 1:1 aligned. In ad- in the de novo predictions. Next, the obtained results were in- dition, we compared the milu and goat genomes using LAST, tegrated using GLEAN (version 1.0.1) [32], and then genes with version 3 (LAST, RRID:SCR 006119)[23], to find the breakpoints few exons (≤3), which could not be aligned well in SwissProt or (edges of structural variation). The overall density of different TrEMBL, were filtered to produce a final consensus gene set con- types of breakpoints was about 54.76 per Mb, comparable to den- taining 20 125 genes. The number of genes, gene length distribu- sities reported in another study (Table S5) [24], and the average tion, exon number per gene, and intron length distribution were nuclear distance (percentage of different base pairs in the syn- similar to those of other mammals (Fig. S8 and Table S7). We tenic regions) was 6.56% (Fig. S7). The results indicated that the also identified a total of 2803 pseudogenes from GeneWise align- milu genome assembly has good completeness and continuity. ment, of which 2801 had prematurely terminating mutations and 1358 had frame-shifted mutations (Tables S8 and S9) [29]. Then, the KEGG, SwissProt, and TrEMBL databases were Repeat annotation searched for best matches to the final gene set using To annotate repeats, we first searched the milu genome for tan- BLASTP (version 2.2.26) with an E-value of 1e-5. Subse- dem repeats using Tandem Repeats Finder (version 4.04) [25] quently, InterProScan software, version 5.18–57.0 (InterProScan, with the following settings: Match = 2, Mismatch = 7, Delta = RRID:SCR 005829), was applied to map putative encoded pro- 7, PM = 80, PI = 10, Minscore = 50. Then, RepeatMasker ver- tein sequences against entries in the Pfam, PRINTS, ProDom, and SMART databases to identify known motifs and domains. sion 3.3.0 (RepeatMasker, RRID:SCR 012954) and RepeatProtein- Mask (version 3.3.0, a package in RepeatMasker) [26]wereused In total, at least 1 function was allocated to 17 913 (89.31%) to find known transposable element (TE) repeats in the Repbase of the genes in this manner (Table S10). Next, reads from the TE library (version 16.01) [27]. In addition, RepeatModeler version short-insert library with about 27-fold genome coverage were 1.0.5 (RepeatModeler, RRID:SCR 015027)and LTR FINDER version mapped to the milu genome using BWA, version 0.7.15-r1140 1.0.5 (LTR Finder, RRID:SCR 015247)[28] were used to construct (BWA, RRID:SCR 010910)[33], and subsequently called variants a de novo repeat library, and RepeatMasker was employed to find by SAMtools (version 1.3.1) [34]. Finally, SnpEff (version 4.10) [35] homolog repeats in the genome and classify the detected re- was applied to identify the distribution of single nucleotide vari- peats. The results indicated that long interspersed elements ac- ants (SNVs) in the milu genome (Table S11). counted for 27.05% of the milu genome, and other identified re- In addition, putative short noncoding RNAs were identi- peat sequences accounted for a further 13.99% (Table S6). fied by BLASTN alignment of human rRNA sequences with milu homologs. We employed Infernal, version 0.81 (Infer- nal, RRID:SCR 011809), with the Rfam database (release 9.1) Gene annotation to annotate the miRNA and snRNA genes. The tRNAs were To annotate structures and functions of putative genes in our annotated using tRNAscan-SE, version 1.3.1 (tRNAscan-SE, milu genome assembly, we used both homology-based and RRID:SCR 010835), with default parameters. In total, 3949 short Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Zhang et al. Figure 2: Phylogenetic relationships and genomic comparisons. (A) A Venn diagram of the orthologues shared among Elaphurus davidianus, Equus caballus, Capra hircus, Bos taurus, and Homo sapiens. Each number represents a gene family number, and the sum of the numbers in the green, yellow, brown, red, and blue areas indicate total numbers of the gene families in milu, horse, human, goat, and cattle genomes, respectively. (B) Divergence time estimates for the 5 species generated using MCMCtree and the 4-fold degenerate sites; the dots correspond to calibration points, and the divergence times were obtained from http://www.timetree.org/; blue nodal bars indicate 95% confidence intervals. noncoding RNA sequences were identified in the milu deer regions. Moreover, this genome assembly should be elevated to genome (Table S12). the chromosomal level in the future with Hi-C, optical mapping, or genetic mapping technologies. Species-specific genes and phylogenetic relationships Supporting data The detected milu genes were clustered in families using Or- The raw reads of each sequencing library have been deposited thoMCL, version 2.0.9 (OrthoMCL DB: Ortholog Groups of Pro- at NCBI, Project ID: PRJNA391565. For the assembled individual tein Sequences, RRID:SCR 007839)[36], with an E-value cutoff of (Sample ID: SAMN07270940), please refer to the SRA accession 1e-5, and Markov Chain Clustering with a default inflation pa- numbers in Table S1. The SRA accession number for the other se- rameter in an all-to-all BLASTP analysis of entries for 5 species quenced individual (Sample ID: SAMN08014286) is SRR6287186. (Homo sapiens, Equus caballus, Capra hircus, Bos taurus,and Ela- The assembly and annotation of the milu genome, together phurus davidianus). The results indicated that 69 gene families with further supporting data, are available via the GigaScience and 167 genes were specific to milu (Fig. 2A, Table S13). We database, GigaDB [42]. Supplementary figures and tables are pro- also detected 293 gene families that had apparently expanded vided in Additional file 1. in the milu lineage using Computational Analysis of gene Fam- ily Evolution (CAFE; version 4.0.1) [37]. The milu species-specific gene families were enriched in 4 gene ontology (GO) categories Additional files related to hormone activity, pinocytosis, ribosomes, and struc- tural constituents of ribosomes (Table S14). The expanded gene Figure S1: K-mer (k = 25) distribution in the milu genome. families were enriched in 34 GO categories: motor activity, AT- Figure S2: GenomeScope K-mer profile plot of the milu Pase activity, calcium ion binding, and 31 others (for details, see genome. Table S15). Subsequently, 7906 1:1 orthologs were identified Figure S3: Sequence depth distribution of the assembly data. from these species and aligned using PRANK (version 3.8.31) Figure S4: Sequence depth distribution of the assembly data [38]. Next, we extracted 4D sites (4-fold degenerate sites) to for the genome of the other sequenced individual. construct a phylogenetic tree by RAxML, version 7.2.8 (RAxML, Figure S5: Feature-response curves of 4 ruminant genome as- RRID:SCR 006086)[39], with the GTR+G+I model. Finally, phy- semblies. logenetic analysis by PAML MCMCtree (version 4.5) [40], cali- Figure S6: Visualized synteny between the milu and goat brated with published timings for the divergence of the refer- genomes. ence species [41], showed that Elaphurus davidianus, Bos taurus, Figure S7: DNA sequence divergence between milu and goat. and Capra hircus diverged from a common ancestor approxi- Figure S8: Comparison of gene lengths, intron lengths, exon mately 28.20 million years ago (Fig. 2B). lengths, and exon numbers in milu, cattle, human, and sheep In summary, we report the first sequencing, assembly, and genomes. annotation of the milu genome. The assembled draft genome Table S1: Summary of sequenced reads. will provide a valuable resource for studying the species’ evolu- Table S2: 17-mer depth distribution. tionary history, as well as genetic changes and associated phe- Table S3: Summary of the C values of Cervidae and estimated nomena, such as genetic load and selection pressures that oc- milu genome sizes. curred during its severe bottleneck or other unknown historical Table S4: Summary of BUSCO analysis of matches to the 4104 events. It should be noted that this draft assembly was generated mammalian BUSCOs. by next-generation sequencing data and there may be some er- Table S5: Summary of breakpoints between milu and goat rors in highly guanosine and cytosine (GC)-biased or repeated genomes. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Genome of milu 5 Table S6: TE contents in the assembled milu genome. References Table S7: General statistics of predicted protein-coding genes. 1. Harrison RJ, Hamilton WJ. The reproductive tract and the pla- Table S8: Summary of the predicted pseudogenes. centa and membranes of Pere David’s deer (Elaphurus david- Table S9: List of predicted pseudogenes. ianus Milne Edwards). J Anat 1952;86(2):203–25. Table S10: Summary statistics of gene function annotation. 2. Cao K. On the time of extinction of the wild Mi-deer in China Table S11: Distribution of single nucleotide variants in the [Chinese]. Acta Zoologica Sinica 1978;24(3):289–91. milu genome. 3. JONES F. A contribution to the history and anatomy of Table S12: Summary of short ncRNA annotation. Per ` e David’s Deer (Elaphurus davidianus). J Zool 1951;2(121): Table S13: List of milu species-specific genes. 319–70. Table S14: GO term enrichment in milu species-specific 4. Ding Y. Chinese Milu Research [Chinese]. Changchun, China: genes. Jilin Publishing House for the Science and Technology; 2004. Table S15: GO term enrichment in gene families that have 5. Tate ML, Mathias HC, Fennessy PF et al. A new gene mapping expanded in milu. resource: interspecies hybrids between Pere David’s deer (Elaphurus davidianus) and red deer (Cervus elaphus). Genet- Abbreviations ics 1995;139(3):1383–91. 6. Slate J, Van Stijn TC, Anderson RM et al. A deer (subfamily bp: base pair; BUSCO: Benchmarking Universal Single-Copy Or- Cervinae) genetic linkage map and the evolution of ruminant thologs; FRC: feature-response curves; Gb: giga base; GO: gene genomes. Genetics 2002;160(4):1587–97. ontology; kb: kilo base; Mb: mega base; SNV: single nucleotide 7. Pitra C, Fickel J, Meijaard E et al. Evolution and phylogeny of variant; TE: transposable element. old world deer. Mol Phylogenet Evol 2004;33(3):880–95. 8. Maqbool NJ, Tate ML, Dodds KG et al. A QTL study of growth and body shape in the inter-species hybrid of Pere David’s Ethics approval deer (Elaphurus davidianus) and red deer (Cervus elaphus). Animal collection and utility protocols were approved by the Anim Genet 2007;38(3):270–6. Northwestern Polytechnical University and BGI-Shenzhen Lab- 9. Emerson BC, Tate ML. Genetic analysis of evolutionary oratory Animal Care and Use Committee and were in accor- relationships among deer (subfamily Cervinae). J Hered dance with guidelines from the China Council on Animal Care. 1993;84(4):266–73. Samples provided by San Diego Zoo Global (SDZG) were col- 10. Li R, Fan W, Tian G et al. The sequence and de novo assembly lected in accordance with SDZG’s Institutional Animal Care and of the giant panda genome. Nature 2010;463(7279):311–7. Use Committee policies, which meet or exceed US regulatory 11. Liu B, Shi Y, Fan W. Estimation of genomic characteristics standards for the humane care and treatment of animals in by analyzing k-mer frequency in de novo estimation of ge- research. nomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint 2013. arXiv:1308.2012. 12. Vurture GW, Sedlazeck FJ, Nattestad M et al. GenomeScope: Competing interests fast reference-free genome profiling from short reads. Bioin- The authors declare that they have no competing interests. formatics 2017;33(14):2202–4. 13. Marcais G, Kingsford C. A fast, lock-free approach for effi- cient parallel counting of occurrences of k-mers. Bioinfor- Author contributions matics 2011;27(6):764–70. Q.Q., W.W., and G.Z. conceived the study. C.Z. and L.C. designed 14. Gregory TR. Animal Genome Size Database. 2017. http:// the scientific objectives. L.G.C. and O.A.R. evaluated and pro- www.genomesize.com. vided samples from San Diego Zoo Global. Y.Z. collected the 15. Li R, Zhu H, Ruan J et al. De novo assembly of human samples, extracted the genomic DNA, and constructed the DNA genomes with massively parallel short read sequencing. libraries. C.Z. and Y.Z. estimated the milu genome size and as- Genome Res 2010;20(2):265–72. sembled the genome. C.Z., L.C., and Y.Z. carried out the qual- 16. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding ity assessment, repeat annotation, and gene annotation. C.Z. pre-assembled contigs using SSPACE. Bioinformatics and K.W. were responsible for finding species-specific genes and 2011;27(4):578–9. phylogenetic relationship construction. L.C. and C.Z. uploaded 17. Simao FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- the raw read data, genome assembly, and annotation in the NCBI ing genome assembly and annotation completeness with and GigaScience (GigaDB) databases. C.Z., Q.Q., and W.W. wrote single-copy orthologs. Bioinformatics 2015;31(19):3210–2. the manuscript. Q.Q., W.W., and G.Z. supervised all aspects of 18. Vezzi F, Narzisi G, Mishra B. Reevaluating assembly evalua- the work to ensure the accuracy and integrity of the research tions with feature response curves: GAGE and assemblath- and data. O.A.R. contributed to editing the final manuscript. All ons. Plos One 2012;7(12):e52210. authors read and approved the final manuscript. 19. Bickhart DM, Rosen BD, Koren S et al. Single-molecule se- quencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Acknowledgements Genet 2017;49(4):643–50. This study was supported by grants from the Talents Team 20. Jiang Y, Xie M, Chen W et al. The sheep genome illumi- nates biology of the rumen and lipid metabolism. Science Construction Fund of Northwestern Polytechnical University (NWPU) to Q.Q. and W.W. We thank Nowbio Biotech Inc., Kun- 2014;344(6188):1168–73. 21. Elsik CG, Tellam RL, Worley KC et al. The genome sequence of ming, China, for excellent work on DNA library construction and assistance during the genome sequencing. This project was ini- taurine cattle: a window to ruminant biology and evolution. Science 2009;324(5926):522–8. tiated under the auspices of the Genome 10K Project. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Zhang et al. 22. Delcher AL, Salzberg SL, Phillippy AM. Using MUMmer to 33. Li H. Aligning sequence reads, clone sequences and identify similar regions in large sequence sets. Curr Protoc assembly contigs with BWA-MEM. arXiv preprint 2013. Bioinformatics 2003;Chapter 10:10–13. arXiv:13033997. 23. Kielbasa SM, Wan R, Sato K et al. Adaptive seeds tame 34. Li H, Handsaker B, Wysoker A et al. The Sequence Alignment/ genomic sequence comparison. Genome Res 2011;21(3): Map format and SAMtools. Bioinformatics 2009;25(16): 487–93. 2078–9. 24. Wang K, Wang L, Lenstra JA et al. The genome se- 35. Cingolani P, Platts A, Wang LL et al. A program for annotat- quence of the wisent (Bison bonasus). Gigascience 2017;6(4): ing and predicting the effects of single nucleotide polymor- 1–5. phisms, SnpEff. Fly 2012;6(2):80–92. 25. Benson G. Tandem repeats finder: a program to analyze DNA 36. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of sequences. Nucleic Acids Res 1999;27(2):573–80. ortholog groups for eukaryotic genomes. Genome Res 26. Tarailo-Graovac M, Chen N. Using RepeatMasker to iden- 2003;13(9):2178–89. tify repetitive elements in genomic sequences. Curr Protoc 37. De Bie T, Cristianini N, Demuth JP et al. CAFE: a computa- Bioinformatics 2009;Chapter 4:4–10. tional tool for the study of gene family evolution. Bioinfor- 27. Jurka J, Kapitonov VV, Pavlicek A et al. Repbase Update, matics 2006;22(10):1269–71. a database of eukaryotic repetitive elements. Cytogenet 38. Loytynoja A, Goldman N. From the cover: an algorithm for Genome Res 2005;110(1–4):462–7. progressive multiple alignment of sequences with inser- 28. Xu Z, Wang H. LTR˙FINDER: an efficient tool for the predic- tions. Proc Natl Acad Sci U S A 2005;102(30):10557–62. tion of full-length LTR retrotransposons. Nucleic Acids Res 39. Stamatakis A. RAxML version 8: a tool for phylogenetic anal- 2007;35(Web Server):W265–8. ysis and post-analysis of large phylogenies. Bioinformatics 29. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. 2014;30(9):1312–3. Genome Res 2004;14(5):988–95. 40. Yang Z. PAML 4: phylogenetic analysis by maximum likeli- 30. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- hood. Mol Biol Evol 2007;24(8):1586–91. tio prediction of alternative transcripts. Nucleic Acids Res 41. Kumar S, Stecher G, Suleski M, Hedges SB. TimeTree: A Re- 2006;34(Web Server):W435–9. source for Timelines, Timetrees, and Divergence Times. Mol 31. Burge C, Karlin S. Prediction of complete gene structures in Biol Evol 2017;34(7):1812–9. human genomic DNA. J Mol Biol 1997;268(1):78–94. 42. Zhang C, Chen L, Zhou Y et al. Supporting data for “Draft 32. Elsik CG, Mackey AJ, Reese JT et al. Creating a honey bee con- genome of the milu (Elaphurus davidianus).” GigaScience sensus gene set. Genome Biol 2007;8(1):R13. Database 2017. http://dx.doi.org/10.5524/100383. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4757066 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Feb 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

Monthly Plan

  • Read unlimited articles
  • Personalized recommendations
  • No expiration
  • Print 20 pages per month
  • 20% off on PDF purchases
  • Organize your research
  • Get updates on your journals and topic searches

$49/month

Start Free Trial

14-day Free Trial

Best Deal — 39% off

Annual Plan

  • All the features of the Professional Plan, but for 39% off!
  • Billed annually
  • No expiration
  • For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

$588

$360/year

billed annually
Start Free Trial

14-day Free Trial