Innovative assembly strategy contributes to understanding the evolution and conservation genetics of the endangered Solenodon paradoxus from the island of Hispaniola

Innovative assembly strategy contributes to understanding the evolution and conservation genetics... Photo of a Solenodon paradoxus from the island of Hispaniola, courtesy of Eladio Fernandez, Caribbean Nature Photography Grigorev et al. Grigorev et al. GigaScience 2018, 7:6 https://doi.org/10.1093/gigascience/giy025 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 GigaScience, 7, 2018, 1–21 doi: 10.1093/gigascience/giy025 Advance Access Publication Date: 16 March 2018 Research RESEARCH Innovative assembly strategy contributes to understanding the evolution and conservation genetics of the endangered Solenodon paradoxus from the island of Hispaniola 1,† 2,† 2 2 Kirill Grigorev ,SergeyKliver , Pavel Dobrynin , Aleksey Komissarov , 1,3 2 Walter Wolfsberger , Ksenia Krasheninnikova , 1 4,5 6 Yashira M. Afanador-Hernandez ´ ,AdamL.Brandt , Liz A. Paulino , 6 6 7 4,8 Rosanna Carreras , Luis E. Rodr´ıguez ,AdrellNu´ nez ˜ , Jessica R. Brandt , 9,10 11 1 Filipe Silva ,J.David Hernandez-Martic ´ h , Audrey J. Majeske , 9,10 4,12 2,13 Agostinho Antunes , Alfred L. Roca , Stephen J. O’Brien , 1 1,3,∗ Juan Carlos Mart´ınez-Cruzado and Taras K. Oleksyk 1 2 Department of Biology, University of Puerto Rico at Mayaguez, ¨ Mayaguez, ¨ Puerto Rico, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia, Biology Department, Uzhhorod National University, Uzhhorod, Ukraine, Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA, Division of Natural Sciences, St. Norbert College, De Pere, Wisconsin, USA, Instituto Tecnolog ´ ico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic, Department of Conservation and Science, Parque Zoologico Nacional (ZOODOM), Santo Domingo, Dominican 8 9 Republic, Department of Biology, Marian University, Fond du Lac, Wisconsin, USA, CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixoes, ˜ Av. General Norton de Matos, s/n, 4450–208 Porto, Portugal, Department of Biology, Faculty of Sciences, University of Porto. Rua do Campo Alegre, 4169-007 Porto, Portugal, Instituto de Investigaciones Botanicas ´ y Zoolog ´ icas, Universidad Autonoma ´ de Santo Domingo, Santo Domingo, Dominican Republic, Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA and Oceanographic Center, Nova Southeastern University, Fort Lauderdale, Florida, USA Correspondence address. Departamento de Biolog´ıa Carr. 108 Barrio Miradero Km 1.3. Entrada al Zoolog ´ ico Mayaguez, ¨ PR 00680 E-mail: taras.oleksyk@upr.edu These authors contributed equally. Received: 20 July 2017; Revised: 26 January 2018; Accepted: 7 March 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Grigorev et al. Abstract Solenodons are insectivores that live in Hispaniola and Cuba. They form an isolated branch in the tree of placental mammals that are highly divergent from other eulipothyplan insectivores The history, unique biology, and adaptations of these enigmatic venomous species could be illuminated by the availability of genome data. However, a whole genome assembly for solenodons has not been previously performed, partially due to the difficulty in obtaining samples from the field. Island isolation and reduced numbers have likely resulted in high homozygosity within the Hispaniolan solenodon (Solenodon paradoxus). Thus, we tested the performance of several assembly strategies on the genome of this genetically impoverished species. The string graph–based assembly strategy seemed a better choice compared to the conventional de Bruijn graph approach due to the high levels of homozygosity, which is often a hallmark of endemic or endangered species. A consensus reference genome was assembled from sequences of 5 individuals from the southern subspecies (S. p. woodi). In addition, we obtained an additional sequence from 1 sample of the northern subspecies (S. p. paradoxus). The resulting genome assemblies were compared to each other and annotated for genes, with an emphasis on venom genes, repeats, variable microsatellite loci, and other genomic variants. Phylogenetic positioning and selection signatures were inferred based on 4,416 single-copy orthologs from 10 other mammals. We estimated that solenodons diverged from other extant mammals 73.6 million years ago. Patterns of single-nucleotide polymorphism variation allowed us to infer population demography, which supported a subspecies split within the Hispaniolan solenodon at least 300 thousand years ago. Keywords: genome assembly; de Bruijn; string graph; Solenodon paradoxus; Hispaniola; Caribbean; island evolution; natural selection; isolation; heterozygosity which saliva secreted by the submaxillary gland flows into the Background victim [10]. The genus name Solenodon means “grooved tooth” The only 2 surviving species of solenodons, found on the 2 in Greek and refers to the shape of this incisor. Although solen- largest Caribbean islands, Hispaniola (Solenodon paradoxus) and odons rarely bite humans, the bites can be very painful (Nicolas ´ Cuba (Solenodon cubanus), are among the few endemic terres- Corona, personal communication), and even a small injection of trial mammals that survived human settlement of these islands. venom has been shown to be fatal to mice in a matter of min- Phenotypically, solenodons somewhat resemble shrews (Fig. 1), utes [7]. The chemical composition of solenodon venom has not but molecular evidence indicates that they are actually the sister yet been resolved [11]. group to all other extant eulipotyphlan insectivores (hedgehogs, Roca et al. [3] sequenced 13.9 kb of nuclear and mitochon- moles, shrews) from which they split in the Cretaceous Period drial sequences of S. paradoxus, inferring that solenodon diver- [1–3]. These enigmatic species have various local names in Cuba gence from other eulipotyphlan mammals such as shrews and and Hispaniola, including oso (bear), hormiguero (ant-eater), joron moles dates back to the Cretaceous Period, ∼76 million years (ferret), milquı (or almiquı), and agouta [4, 5], all pointing to the ´ ´ ago (Mya), before the mass extinction of the dinosaurs ∼66 Mya. first impression made on the Spanish colonists by its unusual Brandt et al. [12] sequenced complete mitogenome sequences of appearance. Today, the Hispaniolan solenodon (S. paradoxus)is 6 Hispaniolan solenodon specimens from the southern part of difficult to find in the wild because of its nocturnal activity pat- Hispaniola (Fig. 2), corroborating this conclusion, and estimated tern and its low population numbers. Here, we report the as- that S. paradoxus diverged from all other mammals ∼78 Mya [12]. sembly and annotation of the nuclear genome sequences and Other studies have reported similarly deep divergence dates (re- genomic variation of 2 subspecies of S. paradoxus.Weusedan- viewed by Sato et al. [13]). Whole genome analysis of S. paradoxus alytical strategies that will allow researchers to formulate hy- could provide support and validation to the earlier evolutionary potheses and develop genetic tools to assist future studies of studies. evolutionary inference and conservation applications. Morphometric studies suggest that southern and northern Solenodon paradoxus was originally described from a skin and Hispaniolan solenodons may be distinctive enough to be consid- partial skull at the St. Petersburg Academy of Sciences in Russia ered separate subspecies [2, 14, 15], a notion supported by recent [6]. It has a large head with a long rostrum and tiny eyes and ears mitochondrial DNA studies [12, 16]. The southern Hispaniolan partially hidden by the dusky brown body fur that turns reddish solenodons had less genetic diversity than those in the north, on the sides of the head, throat, and upper chest. The tail, legs, so that the control region sequences of all 5 southern speci- snout, and eyelids of the S. paradoxus are hairless. The front legs mens (the same individuals used in this study) were identical are noticeably more developed, but all 4 have strong claws use- or nearly identical [12], indicating that Hispaniolan solenodons ful for burrowing (Fig. 1). Adult animals measure 49–72 cm in have a very low level of mitochondrial diversity. total length and weigh more than 1 kg [7]. Solenodons are social It may now be imperative to study conservation genomics of animals; they spend their days in extensive underground tun- solenodons because their extinction would mean the loss of an nel networks shared by family groups and come to the surface entire evolutionary lineage whose antiquity goes back to the age at night to hunt small vertebrates and large invertebrates [8]. A of dinosaurs. Solenodon paradoxus survived in spectacular island unique feature is the os proboscidis, a bone that extends forward isolation despite the devastating human impact to biodiversity from the nasal opening to support the snout cartilage [9]. Solen- in recent centuries [3, 12]. Nevertheless, survival of this species odons are venomous mammals that display a fascinating strat- is now threatened by deforestation, increasing human activity, egy for venom delivery. The second lower incisor of solenodons and predation by introduced dogs, cats, and mongooses. It is de- has a narrow, almost fully enfolded tubular channel, through clining in population, its habitat is severely fragmented, and it Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 3 Figure 1: The 2 subspecies of Solenodon paradoxus. (A) A captive Hispaniolan solenodon from the northern subspecies (S. p. paradoxus) photographed at the Santo Domingo Zoo (photo taken by Juan C. Mart´ınez-Cruzado in 2014). (B) A mounted specimen of the southern subspecies (S. p. woodi) photographed at the Museo Nacional de Historia Natural prof. Eugenio de Jesus ´ Marcano in Santo Domingo, Dominican Republic (photo taken by Taras K. Oleksyk in 2017). Figure 2: Origins of the genomic DNA samples of Solenodon paradoxus from the island of Hispaniola. Approximate locations of capture for 5 wild individuals of S. p. woodi: Spa-K and Spa-L from La Canada ˜ del Verraco, as well as Spa-M, Spa-N, and Spa-O from the El Manguito location in the Pedernales Province in the southwest corner of the Dominican Republic bordering Haiti. In addition, 1 S. p. paradoxus sample (Spa-1) was taken from Cordillera Septentrional in the northern part of the island. Exact coordinates of each sample location are listed in [12]. The dashed line indicates the position of the Cul de Sac Plain and Neiba Valley; this region was periodically inundated by a marine canal that separated Hispaniola into north and south paleo-islands during the Pliocene and Pleistocene [15]. The original map is in the public domain (courtesy of NASA). is listed as endangered by the International Union for Conser- performs better than the assembly by a traditional de Bruijn al- vation of Nature (IUCN) Red List of Threatened Species (Red List gorithm (SOAPdenovo2) [18]. We used the string-graph assem- category B2ab, assessed in 2008 [17]). bler Fermi [19] as a principal tool for contig assembly in con- In this study, we assembled the genome of S. paradoxus us- junction with SSPACE [20] and GapCloser [18] for scaffolding. The ing low coverage genome data (∼5× each) from 5 individuals resulting genome sequence data was sufficient for high-quality of S. paradoxus woodi. We took advantage of the low individ- annotation of genes and functional elements, as well as for com- ual and population genetic diversity to pool individual data parative genomics and population genetic analyses. Prior to this and apply a string graph assembly approach that resulted in a study, the string-graph assembler Fermi [19] had been used only working genome assembly of the S. paradoxus genome from the in studies for annotation or as a complementary tool for de novo combined paired-end dataset (∼26×;Fig. 3). Our methodology assemblies made with de Bruijn algorithms [21]. We present and introduces a useful pipeline for genome assembly to compen- compare genome assemblies for the southern subspecies (S. p. sate for the limited amount of sequencing that, in this instance, woodi) based on several combinations of assembly tools, provide Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Grigorev et al. in the northern part of the island. Figure 2 highlights geographi- cal locations of sample collection points for the samples used in this study. The 5 S. p. woodi samples were sequenced using Hiseq2000 technology (Illumina Inc.), resulting in an average of 151,783,327 paired-end reads 101 bp long, or 15.33 Gb of sequence data, per individual. In addition, DNA extracted from the northern solen- odon (S. p. paradoxus; Spa-1) was sequenced using MiSeq V3 tech- nology (Illumina Inc.) and produced 52,358,830 paired-end reads, equating to approximately 13.09 Gb of sequence data. Only the samples of S. paradoxus woodi were used for assembly since the northern subspecies (S. paradoxus paradoxus) did not have suffi- cient coverage for the de novo assembly. Further details about sample collection, DNA extraction, li- brary construction, and sequencing can be found in the Meth- ods section. The whole genome shotgun data from this project have been deposited at DDBJ/ENA/GenBank under the acces- sion NKTL00000000. The version described here is version NKTL01000000. The genome data has also been deposited into the NCBI under BioProject PRJNA368679 and to the GigaScience GigaDB repository [22]. Read correction After the reduction of adapter contamination with Cookiecut- ter [23], the k-mer distribution in the reads for the 5 individu- als of S. paradoxus woodi was assessed with Jellyfish (Jellyfish, Figure 3: Heterozygosity and k-mer distribution. k-mer distributions for the S. p. RRID:SCR 005491)[24]. The predicted mean genome coverage woodi reads. Only 1 original sample (SPA-K) distribution is shown as a solid gray was approximately 5× for each sample (Fig. 3), which is too low line, as the distributions were identical for each of the individual samples. The for individual de novo genome assembly. However, because of the predicted mean genome coverage was approximately 5x for each sample (x = 5). One example is plotted by a black solid line on the left. The combined uncor- extremely low levels of genetic diversity suggested by the ear- rected dataset is plotted in a dashed red line indicates a maximum at x = 26. The lier study of mitochondrial DNA in the southern subspecies [12] combined dataset corrected with QuorUM [25] is plotted in a solid blue line, also and in order to increase the average depth of coverage, the reads with a maximum at x = 26. A smaller local maximum on the left side for both from the 5 samples were combined into a single dataset. As a combined distributions, corrected and uncorrected (representing k-mers found result, the projected mean genome coverage for the combined once or very few times), is expected from differences between overlapping reads, genome assembly was 26×. Error correction was applied with most likely the sequencing errors. Other local maxima (seen as a small bulge at the x = 5) are interpreted as heterozygous sites. These proved to have almost no QuorUM (QuorUM, RRID:SCR 011840)[25] using the value k = 31. impact on the combined sample even after read correction, indicating a lack of The k-mer distribution analysis by Jellyfish in the combined and heterozygous sites for this solenodon subspecies. The largest local maxima (to error-corrected dataset indicated very low levels of heterozygos- the right) are interpreted as projected coverage. For the combined samples, this ity in accordance with the hypothesis (see Fig. 3 legend), allow- value is x = 26. ing use of the combined dataset for further genome assembly. Using KmerGenie [26], the genome size has been estimated to be 2.06 Gbp. a high-quality annotation of genome features and describe ge- netic variation in 2 subspecies (S. p. woodi and S. p. paradoxus), make inferences about recent evolution and selection signatures Analyses in genes, trace demographic histories, and develop molecular tools for future conservation studies. Assembly tool combinations We used several alternative combinations of tools to determine Data Description the best approach to an assembly of the combined genome data, outlined in Table 1. First, the combined libraries of paired-end Sample collection and sequencing reads were assembled into contigs with Fermi, a string graph– Five adult individuals of S. paradoxus woodi (National Center for based tool [19]. Second, the same libraries were also assembled Biotechnology Information [NCBI] Taxon ID:1906352) from the with SOAPdenovo2, a de Bruijn graph–based tool (SOAPdenovo2, southern Dominican Republic were collected in the wild follow- RRID:SCR 014986)[18]. The optimal k-mer length parameter for ing a general field protocol described earlier [ 12]; this included SOAPdenovo2 was determined to be k = 35 with the use of Kmer- 2 specimens caught from La Canada ˜ del Verraco and 3 from Genie [26]. For the scaffolding step, we used either SSPACE (SS- the El Manguito location in the Pedernales Province. The cap- PACE, RRID:SCR 005056)[20] or the scaffolding module of SOAP- tured individuals were visually assessed for obvious signs of dis- denovo2 [18]. For all instances, the GapCloser module of SOAP- ease, weighed, measured, sexed, and released at the capture site, denovo2 was used to fill in gaps in the scaffolds (GapCloser, all within 10 minutes of capture. Geographic coordinates were RRID:SCR 015026)[18]. After assembly, datasets were trimmed; recorded for every location. In addition, 1 S. p. paradoxus (Spa- scaffolds shorter than 1 Kbp were removed from the output. In 1) sample was acquired through collaboration with ZooDom at Table 1, the 4 possible combinations of tools used for the assem- Santo Domingo that originated in the Cordillera Septentrional bly are referred to with capital letters A, B, C, and D for brevity. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 5 Table 1: Description of the assembly strategies and comparison of metrics for the resulting assemblies. Assembly names A B C D Assembly tools Contig assembly tool Fermi Fermi SOAPdenovo2 SOAPdenovo2 Scaffolding tool SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE Gap closing tool GapCloser GapCloser GapCloser GapCloser Assembly metrics Total contigs (>1,000 bp) 71,429 71,429 189,566 189,566 Contig N50 54,944 54,944 4,048 4,048 Contig CEGMA (%) 96.37(77.42) 96.37(77.42) 68.15(33.06) 68.15(33.06) Contig BUSCO (%) 86(65) 86(65) 42(21) 42(21) Total scaffolds (>1,000 bp) 14,417 40,372 20,466 - Final N50 555,585 110,915 331,639 - Final CEGMA (%) 95.56(81.85) 95.97(88.71) 95.97(90.73) - Final BUSCO (%) 91(74) 86(64) 94(80) - Quality Percentage of Ns (%) 0.06322 0.0135 0.02622 - REAPR error-free bases (%) 96.46 95.35 94.98 - REAPR low-scoring regions 18 16 71 - REAPR incorrectly oriented reads 11,543 5,329 28,964 - BUSCO [29] and CEGMA [30] percentages are reported for all genes (complete and partial), while the percentage of complete genes are shown in parentheses. However, SOAPdenovo2 introduces artifacts at the contig con- ber of detected rearrangements vs. Sorex assembly would be the struction stage, which it is specifically designed to mitigate at same for all 3 Solenodon assemblies (A, B, and C). Following the later stages, and SSPACE is not aware of such artifacts [27]. For parsimony principle, an assembly that shows rearrangements this reason, the assembly produced by combination D (contig as- is also likely to contain the most assembly artifacts. Conversely, sembly with SOAPdenovo2 and scaffolding with SSPACE) was not we expected that the best of the 3 assemblies of the Solenodon reported. genome would contain the least number of reversals and trans- positions when compared to the best available closely related genome (Sorex araneus). Quality control (QC) and structural comparisons To test this hypothesis, the 3 completed assemblies of Solen- between the assemblies odon (A, B, and C) were aligned to each other and to the out- group, which was the Sorex genome (SorAra 2.0, NCBI accession We used QUAST (QUAST, RRID:SCR 001228)[28] to estimate the common metrics of assembly quality for all combinations of as- GCA 000181275.2), using Progressive Cactus [32]. Custom scripts were used to interpret binary output of the pairwise genome by sembly tools: N50 and gappedness (the percentage of Ns(Ta- genome comparisons; the resulting coverage metrics are pre- ble 1)). Fermi-assembled contigs (A and B) were overall longer and fewer in number than the SOAPdenovo2 (C and D). The as- sented in Table 2. In this comparison, all 3 Solenodon genome as- semblies had a substantial overlap and resulted in similar levels sembly completeness was also evaluated with both benchmark- ing universal single-copy orthologs (BUSCO, RRID:SCR 015008) of synteny when compared against the Sorex reference assem- bly. However, assemblies A and B had the fewest differences with [29] and core eukaryotic genes mapping approach (CEGMA, RRID:SCR 015055)[30] for completeness of conservative genes. Sorex, while assembly C had more differences vs. A, B, and Sorex. Next, syntenic blocks between each of the 3 Solenodon assemblies Fermi assemblies (A and B) showed high levels of complete- ness compared to SOAPdenovo2 (86% vs. 42%) at the contig level. (A, B, and C) were compared to the Sorex assembly. The 50-Kbp syntenic blocks were identified using the ragout-maf2 synteny However, this difference was partially mitigated at the scaffold- ing step where SOAPdenovo2 increases completeness for Fermi module of the software package Ragout [33], and the number of scaffolds that contained syntenic block rearrangements was de- assembly (A) and more than doubles it for the SOAPdenovo2 as- termined. As a result, assembly B had the lowest number of re- sembly (C). To directly evaluate the quality of all assemblies, versals and transpositions compared to the S. araneus reference we applied REAPR [31]. From the REAPR metrics presented at genome (Table 2). Based on the combined results of the evalu- the bottom of Table 1, it appears that even though the scaf- folding step increased the final N50 for the C assembly, it con- ations by REAPR [31], Progressive Cactus [32], and Ragout [33], assembly C (generated by the complete SOAPdenovo2 run) was tains significantly more regions with high probability of mis- assemblies (low-scoring regions), less error-free bases, and 3 to not included in further analysis. 6 times higher number of incorrectly oriented reads compared to the Fermi-based assemblies (A and B) (Table 1). Genome annotation and evaluation of assembly We hypothesized that aligning the 3 genome assemblies completeness to each other would allow us to detect some of these mis- assemblies. A comparison to the best, most closely related Repeats in assemblies A and B were identified and soft masked genome assembly (i.e., Sorex araneus) will reveal several rear- using RepeatMasker (RepeatMasker, RRID:SCR 012954)[34]with rangements that, in many cases, reflect real evolutionary events. the RepBase library [35]. The total percentage of all interspersed It is reasonable to assume that if all the rearrangements that are repeats masked in the genome was lower than in S. araneus detected are real and not due to the assembly artifacts, the num- (22.53% vs. 30.48%). One possible reason could be that a low Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Grigorev et al. Table 2: Pairwise genomic coverage for the 3 assemblies and the Sorex araneus genome (SorAra 2.0, NCBI accession number GCA 000181275.2) obtained from the Progressive Cactus [32] alignments. vs S. paradoxus woodi vs S. araneus Pairwise genome coverage (%) Assembly A B C # Inversions # Translocations S. araneus 42.1 42.2 42.3 - - - S. p. woodi A - 99.4 98.5 35.5 87 5 B 99.3 - 99.3 35.5 34 0 C 98.4 98.5 - 35.5 81 2 While all 3 assemblies have similar amounts of syntenic coverage to the Sorex genome, assembly B contains the least number of structural rearrangements (inversions and translocations) compared to the other 2 assemblies (A and C). Values in cells at the intersection of rows and columns represent the percentage (%) of coverage between the 2 compared genome assemblies. Syntenic blocks between each of the 3 solenodon assemblies (A, B, and C) were compared to the S. araneus assembly, and 50 Kbp syntenic blocks were identified using the ragout-maf2synteny module of the software package Ragout [33]. Table 3: Repeat content of the Solenodon paradoxus genome (assem- (GRCm38.p4) were aligned to a S. paradoxus assembly with Ex- bly B), annotated by RepeatMasker [34] with the RepBase library [35]. onerate [36] with a maximum of 3 “hits” (matches) per protein. The obtained alignments were classified into the top (primary) Class Number Length (bp) Percentage (%) hit and 2 secondary hits; the coding sequence (CDS) fragments were cutfromeachsideby3bp forthe tophitsand by 9bpfor Total interspersed 461,754,432 22.53 secondary hits. These truncated fragments were clustered and repeats supplied as hints (local pieces of information about the gene in SINEs 271,839 36,271,455 1.77 the input sequence, such as a likely stretch of coding sequence) Alu/B1 6 341 <0.0001 of the potential protein-coding regions to the AUGUSTUS MIRs 264,319 35,557,190 1.73 software package (Augustus: Gene Prediction, RRID:SCR 008417) LINEs 610,079 304,823,409 14.87 LINE1 425,750 260,176,709 12.7 [37], which predicted genes in the soft-masked Solenodon LINE2 157,422 39,432,276 1.92 assembly. Proteins were extracted from the predicted genes L3/CR1 22,172 4,293,335 0.21 and aligned by HMMER (Hmmer, RRID:SCR 005305)[38]and RTE 4,122 839,744 0.04 Basic Local Alignment Search Tool for Proteins (BLAST; NCBI, LTR elements 246,305 78,108,726 3.81 RRID:SCR 004870)[39] to Pfam (Pfam, RRID:SCR 004726)[40] ERVL 61,150 24,158,692 1.18 and Swiss-Prot [41] databases, respectively. Genes supported ERVL-MaLRs 94,934 30,075,905 1,47 by hits to protein databases and hints were retained; the ERV classI 57,674 19,259,649 0.94 unsupported sequences were discarded. The annotated genes ERV classII 24,454 2,840,874 0,14 can be retrieved from Database S2. DNA elements 204,413 42,015,054 2.05 Assembly B showed a higher support compared to assembly hAT-Charlie 112,664 21,168,194 1.03 A (91.7% vs. 79.2%) for the protein-coding gene predictions by ex- TcMar-Tigger 43,950 11,141,107 0.54 trinsic evidence, even though assembly A had a larger N50 value Small RNAs 4,772 456,810 0.02 (Table 1). These values were calculated as a median fraction Satellites 46,734 20,910,815 1.02 of exons supported by alignments of proteins from reference Simple repeats 644,811 28,549,871 1.39 species to genome (Fig. 4). In other words, assembly B is more Low complexity 114,188 5,933,786 0.29 useful for gene predictions and is likely to contain better gene regions models that can be used in the downstream analysis. Therefore, Unclassified 3,051 535,788 0.03 based on 2 lines of evidence, low rearrangement counts (Table 2) and high support to gene prediction for the assembly B, it was chosen for the subsequent analyses as the most useful current coverage assembly may perform better in nonrepetitive regions. representation of the Solenodon genome. Alternatively, if the repeat content in S. paradoxus is indeed lower, this would have to be evaluated using a higher-quality as- sembly with the use of long read data. The total masked repeat Noncoding RNA genes content of the S. paradoxus genome including simple/tandem re- peats, satellite DNA, low complexity regions, and other elements For all noncoding RNA genes except for tRNA and rRNA is presented in Table 3. The repeat content can be retrieved from genes, the search was performed with INFERNAL (Infernal, Database S1. RRID:SCR 011809)[42] using the Rfam (Rfam, RRID:SCR 007891) The annotation of protein-coding genes was performed [43] BLASTN hits as seeds. The tRNA genes were predicted using using a combined approach that synthesized both homology- tRNAScan-SE (tRNAscan-SE, RRID:SCR 010835)[44], and rRNA based and de novo predictions, where de novo predictions were genes were predicted with Barrnap ((BAsic Rapid Ribosomal RNA used to fill gaps and extend homology-based predictions. Predictor), version 0.6 (Barrnap, RRID:SCR 015995)[45]). Addition- Gene annotation was performed for both assemblies (A and ally, RNA genes discovered by RepeatMasker at the earlier stages B) independently. Proteins of 4 reference species S. araneus of the analysis were used to cross-reference the findings of (SorAra 2.0, GCA 000181275.2), Erinaceus europaeus (EriEur2.0, rRNA- and tRNA-finding software. The list of the noncoding RNA GCA 000296755.1), Homo sapiens (GRCh38.p7), and Mus musculus genes can be accessed in Database S3. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 7 Figure 4: Distribution of the gene prediction support by extrinsic evidence for Solenodon assemblies A (on the left) and B (on the right). Proteins of 4 reference species S. araneus (SorAra 2.0, GCA 000181275.2), Erinaceus europaeus (EriEur2.0, GCA 000296755.1), Homo sapiens (GRCh38.p7), and Mus musculus (GRCm38.p4) were aligned to a S. paradoxus assembly with Exonerate [36] with a maximum of 3 best matches per protein. Coding sequences =were cut from each, clustered, and uploaded into the AUGUSTUS software package [37] to predict genes in the soft-masked Solenodon assembly. Proteins from the predicted genes were aligned by HMMER [38] and BLAST [39] to Pfam [40] and Swiss-Prot [41] databases. Genes supported by matches to protein databases and “hints” (see definition in main text) were retained; the rest were discarded. Substantially more transcripts have higher hint support in assembly B. The annotated genes can be retrieved from Database S2. Assembly C has not been evaluated. Table 4: The weighted coverages of the genomes in the Progressive (“Cactus”) [32]. Cactus genome alignments were used to build Cactus alignment [32], as calculated against the C. familiaris genome. a “sparse map” of the homologies between a set of input se- quences. Once this sparse map is constructed, in the form of Query genome Weighted coverage a Cactus graph, the sequences that were initially unaligned in the sparse map are also aligned [32]. Weighted coverage of Dog (Canis familiaris) (1.14) a genome by genome comparison was calculated by binning Cow (Bos taurus)1.06 an alignment into regions of different coverage and averaging Common shrew (Sorex araneus)1.05 these coverages, with lengths of bins as weights. The weighted Star-nosed mole (Condylura cristata)1.04 coverage of S. paradoxus to C. familiaris was 1.05, which indi- Hispaniolan solenodon (Solenodon paradoxus)1.05 cated that the present genome assembly is comparable in qual- The weighted coverage of a genome to itself is parenthesized as it is not a com- ity and duplication structure to other available mammalian as- parative value. semblies, which are close to each other and are close to 1.0 The weighted coverage of the S. paradoxus genome assembly from our study is (Table 4). comparable to other high-coverage mammalian genome assemblies. The clado- gram used for multiple genome alignment with Progressive Cactus is shown in Fig. S1. Detection of single-copy orthologs Single-copy orthologs (single gene copies) are essential for the evolutionary analysis since they represent a useful conserva- Multiple genome alignment, synteny, and duplication tive homologous set, unlike genes with paralogs, which are structure difficult to compare across species. The longest polypeptide To compare the Solenodon genome assembly with other mam- coded by each gene of S. paradoxus and of 3 other Eulipotyphla— malian genomes, a multiple alignment with genomes of re- Erinaceus europaeus, S. araneus,and C. cristata—were aligned lated species was performed using Progressive Cactus [32]. Cur- to profile hidden Markov models of the TreeFam database rently available genomic assemblies of cow (Bos taurus, BosTau (Tree families database, RRID:SCR 013401)[46, 47] using HM- 3.1.1, NCBI accession number DAAA00000000.2), dog (Canis famil- MER [38]. Top hits from these alignments were extracted and iaris,CanFam3.1,GCA 000002285.2), star-nosed mole (Condylura used for assignment of corresponding proteins to families. The cristata, ConCri 1.0, GCF 000260355.1), common shrew (S. ara- same procedure was performed in order to assign proteins neus, SorAra 2.0, GCA 000181275.2), and S. paradoxus woodi (as- to orthologous groups using profile hidden Markov models of sembly B from this study) were aligned together, guided by a orthologous groups of the maNOG subset from the eggNOG cladogram representing branching order in a subset of a larger database (eggNOG, RRID:SCR 002456)[48] as referenced. Or- phylogeny (Fig. S1). We evaluated the S. paradoxus coverage by thologous groups and families for which high levels of error comparing it to the weighted coverages of other genomes in the rates were observed while testing assignment of proteins to alignment to the C. familiaris genome (Table 4). Custom scripts them were discarded; the rest of the orthologous groups and were used to interpret the binary output of Progressive Cactus families were retained for further analysis. Proteins and the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Grigorev et al. Table 5: Fossil-based priors associated with mammalian evolution used for calibration of divergence times [54–57]. Node Calibration prior on clade Node min. age (Mya) Node max. age (Mya) Evidence Opossum—placental Eutheria—Metatheria 157.3 169.6 Fossil [55] mammals split Human—mouse Archonta—Glires 61.5 100.5 Biostratigraphy [55] Primates, Euarchontaglires—Laurasiatheria 61.6 100.5 Fossil [55] mouse—dog, horse, cow Dog—ferret Canidae—Arctoidea 35 45 Fossil [56, 57] Solenodon— Lipotyphla 61.6 100.5 Fossil [55] hedgehog, shrew, mole Cow—horse Artiodactyla as soft minimum 52.4 100.5 Fossil [55] The 4,416 single-copy orthologs identified in our assembly were used for phylogeny inference via 4-fold degenerate sites with programs RAxML [ 49] and PAML [50]. The resulting phylogenetic tree was plotted with FigTree [58] and is presented in Fig. 5. corresponding assignments were obtained from the maNOG Positively selected genes database for 7 other species: H. sapiens, M. musculus, B. taurus, To evaluate signatures of selection in the assembled genomes, C. familiaris, Equus caballus, Mustela putorius furo,and Monodel- we used a dataset of 4,416 orthologous groups containing single- phis domestica. Inspection of assignments across all the species copy orthologous genes of the mammalian species described yielded 4,416 orthologous groups containing single-copy orthol- earlier. Single-copy orthologs were used as a conservative set ogous genes (Database S5). necessary for comparing coding sequences that only arose 1 time in order to avoid the uncertainties associated with paralogs Species tree reconstruction and divergence time and lineage-specific gene duplications. First, we translated DNA estimation sequences into amino acids, aligned them in MUSCLE (MUSCLE, RRID:SCR 011812)[65], and then translated them back into DNA We used our genome assembly to infer phylogenetic relation- code using the original nucleotide sequences by PAL2NAL [66]. ships between S. paradoxus and other eutherian species with Genic dN/dS ratios were estimated among the 11 mammalian known genome sequences and estimated their divergence time species (including Solenodon) used in constructing the phylogeny using the new data. Based on the alignments of the single-copy represented in Fig. 5. orthologous proteins for the species included in the analysis, To estimate dN/dS ratios, we used the codeml module from a maximum likelihood tree was built using RAxML [49]with the PAML package [50]. The dN/dS ratios were calculated over the the PROTGAMMAAUTO option and the JTT fitting model tested entire length of a protein-coding gene. The branch-site model with 1,000 bootstrap replications. From the codon alignments was not included in the current analysis because of the risk of of single-copy orthologs of the 11 species, 461,539 4-fold de- reporting false positives due to sequencing and alignment errors generate sites were extracted. The divergence time estimation [67], especially on smaller datasets, and additional uncertainties was made by the MCMCtree tool from the software package could be introduced from the lack of power under synonymous PAML (PAML, RRID:SCR 014932)[50]withthe HKY+G model of substitution saturation and high variation in the GC content [68]. nucleotide substitutions and 2,200,000 generations of MCMC (of All the single-copy orthologs were plotted in the dN to dS which the first 200,000 generations were discarded as burn-in). coordinates and color-coded according to the 96 gene ontology A test for substitution saturation [51, 52] was performed using generic categories (Fig. 6). We retrieved values of dN, dS, and w DAMBE6 [53] for both all third codon positions and only 4-fold (w = dN/dS) for all single-copy orthologs and used human an- degenerated sites. In both cases the index of substitution satu- notation categories to assign all the genes with their gene on- ration was significantly lower than the threshold value for both tologies (GO) using the Python package goatools [69]and theGO symmetrical and asymmetrical trees indicating low saturation Slim generic database [69] to assign the genes to the major GO level. Therefore, saturation was not detected for any of the third categories. positions nor for the 4-fold degenerated sites. The dN/dS values for the 12 genes exhibiting positive selec- Divergence times were calibrated using fossil-based priors tion (Table 6) are visible above the line showing dN = dS. Three associated with mammalian evolution, listed in Table 5 and of these genes belong to the plasma membrane GO category based on [54–57]. FigTree [58] was used to plot the resulting tree, (GO:0005886), while cytosol (GO:0005829), mitochondrial elec- showninFig. 5. According to this analysis, S. paradoxus diverged tron transport chain (GO:0005739), cytoplasm (GO:0005737), and from other mammals 73.6 Mya (95% confidence interval of 61.4– generation of precursor metabolites (GO:0006091) were repre- 88.2 Mya). This is in accordance with earlier estimates based on sented by 1 gene each. Five of the genes exhibiting positive selec- nuclear and mitochondrial sequences (e.g., [3, 12]) as reviewed tion signatures could not be assigned to GO categories. Some of by Springer et al. [59]. This date is also much older than the these are also associated with the plasma membranes (TMEM56, timeframe of molecular estimates of divergence times between SMIM3), and 1 gene (CCRNL4) encodes a protein highly similar to most island taxa and their closest mainland relatives [60]. Our the nocturnin, a gene identified as a circadian clock regulator in data support solenodons forming a sister group to other eulipo- Xenopus laevis [70]. The full list of genes, GO annotations, and typhlans, i.e., hedgehogs, shrews, and moles [61–64], with a di- associated dN/dS values are included in Database S6. vergence date as old as splits between some pairs of mammalian Traditionally, one of the most commonly used signatures of orders, such as between rodents and primates or between car- selection is the ratio of nonsynonymous (dN) to synonymous nivores and artiodactyls (Fig. 5). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 9 Figure 5: Phylogenetic relationships of Solenodon paradoxus and other mammals from whole-genome data. (A) Maximum likelihood phylogeny showing branch lengths. The tree was built using RAxML [49] with the PROTGAMMAAUTO option and the JTT fitting model tested with 1,000 bootstrap replicates. ( B) Divergence time estimates based on 461,539 4-fold degenerate sites from the codon alignments of single-copy orthologs and using fossil-based priors (Table 5). The divergence time estimation was made by the MCMCtree tool from the software package PAML [50] with the HKY+G model of nucleotide substitutions and 2,200,000 generations of MCMC (of which the first 200,000 generations were discarded as burn-in). The 95% confidence intervals are given in square brackets and depicted as semitransparent bo xes around the nodes. The inferred divergence time of S. paradoxus from other mammals is 73.6 Mya (95% confidence interval of 61.4–88.2 Mya). (dS) substitutions, dN/dS [71]. The synonymous rate (dS) ex- selection is likely to have accelerated the rate of fixation of presses the rate of unconstrained, neutral evolution, so that nonsynonymous substitutions. It is possible to quantify the pro- when dN/dS <1, the usual interpretation is that negative se- portion of nonsynonymous substitutions that are slightly dele- lection has taken place on nonsynonymous substitutions. Oth- terious from the differences in dN/dS between rare and common erwise, when dN/dS >1, the interpretation is that the positive alleles [72,73]. In our comparison, a subset of single-copy or- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Grigorev et al. positively selected genes is relatively short compared to num- bers of positively selected genes reported in other studies (e.g., human-to-chimpanzee comparison yields several hundreds of human-specific genes under selection [ 75–77]. This observation could be a consequence of the averaging effect of a large com- parison group that included mammals very distantly related to solenodons. The dN/dS ratios can also be used as a proxy to illustrate the rate of evolution for proteins. By looking at the trends in fast evolved genes (dN/dS >0.25), we can make inferences about the factors that shaped the genome of this species during the mil- lions of years of island isolation. To summarize the functional contributions, we used the PANTHER Overrepresentation Test and GO Ontology database based on the H. sapiens (Supplemen- tary Table S1) and M. musculus (Supplementary Table S2) genes [78]. Interestingly, genes involved in the inflammatory response and located on cell surfaces were among those overrepresented among the rapidly evolving genes in the Solenodon genome compared to either the human or mouse databases (Supplemen- tary Tables S1 and S2). Venom gene identification Since solenodon is one of very few venomous eutherian mam- Figure 6: The dN/dS ratios for 4,416 single-copy orthologous genes. The dN and mals, of special interest in the solenodon genome were the puta- dS ratios were calculated with the codeml module from the PAML package [50] tive venom genes. While there was no saliva sample in our pos- and calculated over the entire length of a protein coding gene. Values are color- session that could be analyzed for the expressed toxin genes, coded by GO term aggregated by the GO Slim generic database [69,106]; the color a comparative genome approach could be applied as an indi- code legend is presented in Fig. S2. The solid black line represents dN = dS; dots above it represent genes showing signatures of positive selection. The figure is rect way to find venom genes orthologous to genes expressed truncated at dN = 1 and dS = 2, so larger values are not shown on the graph, but in venom for other species. First, we identified 6,534 toxin and all ω, dN, and dS values are available in Database S6. venom protein representatives (Tox-Prot) from Uniprot (UniProt, RRID:SCR 002380)[79] and queried them with BLAST against thologs dN/dS compared to the 10 mammalian species (Fig. 5)is the current S. paradoxus genome assembly. The hit scaffolds estimated to be ∼0.18 or 18%, on average, compared to ∼0.25 re- were then extracted from the AUGUSTUS CDS prediction file. ported for the human–chimp and ∼0.13 reported for the mouse– The same Tox-Prot sequences were used for Exonerate with rat comparisons [74]. In other words, it suggests that up to 82% the protein-to-genome model. The hits were used as queries of all amino acid replacements in S. paradoxus are removed by against the NCBI database to ensure gene identity, further ex- purifying selection [74]. amined through phylogenetic analyses with select model mam- Note that purifying selection is the conservative force in malian and venom reptile genes (also adding randomly se- molecular evolution, whereas positive selection is the diversi- lected sequences for each gene to reduce clade bias). The fying force that drives molecular adaptation. Overall, the list of retrieved sequences were aligned with MUSCLE [65], followed Table 6: The putative targets of positive selection in the solenodon genome. GO category Solenodon gene dS dN dN/dS description Human ortholog ENOG410UG5H 0.000003 0.002563 ≥999 Plasma membrane KLF9 ENOG410USMX 0.000011 0.010830 ≥999 Plasma membrane TNFSF13B ENOG410UWRE 0.000015 0.014790 ≥999 - SMIM3 ENOG410UNED 0.000174 0.030411 174.84 - CCRN4L ENOG410UJP8 0.013214 0.120449 9.12 Cytosol PLK4 ENOG410UWA9 0.020955 0.104972 5.01 Mitochondrion NDUFC1 ENOG410V3Q6 0.047538 0.071112 1.50 Plasma membrane SYT16 ENOG410UQAM 0.078543 0.096445 1.23 - WBP2NL ENOG410UKXY 0.168982 0.185535 1.10 - TIGIT ENOG410UKXJ 0.134581 0.146926 1.09 Cytoplasm LRRC66 ENOG410UIAB 0.060622 0.065402 1.08 - TMEM56 ENOG410UG23 0.176172 0.177344 1.01 Generation of THTPA precursor metabolites and energy The dN/dS values and the GO categories for the 12 genes that showed signatures of positive selection in the Solenodon paradoxus woodi genome (dN>dS). All other genes are reported in Database S6. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 11 Table 7: Homologous matches for the most relevant protein venom were not filtered by repetitive regions or mappability mask, and classes in the Solenodon paradoxus genome. the number of possible SNV sites was defined as the genome assembly size minus the umber of unknown base pairs (‘N’). Protein groups found in animal Number of matches in Based on the variation data from the genomes of venoms the S. paradoxus genome 2 subspecies (S. p. woodi and S. p .paradoxus), we estimated population dynamics using pairwise sequentially Markovian Metalloproteinase; serine protease 8 each coalescent (PSMC) model [87]. PSMC uses a coalescent approach Hyaluronidase 6 to estimate changes in population size that allowed us to create (Acetyl) cholinesterase 2 a (The Most Recent Common Ancestor) TMRCA distribution Calglandulin; nerve growth factors 4 each Lipase 3 across the genome and estimate the effective population size Hydrolase; Kunitz serine protease 1 each (Ne) in recent evolutionary history (e.g., from 10,000 to 1 million inhibitor; nucleotidase; years). O-methyltransferase; oxidase; Demographic history was inferred separately for S. p. woodi peptidase; phosphodiesterase; and S. p. paradoxus, and the resulting plots revealed differences phospholipase; vascular in demographic histories of the 2 subspecies (Fig. 9). Each south- endothelial growth factor ern individual was considered separately and their demographic histories overlapped. The difference in demographic history pro- Genes were identified by querying 6,534 toxin and venom protein representatives vides another argument in favor of a subspecies split, as evi- found in animal venoms in Tox-Prot from Uniprot [79]. All of the protein groups denced by distinctly different effective population sizes at least are present in snake venoms. The sequences of the putative venom genes from S. paradoxus are available in the Database S7. since 300 Kya. According to this analysis, the northern solen- odon subspecies currently has a much larger Ne, which has ex- panded relatively recently, between 10,000 and 11,000 years ago by a maximum likelihood (WAG+I+G) phylogenetic reconstruc- (Fig. 9). Prior to that, it was the southern subspecies (S. p. woodi) tion. Hits were matched against their respective references in an that had a larger Ne. At the same time, the demographic history alignment and visually inspected. inferred for both populations showed similar cyclical patterns of As a result, we identified 44 gene hits of the 16 most rele- expansion and contraction around the mean of 6,000 “effective” vant protein venom classes (all present in snakes) in the S. para- individuals for the southern subspecies (S. p. woodi) and 3,000 doxus genome (Table 7). Inspection of pairwise MUSCLE align- for the northern subspecies (S. p. paradoxus). One unusual result ments of the putative Solenodon venom genes (Database S7) of this analysis is that the northern subspecies shows a much with their animal homologs revealed several interesting cues. lower Ne for all but the most recent time period. The putative venom genes could not be confirmed through ge- nomic information alone, yet they cannot be discarded given that they were matched to high homology regions of closely re- Development of tools to study population and lated genes, such as those originally recruited into venom. There conservation genetics of S. paradoxus were also unusual insertions not found in other species’ venom The presence of genome-wide sequences of multiple individ- genes. Specifically, an insertion in a serine protease, a gene uals from 2 subspecies created a possibility for the develop- with a role in coagulation (namely, coagulation factor X), is not ment of practical tools for conservation genetics of this endan- present in known homologs. The insertion seems to be located gered species. Generally, microsatellite loci are both abundant at the start of the second exon. This particular gene was fur- and widely distributed throughout the genome, while usable loci ther analyzed to understand the insertion and its potential func- are characterized by a unique flanking DNA sequence so that a tional consequences (Fig. 7). Finally, none of the known venom single locus can be independently amplified in many individu- genes from the closest related venomous insectivore (Blarina bre- als [88–90]. The major advantages of microsatellite markers are vicauda) were found in our study. Our results indicate that a more well known: codominant transmission, high levels of polymor- detailed study of Solenodon venom genes using a transcriptome phisms leading to the high information content, high mutation obtained from a fresh saliva sample is needed to address their rates that allow differentiation between individuals or popula- molecular evolution and function. tions within a species, and ease of genotyping. While a genome obtained from one individual can be searched for potentially Genomic variation and demographic history inference variable microsatellite loci, this would (1) miss the majority of Once the reference alignment was assembled as a consensus be- variable loci not represented in the individual’s 2 chromosomes tween the sequences obtained from the 5 S. p. woodi individu- and (2) result in many positives that may be monomorphic fol- als, polymorphisms were identified in the 6 individual genomes lowing laboratory tests (usually by electrophoresis of the ampli- by aligning them to the combined reference. Single-nucleotide fied fragments from population samples). The availability of sev- and short variants and indels were identified in 5 southern and eral genomes can allow generation of a more comprehensive set 1 northern individual using Bowtie2 (Bowtie, RRID:SCR 005476) of variable markers, while reducing false positives. [80], SAMtools and Bcftools (SAMTOOLS, RRID:SCR 002105)[81], All 3 assemblies from this study (A, B, and C) were indepen- and VCFtools (VCFtools, RRID:SCR 001235)[82]. The S. p. woodi dently analyzed using a short tandem repeat (STR) detection individuals differed from the reference by an average of 1.25 mil- pipeline. A, B, and C assemblies were analyzed separately with lion polymorphisms, and the S. p. paradoxus individual differed Tandem Repeats Finder to locate and display tandem repeats by 2.65 million from the reference assembly. [91]. Each of the 6 individual samples from the 2 solenodon sub- Whole solenodon genome single nucleotide variation (SNV) species (5 from S. p. woodi and 1 from S. p. paradoxus) were aligned rates, defined as a ratio of all observed SNVs to all possible SNV to the reference assemblies A, B, and C by Burrow-Wheelers sites in the genome, were calculated and found to be compara- Aligner [81]. Each set of individual alignments was analyzed with tively low relative to other mammals (Fig. 8)[83–86]. To enable HipSTR [92]. Only loci that shared more than 20 reads in the sam- this comparison, the same calculations were used, where SNVs ple alignments were considered for further steps in the search Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 12 Grigorev et al. Figure 7: (A) Predicted coagulation factor X (F10) gene structure arrangement from the structure of known homologs (due to the scaffolding, the total gene length is unknown in solenodon). The 21 codon insertion is highlighted in red on exon 2 of the solenodon F10 gene. Exons are represented as black boxes and introns as lines connecting exons. (B) F10 protein sequence alignment showing an unusual insertion in the Solenodon paradoxus genome absent in all other mammalian and reptilian genes retrieved from the Tox-Prot from Uniprot [79]. The insertion of 21 amino acids is indicated with a red-boxed line in the alignment. (C) Reconstructed mammalian F10 phylogenetic maximum likelihood tree using the model GTR+I+, 1,000 bootstrap replicates (1,590 bp long alignment). The numbers set indicate approximate likelihood-ratio branch test, Bayesian-like modification of the aLRT, and bootstrap percentage, respectively. for variable microsatellite loci. The result of this search was to HipSTR [92], genotypes with a flank indel in more than 15% of saved in a variant call format file that includes annotations of reads, and genotypes with more than 15% of reads with detected all loci that had variation between samples and passed the min- polymerase chain reaction (PCR) stutter artifacts were discarded. imum qualification of the reads parameter: to be successfully The final set containing loci that have at least 2 allele calls in identified in silico in the data from at least 1 individual. The loci 2 different individuals after filtering has been deposited in that did not pass these criteria were labeled as unsuccessfully ver- the polymorphic microsatellite database (Database S8). This ified and excluded from the list. database contains a list of variable microsatellites discovered, The remaining loci were subjected to additional filtering; all a total of 1,200 bp flanking sequences for primer construction, genotypes that had less than 90% posterior probability according and the information on whether and where it was found to be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 13 Figure 8: Low genome heterozygosity in Solenodon paradoxus woodi compared to other mammalian taxa. The SNV rate in the S. p. woodi genome is shown relative to other mammal genomes as an estimate of genome diversity (h). The value for each sequenced individual was estimated using all variant positions, with repetitive regions not filtered. The SNVs are deposited in Database S9. variable—between subspecies or within 1 of the subspecies. We also providing a window into genetic underpinnings of adap- also report the type (di-, tri-, etc.), number of repeats, number tive features, including genes responsible for inflammation and of variants, and % variable and provide up to 600 bp flanking venom and how these may reflect its adaptation. In addition, we sequences on each side that can be used to develop primer se- developed tools that will help guide future genome studies as quences (Database S8). well as conservation surveys of the remaining solenodon popu- lations on the island of Hispaniola. In this study, we have made the first step into the whole-genome analysis of the Solenodon.A more complete genome sequence may provide a better picture Discussion of its evolutionary history, possible signatures of selection, and In this study, we sequenced and assembled the genome of an clues about the genetic basis of adaptive phenotypic features fa- endangered Antillean mammal that survived tens of millions cilitating life on Caribbean islands and contribute to a better in- of years of island isolation but nevertheless is currently threat- sight into island evolution and possible responses to current and ened with extinction due to anthropogenic activities. Our ap- future climate change. proach demonstrated sequencing, assembly, and annotation of a genome of a highly divergent lineage within the placental mam- mal tree, delivering an important phylogenetically diverse mam- malian genome for analysis in a comparative context [93]. Al- though the full description of genome diversity of this rare enig- The string graph assembly approach for homozygous matic mammal needs to be further improved with more samples genomes and analyses, our initial assembly of the solenodon genome con- tributes information and tools for future studies of evolution and The advantages of the string graph assemblies in our particular conservation. Future studies can combine the current genome case can be understood by looking at the nature of the under- annotations with the inclusion of additional genetic and eco- lying algorithms. The de Bruijn graph is a mathematical con- logical data from further sampling. cept that simplifies genome assembly by reducing information With the new genome-wide assembly, we inferred a phy- from short next-generation sequencing reads, of which there logeny that validates previous estimates of the time of diver- can be billions, to an optimized computational problem that gence of Solenodon from other eulipotyphlan insectivores [3, 12], can be solved efficiently [ 94]. However, some information may Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 14 Grigorev et al. Figure 9: Demographic history inference for the southern S. p. woodi (red) and the northern S. p. paradoxus (blue) subspecies using the pairwise sequentially Markovian coalescent model [87]. indeed be lost, as the set of reads is effectively replaced with Potential Implications a set of much shorter k-mers to produce an optimal assembly Comparative genomics path. Usually, this is compensated by overwhelming amounts of data in high coverage assemblies, and the difference in ef- We have taken advantage of the fact that the genome of this fectiveness between this and other types of algorithms, bar- mammal shows reduced heterozygosity [12], which made it fea- ring speed, becomes less evident. While sequencing becomes sible to combine samples of multiple individuals in order to pro- cheaper, genome projects continue to rely on the increased high- vide higher coverage and achieve a better assembly using Illu- quality coverage, increasing the cost of the sequence data rather mina reads. The current assembly was performed without the than trying to increase the efficacy of the assembly itself. In use of mate pair libraries and without high-quality DNA, nev- contrast, the string graph–based algorithms for genome assem- ertheless, it is comparable in quality to other available mam- bly are intrinsically less erroneous than de Bruijn graph–based malian assemblies. In terms of contig N50 as a measure of con- ones, since building and resolving a string graph does not re- tiguity, our assembly resulted in contig N50 of 54,944, while the quire breaking reads into k-mers and therefore does not sacrifice most closely related available genome sequences of Sorex ara- long-range information [19]. This also helps reduce the probabil- neus (SorAra2.0) assembly features a contig N50 of 22,623, and ity of misassemblies; in theory, any path in a string graph repre- the Condylura cristata (ConCri1.0) assembly has contig N50 of sents a valid assembly [95, 96]. String graph–based approaches 46,163. It should be noted that scaffold N50 values are not to be have already been applied successfully to assemblies from high compared as this study used only paired-end reads, as opposed coverage read sets; and one example is the Assemblathon 2 [97]. to S. araneus and C. cristata. More importantly, the assembly pro- In projects with lower genome coverage like ours, adoption of a vided complete or partial annotation for more than 95% of the string graph–based approach might be of benefit to the genome genes based on the evolutionarily informed expectations of gene assembly because it uses more information from the sequences. content from near-universal single-copy orthologs selected from However, there are 2 major downsides for its widespread use: OrthoDB v9 by BUSCO [29]. Among these, 4416 single-copy genes (1) it is more computationally intensive than methods utilizing that have clear 1-to-1 orthologs across species (single-copy or- de Bruijn graph algorithms and (2) the implementation of the thologs) [98, 99] were chosen for a subsequent comparative anal- string graph model is sensitive to sequence variation, and the ysis involving genes in different mammalian species. effectiveness of this approach may depend on the level of het- Specifically, the repetitive composition of the solenodon erozygosity in a DNA sample. It is worth noting that [19]was genome was evaluated. Compared to the estimates based on the primarily intended for variant annotation via de novo local as- reference human genome [100], very conspicuous is the lower sembly, and not for whole genome assembly. Nevertheless, the numbers of SINEs (no Alu elements) and a substantially lower new genome-wide data produced by our pipeline were sufficient number of LINEs as well. Transpositions and translocations be- for the comparative analysis and have been annotated for the tween the genomes of S. paradoxus and S. araneus were identi- genes and repetitive elements and interrogated for phylogeny, fied; very few rearrangements and translocations between the demographic history, and signatures of selection. In addition, assembly and the S. araneus genome were found. At the same using the current genome assembly, we were able to annotate time a higher coverage would be needed to do more detailed large transpositions and translocations in the Solenodon in re- analyses, for instance, to address the relative length and simi- lation to the closest available high-quality genome assembly larity of indels and copy number polymorphisms between solen- (S. araneus). odon populations [101]. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 15 Evolutionary genomics tion could allow a more successful interaction with molecules capable of activating the F10 protein. In mice, venom extracted Using the nuclear genomes, we were able to confirm earlier di- from solenodons and venom prothrombin activator injections vergence time estimates based on sets of genes [3], as well as can both be lethal in minutes [7, 102]. The insertion was also full mitochondrial sequences [12]. The whole genome analysis searched against possible mobile DNA elements, but no matches points to a split between Solenodon and other eulypotiphlans that were found. Our results should be followed in the future by de- occurred around 74 Mya (Fig. 5), which is very close to our earlier tailed pharmacological studies. estimate of 78 Mya based on the full mitochondrial genome [12]. Our result does not support the 60 Mya point estimate made by Conservation genetics a phylogenetic analysis based on sequences of 5 slowly evolving nuclear genes [13]. The low variation that exists between the solenodon sequences Our assembly provided enough gene sequences to gain in- is hardly surprising, because the theoretical consensus in con- sights into the evolution of functional elements in the solenodon servation genetics predicts that small populations lose genetic genome. It is reasonable to suggest that this species historically diversity more rapidly than large populations [103] and mea- had low effective population sizes, if they remained close to sures of genetic diversity have been explicitly suggested to IUCN those estimated by this study, or about 4,000 on average (Fig. 9). as a factor to consider in identifying species of conservation con- Among the 4,416 single-copy orthologs analyzed for dN/dS ra- cern [104]. The historical N for each subspecies was examined tios over the entire length of a protein-coding gene between S. by our analysis (Fig. 9) and showed lower levels recently in S. p. paradoxus and 10 other mammals, 12 genes were identified as woodi. Due to the limitations of PSMC, the most recent Ne cannot positively selected. Among these, the majority were membrane be calculated from the genome sequences [87]. Therefore, this proteins, with 1 gene (CCRNL4) similar to a circadian clock reg- estimate of diversity does not reflect the recent impact on the ulator (Table 6). It is possible that the short list of the positively solenodon population caused by anthropogenic factors in the selected genes could be a consequence of the large comparison last 10,000 years (Fig. 9). group that included mammals very distantly related to solen- Many endangered species with small populations also have odon, and its genes need to be compared with more closely re- reduced heterozygosity across their genomes and would ben- lated species, e.g., once the genome of S. cubanus is reported and efit from a computational approach that reduces the cost and better gene annotations for Sorex araneus become available. optimizes the amount of data for the genome assembly. The Solenodon is one of few mammals that use venomous saliva real-life scenarios where no high-quality DNA can be produced to disable prey. It delivers its venom similarly to snakes—using because of the remoteness of sampling location, when trans- its teeth to inject venomous saliva into its target. Different ap- portation and storage are difficult, or when the high coverage proaches could be used to characterize venom genes, such as cannot be produced due to the limited funds are well known to the use of noncurated databases to widen the search spectrum, many, especially in the field of conservation genetics. The diffi- which may include different molecules that could be found in cult field conditions and international regulations make it diffi- Solenodon. For example, 6,534 toxin and venom protein represen- cult to obtain samples with high-molecular-weight DNA. To aid tatives can be found in the UniProt database. It is also impor- future conservation studies, we have mined the current dataset tant to note that the database of venom gene sequences may for microsatellite markers that are useful within and between not include those relevant to solenodons given their deep di- subspecies, to be used as tools for studies on population diver- vergence from any other venomous mammalian species. The sity, censoring, and monitoring. venom of Solenodon may contain novel protein modifications The comparative analysis of the number and the length of with unknown potential or application, making it valuable for microsatellite alleles pointed once more to the advantage of as- future detailed characterization. sembly B over A and C. The average length of microsatellite short Genes associated with venom, such as serine proteases in- tandem repeats in assembly B was the highest: 20.95 (assembly volved in coagulation (namely, the coagulation factor X), are of A) vs. 21.14 (assembly B) vs. 18.86 (assembly C). This may be a major interest since factor X in solenodon exhibited unusual in- direct consequence of the high number of microsatellite alle- sertions when compared to its homologs (Fig. 7). The detection les that were successfully genotyped in all of the southern sam- of an unusual insertion in a serine protease has been previously ples for assembly B (2,660), as well as microsatellites that proved found in another venomous mammalian species, the shrew Bla- variable between the 2 subspecies but fixed within the southern rina brevicauda, but in a different gene than in solenodon. The samples (639). The low number of variable microsatellites be- coagulation factor X is involved in the circulatory system and is tween the 2 subspecies was likely due to the reduced amount responsible for activating thrombin and inducing clotting. The of information obtainable from a single low coverage genome insertion in the coagulation factor X gene seems to be a hy- of the northern subspecies (S. p. paradoxus)usedinthisstudy. drophilic alpha helix with 3 potential protein–protein interac- Venn diagrams showing overlap in microsatellite variation in 3 tion sites. It occurs at the end of the region annotated as the assemblies are presented in Fig. 11. signal peptide, while having a signal peptide cleavage site itself Recently, a genetic survey using mitochondrial cytochrome b at the beginning of its sequence. The factor X protein structure and control region sequences from 34 solenodon samples iden- was successfully modeled by Swiss-Model based on the ven- tified distinct haplotypes in northern and southern Hispaniola omous elapid snake Pseudonaja textilis (pdb:4bxs)tohaveaheavy [16], along with a distinctive third group, a small remnant pop- chain that contains the serine protease activity, which was mod- ulation at the Massif de la Hotte in the extreme western tip of eled with a high degree of confidence (Fig. 10). The venom Haiti [16, 105] not sampled for this study. The north–south sub- prothrombin activator has an advantage as a toxin in part due species subdivision within S. paradoxus was further supported to modifications in inhibition sites, making it difficult to stop by mitogenomic sequences [12]. The island of Hispaniola has its activity. Another advantage is that the molecules are always been divided into 3 main biogeographic regions that differ in cli- found in an active form (Kinin). We hypothesize that the inser- mate and habitat. The north and center of the island provide the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 16 Grigorev et al. Figure 10: (A) Simplified version of the coagulation cascade, indicating key steps involving the coagulation factor X (F10). (B) Protein modeling of solenodon s equence data using SWISS-MODEL. The target model (4bxs) used was the F10-like protease of the venomous elapid snake Pseudonaja textilis. Due to its location, the insertion cannot be represented in the model (its location is indicated according to the PDB annotation). Colors indicate model quality, with red being low-quality and blue high-quality modeling. Colors also separate F10’s light chain (EGF-like domain) in red from the heavy chain (serine protease domain) in blue (the half circle line in black separates both domains). (C) Amino acid sequence properties calculated for the solenodon F10 translated gene, with focus on the insertion region 23–43. One signal peptide cleavage site was detected between positions 25 and 26. Predicted protein interaction sites at position 26, 29–30 and 32–40. Hydropathy analysis showed a relatively hydrophilic structure for the insertion. largest area with known solenodon populations and shows no Methods discontinuity with the southeast. However, the solenodon pop- Provenance of the samples is shown on the map (Fig. 2), with ulations in the southwestern part of the island are currently geo- coordinates listed in Supplementary Table S4. Solenodons were graphically isolated by Cordillera Central and may have been iso- caught with help of local guides (Nicolas ´ Corona and Yimell lated in the past by the ancient marine divide across the Neiba Corona). During the day, potential locations were inspected in Valley (Fig. 2). This geographic isolation is likely the reason why daylight for animal tracks, burrows, droppings, and other signs the S. p. paradoxus in the larger northern area and S. p. woodi of solenodon activity. At dawn, ambushes were set up in the in the southwest show morphological differences suggestive of forested areas along the potential animal trails. The approach- separate subspecies [15]. Future conservation strategies directed ing solenodons were identified by sound and chased with flash- at protecting and restoring solenodon populations on Hispaniola lights when approached. Since solenodons move slowly, animals should take into consideration this subdivision and treat the 2 were picked up by their tails, which is the only way to avoid po- subspecies as 2 separate conservation units. tentially venomous bites. All wild-caught animals were released Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 17 Figure 11: Numbers of variable microsatellite alleles discovered in S. paradoxus assemblies. The diagrams were built independently for Fermi-based assemblies (A) and (B) and 1 SOAPdenovo2-based assembly (C). The red circle indicates microsatellites that were successfully genotyped in all samples with at least 1 alternative allele in the southern subspecies (S. p. woodi). The blue circle indicates microsatellites that were successfully genotyped in all samples with at least 1 alternative allele in the northern subspecies (S. p. paradoxus). The overlap indicates microsatellite loci with at least 1 alternative variant found in both subspecies. All alleles discovered, number of fixed alleles in each population, and number of unique alleles in each population are presented in Table S3. All the candidate microsatellite loci discovered in this study, along with their 5’ and 3’ flanking regions, are listed in the Database S8. back into their habitats within 10 minutes after their capture. http://public.dobzhanskycenter.ru/solenodon/repeats/solpar- Before the release, the animals’ tails were marked with a Sharpie a.txt pen to avoid recapturing. http://public.dobzhanskycenter.ru/solenodon/repeats/solpar- Blood was drawn by a licensed ZooDom veterinarian (Adrell b.txt Nunez) ˜ from the vena jugularis using a 3-mL syringe with a 23G x Database S2: List of protein coding genes in the solenodon 1-inch needle. The blood volume collected never exceeded 1% genome (assembly B) of body weight of animals. Before the draw, an aseptic tech- http://public.dobzhanskycenter.ru/solenodon/genes/solpar-b.gff nique was applied using a povidone–iodine solution, followed also cds for each gene and translated sequences by isopropyl alcohol. Once collected, the samples were trans- Database S3: List of the annotated non-coding RNAs ferred to a collection tube with anticoagulant (BD Microtainer, in the solenodon genome http://public.dobzhanskycenter. 1.0 mg K2EDTA for 250–500l L volume). Collection tubes were ru/solenodon/rna refrigerated and transported to the lab at the Instituto Tec- Database S5: List of single-copy orthologs in the solenodon nolog ´ ico de Santo Domingo where DNA was extracted from sam- genome (columns include: ENOG id, gene name) http://public. ples using the DNeasy Blood and Tissue kit (Qiagen, Hilden, Ger- dobzhanskycenter.ru/solenodon/monoorthologs.txt many). Database S6: List of genes with dN/dS values and GO anno- This study was reviewed and approved by the Institutional tations Animal Care and Use Committee of the University of Puerto http://public.dobzhanskycenter.ru/solenodon/selection.xls Rico at Mayaguez. All the required collection and permits had Database S7: List of venom genes http://public. been obtained before any field work was started. The samples dobzhanskycenter.ru/solenodon/venom genes HitGeneDB.fasta were collected and exported in compliance with export permit Database S8: Microsatellite loci discovered in genomes of VAPB-00909 (Dominican Republic Environment and Natural Re- two solenodon subspecies Solenodon paradoxus paradoxus (north- sources Ministry Viceministry of Protected Areas and Biodiver- ern) and S. p. woodi (southern), alleles, 600bp flanking re- sity Department of Biodiversity) and imported in compliance gions (a total of 1,200 bp per locus), and frequency informa- with CITES/ESA import permit 14US84465A/9 (University of Illi- tion for the two subspecies http://public.dobzhanskycenter.ru/ nois Board of Trustees). solenodon/STRs.xlsx Database S9: Lists of single nucleotide differences from the assembled individual genome of Spa-1 (from Solenodon paradoxus Sequencing paradoxus) and Spa K,—L,—M, -N, and –O (from the five S. p. Sequences for S. p. woodi were generated by Illumina HiSeq 2000 woodi)) used to show estimates of heterozygosity in Fig. 8 (see (Illumina Inc) with 100 bp paired-end reads. The Illumina HiSeq explanation in text) generated raw images utilizing HiSeq Control Software (HCS) http://public.dobzhanskycenter.ru/solenodon/variants v2.2.38 for system control and base calling through an inte- Supporting raw data is in the NCBI Sequence Read Archive grated primary analysis software called RTA (Real Time Analy- [ENA: NKTL01000000, PROJECT: PRJNA368679] and genome as- sis. v1.18.61.0). The base call binaries were converted into FASTQ semblies, custom codes and annotations are in the GigaScience utilizing the Illumina package bcl2fastq (v1.8.4). Sequences for S. GigaDB database [22]. p. paradoxus were generated by the Illumina MiSeq V3 (Illumina Inc.) at the Roy J. Carver Biotechnology Center, University of Illi- nois. The sequencing data for each sample used in this study are Additional file presented in Supplementary Table S5. Figure S1: The phylogenetic tree used for multiple genome alignment with Progressive Cactus (Paten et al. 2011). The taxa Availability of supporting data have been chosen based on their availability and the quality of Database S1: Lists of repeats in the solenodon genome (assem- genome assembly, not to make inferences about mammalian blies A and B) phylogeny. This cladogram only shows tree topology, and the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 18 Grigorev et al. branches do not represent evolutionary time, and do not assume References the basal position of Solenodon paradoxus. 1. MacPhee RDE, Flemming C, Lunde DP. “ Last occurrence” Figure S2: The colors assigned to GO terms represented in of the Antillean insectivoran Nesophontes: new radiomet- Fig. 4. ric dates and their interpretation. American Museum novi- Table S1: Classification of the fast-evolving genes tates; no. 3261. New York, NY: American Museum of Natural (dN/dS>0.25) in solenodon genome using PANTHER Overrep- History; 1999. resentation Test (release 20160715) and GO Ontology database 2. Ottenwalder JA. Systematics and biogeography of the West (Released 2017-02-28) based on the Homo sapiens genes. Genes Indian genus Solenodon. Biogeogr. West Indies Patterns are only represented if P < 0.05 after the Bonferroni correction Perspect. Second Ed. CRC Press; 2001, 253–329. for multiple testing (Treangen and Salzberg, 2012). 3. Roca AL, Bar-Gal GK, Eizirik E et al. Mesozoic origin for Table S2: Classification of the fast-evolving genes (w >0.25) West Indian insectivores. Nature, Nature Publishing Group; in solenodon genome using PANTHER Overrepresentation Test 2004;429(6992):649–51. (release 20160715) and GO Ontology database (Released 2017-02- 4. Verill AH. Notes on the habits and external characters of the 28) based on the Mus musculus genes. Genes are only represented solenodon of San Domingo (Solenodon paradoxus). American if P < 0.05 after the Bonferroni correction for multiple testing Journal of Science 1907;XXIV(139):55–57. (Treangen and Salzberg, 2012). 5. Allen JA. Notes on Solenodon paradoxus Brandt. Bull Am Mu- Table S3: Microsatellite alleles discovered in genomes of two seum Nat Hist 1908;XXIV:505–5017. solenodon subspecies Solenodon paradoxus paradoxus (northern) 6. Brandt JF. De Solenodonte: novo mammalium insec- and S. p. woodi (southern). tivororum genere. Mem. l’Academie ´ Imperiale ´ des Sci. Table S4: Locations for the samples used in this study as de- St. Petersbg. ´ l’Academie ´ Imperiale ´ des Sciences de St. scribed in Fig. 2. Adopted from Brandt et al., (2016). Petersbour ´ g; 1833;2:459–78. Table S5: Sequencing data for the samples used in this 7. Derbridge JJ, Posthumus EE, Chen HL et al. Solenodon study. paradoxus (Soricomorpha: Solenodontidae). Mammalian Species 2015;47(927):100–6. Abbreviations 8. Feldhamer GA, Drickamer LC, Vessey SH et al. Mammalogy: Adaptation, Diversity, and Ecology. 4th ed. Johns Hopkins BLAST: Basic Local Alignment Search Tool for Proteins; BUSCO: University Press, Baltimore, Maryland. 2015, p. 768 benchmarking universal single-copy orthologs; CDS: coding se- 9. Wible JR. On the cranial osteology of the Hispaniolan quence; GO: gene ontology; IUCN: International Union for Con- solenodon, Solenodon paradoxus Brandt, 1833 (Mammalia, servation of Nature; Kya: thousand (kilo) years ago; Mya: million Lipotyphla, Solenodontidae). Annals of Carnegie Museum years ago; NCBI: National Center for Biotechnology Information; 2008;77(3):321–402. PSMC: pairwise sequentially Markovian coalescent; SNV: single 10. Folinsbee KE, Muller ¨ J, Reisz RR. Canine grooves: morphol- nucleotide variation; STR: short tandem repeat; UCN: Interna- ogy, function, and relevance to venom. Journal of Vertebrate tional Union for Conservation of Nature. Paleontology 2007;27(2):547–51. 11. Dufton MJ. Venomous mammals. Pharmacology & Thera- peutics. Elsevier; 1992;53(2):199–215. Competing interests 12. Brandt AL, Grigorev K, Afanador-Hernandez ´ YM et al. The authors declare that they have no competing interests. Mitogenomic sequences support a North-South subspecies subdivision within Solenodon paradoxus. Mitochondrial DNA Part A [Internet]. Taylor & Francis; 2017;28(5):662– Author contributions 70. Available from: https://www.ncbi.nlm.nih.gov/ KG, TKO, JCMC and ALR conceived and designed the study. KG, pubmed/27159724. SK, PD, AK, WW, FS, and TKO did the analysis. YMAH, ALB, LAP, 13. Sato JJ, Ohdachi SD, Echenique-Diaz LM et al. Molecular RC, AN, JRB, JCMC and TKO collected, processed, and contributed phylogenetic analysis of nuclear genes suggests a Ceno- samples in the field and the laboratory, KG, AJM, KG, AA, ALR, zoic over-water dispersal origin for the Cuban solenodon. SJO, JCMC and TKO wrote the draft. All authors contributed to Sci Rep. 2016;6(1):31173. finalizing of the manuscript. 14. Ottenwalder JA. The Distribution and Habitat of Solen- odon in the Dominican Republic. MS thesis Univ. Florida, Gainesville. 1985, p. 128. Funding 15. Ottenwalder JA. The Systematics, Biology and Conserva- Authors at the University of Puerto Rico at Mayaguez were sup- tion of Solenodon. Ph.D. dissertation, University of Florida, ported in part by a National Science Foundation award (1432092). Gainesville 1991, p. 281. Authors at the Theodosius Dobzhansky Center for Genome 16. Turvey S, Inchaustegui S. 2008. Solenodon para- Bioinformatics were supported by a Russian Ministry of Science doxus. The IUCN Red List of Threatened Species 2008: e.T20321A9186243. http://dx.doi.org/10.2305/IUCN.UK. mega-grant (11.G34.31.0068) and a St. Petersburg State Univer- sity grant (1.50.1623.2013). 2008.RLTS.T20321A9186243.en. Downloaded on 07 April 17. http://www.iucnredlist.org/details/20321/0. Acknowledgements 18. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically im- proved memory-efficient short-read de novo assembler. Gi- The authors thank Nicolas ´ Corona and Yimell Corona for assis- gaSci 2012;1(1):589–602. tance in collecting samples. A special thanks to Kara Fore for 19. Li H. Exploring single-sample SNP and INDEL calling with helping to edit the manuscript. Sample collection in the Domini- whole-genome de novo assembly. Bioinformatics. Oxford can Republic was performed under permit 00201171139 from the University Press; 2012;28(14):1838–44. Ministry of Environment and Natural Resources. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 19 20. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding pre- 39. Altschul SF, Gish W, Miller W et al. Basic local align- assembled contigs using SSPACE. Bioinformatics. Oxford ment search tool. Journal of Molecular Biology Elsevier; University Press; 2011;27(4):578–9. 1990;215(3):403–10. 21. Wang Y, Lu Y, Zhang Y et al. The draft genome of the grass 40. Bateman A, Coin L, Durbin R et al. The Pfam protein families carp (Ctenopharyngodon idellus) provides insights into its evo- database. Nucleic Acids Res. 2004;32(90001):138D–141. lution and vegetarian adaptation. Nat Genet 2015;47(6):625– 41. The UniProt Consortium. UniProt: a hub for protein 31. information. Nucleic Acids Research. 2015;43(Database 22. Grigorev K, Kliver S, Dobrynin P et al. Supporting data issue):D204–12. Available at: doi:10.1093/nar/gku989. for “Innovative assembly strategy contributes to un- 42. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA ho- derstanding the evolution and conservation genetics of mology searches. Bioinformatics. Oxford University Press; the endangered Solenodon paradoxus from the island of 2013;29(22):2933–5. Hispaniola.” GigaScience Database. 2018; http://dx.doi. 43. Nawrocki EP, Burge SW, Bateman A et al. Rfam 12.0: up- org/10.5524/100422. dates to the RNA families database. Nucleic Acids Res. 23. Starostina E, Tamazian G, Dobrynin P et al. Cookiecut- 2015;43(Database issue):D130–7. doi:10.1093/nar/gku1063. ter: a tool for kmer-based read filtering and extraction. 44. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved bioRxiv Cold Spring Harbor Labs Journals 2015;24679. Avail- detection of transfer RNA genes in genomic sequence. Nu- able from: https://doi.org/10.1101/024679. cleic Acids Res. 1997;25(5):955–64. 24. Marc¸ais G, Kingsford C. A fast, lock-free approach for effi- 45. Seemann T, Booth T. BARNAP: BAsic Rapid Ribosomal cient parallel counting of occurrences of k-mers. Bioinfor- RNA Predictor [Internet]. Berlin: GitHub; 2013. p. https: matics. 2011;27(6):764–70. //github.com/tseemann/barrnap. Accessed 1st March 2018. 25. Marc¸ais G, Yorke JA, Zimin A. QuorUM: an error corrector 46. Ruan J, Li H, Chen Z et al. TreeFam: 2008 update. Nucleic for Illumina reads. PLoS One 2015;10(6):e0130821. Acids Res. 2008;36(Database):D735–40. 26. Chikhi R, Medvedev P. Informed and automated 47. Li H, Coghlan A, Ruan J et al. TreeFam: a curated database k-mer size selection for genome assembly. Bioin- of phylogenetic trees of animal gene families. Nucleic Acids formatics [Internet]. 2014;30:31–7. Available from: Res. 2006;34(90001):D572–80. http://dx.doi.org/10.1093/bioinformatics/btt310. 48. Huerta-Cepas J, Szklarczyk D, Forslund K et al. eggNOG 27. Li H, Homer N. A survey of sequence alignment algo- 4.5: a hierarchical orthology framework with improved rithms for next-generation sequencing. Brief Bioinform functional annotations for eukaryotic, prokaryotic and vi- 2010;11(5):473–83. ral sequences. Nucleic Acids Research. 2016;44(Database 28. Gurevich A, Saveliev V, Vyahhi N et al. QUAST: quality issue):D286–93. Available at: doi:10.1093/nar/gkv1248. assessment tool for genome assemblies. Bioinformatics. 49. Stamatakis A. RAxML version 8: a tool for phylogenetic 2013;29(8):1072–5. analysis and post-analysis of large phylogenies. Bioinfor- 29. Simao ˜ FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- matics 2014;30(9):1312–3. ing genome assembly and annotation completeness with 50. Yang Z. PAML 4: phylogenetic analysis by maximum like- single-copy orthologs. Bioinformatics. 2015;31(19):3210–2. lihood. Molecular Biology and Evolution 2007;24(8):1586– 30. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately 91. annotate core genes in eukaryotic genomes. Bioinformat- 51. Xia X, Xie Z, Salemi M et al. An index of substitution satura- ics. 2007;23(9):1061–7. tion and its application. Mol Phylogenet Evol 2003;26(1):1–7. 31. Hunt M, Kikuchi T, Sanders M et al. REAPR: a univer- 52. Xia X, Lemey P. Assessing substitution saturation with sal tool for genome assembly evaluation. Genome Biol. DAMBE. In: Lemey Philippe, Salemi, Marco, Vandamme, 2013;14(5):R47. Anne-Mieke eds. Phylogenetic Handb. A Pract. Approach 32. Paten B, Earl D, Nguyen N et al. Cactus: algorithms for to DNA Protein Phylogeny. 2nd ed. Cambridge University genome multiple sequence alignment. Genome Research. Press; 2009, 615–30. 2011;21(9):1512–28. 53. Xia X. DAMBE6: new tools for microbial genomics, phy- 33. Kolmogorov M, Raney B, Paten B et al. Ragout–a reference- logenetics, and molecular evolution. J Hered 2017;108(4): assisted assembly tool for bacterial genomes. Bioinformat- 431–7. ics. 2014;30(12):i302–9. 54. Ksepka DT, Parham JF, Allman JF et al. The fossil calibration 34. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996– database—a new resource for divergence dating. Syst Biol 2010 Accessible at: http://www.repeatmasker.org. 2015;syv025. 35. Bao W, Kojima KK, Kohany O. Repbase Update, a database 55. Benton MJ, Donoghue PCJ, Asher RJ et al. Constraints on the of repetitive elements in eukaryotic genomes. Mobile DNA timescale of animal evolutionary history. Palaeontol Elec- 2015;6(1):11. tron, Paleontological Society; 2015;18:1–106. 36. Slater GSC, Birney E. Automated generation of heuristics 56. Munthe K. Canidae. Evol. Tert. Mamm. North Am. Cam- for biological sequence comparison. BMC Bioinformatics. bridge University Press, Cambridge. 1998, p. 124–43. 2005;6(1):31. 57. Wang X, Whistler DP, Takeuchi GT. A new basal skunk 37. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- Martinogale (Carnivora, Mephitinae) from late miocene tio prediction of alternative transcripts. Nucleic Acids Res. dove spring formation, California, and origin of New 2006;34(Web Server):W435–9. World mephitines. Journal of Vertebrate Paleontology 38. Finn RD, Clements J, Eddy SR. HMMER web server: in- 2005;25(4):936–49. teractive sequence similarity searching. Nucleic Acids Re- 58. Rambaut A. FigTree [Internet]. 2016. Available from: search. 2011;39(Web Server issue):W29–37. Available at: http://tree.bio.ed.ac.uk/software/figtree/ Accessed 1 March doi:10.1093/nar/gkr367. 2018. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 20 Grigorev et al. 59. Springer MS, Murphy WJ, Roca AL. Appropriate fossil cal- 79. Jungo F, Bougueleret L, Xenarios I et al. The ibrations and tree constraints uphold the Mesozoic diver- UniProtKB/Swiss-Prot Tox-Prot program: a central hub gence of solenodons from other extant mammals. Molecu- of integrated venom protein data. Toxicon 2012;60(4):551– lar Phylogenetics and Evolution 2018;121:158–65. 7. 60. Hedges SB. Vicariance and dispersal in Caribbean biogeog- 80. Langmead B, Salzberg SL. Fast gapped-read alignment with raphy. Herpetologica 1996;52:158–65. Bowtie 2. Nat Methods 2012;9(4):357–9. 61. McDowell SB. The greater antillean insectivores. Bull Am 81. Li H, Durbin R. Fast and accurate short read align- Museum Nat Hist [Internet]. 1958;115:117. Available from: ment with Burrows-Wheeler transform. Bioinformatics http://hdl.handle.net/2246/1199. 2009;25(14):1754–60. 62. Butler PM. Phylogeny of the insectivores. In: Benton MJ, eds. 82. Danecek P, Auton A, Abecasis G et al. The variant call Phylogeny Classif. Tetrapods. Oxford: Clarendon; 1988, 117– format and VCFtools. Bioinformatics. 2011;27(15):2156– 41. 8. 63. MacPhee RD, Novacek M. Definition and relationships of 83. Dobrynin P, Liu S, Tamazian G et al. Genomic legacy the Lipotyphla. In: Soule F, Novacek M, McKenna M, eds. of the African cheetah, Acinonyx jubatus. Genome Biol Mamm. Phylogeny, Vol. 2, Placentals. New York: Springer- 2015;16(1):277. doi: 10.1186/s13059-015-0837-4. Verlag; 1993, 13–31. 84. Gordon D, Huddleston J, Chaisson MJP et al. Long- 64. McKenna M, Bell S, Simpson S. Classification of Mammals read sequence assembly of the gorilla genome. Science. Above the Species Level. New York: Columbia University 2016;352(6281):aae0344-. doi:10.1126/science.aae0344. Press; 1997. 85. Li R, Fan W, Tian G et al. Erratum: the sequence and 65. Edgar RC. MUSCLE: multiple sequence alignment with de novo assembly of the giant panda genome. Nature. high accuracy and high throughput. Nucleic Acids Res. 2010;463(7284):1106-. 2004;32(5):1792–7. 86. Cho YS, Hu L, Hou H et al. The tiger genome and compara- 66. Suyama M, Torrents D, Bork P. PAL2NAL: robust conver- tive analysis with lion and snow leopard genomes. Nature sion of protein sequence alignments into the correspond- Communications. 2013;4:2433. ing codon alignments. Nucleic Acids Res. 2006;34(Web 87. Li H, Durbin R. Inference of human population his- Server):W609–12. tory from individual whole-genome sequences. Nature. 67. Soto-Giron ´ MJ, Ospina OE, Massey SE. Elevated levels of 2011;475(7357):493–6. adaption in Helicobacter pylori genomes from Japan; a link 88. Weber JL. Human DNA polymorphisms and methods of to higher incidences of gastric cancer? Evolution, Medicine, analysis. Current Opinion in Biotechnology 1990;1(2):166– and Public Health. 2015;2015(1):88–105. 71. 68. Gharib WH, Robinson-Rechavi M. The Branch-Site Test of 89. Weber JL, Wong C. Mutation of human short tandem re- Positive Selection Is Surprisingly Robust but Lacks Power peats. Hum Mol Genet 1993;2(8):1123–8. under Synonymous Substitution Saturation and Variation 90. Weber JL, May PE. Abundant class of human DNA polymor- in GC. Molecular Biology and Evolution. 2013;30(7):1675–86. phisms which can be typed using the polymerase chain re- 69. Tang H, Klopfenstein D, Pedersen B et al. GOATOOLS: Tools action. Am J Hum Genet 1989;44:388–96. PMC1715443 for gene ontology [Internet]. Zenodo 2015. Available from: 91. Benson G. Tandem repeats finder: a program to ana- https://doi.org/10.5281/zenodo.31628. lyze DNA sequences. Nucleic Acids Res. 1999;27(2):573– 70. Baggs JE, Green CB. Nocturnin, a Deadenylase in Xenopus 80. laevis Retina. Current Biology 2003;13(3):189–98. 92. Willems T, Zielinski D, Yuan J et al. Genome-wide pro- 71. Oleksyk TK, Smith MW, O’Brien SJ. Genome-wide filing of heritable and de novo STR variations. Nat Meth scans for footprints of natural selection. Philosophical 2017;14(6):590–2. Transactions of the Royal Society B: Biological Sciences 93. Koepfli K-P, Paten B, O’Brien SJ. The Genome 10K Project: a 2010;365(1537):185–205. way forward. Annu Rev Anim Biosci; 2015;3:57–111. 72. Fay JC, Wyckoff GJ, Wu CI. Positive and negative selection 94. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn on the human genome. Genetics 2001;158:1227–34. graphs to genome assembly. Nat Biotechnol 2011;29:987– 73. Fay JC, Wu C-I. The neutral theory in the genomic era. 91. Current Opinion in Genetics & Development 2001;11(6): 95. Myers EW. Toward simplifying and accurately formulat- 642–6. ing fragment assembly. Journal of Computational Biology 74. Ellegren H. Evolution: natural selection in the evolution of 1995;2:275–90. humans and chimps. Current Biology 2005;5(22):R919–22. 96. Myers EW. The fragment assembly string graph. Bioinfor- 75. Gaya-V ` idal M, Alba` M. Uncovering adaptive evolution in the matics 2005;21:ii79–85. human lineage. BMC Genomics 2014;15(1):R919–22. 97. Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2: 76. Bakewell MA, Shi P, Zhang J. More genes underwent posi- evaluating de novo methods of genome assembly in three tive selection in chimpanzee evolution than in human evo- vertebrate species. GigaSci 2013;2:10. lution. Proceedings of the National Academy of Sciences 98. Gogarten JP, Olendzenski L. Orthologs, paralogs and 2007;104(18):7489–94. genome comparisons. Current Opinion in Genetics & De- 77. Olson MV, Varki A. Sequencing the chimpanzee genome: velopment 1999:630–6. insights into human evolution and disease. Nat Rev Genet 99. Creevey CJ, Muller J, Doerks T et al. Identifying single copy 2003;4(1):20–28. orthologs in metazoa. PLoS Comput Biol 2011;7:e1002269. 78. Mi H, Huang X, Muruganujan A et al. PANTHER version 100. Treangen TJ, Salzberg SL. Repetitive DNA and next- 11: expanded annotation data from Gene Ontology and Re- generation sequencing: computational challenges and so- actome pathways, and data analysis tool enhancements. lutions. Nat Rev Genet 2012;13:36–46. Nucleic Acids Research. 2017;45(Database issue):D183–9. 101. Volfovsky N, Oleksyk TK, Cruz KC et al. Genome and gene Available at doi:10.1093/nar/gkw1138. alterations by insertions and deletions in the evolution of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 21 human and chimpanzee chromosome 22. BMC Genomics new recommendations regarding IUCN conservation rank- 2009;10:51. ings. Biological Conservation 2015;191:495–503. 102. Rabb GB. Toxic salivary glands in the primitive insectivore 105. Turvey ST, Meredith HMR, Scofield RP. Continued survival Solenodon. Nat Hist Misc 1959;170:1–3. of Hispaniolan solenodon Solenodon paradoxus in Haiti. ORX. 103. Allendorf FW, Luikart G. Conservation and the Genetics of Cambridge University Press; 2008;42:611–4. Populations. John Wiley & Sons; 2009. 106. Gene Ontology Consortium. The Gene Ontology (GO) 104. Willoughby JR, Sundaram M, Wijayawardena BK et al. The database and informatics resource. Nucleic Acids Res. Ox- reduction of genetic diversity in threatened vertebrates and ford Univ Press; 2004;32:D258–61. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Loading next page...
 
/lp/ou_press/innovative-assembly-strategy-contributes-to-understanding-the-FyMI5rfNh0
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/giy025
Publisher site
See Article on Publisher Site

Abstract

Photo of a Solenodon paradoxus from the island of Hispaniola, courtesy of Eladio Fernandez, Caribbean Nature Photography Grigorev et al. Grigorev et al. GigaScience 2018, 7:6 https://doi.org/10.1093/gigascience/giy025 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 GigaScience, 7, 2018, 1–21 doi: 10.1093/gigascience/giy025 Advance Access Publication Date: 16 March 2018 Research RESEARCH Innovative assembly strategy contributes to understanding the evolution and conservation genetics of the endangered Solenodon paradoxus from the island of Hispaniola 1,† 2,† 2 2 Kirill Grigorev ,SergeyKliver , Pavel Dobrynin , Aleksey Komissarov , 1,3 2 Walter Wolfsberger , Ksenia Krasheninnikova , 1 4,5 6 Yashira M. Afanador-Hernandez ´ ,AdamL.Brandt , Liz A. Paulino , 6 6 7 4,8 Rosanna Carreras , Luis E. Rodr´ıguez ,AdrellNu´ nez ˜ , Jessica R. Brandt , 9,10 11 1 Filipe Silva ,J.David Hernandez-Martic ´ h , Audrey J. Majeske , 9,10 4,12 2,13 Agostinho Antunes , Alfred L. Roca , Stephen J. O’Brien , 1 1,3,∗ Juan Carlos Mart´ınez-Cruzado and Taras K. Oleksyk 1 2 Department of Biology, University of Puerto Rico at Mayaguez, ¨ Mayaguez, ¨ Puerto Rico, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia, Biology Department, Uzhhorod National University, Uzhhorod, Ukraine, Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA, Division of Natural Sciences, St. Norbert College, De Pere, Wisconsin, USA, Instituto Tecnolog ´ ico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic, Department of Conservation and Science, Parque Zoologico Nacional (ZOODOM), Santo Domingo, Dominican 8 9 Republic, Department of Biology, Marian University, Fond du Lac, Wisconsin, USA, CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixoes, ˜ Av. General Norton de Matos, s/n, 4450–208 Porto, Portugal, Department of Biology, Faculty of Sciences, University of Porto. Rua do Campo Alegre, 4169-007 Porto, Portugal, Instituto de Investigaciones Botanicas ´ y Zoolog ´ icas, Universidad Autonoma ´ de Santo Domingo, Santo Domingo, Dominican Republic, Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA and Oceanographic Center, Nova Southeastern University, Fort Lauderdale, Florida, USA Correspondence address. Departamento de Biolog´ıa Carr. 108 Barrio Miradero Km 1.3. Entrada al Zoolog ´ ico Mayaguez, ¨ PR 00680 E-mail: taras.oleksyk@upr.edu These authors contributed equally. Received: 20 July 2017; Revised: 26 January 2018; Accepted: 7 March 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Grigorev et al. Abstract Solenodons are insectivores that live in Hispaniola and Cuba. They form an isolated branch in the tree of placental mammals that are highly divergent from other eulipothyplan insectivores The history, unique biology, and adaptations of these enigmatic venomous species could be illuminated by the availability of genome data. However, a whole genome assembly for solenodons has not been previously performed, partially due to the difficulty in obtaining samples from the field. Island isolation and reduced numbers have likely resulted in high homozygosity within the Hispaniolan solenodon (Solenodon paradoxus). Thus, we tested the performance of several assembly strategies on the genome of this genetically impoverished species. The string graph–based assembly strategy seemed a better choice compared to the conventional de Bruijn graph approach due to the high levels of homozygosity, which is often a hallmark of endemic or endangered species. A consensus reference genome was assembled from sequences of 5 individuals from the southern subspecies (S. p. woodi). In addition, we obtained an additional sequence from 1 sample of the northern subspecies (S. p. paradoxus). The resulting genome assemblies were compared to each other and annotated for genes, with an emphasis on venom genes, repeats, variable microsatellite loci, and other genomic variants. Phylogenetic positioning and selection signatures were inferred based on 4,416 single-copy orthologs from 10 other mammals. We estimated that solenodons diverged from other extant mammals 73.6 million years ago. Patterns of single-nucleotide polymorphism variation allowed us to infer population demography, which supported a subspecies split within the Hispaniolan solenodon at least 300 thousand years ago. Keywords: genome assembly; de Bruijn; string graph; Solenodon paradoxus; Hispaniola; Caribbean; island evolution; natural selection; isolation; heterozygosity which saliva secreted by the submaxillary gland flows into the Background victim [10]. The genus name Solenodon means “grooved tooth” The only 2 surviving species of solenodons, found on the 2 in Greek and refers to the shape of this incisor. Although solen- largest Caribbean islands, Hispaniola (Solenodon paradoxus) and odons rarely bite humans, the bites can be very painful (Nicolas ´ Cuba (Solenodon cubanus), are among the few endemic terres- Corona, personal communication), and even a small injection of trial mammals that survived human settlement of these islands. venom has been shown to be fatal to mice in a matter of min- Phenotypically, solenodons somewhat resemble shrews (Fig. 1), utes [7]. The chemical composition of solenodon venom has not but molecular evidence indicates that they are actually the sister yet been resolved [11]. group to all other extant eulipotyphlan insectivores (hedgehogs, Roca et al. [3] sequenced 13.9 kb of nuclear and mitochon- moles, shrews) from which they split in the Cretaceous Period drial sequences of S. paradoxus, inferring that solenodon diver- [1–3]. These enigmatic species have various local names in Cuba gence from other eulipotyphlan mammals such as shrews and and Hispaniola, including oso (bear), hormiguero (ant-eater), joron moles dates back to the Cretaceous Period, ∼76 million years (ferret), milquı (or almiquı), and agouta [4, 5], all pointing to the ´ ´ ago (Mya), before the mass extinction of the dinosaurs ∼66 Mya. first impression made on the Spanish colonists by its unusual Brandt et al. [12] sequenced complete mitogenome sequences of appearance. Today, the Hispaniolan solenodon (S. paradoxus)is 6 Hispaniolan solenodon specimens from the southern part of difficult to find in the wild because of its nocturnal activity pat- Hispaniola (Fig. 2), corroborating this conclusion, and estimated tern and its low population numbers. Here, we report the as- that S. paradoxus diverged from all other mammals ∼78 Mya [12]. sembly and annotation of the nuclear genome sequences and Other studies have reported similarly deep divergence dates (re- genomic variation of 2 subspecies of S. paradoxus.Weusedan- viewed by Sato et al. [13]). Whole genome analysis of S. paradoxus alytical strategies that will allow researchers to formulate hy- could provide support and validation to the earlier evolutionary potheses and develop genetic tools to assist future studies of studies. evolutionary inference and conservation applications. Morphometric studies suggest that southern and northern Solenodon paradoxus was originally described from a skin and Hispaniolan solenodons may be distinctive enough to be consid- partial skull at the St. Petersburg Academy of Sciences in Russia ered separate subspecies [2, 14, 15], a notion supported by recent [6]. It has a large head with a long rostrum and tiny eyes and ears mitochondrial DNA studies [12, 16]. The southern Hispaniolan partially hidden by the dusky brown body fur that turns reddish solenodons had less genetic diversity than those in the north, on the sides of the head, throat, and upper chest. The tail, legs, so that the control region sequences of all 5 southern speci- snout, and eyelids of the S. paradoxus are hairless. The front legs mens (the same individuals used in this study) were identical are noticeably more developed, but all 4 have strong claws use- or nearly identical [12], indicating that Hispaniolan solenodons ful for burrowing (Fig. 1). Adult animals measure 49–72 cm in have a very low level of mitochondrial diversity. total length and weigh more than 1 kg [7]. Solenodons are social It may now be imperative to study conservation genomics of animals; they spend their days in extensive underground tun- solenodons because their extinction would mean the loss of an nel networks shared by family groups and come to the surface entire evolutionary lineage whose antiquity goes back to the age at night to hunt small vertebrates and large invertebrates [8]. A of dinosaurs. Solenodon paradoxus survived in spectacular island unique feature is the os proboscidis, a bone that extends forward isolation despite the devastating human impact to biodiversity from the nasal opening to support the snout cartilage [9]. Solen- in recent centuries [3, 12]. Nevertheless, survival of this species odons are venomous mammals that display a fascinating strat- is now threatened by deforestation, increasing human activity, egy for venom delivery. The second lower incisor of solenodons and predation by introduced dogs, cats, and mongooses. It is de- has a narrow, almost fully enfolded tubular channel, through clining in population, its habitat is severely fragmented, and it Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 3 Figure 1: The 2 subspecies of Solenodon paradoxus. (A) A captive Hispaniolan solenodon from the northern subspecies (S. p. paradoxus) photographed at the Santo Domingo Zoo (photo taken by Juan C. Mart´ınez-Cruzado in 2014). (B) A mounted specimen of the southern subspecies (S. p. woodi) photographed at the Museo Nacional de Historia Natural prof. Eugenio de Jesus ´ Marcano in Santo Domingo, Dominican Republic (photo taken by Taras K. Oleksyk in 2017). Figure 2: Origins of the genomic DNA samples of Solenodon paradoxus from the island of Hispaniola. Approximate locations of capture for 5 wild individuals of S. p. woodi: Spa-K and Spa-L from La Canada ˜ del Verraco, as well as Spa-M, Spa-N, and Spa-O from the El Manguito location in the Pedernales Province in the southwest corner of the Dominican Republic bordering Haiti. In addition, 1 S. p. paradoxus sample (Spa-1) was taken from Cordillera Septentrional in the northern part of the island. Exact coordinates of each sample location are listed in [12]. The dashed line indicates the position of the Cul de Sac Plain and Neiba Valley; this region was periodically inundated by a marine canal that separated Hispaniola into north and south paleo-islands during the Pliocene and Pleistocene [15]. The original map is in the public domain (courtesy of NASA). is listed as endangered by the International Union for Conser- performs better than the assembly by a traditional de Bruijn al- vation of Nature (IUCN) Red List of Threatened Species (Red List gorithm (SOAPdenovo2) [18]. We used the string-graph assem- category B2ab, assessed in 2008 [17]). bler Fermi [19] as a principal tool for contig assembly in con- In this study, we assembled the genome of S. paradoxus us- junction with SSPACE [20] and GapCloser [18] for scaffolding. The ing low coverage genome data (∼5× each) from 5 individuals resulting genome sequence data was sufficient for high-quality of S. paradoxus woodi. We took advantage of the low individ- annotation of genes and functional elements, as well as for com- ual and population genetic diversity to pool individual data parative genomics and population genetic analyses. Prior to this and apply a string graph assembly approach that resulted in a study, the string-graph assembler Fermi [19] had been used only working genome assembly of the S. paradoxus genome from the in studies for annotation or as a complementary tool for de novo combined paired-end dataset (∼26×;Fig. 3). Our methodology assemblies made with de Bruijn algorithms [21]. We present and introduces a useful pipeline for genome assembly to compen- compare genome assemblies for the southern subspecies (S. p. sate for the limited amount of sequencing that, in this instance, woodi) based on several combinations of assembly tools, provide Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Grigorev et al. in the northern part of the island. Figure 2 highlights geographi- cal locations of sample collection points for the samples used in this study. The 5 S. p. woodi samples were sequenced using Hiseq2000 technology (Illumina Inc.), resulting in an average of 151,783,327 paired-end reads 101 bp long, or 15.33 Gb of sequence data, per individual. In addition, DNA extracted from the northern solen- odon (S. p. paradoxus; Spa-1) was sequenced using MiSeq V3 tech- nology (Illumina Inc.) and produced 52,358,830 paired-end reads, equating to approximately 13.09 Gb of sequence data. Only the samples of S. paradoxus woodi were used for assembly since the northern subspecies (S. paradoxus paradoxus) did not have suffi- cient coverage for the de novo assembly. Further details about sample collection, DNA extraction, li- brary construction, and sequencing can be found in the Meth- ods section. The whole genome shotgun data from this project have been deposited at DDBJ/ENA/GenBank under the acces- sion NKTL00000000. The version described here is version NKTL01000000. The genome data has also been deposited into the NCBI under BioProject PRJNA368679 and to the GigaScience GigaDB repository [22]. Read correction After the reduction of adapter contamination with Cookiecut- ter [23], the k-mer distribution in the reads for the 5 individu- als of S. paradoxus woodi was assessed with Jellyfish (Jellyfish, Figure 3: Heterozygosity and k-mer distribution. k-mer distributions for the S. p. RRID:SCR 005491)[24]. The predicted mean genome coverage woodi reads. Only 1 original sample (SPA-K) distribution is shown as a solid gray was approximately 5× for each sample (Fig. 3), which is too low line, as the distributions were identical for each of the individual samples. The for individual de novo genome assembly. However, because of the predicted mean genome coverage was approximately 5x for each sample (x = 5). One example is plotted by a black solid line on the left. The combined uncor- extremely low levels of genetic diversity suggested by the ear- rected dataset is plotted in a dashed red line indicates a maximum at x = 26. The lier study of mitochondrial DNA in the southern subspecies [12] combined dataset corrected with QuorUM [25] is plotted in a solid blue line, also and in order to increase the average depth of coverage, the reads with a maximum at x = 26. A smaller local maximum on the left side for both from the 5 samples were combined into a single dataset. As a combined distributions, corrected and uncorrected (representing k-mers found result, the projected mean genome coverage for the combined once or very few times), is expected from differences between overlapping reads, genome assembly was 26×. Error correction was applied with most likely the sequencing errors. Other local maxima (seen as a small bulge at the x = 5) are interpreted as heterozygous sites. These proved to have almost no QuorUM (QuorUM, RRID:SCR 011840)[25] using the value k = 31. impact on the combined sample even after read correction, indicating a lack of The k-mer distribution analysis by Jellyfish in the combined and heterozygous sites for this solenodon subspecies. The largest local maxima (to error-corrected dataset indicated very low levels of heterozygos- the right) are interpreted as projected coverage. For the combined samples, this ity in accordance with the hypothesis (see Fig. 3 legend), allow- value is x = 26. ing use of the combined dataset for further genome assembly. Using KmerGenie [26], the genome size has been estimated to be 2.06 Gbp. a high-quality annotation of genome features and describe ge- netic variation in 2 subspecies (S. p. woodi and S. p. paradoxus), make inferences about recent evolution and selection signatures Analyses in genes, trace demographic histories, and develop molecular tools for future conservation studies. Assembly tool combinations We used several alternative combinations of tools to determine Data Description the best approach to an assembly of the combined genome data, outlined in Table 1. First, the combined libraries of paired-end Sample collection and sequencing reads were assembled into contigs with Fermi, a string graph– Five adult individuals of S. paradoxus woodi (National Center for based tool [19]. Second, the same libraries were also assembled Biotechnology Information [NCBI] Taxon ID:1906352) from the with SOAPdenovo2, a de Bruijn graph–based tool (SOAPdenovo2, southern Dominican Republic were collected in the wild follow- RRID:SCR 014986)[18]. The optimal k-mer length parameter for ing a general field protocol described earlier [ 12]; this included SOAPdenovo2 was determined to be k = 35 with the use of Kmer- 2 specimens caught from La Canada ˜ del Verraco and 3 from Genie [26]. For the scaffolding step, we used either SSPACE (SS- the El Manguito location in the Pedernales Province. The cap- PACE, RRID:SCR 005056)[20] or the scaffolding module of SOAP- tured individuals were visually assessed for obvious signs of dis- denovo2 [18]. For all instances, the GapCloser module of SOAP- ease, weighed, measured, sexed, and released at the capture site, denovo2 was used to fill in gaps in the scaffolds (GapCloser, all within 10 minutes of capture. Geographic coordinates were RRID:SCR 015026)[18]. After assembly, datasets were trimmed; recorded for every location. In addition, 1 S. p. paradoxus (Spa- scaffolds shorter than 1 Kbp were removed from the output. In 1) sample was acquired through collaboration with ZooDom at Table 1, the 4 possible combinations of tools used for the assem- Santo Domingo that originated in the Cordillera Septentrional bly are referred to with capital letters A, B, C, and D for brevity. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 5 Table 1: Description of the assembly strategies and comparison of metrics for the resulting assemblies. Assembly names A B C D Assembly tools Contig assembly tool Fermi Fermi SOAPdenovo2 SOAPdenovo2 Scaffolding tool SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE Gap closing tool GapCloser GapCloser GapCloser GapCloser Assembly metrics Total contigs (>1,000 bp) 71,429 71,429 189,566 189,566 Contig N50 54,944 54,944 4,048 4,048 Contig CEGMA (%) 96.37(77.42) 96.37(77.42) 68.15(33.06) 68.15(33.06) Contig BUSCO (%) 86(65) 86(65) 42(21) 42(21) Total scaffolds (>1,000 bp) 14,417 40,372 20,466 - Final N50 555,585 110,915 331,639 - Final CEGMA (%) 95.56(81.85) 95.97(88.71) 95.97(90.73) - Final BUSCO (%) 91(74) 86(64) 94(80) - Quality Percentage of Ns (%) 0.06322 0.0135 0.02622 - REAPR error-free bases (%) 96.46 95.35 94.98 - REAPR low-scoring regions 18 16 71 - REAPR incorrectly oriented reads 11,543 5,329 28,964 - BUSCO [29] and CEGMA [30] percentages are reported for all genes (complete and partial), while the percentage of complete genes are shown in parentheses. However, SOAPdenovo2 introduces artifacts at the contig con- ber of detected rearrangements vs. Sorex assembly would be the struction stage, which it is specifically designed to mitigate at same for all 3 Solenodon assemblies (A, B, and C). Following the later stages, and SSPACE is not aware of such artifacts [27]. For parsimony principle, an assembly that shows rearrangements this reason, the assembly produced by combination D (contig as- is also likely to contain the most assembly artifacts. Conversely, sembly with SOAPdenovo2 and scaffolding with SSPACE) was not we expected that the best of the 3 assemblies of the Solenodon reported. genome would contain the least number of reversals and trans- positions when compared to the best available closely related genome (Sorex araneus). Quality control (QC) and structural comparisons To test this hypothesis, the 3 completed assemblies of Solen- between the assemblies odon (A, B, and C) were aligned to each other and to the out- group, which was the Sorex genome (SorAra 2.0, NCBI accession We used QUAST (QUAST, RRID:SCR 001228)[28] to estimate the common metrics of assembly quality for all combinations of as- GCA 000181275.2), using Progressive Cactus [32]. Custom scripts were used to interpret binary output of the pairwise genome by sembly tools: N50 and gappedness (the percentage of Ns(Ta- genome comparisons; the resulting coverage metrics are pre- ble 1)). Fermi-assembled contigs (A and B) were overall longer and fewer in number than the SOAPdenovo2 (C and D). The as- sented in Table 2. In this comparison, all 3 Solenodon genome as- semblies had a substantial overlap and resulted in similar levels sembly completeness was also evaluated with both benchmark- ing universal single-copy orthologs (BUSCO, RRID:SCR 015008) of synteny when compared against the Sorex reference assem- bly. However, assemblies A and B had the fewest differences with [29] and core eukaryotic genes mapping approach (CEGMA, RRID:SCR 015055)[30] for completeness of conservative genes. Sorex, while assembly C had more differences vs. A, B, and Sorex. Next, syntenic blocks between each of the 3 Solenodon assemblies Fermi assemblies (A and B) showed high levels of complete- ness compared to SOAPdenovo2 (86% vs. 42%) at the contig level. (A, B, and C) were compared to the Sorex assembly. The 50-Kbp syntenic blocks were identified using the ragout-maf2 synteny However, this difference was partially mitigated at the scaffold- ing step where SOAPdenovo2 increases completeness for Fermi module of the software package Ragout [33], and the number of scaffolds that contained syntenic block rearrangements was de- assembly (A) and more than doubles it for the SOAPdenovo2 as- termined. As a result, assembly B had the lowest number of re- sembly (C). To directly evaluate the quality of all assemblies, versals and transpositions compared to the S. araneus reference we applied REAPR [31]. From the REAPR metrics presented at genome (Table 2). Based on the combined results of the evalu- the bottom of Table 1, it appears that even though the scaf- folding step increased the final N50 for the C assembly, it con- ations by REAPR [31], Progressive Cactus [32], and Ragout [33], assembly C (generated by the complete SOAPdenovo2 run) was tains significantly more regions with high probability of mis- assemblies (low-scoring regions), less error-free bases, and 3 to not included in further analysis. 6 times higher number of incorrectly oriented reads compared to the Fermi-based assemblies (A and B) (Table 1). Genome annotation and evaluation of assembly We hypothesized that aligning the 3 genome assemblies completeness to each other would allow us to detect some of these mis- assemblies. A comparison to the best, most closely related Repeats in assemblies A and B were identified and soft masked genome assembly (i.e., Sorex araneus) will reveal several rear- using RepeatMasker (RepeatMasker, RRID:SCR 012954)[34]with rangements that, in many cases, reflect real evolutionary events. the RepBase library [35]. The total percentage of all interspersed It is reasonable to assume that if all the rearrangements that are repeats masked in the genome was lower than in S. araneus detected are real and not due to the assembly artifacts, the num- (22.53% vs. 30.48%). One possible reason could be that a low Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Grigorev et al. Table 2: Pairwise genomic coverage for the 3 assemblies and the Sorex araneus genome (SorAra 2.0, NCBI accession number GCA 000181275.2) obtained from the Progressive Cactus [32] alignments. vs S. paradoxus woodi vs S. araneus Pairwise genome coverage (%) Assembly A B C # Inversions # Translocations S. araneus 42.1 42.2 42.3 - - - S. p. woodi A - 99.4 98.5 35.5 87 5 B 99.3 - 99.3 35.5 34 0 C 98.4 98.5 - 35.5 81 2 While all 3 assemblies have similar amounts of syntenic coverage to the Sorex genome, assembly B contains the least number of structural rearrangements (inversions and translocations) compared to the other 2 assemblies (A and C). Values in cells at the intersection of rows and columns represent the percentage (%) of coverage between the 2 compared genome assemblies. Syntenic blocks between each of the 3 solenodon assemblies (A, B, and C) were compared to the S. araneus assembly, and 50 Kbp syntenic blocks were identified using the ragout-maf2synteny module of the software package Ragout [33]. Table 3: Repeat content of the Solenodon paradoxus genome (assem- (GRCm38.p4) were aligned to a S. paradoxus assembly with Ex- bly B), annotated by RepeatMasker [34] with the RepBase library [35]. onerate [36] with a maximum of 3 “hits” (matches) per protein. The obtained alignments were classified into the top (primary) Class Number Length (bp) Percentage (%) hit and 2 secondary hits; the coding sequence (CDS) fragments were cutfromeachsideby3bp forthe tophitsand by 9bpfor Total interspersed 461,754,432 22.53 secondary hits. These truncated fragments were clustered and repeats supplied as hints (local pieces of information about the gene in SINEs 271,839 36,271,455 1.77 the input sequence, such as a likely stretch of coding sequence) Alu/B1 6 341 <0.0001 of the potential protein-coding regions to the AUGUSTUS MIRs 264,319 35,557,190 1.73 software package (Augustus: Gene Prediction, RRID:SCR 008417) LINEs 610,079 304,823,409 14.87 LINE1 425,750 260,176,709 12.7 [37], which predicted genes in the soft-masked Solenodon LINE2 157,422 39,432,276 1.92 assembly. Proteins were extracted from the predicted genes L3/CR1 22,172 4,293,335 0.21 and aligned by HMMER (Hmmer, RRID:SCR 005305)[38]and RTE 4,122 839,744 0.04 Basic Local Alignment Search Tool for Proteins (BLAST; NCBI, LTR elements 246,305 78,108,726 3.81 RRID:SCR 004870)[39] to Pfam (Pfam, RRID:SCR 004726)[40] ERVL 61,150 24,158,692 1.18 and Swiss-Prot [41] databases, respectively. Genes supported ERVL-MaLRs 94,934 30,075,905 1,47 by hits to protein databases and hints were retained; the ERV classI 57,674 19,259,649 0.94 unsupported sequences were discarded. The annotated genes ERV classII 24,454 2,840,874 0,14 can be retrieved from Database S2. DNA elements 204,413 42,015,054 2.05 Assembly B showed a higher support compared to assembly hAT-Charlie 112,664 21,168,194 1.03 A (91.7% vs. 79.2%) for the protein-coding gene predictions by ex- TcMar-Tigger 43,950 11,141,107 0.54 trinsic evidence, even though assembly A had a larger N50 value Small RNAs 4,772 456,810 0.02 (Table 1). These values were calculated as a median fraction Satellites 46,734 20,910,815 1.02 of exons supported by alignments of proteins from reference Simple repeats 644,811 28,549,871 1.39 species to genome (Fig. 4). In other words, assembly B is more Low complexity 114,188 5,933,786 0.29 useful for gene predictions and is likely to contain better gene regions models that can be used in the downstream analysis. Therefore, Unclassified 3,051 535,788 0.03 based on 2 lines of evidence, low rearrangement counts (Table 2) and high support to gene prediction for the assembly B, it was chosen for the subsequent analyses as the most useful current coverage assembly may perform better in nonrepetitive regions. representation of the Solenodon genome. Alternatively, if the repeat content in S. paradoxus is indeed lower, this would have to be evaluated using a higher-quality as- sembly with the use of long read data. The total masked repeat Noncoding RNA genes content of the S. paradoxus genome including simple/tandem re- peats, satellite DNA, low complexity regions, and other elements For all noncoding RNA genes except for tRNA and rRNA is presented in Table 3. The repeat content can be retrieved from genes, the search was performed with INFERNAL (Infernal, Database S1. RRID:SCR 011809)[42] using the Rfam (Rfam, RRID:SCR 007891) The annotation of protein-coding genes was performed [43] BLASTN hits as seeds. The tRNA genes were predicted using using a combined approach that synthesized both homology- tRNAScan-SE (tRNAscan-SE, RRID:SCR 010835)[44], and rRNA based and de novo predictions, where de novo predictions were genes were predicted with Barrnap ((BAsic Rapid Ribosomal RNA used to fill gaps and extend homology-based predictions. Predictor), version 0.6 (Barrnap, RRID:SCR 015995)[45]). Addition- Gene annotation was performed for both assemblies (A and ally, RNA genes discovered by RepeatMasker at the earlier stages B) independently. Proteins of 4 reference species S. araneus of the analysis were used to cross-reference the findings of (SorAra 2.0, GCA 000181275.2), Erinaceus europaeus (EriEur2.0, rRNA- and tRNA-finding software. The list of the noncoding RNA GCA 000296755.1), Homo sapiens (GRCh38.p7), and Mus musculus genes can be accessed in Database S3. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 7 Figure 4: Distribution of the gene prediction support by extrinsic evidence for Solenodon assemblies A (on the left) and B (on the right). Proteins of 4 reference species S. araneus (SorAra 2.0, GCA 000181275.2), Erinaceus europaeus (EriEur2.0, GCA 000296755.1), Homo sapiens (GRCh38.p7), and Mus musculus (GRCm38.p4) were aligned to a S. paradoxus assembly with Exonerate [36] with a maximum of 3 best matches per protein. Coding sequences =were cut from each, clustered, and uploaded into the AUGUSTUS software package [37] to predict genes in the soft-masked Solenodon assembly. Proteins from the predicted genes were aligned by HMMER [38] and BLAST [39] to Pfam [40] and Swiss-Prot [41] databases. Genes supported by matches to protein databases and “hints” (see definition in main text) were retained; the rest were discarded. Substantially more transcripts have higher hint support in assembly B. The annotated genes can be retrieved from Database S2. Assembly C has not been evaluated. Table 4: The weighted coverages of the genomes in the Progressive (“Cactus”) [32]. Cactus genome alignments were used to build Cactus alignment [32], as calculated against the C. familiaris genome. a “sparse map” of the homologies between a set of input se- quences. Once this sparse map is constructed, in the form of Query genome Weighted coverage a Cactus graph, the sequences that were initially unaligned in the sparse map are also aligned [32]. Weighted coverage of Dog (Canis familiaris) (1.14) a genome by genome comparison was calculated by binning Cow (Bos taurus)1.06 an alignment into regions of different coverage and averaging Common shrew (Sorex araneus)1.05 these coverages, with lengths of bins as weights. The weighted Star-nosed mole (Condylura cristata)1.04 coverage of S. paradoxus to C. familiaris was 1.05, which indi- Hispaniolan solenodon (Solenodon paradoxus)1.05 cated that the present genome assembly is comparable in qual- The weighted coverage of a genome to itself is parenthesized as it is not a com- ity and duplication structure to other available mammalian as- parative value. semblies, which are close to each other and are close to 1.0 The weighted coverage of the S. paradoxus genome assembly from our study is (Table 4). comparable to other high-coverage mammalian genome assemblies. The clado- gram used for multiple genome alignment with Progressive Cactus is shown in Fig. S1. Detection of single-copy orthologs Single-copy orthologs (single gene copies) are essential for the evolutionary analysis since they represent a useful conserva- Multiple genome alignment, synteny, and duplication tive homologous set, unlike genes with paralogs, which are structure difficult to compare across species. The longest polypeptide To compare the Solenodon genome assembly with other mam- coded by each gene of S. paradoxus and of 3 other Eulipotyphla— malian genomes, a multiple alignment with genomes of re- Erinaceus europaeus, S. araneus,and C. cristata—were aligned lated species was performed using Progressive Cactus [32]. Cur- to profile hidden Markov models of the TreeFam database rently available genomic assemblies of cow (Bos taurus, BosTau (Tree families database, RRID:SCR 013401)[46, 47] using HM- 3.1.1, NCBI accession number DAAA00000000.2), dog (Canis famil- MER [38]. Top hits from these alignments were extracted and iaris,CanFam3.1,GCA 000002285.2), star-nosed mole (Condylura used for assignment of corresponding proteins to families. The cristata, ConCri 1.0, GCF 000260355.1), common shrew (S. ara- same procedure was performed in order to assign proteins neus, SorAra 2.0, GCA 000181275.2), and S. paradoxus woodi (as- to orthologous groups using profile hidden Markov models of sembly B from this study) were aligned together, guided by a orthologous groups of the maNOG subset from the eggNOG cladogram representing branching order in a subset of a larger database (eggNOG, RRID:SCR 002456)[48] as referenced. Or- phylogeny (Fig. S1). We evaluated the S. paradoxus coverage by thologous groups and families for which high levels of error comparing it to the weighted coverages of other genomes in the rates were observed while testing assignment of proteins to alignment to the C. familiaris genome (Table 4). Custom scripts them were discarded; the rest of the orthologous groups and were used to interpret the binary output of Progressive Cactus families were retained for further analysis. Proteins and the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Grigorev et al. Table 5: Fossil-based priors associated with mammalian evolution used for calibration of divergence times [54–57]. Node Calibration prior on clade Node min. age (Mya) Node max. age (Mya) Evidence Opossum—placental Eutheria—Metatheria 157.3 169.6 Fossil [55] mammals split Human—mouse Archonta—Glires 61.5 100.5 Biostratigraphy [55] Primates, Euarchontaglires—Laurasiatheria 61.6 100.5 Fossil [55] mouse—dog, horse, cow Dog—ferret Canidae—Arctoidea 35 45 Fossil [56, 57] Solenodon— Lipotyphla 61.6 100.5 Fossil [55] hedgehog, shrew, mole Cow—horse Artiodactyla as soft minimum 52.4 100.5 Fossil [55] The 4,416 single-copy orthologs identified in our assembly were used for phylogeny inference via 4-fold degenerate sites with programs RAxML [ 49] and PAML [50]. The resulting phylogenetic tree was plotted with FigTree [58] and is presented in Fig. 5. corresponding assignments were obtained from the maNOG Positively selected genes database for 7 other species: H. sapiens, M. musculus, B. taurus, To evaluate signatures of selection in the assembled genomes, C. familiaris, Equus caballus, Mustela putorius furo,and Monodel- we used a dataset of 4,416 orthologous groups containing single- phis domestica. Inspection of assignments across all the species copy orthologous genes of the mammalian species described yielded 4,416 orthologous groups containing single-copy orthol- earlier. Single-copy orthologs were used as a conservative set ogous genes (Database S5). necessary for comparing coding sequences that only arose 1 time in order to avoid the uncertainties associated with paralogs Species tree reconstruction and divergence time and lineage-specific gene duplications. First, we translated DNA estimation sequences into amino acids, aligned them in MUSCLE (MUSCLE, RRID:SCR 011812)[65], and then translated them back into DNA We used our genome assembly to infer phylogenetic relation- code using the original nucleotide sequences by PAL2NAL [66]. ships between S. paradoxus and other eutherian species with Genic dN/dS ratios were estimated among the 11 mammalian known genome sequences and estimated their divergence time species (including Solenodon) used in constructing the phylogeny using the new data. Based on the alignments of the single-copy represented in Fig. 5. orthologous proteins for the species included in the analysis, To estimate dN/dS ratios, we used the codeml module from a maximum likelihood tree was built using RAxML [49]with the PAML package [50]. The dN/dS ratios were calculated over the the PROTGAMMAAUTO option and the JTT fitting model tested entire length of a protein-coding gene. The branch-site model with 1,000 bootstrap replications. From the codon alignments was not included in the current analysis because of the risk of of single-copy orthologs of the 11 species, 461,539 4-fold de- reporting false positives due to sequencing and alignment errors generate sites were extracted. The divergence time estimation [67], especially on smaller datasets, and additional uncertainties was made by the MCMCtree tool from the software package could be introduced from the lack of power under synonymous PAML (PAML, RRID:SCR 014932)[50]withthe HKY+G model of substitution saturation and high variation in the GC content [68]. nucleotide substitutions and 2,200,000 generations of MCMC (of All the single-copy orthologs were plotted in the dN to dS which the first 200,000 generations were discarded as burn-in). coordinates and color-coded according to the 96 gene ontology A test for substitution saturation [51, 52] was performed using generic categories (Fig. 6). We retrieved values of dN, dS, and w DAMBE6 [53] for both all third codon positions and only 4-fold (w = dN/dS) for all single-copy orthologs and used human an- degenerated sites. In both cases the index of substitution satu- notation categories to assign all the genes with their gene on- ration was significantly lower than the threshold value for both tologies (GO) using the Python package goatools [69]and theGO symmetrical and asymmetrical trees indicating low saturation Slim generic database [69] to assign the genes to the major GO level. Therefore, saturation was not detected for any of the third categories. positions nor for the 4-fold degenerated sites. The dN/dS values for the 12 genes exhibiting positive selec- Divergence times were calibrated using fossil-based priors tion (Table 6) are visible above the line showing dN = dS. Three associated with mammalian evolution, listed in Table 5 and of these genes belong to the plasma membrane GO category based on [54–57]. FigTree [58] was used to plot the resulting tree, (GO:0005886), while cytosol (GO:0005829), mitochondrial elec- showninFig. 5. According to this analysis, S. paradoxus diverged tron transport chain (GO:0005739), cytoplasm (GO:0005737), and from other mammals 73.6 Mya (95% confidence interval of 61.4– generation of precursor metabolites (GO:0006091) were repre- 88.2 Mya). This is in accordance with earlier estimates based on sented by 1 gene each. Five of the genes exhibiting positive selec- nuclear and mitochondrial sequences (e.g., [3, 12]) as reviewed tion signatures could not be assigned to GO categories. Some of by Springer et al. [59]. This date is also much older than the these are also associated with the plasma membranes (TMEM56, timeframe of molecular estimates of divergence times between SMIM3), and 1 gene (CCRNL4) encodes a protein highly similar to most island taxa and their closest mainland relatives [60]. Our the nocturnin, a gene identified as a circadian clock regulator in data support solenodons forming a sister group to other eulipo- Xenopus laevis [70]. The full list of genes, GO annotations, and typhlans, i.e., hedgehogs, shrews, and moles [61–64], with a di- associated dN/dS values are included in Database S6. vergence date as old as splits between some pairs of mammalian Traditionally, one of the most commonly used signatures of orders, such as between rodents and primates or between car- selection is the ratio of nonsynonymous (dN) to synonymous nivores and artiodactyls (Fig. 5). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 9 Figure 5: Phylogenetic relationships of Solenodon paradoxus and other mammals from whole-genome data. (A) Maximum likelihood phylogeny showing branch lengths. The tree was built using RAxML [49] with the PROTGAMMAAUTO option and the JTT fitting model tested with 1,000 bootstrap replicates. ( B) Divergence time estimates based on 461,539 4-fold degenerate sites from the codon alignments of single-copy orthologs and using fossil-based priors (Table 5). The divergence time estimation was made by the MCMCtree tool from the software package PAML [50] with the HKY+G model of nucleotide substitutions and 2,200,000 generations of MCMC (of which the first 200,000 generations were discarded as burn-in). The 95% confidence intervals are given in square brackets and depicted as semitransparent bo xes around the nodes. The inferred divergence time of S. paradoxus from other mammals is 73.6 Mya (95% confidence interval of 61.4–88.2 Mya). (dS) substitutions, dN/dS [71]. The synonymous rate (dS) ex- selection is likely to have accelerated the rate of fixation of presses the rate of unconstrained, neutral evolution, so that nonsynonymous substitutions. It is possible to quantify the pro- when dN/dS <1, the usual interpretation is that negative se- portion of nonsynonymous substitutions that are slightly dele- lection has taken place on nonsynonymous substitutions. Oth- terious from the differences in dN/dS between rare and common erwise, when dN/dS >1, the interpretation is that the positive alleles [72,73]. In our comparison, a subset of single-copy or- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Grigorev et al. positively selected genes is relatively short compared to num- bers of positively selected genes reported in other studies (e.g., human-to-chimpanzee comparison yields several hundreds of human-specific genes under selection [ 75–77]. This observation could be a consequence of the averaging effect of a large com- parison group that included mammals very distantly related to solenodons. The dN/dS ratios can also be used as a proxy to illustrate the rate of evolution for proteins. By looking at the trends in fast evolved genes (dN/dS >0.25), we can make inferences about the factors that shaped the genome of this species during the mil- lions of years of island isolation. To summarize the functional contributions, we used the PANTHER Overrepresentation Test and GO Ontology database based on the H. sapiens (Supplemen- tary Table S1) and M. musculus (Supplementary Table S2) genes [78]. Interestingly, genes involved in the inflammatory response and located on cell surfaces were among those overrepresented among the rapidly evolving genes in the Solenodon genome compared to either the human or mouse databases (Supplemen- tary Tables S1 and S2). Venom gene identification Since solenodon is one of very few venomous eutherian mam- Figure 6: The dN/dS ratios for 4,416 single-copy orthologous genes. The dN and mals, of special interest in the solenodon genome were the puta- dS ratios were calculated with the codeml module from the PAML package [50] tive venom genes. While there was no saliva sample in our pos- and calculated over the entire length of a protein coding gene. Values are color- session that could be analyzed for the expressed toxin genes, coded by GO term aggregated by the GO Slim generic database [69,106]; the color a comparative genome approach could be applied as an indi- code legend is presented in Fig. S2. The solid black line represents dN = dS; dots above it represent genes showing signatures of positive selection. The figure is rect way to find venom genes orthologous to genes expressed truncated at dN = 1 and dS = 2, so larger values are not shown on the graph, but in venom for other species. First, we identified 6,534 toxin and all ω, dN, and dS values are available in Database S6. venom protein representatives (Tox-Prot) from Uniprot (UniProt, RRID:SCR 002380)[79] and queried them with BLAST against thologs dN/dS compared to the 10 mammalian species (Fig. 5)is the current S. paradoxus genome assembly. The hit scaffolds estimated to be ∼0.18 or 18%, on average, compared to ∼0.25 re- were then extracted from the AUGUSTUS CDS prediction file. ported for the human–chimp and ∼0.13 reported for the mouse– The same Tox-Prot sequences were used for Exonerate with rat comparisons [74]. In other words, it suggests that up to 82% the protein-to-genome model. The hits were used as queries of all amino acid replacements in S. paradoxus are removed by against the NCBI database to ensure gene identity, further ex- purifying selection [74]. amined through phylogenetic analyses with select model mam- Note that purifying selection is the conservative force in malian and venom reptile genes (also adding randomly se- molecular evolution, whereas positive selection is the diversi- lected sequences for each gene to reduce clade bias). The fying force that drives molecular adaptation. Overall, the list of retrieved sequences were aligned with MUSCLE [65], followed Table 6: The putative targets of positive selection in the solenodon genome. GO category Solenodon gene dS dN dN/dS description Human ortholog ENOG410UG5H 0.000003 0.002563 ≥999 Plasma membrane KLF9 ENOG410USMX 0.000011 0.010830 ≥999 Plasma membrane TNFSF13B ENOG410UWRE 0.000015 0.014790 ≥999 - SMIM3 ENOG410UNED 0.000174 0.030411 174.84 - CCRN4L ENOG410UJP8 0.013214 0.120449 9.12 Cytosol PLK4 ENOG410UWA9 0.020955 0.104972 5.01 Mitochondrion NDUFC1 ENOG410V3Q6 0.047538 0.071112 1.50 Plasma membrane SYT16 ENOG410UQAM 0.078543 0.096445 1.23 - WBP2NL ENOG410UKXY 0.168982 0.185535 1.10 - TIGIT ENOG410UKXJ 0.134581 0.146926 1.09 Cytoplasm LRRC66 ENOG410UIAB 0.060622 0.065402 1.08 - TMEM56 ENOG410UG23 0.176172 0.177344 1.01 Generation of THTPA precursor metabolites and energy The dN/dS values and the GO categories for the 12 genes that showed signatures of positive selection in the Solenodon paradoxus woodi genome (dN>dS). All other genes are reported in Database S6. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 11 Table 7: Homologous matches for the most relevant protein venom were not filtered by repetitive regions or mappability mask, and classes in the Solenodon paradoxus genome. the number of possible SNV sites was defined as the genome assembly size minus the umber of unknown base pairs (‘N’). Protein groups found in animal Number of matches in Based on the variation data from the genomes of venoms the S. paradoxus genome 2 subspecies (S. p. woodi and S. p .paradoxus), we estimated population dynamics using pairwise sequentially Markovian Metalloproteinase; serine protease 8 each coalescent (PSMC) model [87]. PSMC uses a coalescent approach Hyaluronidase 6 to estimate changes in population size that allowed us to create (Acetyl) cholinesterase 2 a (The Most Recent Common Ancestor) TMRCA distribution Calglandulin; nerve growth factors 4 each Lipase 3 across the genome and estimate the effective population size Hydrolase; Kunitz serine protease 1 each (Ne) in recent evolutionary history (e.g., from 10,000 to 1 million inhibitor; nucleotidase; years). O-methyltransferase; oxidase; Demographic history was inferred separately for S. p. woodi peptidase; phosphodiesterase; and S. p. paradoxus, and the resulting plots revealed differences phospholipase; vascular in demographic histories of the 2 subspecies (Fig. 9). Each south- endothelial growth factor ern individual was considered separately and their demographic histories overlapped. The difference in demographic history pro- Genes were identified by querying 6,534 toxin and venom protein representatives vides another argument in favor of a subspecies split, as evi- found in animal venoms in Tox-Prot from Uniprot [79]. All of the protein groups denced by distinctly different effective population sizes at least are present in snake venoms. The sequences of the putative venom genes from S. paradoxus are available in the Database S7. since 300 Kya. According to this analysis, the northern solen- odon subspecies currently has a much larger Ne, which has ex- panded relatively recently, between 10,000 and 11,000 years ago by a maximum likelihood (WAG+I+G) phylogenetic reconstruc- (Fig. 9). Prior to that, it was the southern subspecies (S. p. woodi) tion. Hits were matched against their respective references in an that had a larger Ne. At the same time, the demographic history alignment and visually inspected. inferred for both populations showed similar cyclical patterns of As a result, we identified 44 gene hits of the 16 most rele- expansion and contraction around the mean of 6,000 “effective” vant protein venom classes (all present in snakes) in the S. para- individuals for the southern subspecies (S. p. woodi) and 3,000 doxus genome (Table 7). Inspection of pairwise MUSCLE align- for the northern subspecies (S. p. paradoxus). One unusual result ments of the putative Solenodon venom genes (Database S7) of this analysis is that the northern subspecies shows a much with their animal homologs revealed several interesting cues. lower Ne for all but the most recent time period. The putative venom genes could not be confirmed through ge- nomic information alone, yet they cannot be discarded given that they were matched to high homology regions of closely re- Development of tools to study population and lated genes, such as those originally recruited into venom. There conservation genetics of S. paradoxus were also unusual insertions not found in other species’ venom The presence of genome-wide sequences of multiple individ- genes. Specifically, an insertion in a serine protease, a gene uals from 2 subspecies created a possibility for the develop- with a role in coagulation (namely, coagulation factor X), is not ment of practical tools for conservation genetics of this endan- present in known homologs. The insertion seems to be located gered species. Generally, microsatellite loci are both abundant at the start of the second exon. This particular gene was fur- and widely distributed throughout the genome, while usable loci ther analyzed to understand the insertion and its potential func- are characterized by a unique flanking DNA sequence so that a tional consequences (Fig. 7). Finally, none of the known venom single locus can be independently amplified in many individu- genes from the closest related venomous insectivore (Blarina bre- als [88–90]. The major advantages of microsatellite markers are vicauda) were found in our study. Our results indicate that a more well known: codominant transmission, high levels of polymor- detailed study of Solenodon venom genes using a transcriptome phisms leading to the high information content, high mutation obtained from a fresh saliva sample is needed to address their rates that allow differentiation between individuals or popula- molecular evolution and function. tions within a species, and ease of genotyping. While a genome obtained from one individual can be searched for potentially Genomic variation and demographic history inference variable microsatellite loci, this would (1) miss the majority of Once the reference alignment was assembled as a consensus be- variable loci not represented in the individual’s 2 chromosomes tween the sequences obtained from the 5 S. p. woodi individu- and (2) result in many positives that may be monomorphic fol- als, polymorphisms were identified in the 6 individual genomes lowing laboratory tests (usually by electrophoresis of the ampli- by aligning them to the combined reference. Single-nucleotide fied fragments from population samples). The availability of sev- and short variants and indels were identified in 5 southern and eral genomes can allow generation of a more comprehensive set 1 northern individual using Bowtie2 (Bowtie, RRID:SCR 005476) of variable markers, while reducing false positives. [80], SAMtools and Bcftools (SAMTOOLS, RRID:SCR 002105)[81], All 3 assemblies from this study (A, B, and C) were indepen- and VCFtools (VCFtools, RRID:SCR 001235)[82]. The S. p. woodi dently analyzed using a short tandem repeat (STR) detection individuals differed from the reference by an average of 1.25 mil- pipeline. A, B, and C assemblies were analyzed separately with lion polymorphisms, and the S. p. paradoxus individual differed Tandem Repeats Finder to locate and display tandem repeats by 2.65 million from the reference assembly. [91]. Each of the 6 individual samples from the 2 solenodon sub- Whole solenodon genome single nucleotide variation (SNV) species (5 from S. p. woodi and 1 from S. p. paradoxus) were aligned rates, defined as a ratio of all observed SNVs to all possible SNV to the reference assemblies A, B, and C by Burrow-Wheelers sites in the genome, were calculated and found to be compara- Aligner [81]. Each set of individual alignments was analyzed with tively low relative to other mammals (Fig. 8)[83–86]. To enable HipSTR [92]. Only loci that shared more than 20 reads in the sam- this comparison, the same calculations were used, where SNVs ple alignments were considered for further steps in the search Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 12 Grigorev et al. Figure 7: (A) Predicted coagulation factor X (F10) gene structure arrangement from the structure of known homologs (due to the scaffolding, the total gene length is unknown in solenodon). The 21 codon insertion is highlighted in red on exon 2 of the solenodon F10 gene. Exons are represented as black boxes and introns as lines connecting exons. (B) F10 protein sequence alignment showing an unusual insertion in the Solenodon paradoxus genome absent in all other mammalian and reptilian genes retrieved from the Tox-Prot from Uniprot [79]. The insertion of 21 amino acids is indicated with a red-boxed line in the alignment. (C) Reconstructed mammalian F10 phylogenetic maximum likelihood tree using the model GTR+I+, 1,000 bootstrap replicates (1,590 bp long alignment). The numbers set indicate approximate likelihood-ratio branch test, Bayesian-like modification of the aLRT, and bootstrap percentage, respectively. for variable microsatellite loci. The result of this search was to HipSTR [92], genotypes with a flank indel in more than 15% of saved in a variant call format file that includes annotations of reads, and genotypes with more than 15% of reads with detected all loci that had variation between samples and passed the min- polymerase chain reaction (PCR) stutter artifacts were discarded. imum qualification of the reads parameter: to be successfully The final set containing loci that have at least 2 allele calls in identified in silico in the data from at least 1 individual. The loci 2 different individuals after filtering has been deposited in that did not pass these criteria were labeled as unsuccessfully ver- the polymorphic microsatellite database (Database S8). This ified and excluded from the list. database contains a list of variable microsatellites discovered, The remaining loci were subjected to additional filtering; all a total of 1,200 bp flanking sequences for primer construction, genotypes that had less than 90% posterior probability according and the information on whether and where it was found to be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 13 Figure 8: Low genome heterozygosity in Solenodon paradoxus woodi compared to other mammalian taxa. The SNV rate in the S. p. woodi genome is shown relative to other mammal genomes as an estimate of genome diversity (h). The value for each sequenced individual was estimated using all variant positions, with repetitive regions not filtered. The SNVs are deposited in Database S9. variable—between subspecies or within 1 of the subspecies. We also providing a window into genetic underpinnings of adap- also report the type (di-, tri-, etc.), number of repeats, number tive features, including genes responsible for inflammation and of variants, and % variable and provide up to 600 bp flanking venom and how these may reflect its adaptation. In addition, we sequences on each side that can be used to develop primer se- developed tools that will help guide future genome studies as quences (Database S8). well as conservation surveys of the remaining solenodon popu- lations on the island of Hispaniola. In this study, we have made the first step into the whole-genome analysis of the Solenodon.A more complete genome sequence may provide a better picture Discussion of its evolutionary history, possible signatures of selection, and In this study, we sequenced and assembled the genome of an clues about the genetic basis of adaptive phenotypic features fa- endangered Antillean mammal that survived tens of millions cilitating life on Caribbean islands and contribute to a better in- of years of island isolation but nevertheless is currently threat- sight into island evolution and possible responses to current and ened with extinction due to anthropogenic activities. Our ap- future climate change. proach demonstrated sequencing, assembly, and annotation of a genome of a highly divergent lineage within the placental mam- mal tree, delivering an important phylogenetically diverse mam- malian genome for analysis in a comparative context [93]. Al- though the full description of genome diversity of this rare enig- The string graph assembly approach for homozygous matic mammal needs to be further improved with more samples genomes and analyses, our initial assembly of the solenodon genome con- tributes information and tools for future studies of evolution and The advantages of the string graph assemblies in our particular conservation. Future studies can combine the current genome case can be understood by looking at the nature of the under- annotations with the inclusion of additional genetic and eco- lying algorithms. The de Bruijn graph is a mathematical con- logical data from further sampling. cept that simplifies genome assembly by reducing information With the new genome-wide assembly, we inferred a phy- from short next-generation sequencing reads, of which there logeny that validates previous estimates of the time of diver- can be billions, to an optimized computational problem that gence of Solenodon from other eulipotyphlan insectivores [3, 12], can be solved efficiently [ 94]. However, some information may Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 14 Grigorev et al. Figure 9: Demographic history inference for the southern S. p. woodi (red) and the northern S. p. paradoxus (blue) subspecies using the pairwise sequentially Markovian coalescent model [87]. indeed be lost, as the set of reads is effectively replaced with Potential Implications a set of much shorter k-mers to produce an optimal assembly Comparative genomics path. Usually, this is compensated by overwhelming amounts of data in high coverage assemblies, and the difference in ef- We have taken advantage of the fact that the genome of this fectiveness between this and other types of algorithms, bar- mammal shows reduced heterozygosity [12], which made it fea- ring speed, becomes less evident. While sequencing becomes sible to combine samples of multiple individuals in order to pro- cheaper, genome projects continue to rely on the increased high- vide higher coverage and achieve a better assembly using Illu- quality coverage, increasing the cost of the sequence data rather mina reads. The current assembly was performed without the than trying to increase the efficacy of the assembly itself. In use of mate pair libraries and without high-quality DNA, nev- contrast, the string graph–based algorithms for genome assem- ertheless, it is comparable in quality to other available mam- bly are intrinsically less erroneous than de Bruijn graph–based malian assemblies. In terms of contig N50 as a measure of con- ones, since building and resolving a string graph does not re- tiguity, our assembly resulted in contig N50 of 54,944, while the quire breaking reads into k-mers and therefore does not sacrifice most closely related available genome sequences of Sorex ara- long-range information [19]. This also helps reduce the probabil- neus (SorAra2.0) assembly features a contig N50 of 22,623, and ity of misassemblies; in theory, any path in a string graph repre- the Condylura cristata (ConCri1.0) assembly has contig N50 of sents a valid assembly [95, 96]. String graph–based approaches 46,163. It should be noted that scaffold N50 values are not to be have already been applied successfully to assemblies from high compared as this study used only paired-end reads, as opposed coverage read sets; and one example is the Assemblathon 2 [97]. to S. araneus and C. cristata. More importantly, the assembly pro- In projects with lower genome coverage like ours, adoption of a vided complete or partial annotation for more than 95% of the string graph–based approach might be of benefit to the genome genes based on the evolutionarily informed expectations of gene assembly because it uses more information from the sequences. content from near-universal single-copy orthologs selected from However, there are 2 major downsides for its widespread use: OrthoDB v9 by BUSCO [29]. Among these, 4416 single-copy genes (1) it is more computationally intensive than methods utilizing that have clear 1-to-1 orthologs across species (single-copy or- de Bruijn graph algorithms and (2) the implementation of the thologs) [98, 99] were chosen for a subsequent comparative anal- string graph model is sensitive to sequence variation, and the ysis involving genes in different mammalian species. effectiveness of this approach may depend on the level of het- Specifically, the repetitive composition of the solenodon erozygosity in a DNA sample. It is worth noting that [19]was genome was evaluated. Compared to the estimates based on the primarily intended for variant annotation via de novo local as- reference human genome [100], very conspicuous is the lower sembly, and not for whole genome assembly. Nevertheless, the numbers of SINEs (no Alu elements) and a substantially lower new genome-wide data produced by our pipeline were sufficient number of LINEs as well. Transpositions and translocations be- for the comparative analysis and have been annotated for the tween the genomes of S. paradoxus and S. araneus were identi- genes and repetitive elements and interrogated for phylogeny, fied; very few rearrangements and translocations between the demographic history, and signatures of selection. In addition, assembly and the S. araneus genome were found. At the same using the current genome assembly, we were able to annotate time a higher coverage would be needed to do more detailed large transpositions and translocations in the Solenodon in re- analyses, for instance, to address the relative length and simi- lation to the closest available high-quality genome assembly larity of indels and copy number polymorphisms between solen- (S. araneus). odon populations [101]. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 15 Evolutionary genomics tion could allow a more successful interaction with molecules capable of activating the F10 protein. In mice, venom extracted Using the nuclear genomes, we were able to confirm earlier di- from solenodons and venom prothrombin activator injections vergence time estimates based on sets of genes [3], as well as can both be lethal in minutes [7, 102]. The insertion was also full mitochondrial sequences [12]. The whole genome analysis searched against possible mobile DNA elements, but no matches points to a split between Solenodon and other eulypotiphlans that were found. Our results should be followed in the future by de- occurred around 74 Mya (Fig. 5), which is very close to our earlier tailed pharmacological studies. estimate of 78 Mya based on the full mitochondrial genome [12]. Our result does not support the 60 Mya point estimate made by Conservation genetics a phylogenetic analysis based on sequences of 5 slowly evolving nuclear genes [13]. The low variation that exists between the solenodon sequences Our assembly provided enough gene sequences to gain in- is hardly surprising, because the theoretical consensus in con- sights into the evolution of functional elements in the solenodon servation genetics predicts that small populations lose genetic genome. It is reasonable to suggest that this species historically diversity more rapidly than large populations [103] and mea- had low effective population sizes, if they remained close to sures of genetic diversity have been explicitly suggested to IUCN those estimated by this study, or about 4,000 on average (Fig. 9). as a factor to consider in identifying species of conservation con- Among the 4,416 single-copy orthologs analyzed for dN/dS ra- cern [104]. The historical N for each subspecies was examined tios over the entire length of a protein-coding gene between S. by our analysis (Fig. 9) and showed lower levels recently in S. p. paradoxus and 10 other mammals, 12 genes were identified as woodi. Due to the limitations of PSMC, the most recent Ne cannot positively selected. Among these, the majority were membrane be calculated from the genome sequences [87]. Therefore, this proteins, with 1 gene (CCRNL4) similar to a circadian clock reg- estimate of diversity does not reflect the recent impact on the ulator (Table 6). It is possible that the short list of the positively solenodon population caused by anthropogenic factors in the selected genes could be a consequence of the large comparison last 10,000 years (Fig. 9). group that included mammals very distantly related to solen- Many endangered species with small populations also have odon, and its genes need to be compared with more closely re- reduced heterozygosity across their genomes and would ben- lated species, e.g., once the genome of S. cubanus is reported and efit from a computational approach that reduces the cost and better gene annotations for Sorex araneus become available. optimizes the amount of data for the genome assembly. The Solenodon is one of few mammals that use venomous saliva real-life scenarios where no high-quality DNA can be produced to disable prey. It delivers its venom similarly to snakes—using because of the remoteness of sampling location, when trans- its teeth to inject venomous saliva into its target. Different ap- portation and storage are difficult, or when the high coverage proaches could be used to characterize venom genes, such as cannot be produced due to the limited funds are well known to the use of noncurated databases to widen the search spectrum, many, especially in the field of conservation genetics. The diffi- which may include different molecules that could be found in cult field conditions and international regulations make it diffi- Solenodon. For example, 6,534 toxin and venom protein represen- cult to obtain samples with high-molecular-weight DNA. To aid tatives can be found in the UniProt database. It is also impor- future conservation studies, we have mined the current dataset tant to note that the database of venom gene sequences may for microsatellite markers that are useful within and between not include those relevant to solenodons given their deep di- subspecies, to be used as tools for studies on population diver- vergence from any other venomous mammalian species. The sity, censoring, and monitoring. venom of Solenodon may contain novel protein modifications The comparative analysis of the number and the length of with unknown potential or application, making it valuable for microsatellite alleles pointed once more to the advantage of as- future detailed characterization. sembly B over A and C. The average length of microsatellite short Genes associated with venom, such as serine proteases in- tandem repeats in assembly B was the highest: 20.95 (assembly volved in coagulation (namely, the coagulation factor X), are of A) vs. 21.14 (assembly B) vs. 18.86 (assembly C). This may be a major interest since factor X in solenodon exhibited unusual in- direct consequence of the high number of microsatellite alle- sertions when compared to its homologs (Fig. 7). The detection les that were successfully genotyped in all of the southern sam- of an unusual insertion in a serine protease has been previously ples for assembly B (2,660), as well as microsatellites that proved found in another venomous mammalian species, the shrew Bla- variable between the 2 subspecies but fixed within the southern rina brevicauda, but in a different gene than in solenodon. The samples (639). The low number of variable microsatellites be- coagulation factor X is involved in the circulatory system and is tween the 2 subspecies was likely due to the reduced amount responsible for activating thrombin and inducing clotting. The of information obtainable from a single low coverage genome insertion in the coagulation factor X gene seems to be a hy- of the northern subspecies (S. p. paradoxus)usedinthisstudy. drophilic alpha helix with 3 potential protein–protein interac- Venn diagrams showing overlap in microsatellite variation in 3 tion sites. It occurs at the end of the region annotated as the assemblies are presented in Fig. 11. signal peptide, while having a signal peptide cleavage site itself Recently, a genetic survey using mitochondrial cytochrome b at the beginning of its sequence. The factor X protein structure and control region sequences from 34 solenodon samples iden- was successfully modeled by Swiss-Model based on the ven- tified distinct haplotypes in northern and southern Hispaniola omous elapid snake Pseudonaja textilis (pdb:4bxs)tohaveaheavy [16], along with a distinctive third group, a small remnant pop- chain that contains the serine protease activity, which was mod- ulation at the Massif de la Hotte in the extreme western tip of eled with a high degree of confidence (Fig. 10). The venom Haiti [16, 105] not sampled for this study. The north–south sub- prothrombin activator has an advantage as a toxin in part due species subdivision within S. paradoxus was further supported to modifications in inhibition sites, making it difficult to stop by mitogenomic sequences [12]. The island of Hispaniola has its activity. Another advantage is that the molecules are always been divided into 3 main biogeographic regions that differ in cli- found in an active form (Kinin). We hypothesize that the inser- mate and habitat. The north and center of the island provide the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 16 Grigorev et al. Figure 10: (A) Simplified version of the coagulation cascade, indicating key steps involving the coagulation factor X (F10). (B) Protein modeling of solenodon s equence data using SWISS-MODEL. The target model (4bxs) used was the F10-like protease of the venomous elapid snake Pseudonaja textilis. Due to its location, the insertion cannot be represented in the model (its location is indicated according to the PDB annotation). Colors indicate model quality, with red being low-quality and blue high-quality modeling. Colors also separate F10’s light chain (EGF-like domain) in red from the heavy chain (serine protease domain) in blue (the half circle line in black separates both domains). (C) Amino acid sequence properties calculated for the solenodon F10 translated gene, with focus on the insertion region 23–43. One signal peptide cleavage site was detected between positions 25 and 26. Predicted protein interaction sites at position 26, 29–30 and 32–40. Hydropathy analysis showed a relatively hydrophilic structure for the insertion. largest area with known solenodon populations and shows no Methods discontinuity with the southeast. However, the solenodon pop- Provenance of the samples is shown on the map (Fig. 2), with ulations in the southwestern part of the island are currently geo- coordinates listed in Supplementary Table S4. Solenodons were graphically isolated by Cordillera Central and may have been iso- caught with help of local guides (Nicolas ´ Corona and Yimell lated in the past by the ancient marine divide across the Neiba Corona). During the day, potential locations were inspected in Valley (Fig. 2). This geographic isolation is likely the reason why daylight for animal tracks, burrows, droppings, and other signs the S. p. paradoxus in the larger northern area and S. p. woodi of solenodon activity. At dawn, ambushes were set up in the in the southwest show morphological differences suggestive of forested areas along the potential animal trails. The approach- separate subspecies [15]. Future conservation strategies directed ing solenodons were identified by sound and chased with flash- at protecting and restoring solenodon populations on Hispaniola lights when approached. Since solenodons move slowly, animals should take into consideration this subdivision and treat the 2 were picked up by their tails, which is the only way to avoid po- subspecies as 2 separate conservation units. tentially venomous bites. All wild-caught animals were released Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 17 Figure 11: Numbers of variable microsatellite alleles discovered in S. paradoxus assemblies. The diagrams were built independently for Fermi-based assemblies (A) and (B) and 1 SOAPdenovo2-based assembly (C). The red circle indicates microsatellites that were successfully genotyped in all samples with at least 1 alternative allele in the southern subspecies (S. p. woodi). The blue circle indicates microsatellites that were successfully genotyped in all samples with at least 1 alternative allele in the northern subspecies (S. p. paradoxus). The overlap indicates microsatellite loci with at least 1 alternative variant found in both subspecies. All alleles discovered, number of fixed alleles in each population, and number of unique alleles in each population are presented in Table S3. All the candidate microsatellite loci discovered in this study, along with their 5’ and 3’ flanking regions, are listed in the Database S8. back into their habitats within 10 minutes after their capture. http://public.dobzhanskycenter.ru/solenodon/repeats/solpar- Before the release, the animals’ tails were marked with a Sharpie a.txt pen to avoid recapturing. http://public.dobzhanskycenter.ru/solenodon/repeats/solpar- Blood was drawn by a licensed ZooDom veterinarian (Adrell b.txt Nunez) ˜ from the vena jugularis using a 3-mL syringe with a 23G x Database S2: List of protein coding genes in the solenodon 1-inch needle. The blood volume collected never exceeded 1% genome (assembly B) of body weight of animals. Before the draw, an aseptic tech- http://public.dobzhanskycenter.ru/solenodon/genes/solpar-b.gff nique was applied using a povidone–iodine solution, followed also cds for each gene and translated sequences by isopropyl alcohol. Once collected, the samples were trans- Database S3: List of the annotated non-coding RNAs ferred to a collection tube with anticoagulant (BD Microtainer, in the solenodon genome http://public.dobzhanskycenter. 1.0 mg K2EDTA for 250–500l L volume). Collection tubes were ru/solenodon/rna refrigerated and transported to the lab at the Instituto Tec- Database S5: List of single-copy orthologs in the solenodon nolog ´ ico de Santo Domingo where DNA was extracted from sam- genome (columns include: ENOG id, gene name) http://public. ples using the DNeasy Blood and Tissue kit (Qiagen, Hilden, Ger- dobzhanskycenter.ru/solenodon/monoorthologs.txt many). Database S6: List of genes with dN/dS values and GO anno- This study was reviewed and approved by the Institutional tations Animal Care and Use Committee of the University of Puerto http://public.dobzhanskycenter.ru/solenodon/selection.xls Rico at Mayaguez. All the required collection and permits had Database S7: List of venom genes http://public. been obtained before any field work was started. The samples dobzhanskycenter.ru/solenodon/venom genes HitGeneDB.fasta were collected and exported in compliance with export permit Database S8: Microsatellite loci discovered in genomes of VAPB-00909 (Dominican Republic Environment and Natural Re- two solenodon subspecies Solenodon paradoxus paradoxus (north- sources Ministry Viceministry of Protected Areas and Biodiver- ern) and S. p. woodi (southern), alleles, 600bp flanking re- sity Department of Biodiversity) and imported in compliance gions (a total of 1,200 bp per locus), and frequency informa- with CITES/ESA import permit 14US84465A/9 (University of Illi- tion for the two subspecies http://public.dobzhanskycenter.ru/ nois Board of Trustees). solenodon/STRs.xlsx Database S9: Lists of single nucleotide differences from the assembled individual genome of Spa-1 (from Solenodon paradoxus Sequencing paradoxus) and Spa K,—L,—M, -N, and –O (from the five S. p. Sequences for S. p. woodi were generated by Illumina HiSeq 2000 woodi)) used to show estimates of heterozygosity in Fig. 8 (see (Illumina Inc) with 100 bp paired-end reads. The Illumina HiSeq explanation in text) generated raw images utilizing HiSeq Control Software (HCS) http://public.dobzhanskycenter.ru/solenodon/variants v2.2.38 for system control and base calling through an inte- Supporting raw data is in the NCBI Sequence Read Archive grated primary analysis software called RTA (Real Time Analy- [ENA: NKTL01000000, PROJECT: PRJNA368679] and genome as- sis. v1.18.61.0). The base call binaries were converted into FASTQ semblies, custom codes and annotations are in the GigaScience utilizing the Illumina package bcl2fastq (v1.8.4). Sequences for S. GigaDB database [22]. p. paradoxus were generated by the Illumina MiSeq V3 (Illumina Inc.) at the Roy J. Carver Biotechnology Center, University of Illi- nois. The sequencing data for each sample used in this study are Additional file presented in Supplementary Table S5. Figure S1: The phylogenetic tree used for multiple genome alignment with Progressive Cactus (Paten et al. 2011). The taxa Availability of supporting data have been chosen based on their availability and the quality of Database S1: Lists of repeats in the solenodon genome (assem- genome assembly, not to make inferences about mammalian blies A and B) phylogeny. This cladogram only shows tree topology, and the Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 18 Grigorev et al. branches do not represent evolutionary time, and do not assume References the basal position of Solenodon paradoxus. 1. MacPhee RDE, Flemming C, Lunde DP. “ Last occurrence” Figure S2: The colors assigned to GO terms represented in of the Antillean insectivoran Nesophontes: new radiomet- Fig. 4. ric dates and their interpretation. American Museum novi- Table S1: Classification of the fast-evolving genes tates; no. 3261. New York, NY: American Museum of Natural (dN/dS>0.25) in solenodon genome using PANTHER Overrep- History; 1999. resentation Test (release 20160715) and GO Ontology database 2. Ottenwalder JA. Systematics and biogeography of the West (Released 2017-02-28) based on the Homo sapiens genes. Genes Indian genus Solenodon. Biogeogr. West Indies Patterns are only represented if P < 0.05 after the Bonferroni correction Perspect. Second Ed. CRC Press; 2001, 253–329. for multiple testing (Treangen and Salzberg, 2012). 3. Roca AL, Bar-Gal GK, Eizirik E et al. Mesozoic origin for Table S2: Classification of the fast-evolving genes (w >0.25) West Indian insectivores. Nature, Nature Publishing Group; in solenodon genome using PANTHER Overrepresentation Test 2004;429(6992):649–51. (release 20160715) and GO Ontology database (Released 2017-02- 4. Verill AH. Notes on the habits and external characters of the 28) based on the Mus musculus genes. Genes are only represented solenodon of San Domingo (Solenodon paradoxus). American if P < 0.05 after the Bonferroni correction for multiple testing Journal of Science 1907;XXIV(139):55–57. (Treangen and Salzberg, 2012). 5. Allen JA. Notes on Solenodon paradoxus Brandt. Bull Am Mu- Table S3: Microsatellite alleles discovered in genomes of two seum Nat Hist 1908;XXIV:505–5017. solenodon subspecies Solenodon paradoxus paradoxus (northern) 6. Brandt JF. De Solenodonte: novo mammalium insec- and S. p. woodi (southern). tivororum genere. Mem. l’Academie ´ Imperiale ´ des Sci. Table S4: Locations for the samples used in this study as de- St. Petersbg. ´ l’Academie ´ Imperiale ´ des Sciences de St. scribed in Fig. 2. Adopted from Brandt et al., (2016). Petersbour ´ g; 1833;2:459–78. Table S5: Sequencing data for the samples used in this 7. Derbridge JJ, Posthumus EE, Chen HL et al. Solenodon study. paradoxus (Soricomorpha: Solenodontidae). Mammalian Species 2015;47(927):100–6. Abbreviations 8. Feldhamer GA, Drickamer LC, Vessey SH et al. Mammalogy: Adaptation, Diversity, and Ecology. 4th ed. Johns Hopkins BLAST: Basic Local Alignment Search Tool for Proteins; BUSCO: University Press, Baltimore, Maryland. 2015, p. 768 benchmarking universal single-copy orthologs; CDS: coding se- 9. Wible JR. On the cranial osteology of the Hispaniolan quence; GO: gene ontology; IUCN: International Union for Con- solenodon, Solenodon paradoxus Brandt, 1833 (Mammalia, servation of Nature; Kya: thousand (kilo) years ago; Mya: million Lipotyphla, Solenodontidae). Annals of Carnegie Museum years ago; NCBI: National Center for Biotechnology Information; 2008;77(3):321–402. PSMC: pairwise sequentially Markovian coalescent; SNV: single 10. Folinsbee KE, Muller ¨ J, Reisz RR. Canine grooves: morphol- nucleotide variation; STR: short tandem repeat; UCN: Interna- ogy, function, and relevance to venom. Journal of Vertebrate tional Union for Conservation of Nature. Paleontology 2007;27(2):547–51. 11. Dufton MJ. Venomous mammals. Pharmacology & Thera- peutics. Elsevier; 1992;53(2):199–215. Competing interests 12. Brandt AL, Grigorev K, Afanador-Hernandez ´ YM et al. The authors declare that they have no competing interests. Mitogenomic sequences support a North-South subspecies subdivision within Solenodon paradoxus. Mitochondrial DNA Part A [Internet]. Taylor & Francis; 2017;28(5):662– Author contributions 70. Available from: https://www.ncbi.nlm.nih.gov/ KG, TKO, JCMC and ALR conceived and designed the study. KG, pubmed/27159724. SK, PD, AK, WW, FS, and TKO did the analysis. YMAH, ALB, LAP, 13. Sato JJ, Ohdachi SD, Echenique-Diaz LM et al. Molecular RC, AN, JRB, JCMC and TKO collected, processed, and contributed phylogenetic analysis of nuclear genes suggests a Ceno- samples in the field and the laboratory, KG, AJM, KG, AA, ALR, zoic over-water dispersal origin for the Cuban solenodon. SJO, JCMC and TKO wrote the draft. All authors contributed to Sci Rep. 2016;6(1):31173. finalizing of the manuscript. 14. Ottenwalder JA. The Distribution and Habitat of Solen- odon in the Dominican Republic. MS thesis Univ. Florida, Gainesville. 1985, p. 128. Funding 15. Ottenwalder JA. The Systematics, Biology and Conserva- Authors at the University of Puerto Rico at Mayaguez were sup- tion of Solenodon. Ph.D. dissertation, University of Florida, ported in part by a National Science Foundation award (1432092). Gainesville 1991, p. 281. Authors at the Theodosius Dobzhansky Center for Genome 16. Turvey S, Inchaustegui S. 2008. Solenodon para- Bioinformatics were supported by a Russian Ministry of Science doxus. The IUCN Red List of Threatened Species 2008: e.T20321A9186243. http://dx.doi.org/10.2305/IUCN.UK. mega-grant (11.G34.31.0068) and a St. Petersburg State Univer- sity grant (1.50.1623.2013). 2008.RLTS.T20321A9186243.en. Downloaded on 07 April 17. http://www.iucnredlist.org/details/20321/0. Acknowledgements 18. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically im- proved memory-efficient short-read de novo assembler. Gi- The authors thank Nicolas ´ Corona and Yimell Corona for assis- gaSci 2012;1(1):589–602. tance in collecting samples. A special thanks to Kara Fore for 19. Li H. Exploring single-sample SNP and INDEL calling with helping to edit the manuscript. Sample collection in the Domini- whole-genome de novo assembly. Bioinformatics. Oxford can Republic was performed under permit 00201171139 from the University Press; 2012;28(14):1838–44. Ministry of Environment and Natural Resources. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 19 20. Boetzer M, Henkel CV, Jansen HJ et al. Scaffolding pre- 39. Altschul SF, Gish W, Miller W et al. Basic local align- assembled contigs using SSPACE. Bioinformatics. Oxford ment search tool. Journal of Molecular Biology Elsevier; University Press; 2011;27(4):578–9. 1990;215(3):403–10. 21. Wang Y, Lu Y, Zhang Y et al. The draft genome of the grass 40. Bateman A, Coin L, Durbin R et al. The Pfam protein families carp (Ctenopharyngodon idellus) provides insights into its evo- database. Nucleic Acids Res. 2004;32(90001):138D–141. lution and vegetarian adaptation. Nat Genet 2015;47(6):625– 41. The UniProt Consortium. UniProt: a hub for protein 31. information. Nucleic Acids Research. 2015;43(Database 22. Grigorev K, Kliver S, Dobrynin P et al. Supporting data issue):D204–12. Available at: doi:10.1093/nar/gku989. for “Innovative assembly strategy contributes to un- 42. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA ho- derstanding the evolution and conservation genetics of mology searches. Bioinformatics. Oxford University Press; the endangered Solenodon paradoxus from the island of 2013;29(22):2933–5. Hispaniola.” GigaScience Database. 2018; http://dx.doi. 43. Nawrocki EP, Burge SW, Bateman A et al. Rfam 12.0: up- org/10.5524/100422. dates to the RNA families database. Nucleic Acids Res. 23. Starostina E, Tamazian G, Dobrynin P et al. Cookiecut- 2015;43(Database issue):D130–7. doi:10.1093/nar/gku1063. ter: a tool for kmer-based read filtering and extraction. 44. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved bioRxiv Cold Spring Harbor Labs Journals 2015;24679. Avail- detection of transfer RNA genes in genomic sequence. Nu- able from: https://doi.org/10.1101/024679. cleic Acids Res. 1997;25(5):955–64. 24. Marc¸ais G, Kingsford C. A fast, lock-free approach for effi- 45. Seemann T, Booth T. BARNAP: BAsic Rapid Ribosomal cient parallel counting of occurrences of k-mers. Bioinfor- RNA Predictor [Internet]. Berlin: GitHub; 2013. p. https: matics. 2011;27(6):764–70. //github.com/tseemann/barrnap. Accessed 1st March 2018. 25. Marc¸ais G, Yorke JA, Zimin A. QuorUM: an error corrector 46. Ruan J, Li H, Chen Z et al. TreeFam: 2008 update. Nucleic for Illumina reads. PLoS One 2015;10(6):e0130821. Acids Res. 2008;36(Database):D735–40. 26. Chikhi R, Medvedev P. Informed and automated 47. Li H, Coghlan A, Ruan J et al. TreeFam: a curated database k-mer size selection for genome assembly. Bioin- of phylogenetic trees of animal gene families. Nucleic Acids formatics [Internet]. 2014;30:31–7. Available from: Res. 2006;34(90001):D572–80. http://dx.doi.org/10.1093/bioinformatics/btt310. 48. Huerta-Cepas J, Szklarczyk D, Forslund K et al. eggNOG 27. Li H, Homer N. A survey of sequence alignment algo- 4.5: a hierarchical orthology framework with improved rithms for next-generation sequencing. Brief Bioinform functional annotations for eukaryotic, prokaryotic and vi- 2010;11(5):473–83. ral sequences. Nucleic Acids Research. 2016;44(Database 28. Gurevich A, Saveliev V, Vyahhi N et al. QUAST: quality issue):D286–93. Available at: doi:10.1093/nar/gkv1248. assessment tool for genome assemblies. Bioinformatics. 49. Stamatakis A. RAxML version 8: a tool for phylogenetic 2013;29(8):1072–5. analysis and post-analysis of large phylogenies. Bioinfor- 29. Simao ˜ FA, Waterhouse RM, Ioannidis P et al. BUSCO: assess- matics 2014;30(9):1312–3. ing genome assembly and annotation completeness with 50. Yang Z. PAML 4: phylogenetic analysis by maximum like- single-copy orthologs. Bioinformatics. 2015;31(19):3210–2. lihood. Molecular Biology and Evolution 2007;24(8):1586– 30. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately 91. annotate core genes in eukaryotic genomes. Bioinformat- 51. Xia X, Xie Z, Salemi M et al. An index of substitution satura- ics. 2007;23(9):1061–7. tion and its application. Mol Phylogenet Evol 2003;26(1):1–7. 31. Hunt M, Kikuchi T, Sanders M et al. REAPR: a univer- 52. Xia X, Lemey P. Assessing substitution saturation with sal tool for genome assembly evaluation. Genome Biol. DAMBE. In: Lemey Philippe, Salemi, Marco, Vandamme, 2013;14(5):R47. Anne-Mieke eds. Phylogenetic Handb. A Pract. Approach 32. Paten B, Earl D, Nguyen N et al. Cactus: algorithms for to DNA Protein Phylogeny. 2nd ed. Cambridge University genome multiple sequence alignment. Genome Research. Press; 2009, 615–30. 2011;21(9):1512–28. 53. Xia X. DAMBE6: new tools for microbial genomics, phy- 33. Kolmogorov M, Raney B, Paten B et al. Ragout–a reference- logenetics, and molecular evolution. J Hered 2017;108(4): assisted assembly tool for bacterial genomes. Bioinformat- 431–7. ics. 2014;30(12):i302–9. 54. Ksepka DT, Parham JF, Allman JF et al. The fossil calibration 34. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996– database—a new resource for divergence dating. Syst Biol 2010 Accessible at: http://www.repeatmasker.org. 2015;syv025. 35. Bao W, Kojima KK, Kohany O. Repbase Update, a database 55. Benton MJ, Donoghue PCJ, Asher RJ et al. Constraints on the of repetitive elements in eukaryotic genomes. Mobile DNA timescale of animal evolutionary history. Palaeontol Elec- 2015;6(1):11. tron, Paleontological Society; 2015;18:1–106. 36. Slater GSC, Birney E. Automated generation of heuristics 56. Munthe K. Canidae. Evol. Tert. Mamm. North Am. Cam- for biological sequence comparison. BMC Bioinformatics. bridge University Press, Cambridge. 1998, p. 124–43. 2005;6(1):31. 57. Wang X, Whistler DP, Takeuchi GT. A new basal skunk 37. Stanke M, Keller O, Gunduz I et al. AUGUSTUS: ab ini- Martinogale (Carnivora, Mephitinae) from late miocene tio prediction of alternative transcripts. Nucleic Acids Res. dove spring formation, California, and origin of New 2006;34(Web Server):W435–9. World mephitines. Journal of Vertebrate Paleontology 38. Finn RD, Clements J, Eddy SR. HMMER web server: in- 2005;25(4):936–49. teractive sequence similarity searching. Nucleic Acids Re- 58. Rambaut A. FigTree [Internet]. 2016. Available from: search. 2011;39(Web Server issue):W29–37. Available at: http://tree.bio.ed.ac.uk/software/figtree/ Accessed 1 March doi:10.1093/nar/gkr367. 2018. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 20 Grigorev et al. 59. Springer MS, Murphy WJ, Roca AL. Appropriate fossil cal- 79. Jungo F, Bougueleret L, Xenarios I et al. The ibrations and tree constraints uphold the Mesozoic diver- UniProtKB/Swiss-Prot Tox-Prot program: a central hub gence of solenodons from other extant mammals. Molecu- of integrated venom protein data. Toxicon 2012;60(4):551– lar Phylogenetics and Evolution 2018;121:158–65. 7. 60. Hedges SB. Vicariance and dispersal in Caribbean biogeog- 80. Langmead B, Salzberg SL. Fast gapped-read alignment with raphy. Herpetologica 1996;52:158–65. Bowtie 2. Nat Methods 2012;9(4):357–9. 61. McDowell SB. The greater antillean insectivores. Bull Am 81. Li H, Durbin R. Fast and accurate short read align- Museum Nat Hist [Internet]. 1958;115:117. Available from: ment with Burrows-Wheeler transform. Bioinformatics http://hdl.handle.net/2246/1199. 2009;25(14):1754–60. 62. Butler PM. Phylogeny of the insectivores. In: Benton MJ, eds. 82. Danecek P, Auton A, Abecasis G et al. The variant call Phylogeny Classif. Tetrapods. Oxford: Clarendon; 1988, 117– format and VCFtools. Bioinformatics. 2011;27(15):2156– 41. 8. 63. MacPhee RD, Novacek M. Definition and relationships of 83. Dobrynin P, Liu S, Tamazian G et al. Genomic legacy the Lipotyphla. In: Soule F, Novacek M, McKenna M, eds. of the African cheetah, Acinonyx jubatus. Genome Biol Mamm. Phylogeny, Vol. 2, Placentals. New York: Springer- 2015;16(1):277. doi: 10.1186/s13059-015-0837-4. Verlag; 1993, 13–31. 84. Gordon D, Huddleston J, Chaisson MJP et al. Long- 64. McKenna M, Bell S, Simpson S. Classification of Mammals read sequence assembly of the gorilla genome. Science. Above the Species Level. New York: Columbia University 2016;352(6281):aae0344-. doi:10.1126/science.aae0344. Press; 1997. 85. Li R, Fan W, Tian G et al. Erratum: the sequence and 65. Edgar RC. MUSCLE: multiple sequence alignment with de novo assembly of the giant panda genome. Nature. high accuracy and high throughput. Nucleic Acids Res. 2010;463(7284):1106-. 2004;32(5):1792–7. 86. Cho YS, Hu L, Hou H et al. The tiger genome and compara- 66. Suyama M, Torrents D, Bork P. PAL2NAL: robust conver- tive analysis with lion and snow leopard genomes. Nature sion of protein sequence alignments into the correspond- Communications. 2013;4:2433. ing codon alignments. Nucleic Acids Res. 2006;34(Web 87. Li H, Durbin R. Inference of human population his- Server):W609–12. tory from individual whole-genome sequences. Nature. 67. Soto-Giron ´ MJ, Ospina OE, Massey SE. Elevated levels of 2011;475(7357):493–6. adaption in Helicobacter pylori genomes from Japan; a link 88. Weber JL. Human DNA polymorphisms and methods of to higher incidences of gastric cancer? Evolution, Medicine, analysis. Current Opinion in Biotechnology 1990;1(2):166– and Public Health. 2015;2015(1):88–105. 71. 68. Gharib WH, Robinson-Rechavi M. The Branch-Site Test of 89. Weber JL, Wong C. Mutation of human short tandem re- Positive Selection Is Surprisingly Robust but Lacks Power peats. Hum Mol Genet 1993;2(8):1123–8. under Synonymous Substitution Saturation and Variation 90. Weber JL, May PE. Abundant class of human DNA polymor- in GC. Molecular Biology and Evolution. 2013;30(7):1675–86. phisms which can be typed using the polymerase chain re- 69. Tang H, Klopfenstein D, Pedersen B et al. GOATOOLS: Tools action. Am J Hum Genet 1989;44:388–96. PMC1715443 for gene ontology [Internet]. Zenodo 2015. Available from: 91. Benson G. Tandem repeats finder: a program to ana- https://doi.org/10.5281/zenodo.31628. lyze DNA sequences. Nucleic Acids Res. 1999;27(2):573– 70. Baggs JE, Green CB. Nocturnin, a Deadenylase in Xenopus 80. laevis Retina. Current Biology 2003;13(3):189–98. 92. Willems T, Zielinski D, Yuan J et al. Genome-wide pro- 71. Oleksyk TK, Smith MW, O’Brien SJ. Genome-wide filing of heritable and de novo STR variations. Nat Meth scans for footprints of natural selection. Philosophical 2017;14(6):590–2. Transactions of the Royal Society B: Biological Sciences 93. Koepfli K-P, Paten B, O’Brien SJ. The Genome 10K Project: a 2010;365(1537):185–205. way forward. Annu Rev Anim Biosci; 2015;3:57–111. 72. Fay JC, Wyckoff GJ, Wu CI. Positive and negative selection 94. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn on the human genome. Genetics 2001;158:1227–34. graphs to genome assembly. Nat Biotechnol 2011;29:987– 73. Fay JC, Wu C-I. The neutral theory in the genomic era. 91. Current Opinion in Genetics & Development 2001;11(6): 95. Myers EW. Toward simplifying and accurately formulat- 642–6. ing fragment assembly. Journal of Computational Biology 74. Ellegren H. Evolution: natural selection in the evolution of 1995;2:275–90. humans and chimps. Current Biology 2005;5(22):R919–22. 96. Myers EW. The fragment assembly string graph. Bioinfor- 75. Gaya-V ` idal M, Alba` M. Uncovering adaptive evolution in the matics 2005;21:ii79–85. human lineage. BMC Genomics 2014;15(1):R919–22. 97. Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2: 76. Bakewell MA, Shi P, Zhang J. More genes underwent posi- evaluating de novo methods of genome assembly in three tive selection in chimpanzee evolution than in human evo- vertebrate species. GigaSci 2013;2:10. lution. Proceedings of the National Academy of Sciences 98. Gogarten JP, Olendzenski L. Orthologs, paralogs and 2007;104(18):7489–94. genome comparisons. Current Opinion in Genetics & De- 77. Olson MV, Varki A. Sequencing the chimpanzee genome: velopment 1999:630–6. insights into human evolution and disease. Nat Rev Genet 99. Creevey CJ, Muller J, Doerks T et al. Identifying single copy 2003;4(1):20–28. orthologs in metazoa. PLoS Comput Biol 2011;7:e1002269. 78. Mi H, Huang X, Muruganujan A et al. PANTHER version 100. Treangen TJ, Salzberg SL. Repetitive DNA and next- 11: expanded annotation data from Gene Ontology and Re- generation sequencing: computational challenges and so- actome pathways, and data analysis tool enhancements. lutions. Nat Rev Genet 2012;13:36–46. Nucleic Acids Research. 2017;45(Database issue):D183–9. 101. Volfovsky N, Oleksyk TK, Cruz KC et al. Genome and gene Available at doi:10.1093/nar/gkw1138. alterations by insertions and deletions in the evolution of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Innovative assembly of the Solenodon genome 21 human and chimpanzee chromosome 22. BMC Genomics new recommendations regarding IUCN conservation rank- 2009;10:51. ings. Biological Conservation 2015;191:495–503. 102. Rabb GB. Toxic salivary glands in the primitive insectivore 105. Turvey ST, Meredith HMR, Scofield RP. Continued survival Solenodon. Nat Hist Misc 1959;170:1–3. of Hispaniolan solenodon Solenodon paradoxus in Haiti. ORX. 103. Allendorf FW, Luikart G. Conservation and the Genetics of Cambridge University Press; 2008;42:611–4. Populations. John Wiley & Sons; 2009. 106. Gene Ontology Consortium. The Gene Ontology (GO) 104. Willoughby JR, Sundaram M, Wijayawardena BK et al. The database and informatics resource. Nucleic Acids Res. Ox- reduction of genetic diversity in threatened vertebrates and ford Univ Press; 2004;32:D258–61. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy025/4931057 by Ed 'DeepDyve' Gillespie user on 21 June 2018

Journal

GigaScienceOxford University Press

Published: Mar 16, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off