The genomes of two Eutrema species provide insight into plant adaptation to high altitudes

The genomes of two Eutrema species provide insight into plant adaptation to high altitudes Eutrema is a genus in the Brassicaceae, which includes species of scientific and economic importance. Many Eutrema species are montane and/or alpine species that arose very recently, making them ideal candidates for comparative studies to understand both ecological speciation and high-altitude adaptation in plants. Here we provide de novo whole-genome assemblies for a pair of recently diverged perennials with contrasting altitude preferences, the high-altitude E. heterophyllum from the eastern Qinghai-Tibet Plateau and its lowland congener E. yunna- nense. The two assembled genomes are 350 Mb and 412 Mb, respectively, with 29,606 and 28,881 predicted genes. Comparative analysis of the two species revealed contrasting demo- graphic trajectories and evolution of gene families. Gene family expansions shared between E. heterophyllum and other alpine species were identified, including the disease resistance R genes (NBS-LRRs or NLRs). Genes that are duplicated specifically in the high-altitude E. hetero- phyllum are involved mainly in reproduction, DNA damage repair and cold tolerance. The two Eutrema genomes reported here constitute important genetic resources for diverse studies, including the evolution of the genus Eutrema, of the Brassicaceae as a whole and of alpine plants across the world. Key words: Eutrema, high-altitude adaptation, de novo assembly, comparative genomics 1. Introduction to high-altitude conditions remain largely unknown due to a lack of Adaptation to high altitude has contributed to the high level of plant appropriate genome resources. and animal species biodiversity found on numerous mountains Species of the Brassicaceae family have a wide range of global dis- across the world. It is well known that animals (including humans) tributions and adaptations, as well as different life histories and mat- living in highlands have evolved unique genetic adaptations in order ing systems, and they therefore comprise an ideal system for many 2 4,5 to maintain oxygen homeostasis. In plants, UV-B tolerance and cold types of study. Genome resources are critically important for all 3 6,7 tolerance appear to be critical for survival at high altitude. these studies. Recent advances in high-throughput sequencing However, the genetic mechanisms by which immobile plants adapt technology have facilitated the ability of the scientific community to V C The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 307 Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 308 Eutrema genomes and plant high-altitude adaptation 13 14 sequence the genomes of an increasing number of Brassicaceae spe- Trimmomatic v0.33, error correction using BFC (version r181) 8–10 15 cies for multiple purposes. A previous genome study of an alpine and mate-pair data deduplication using FastUniq v1.1. We then polyploid Brassicaceae species suggested that expansion of numerous used SOAPec v2.01 to estimate the genome size as well as the level gene families was correlated with increased tolerance of cold and of of genome-wide heterozygosity, which was based on the 17-mer fre- 3 17 UV-B stress. However, such expansions may have arisen purely quency distribution. We adopted the Platanus pipeline, which can from genome duplications rather than because of high-altitude efficiently assemble genomes with various levels of heterozygosity. adaptations. Both genomes were initially assembled into scaffolds using Platanus Eutrema is a genus in the Brassicaceae that contains more than 30 (version 1.2.2) with the first round of gap closing. An additional species, most of which are distributed in eastern Asia. In addition to gap closing procedure was performed with GapCloser (version the model plant E. salsugineum (Pall.) Al-Shehbaz & Warwick, 1.12). Finally we evaluated genome assemblies for gene coverage 18 19 which is used for studies of abiotic stress, and the commercial crop using the CEGMA and BUSCO (version 2.0) pipeline. wasabi, E. japonicum (Miq.) Koidz, the genus contains numerous montane and/or alpine species. Most of these species occur in the 2.3. Repeat identification Qinghai-Tibet Plateau (QTP) and have originated recently in To structurally annotate repeat sequences in the two Eutrema response to geologic and climatic changes since the Pliocene. In this genomes, we used the TEdenovo pipeline from the REPET package study, we report the de novo genome sequence assembly and compa- (v2.5) with default parameters. The pipeline performed a self- rative genomic analysis of an alpine plant, E. heterophyllum, from comparison for the input genome using the BLASTER software (ver- the eastern QTP, and its lowland congener, E. yunnanense, both of sion 2.25) to identify and classify repeated elements. In order to which originated from a very recent common ancestor and are dip- attain more comprehensive prediction of TEs, we employed the pipe- loid perennials with chromosome numbers of 2n¼ 14 and similar line to carry out structural detection analysis using LTRharvest (ver- selfing breeding systems, therefore providing an ideal basis for 22 23 24 sion 1.5.8). RECON v1.0.8, PILER v1.0 and GROUPER genetic comparisons without the complicating effects of either 21 v2.27 were then employed to cluster the matches detected. By per- genome duplication or phylogenetic evolution. In addition, these two 25 forming a multiple sequence alignment using the Map algorithm, a genome assemblies provide important genetic resources for future consensus sequence for each cluster was generated to represent the comparative studies of this genus and family and of high-altitude ancestral TE. These consensus sequences were then classified by adaptation in alpine plants. looking for characteristic structural features and similarities to known TEs in Repbase (20.05), and by scanning against the Pfam 27 28 library with HMMER3. To annotate all TEs across the genome, 2. Materials and methods we further employed the TEannot pipeline with the default param- 2.1. Plant materials, genomic DNA and total RNA eters. This pipeline mines the genome sequence using repeated extraction sequences identified in the previous TEdenovo pipeline to produce classified non-redundant consensus repeat sequences along with We selected and sequenced a pair of Eutrema species that met three short simple repeats, which are exported in GFF3 format. We fol- criteria: (1) diploid (2n¼ 14), (2) having diverged recently and (3) growing at contrasting altitudes in their natural distributions lowed the method of Maumus and Quesneville to calculate the age (Supplementary Fig. S1). One individual of E. heterophyllum was of repetitive DNAs. collected in June 2015 at the Zhuodala pass (4,700 m above sea level; asl) in Ganzi County, Sichuan Province, China. Fresh and 2.4. Gene prediction healthy leaves, flowers and root were harvested and immediately fro- A combination of ab initio, homology-based and transcript-based zen in liquid nitrogen, followed by storage at 80 C in the labora- methods was used for gene prediction. We first built a comprehensive tory prior to DNA/RNA extraction. For comparison, fresh samples 30 transcriptome database with the PASA pipeline. After quality con- of leaf, leafstalk and root tissues were obtained from one E. yunna- 13 trol with Trimmomatic v0.33, Illumina RNA-seq reads were nense plant collected in April 2014, from Xinning County (600 m 31 assembled into de novo transcripts using Trinity. Then two sets of asl), Hunan Province, China. We used the CTAB method to extract genome-guided transcripts were built using (1) the genome-guided high-quality genomic DNA from 5 g of leaf tissues for each species. mode implemented in Trinity and (2) the HISAT-StringTie pipe- The quality of the high-molecular weight DNA was checked by 1% 32 33 line. We performed ab initio prediction with Augustus (v3.2.2) agarose gel electrophoresis. Around 200 lg of genomic DNA was using Arabidopsis thaliana as the species model and our comprehen- used for library construction. We extracted total RNA with RNAiso sive transcriptome dataset as the input cDNA hint. We used Plus reagent (Takara, Japan) for each tissue collected from both spe- GlimmerHMM (v3.0.4) with the pre-trained directory for A. thali- cies (i.e. leaves, flowers and roots for E. heterophyllum, and leaves, ana genes to generate another set of ab initio predictions. For homol- leafstalks and roots for E. yunnanense). The quality of the extracted ogy-based prediction, protein sequences of 26,531 E. salsugineum RNA was verified by 1% agarose gel. DNase treatment was per- 35 36 genes as well as the UniRef90 dataset for all Brassicaceae species formed before library construction. For each sample, 5 lg of the total 37 were aligned to the genome assemblies using SPALN2. Gene mod- high-quality RNA was used for sequencing. els from the three main sources (i.e. aligned transcripts, ab initio pre- dictions and aligned proteins) were merged to produce consensus 2.2. Genome sequencing and assembly models by EVidenceModeler. Weights of evidences for gene models We constructed Illumina libraries with small (200-, 500- and 800- were manually set as: ab initio predictions, weight (Augustus)¼ 1 bp) and large (2-, 5-, 10- and 20-kb) inserts and then sequenced and weight (GlimmerHMM)¼ 1; protein alignments, weight (E. sal- them following the Illumina protocols (Illumina, San Diego, CA) sugineum) ¼ 2 and weight (Brassicaceae_UniRef90)¼ 1; transcripts, using the Illumina HiSeq X Ten platform at Novogene (Tianjin, weight (PASA)¼ 10. The EVM consensus predictions were updated China). Short reads were first subjected to quality filtering using by performing an additional round of PASA annotation in order to Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 309 add UTR and alternatively spliced isoform annotations to gene with the same parameters. The results were combined and plotted models. using the plot tool in PSMC. A generation time of 3 years and a mutation rate of 7 10 per site per year were applied. 2.5. Functional annotation 38 5 2.8. NLR genes and integrated domains We used BlastP (with E-value 1 10 ) to assign functional descriptions by carrying out sequence homology searches against dif- NLR (nucleotide-binding site with leucine-rich repeat) genes in the E. ferent sources, including protein datasets from SwissProt and the heterophyllum and E. yunnanense genomes were directly extracted Arabidopsis genome annotation version TAIR10 . Motifs and from the InterProScan results using the Pfam domain of the domains within gene models were identified by deploying nucleotide-binding adaptor shared by APAF-1, R proteins and CED- InterProScan (version 5.20.29) with the options -dp -goterms -ipr- 4 (NB-ARC, PF00931) as query. We further applied the same lookup -pathways -t p against multiple publicly available databases strategy to identify NLRs in protein sequences from 13 cruciferous (Pfam, PROSITE, PRINTS, SMART, TIGRFAMs, Gene3D, species with genomes available. After excluding typical Pfam PANTHER, etc.). We used AHRD (https://github.com/groupschoof/ domains contained in NLRs (i.e. PF01582 for Toll/interleukin-1 AHRD (24 January 2018, date last accessed)) to gather information receptor [TIR], PF13855 for leucine-rich repeat [LRR] and PF05659 from the search results and added Gene Ontology (GO) information for Resistance to Powdery Mildew 8 [RPW8]), we obtained a list of using the GO annotation database for A. thaliana (https://www.ebi. additional domains that were likely to be fused with NLR proteins. ac.uk/GOA/arabidopsis_release; (24 January 2018, date last accessed)). We did not detect TIR2 (PF13676) in our annotated NLRs; this is a TIR domain newly identified among NLR proteins. However, most previously characterized gene fusions were recaptured in this study, 2.6. Comparative genomic analysis indicating that the approach we used for domain searching was We downloaded the protein-coding genes of 13 Brassicaceae species conservative and generally reliable. A maximum likelihood tree of (A. thaliana, A. lyrata, Brassica rapa, Capsella rubella, phloem protein 2 domains was constructed with RAxML under the Leavenworthia alabamica, Sisymbrium irio, Aethionema arabicum, PROTGAMMA model with BLOSUM62 matrix, specifying 500 boot- Arabis alpina, Raphanus raphanistrum, Cardamine hirsuta, strap replicates. The gene tree was visualized in FigTree v1.4.2 (http:// Schrenkiella parvula, Thlaspi arvense, E. salsugineum; see 43 tree.bio.ed.ac.uk/software/figtree/ (24 January 2018, date last accessed)). Supplementary Table S4). We used OrthoMCL to delineate gene families and cluster all genes into paralogous and orthologous groups. Gene family expansion and contraction analysis was per- 2.9. Positive selection in SCC3 genes formed with CAFE3 software. Functional enrichment analysis was The ratio of non-synonymous substitutions per non-synonymous site 45 46 carried out by agriGO v2.0. We used R to perform clustering (dN) to synonymous substitutions per synonymous site (dS), or x,is analysis for the gene families of the greatest size in E. heterophyllum. a commonly used indicator of selective pressure acting on protein- The transformed z-score profile was obtained from the number of coding genes, with x> 1 representing positive selection, x¼ 1 repre- genes per species for each family, and the hierarchically clustered senting neutral evolution and x< 1 representing purifying selection. (complete linkage clustering) ones, using Pearson correlation as a dis- To identify signals of positive selection, we extracted SCC3 sequen- tance measure. The 2,235 single-copy gene families were used to ces from all Brassicaceae genomes. A gene tree of aligned protein reconstruct phylogenies with RAxML. Alignment of protein sequences was constructed using RAxML under the PROTGAMMA sequences was performed with PRANK and they were then back model with BLOSUM62 matrix, specifying 500 bootstrap replicates. translated into coding sequences with PAL2NAL. The MCMCtree We used codeml in the PAML software package (v4.9a) to perform program within the PAML package v4.9a was used to estimate the branch-site test among duplicated SCC3 paralogs for a series of divergence time. The split of the two major lineages within the pairs of alternative versus null models, aiming to find out whether or Brassicaceae was constrained to be between 20 and 30 million years not any proportion of coding sites within each paralog has x> 1 51 52 ago. We used MCScanX to call intra- and inter-species gene colli- compared with the background level. Log-likelihood values for each nearity. Synonymous substitutions were calculated for aligned gene pair of nested models were compared using the likelihood ratio test pairs within identified collinearity blocks using the codeml program (LRT) to obtain the P-value of significance. implemented in PAML. 3. Results 2.7. SNP calling and demographic history 3.1. Genome assembly and annotation reconstruction After Illumina sequencing and quality control, we obtained 111.87 We used the pairwise sequentially Markovian coalescent (PSMC) model to estimate the history of effective population sizes (Ne) and 97.29 Gb of Illumina short reads from seven sequencing libra- based on genome-wide diploid sequence data. This method has been ries, corresponding to an estimated 249 and 230 base-pair cover- previously applied to human and other vertebrates, as well as plants, age, for E. heterophyllum and E. yunnanense, respectively (Supplementary Table S1). The estimated genome size of 400 Mb for inferring demographic histories over long evolutionary periods. The quality-filtered Illumina pair end reads with insert sizes of for both species is larger than that of their congener E. sal- 500 bp for each species were mapped to the corresponding genome sugineum, with a higher level of genome-wide heterozygosity in the assemblies using BWA (version 0.7.10-r789). For SNP calling, we low-altitude species E. yunnanense (Supplementary Fig. S2). A de used the SAMtools (v1.4) pipeline. The settings for PSMC analysis novo assembly pipeline allowed us to generate genome assemblies (-p and -t options) were chosen manually according to suggestions that captured 350 and 413 Mb in 149,415 and 317,771 contigs for given by the authors (https://github.com/lh3/psmc (24 January 2018, the two Eutrema genomes, of which 328 and 386 Mb (81% and date last accessed)). We also performed 100 rounds of bootstrapping 91% of the estimated genome sizes) were represented in scaffolds of Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 310 Eutrema genomes and plant high-altitude adaptation 1 kb or greater (Table 1). Repetitive DNA constitutes 67% and 70% (Table 1). Clustering of predicted proteomes for 15 Brassicaceae of, respectively, the E. heterophyllum and the E. yunnanense genomes yielded a total of 27,193 gene families, which comprised genome. Long terminal repeat (LTR) retrotransposons are the most 380,484 genes (85% of all 448,886 genes; see Supplementary abundant among these repetitive sequences (Supplementary Table S4). Table S2), having contributed to a recent wave of TE burst (Kimura divergence 5, Supplementary Fig. S3). Gene completeness analysis 3.2. Comparative phylogenomics and demographic revealed that we had achieved more than 95% completeness of gene histories coverage for both genomes (Supplementary Table S3). We predicted We analysed the phylogenetic relationships of 15 Brassicaceae spe- 29,606 genes for E. heterophyllum and 28,881 for E. yunnanense cies (Supplementary Table S4) using four-fold degenerate third- codon transversion (4DTv) sites in 2,193 one-to-one orthologous Table 1. Genome assembly and annotation statistics for two genes identified across their genomes. The nuclear-genome phylog- Eutrema species eny (Fig. 1; Supplementary Fig. S4) is congruent with the plastome phylogeny reported previously. We dated the divergence of the two E. heterophyllum E. yunnanense species to 3.5 (1.5–7.2) Mya (Fig. 1; Supplementary Fig. S5). Assembly statistics Clustering analysis of gene family size revealed that many duplicated Estimated genome size (Mb) 405 423 genes were retained in genomes known to be polyploid (Fig. 1). We Number of contigs 149,415 317,771 failed to detect any additional whole genome duplication in the two Contig length (Mb) 350.47 412.62 Eutrema species when compared with A. thaliana using synonymous Longest contig (Mb) 0.67 0.39 substitutions per synonymous site (dS) age distribution analysis Contig L50 (seqs) 1264 3269 (Supplementary Fig. S6). We identified 581,883 and 807,391 hetero- Contig N50 (kb) 74.65 32.56 zygous sites in, respectively, the E. heterophyllum and the E. yunna- Number of scaffolds (>1 kb) 7,690 14,031 nense genome. Consistent with the results of analysing K-mer Scaffold length (Mb) 328.16 385.58 frequency distribution (Supplementary Fig. S2), E. heterophyllum Longest scaffold (Mb) 3.5 2.07 Scaffold L50 (seqs) 189 341 had a lower heterozygous SNP rate (1.75 10 ) than E. yunna- Scaffold N50 (kb) 535.32 331.5 nense (2.18 10 ). A sharp peak was observed in the distribution Annotation statistics of genome-wide SNP density for E. heterophyllum (Fig. 2A), suggest- Number of gene models 29,606 28,881 ing that mutations within this genome are more evenly distributed With InterPro annotations 22,593 22 343 than those in the E. yunnanense genome. PSMC analysis based on With GO annotations 20,879 20,254 the local SNP densities of heterozygous sites revealed distinct demo- Mean CDS size (bp) 1201.3 1222.3 graphic histories for the two species, in that the high-altitude E. het- Mean exon size (bp) 291.5 301.3 erophyllum showed a decrease-increase-decrease trend while the Mean intron size (bp) 233.0 210.6 population size of the low-altitude E. yunnanense decreased continu- ously in the recent past (Fig. 2B). Figure 1. Dated phylogeny and gene family analysis of Brassicaceae species. Pie charts showing gains (red) and losses (blue) of gene families are indicated along branches and nodes. The numbers of gene families, orphans (single-copy gene families) and predicted genes, as well as a heat map depicting the hier- archically clustered z-score profile for gene numbers within each family, is indicated next to each species. Background colours on the phylogenetic tree (top to bottom) are basal clade, lineage II and lineage I Brassicaceae species, respectively. Full species names: Aethionema arabicum, Arabis alpina, Eutrema yunna- nense, Eutrema heterophyllum, Eutrema salsugineum, Thlaspi arvense, Schrenkiella parvula, Sisymbrium irio, Raphanus raphanistrum, Brassica rapa, Cardamine hirsuta, Leavenworthia alabamica, Capsella rubella, Arabidopsis thaliana, Arabidopsis lyrata. The star and triangle denote lineage-specific whole- genome triplications. Mya, million years ago. See online version for color display. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 311 Figure 2. SNP density distribution and demographic history of two Eutrema species. (A) Distribution of SNP density across each Eutrema genome. Heterozygous SNPs were counted and used to calculate heterozygosity density in non-overlapping 50-kb windows. (B) PSMC inference of ancestral effective population size change in the two species. The estimated ancestral population sizes are represented by central bold lines, and the PSMC estimations of 100 bootstraps resampled from the original sequence are represented by thin curves surrounding each line. A generation time of 3 years and a mutation rate of 7  10 per site per year were assumed for both species. Dark grey and abbreviations indicate glaciation periods, while light grey corresponds to interglacia- tion. BG (LGP), Baiyu Glaciation (last glaciation period, 70  10 thousand years ago, kya); GG, Guxiang Glaciation (300  130 kya); NG, Naynayxungla Glaciation (780  500 kya); XG, Xixiabangma Glaciation (1,170  800 kya). Red arrow denotes the estimated divergence time of the two species. 3.3. Expansion of NLR genes and analysis of domain integration In comparison with other Brassicaceae species, we found that 141 and 178 gene families had significantly expanded (P< 0.05) in the E. heterophyllum and E. yunnanense genomes, respectively (Supplementary Tables S5–S8). Among the gene families that had expanded within the high-altitude E. heterophyllum, GO terms including ‘response to stress’ and ‘nucleotide binding’ (adjusted P< 0.05, Supplementary Fig. S7) were overrepresented, while no particular GO term was enriched in the case of the low-altitude E. yunnanense. The terms enriched in E. heterophyllum could be largely attributed to 10 clusters of disease resistance genes encoding nucleotide-binding site with leucine-rich repeat (NLR) proteins (Supplementary Table S6). We noticed that most of these NLR gene families had also expanded in the genomes of three other Figure 3. Shared gene family expansions among four Brassicaceae species. Brassicaceae species, A. alpina, A. lyrata and C. hirsuta, of which the Venn diagram showing overlap of shared gene families and of the nuclear- former two are well documented as alpine plants (Fig. 3). Two NLR binding site with leucine-rich repeats (NLR) genes. Numbers of significantly expanded gene families among the four genomes that contain the largest families shared across these species account for 20% of the total number of NLRs are shown. Counts of the significantly expanded NLR gene number of NLR genes that differ between E. heterophyllum and families are given in parentheses. E. yunnanense (Supplementary Table S5). Overall, the gene families expanded in E. heterophyllum overlap significantly with those into the same gene family that had undergone expansion in all these expanded in each of these three species (P< 0.01, Fisher’s exact test) genomes (Supplementary Table S5). Obviously, the fusion of PP2 (Fig. 3; Supplementary Fig. S8). These results imply that parallel arose independently in these three species (Fig. 4), probably in expansion of gene sets, particularly within the NLR genes, may have response to the same class of pathogens. repeatedly occurred in alpine plants. Plant NLR genes can fuse with additional domains (hereafter 3.4. Species-specific gene duplications and referred to as NB-IDs) that serve as ‘baits’ for pathogen-derived mol- ecules. We identified 112 NB-IDs with 82 distinct integrated high-altitude adaptation domains across the Brassicaceae genomes (Supplementary Fig. S9; We also analysed the functional enrichment of genes that were dupli- Supplementary Table S9). Within the fused domains, 22 were present cated specifically in the high- or low-altitude species. In the high- in at least two genomes (Supplementary Table S10). Interestingly, in altitude E. heterophyllum, we observed 30 enriched GO terms within E. heterophyllum, we found that integration of one domain encoding the biological process ontology, including ‘response to stimulus’, phloem protein 2 (PP2, Pfam ID: PF14299), characteristic of a group ‘response to stress’, ‘reproductive process’ and ‘cell cycle’. However, of plant lectins involved in responses to various stresses, was only four terms within the biological process ontology, ‘developmen- shared by two other species, A. alpina and C. hirsuta (Fig. 4). It is tal process’, ‘anatomical structure development’, ‘multicellular worth noting that the NLRs fused with this domain were clustered organism development’ and ‘multicellular organismal process’, were Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 312 Eutrema genomes and plant high-altitude adaptation identified (Supplementary Fig. S10). Of these, one fragment was found within gene pairs that are collinear between the two Eutrema species (Supplementary Fig. S11). These findings suggest that dupli- cation of SCC3 arose before the divergence of these species and sub- sequent pseudogenization may have occurred in E. yunnanense. Three genes, ICE2, FAB1 and VIN3, involved in cold stress was also observed to be duplicated in the high-altitude E. heterophyllum (Supplementary Table S11). ICE2 encodes a transcription factor that confers rapid freezing tolerance in A. thaliana. The product of FAB1 is involved in membrane lipid synthesis and photosynthesis collapses in the Arabidopsis fab1 mutant due to degradation of chloroplasts when plants are exposed to low temperature for long 62,63 periods. VIN3, which is induced by prolonged winter cold, is essential for the repression of FLC during vernalization. 4. Discussion In this study, we performed de novo genome assemblies for E. heter- ophyllum and E. yunnanense, a pair of diploid perennials, which have diverged recently but have contrasting altitude preferences. Compared with previously published Brassicaceae genomes that 8,65 were assembled from Illumina short reads, our genome assem- blies were of high quality in terms of continuity and gene coverage, Figure 4. Phylogeny of the phloem protein 2 (PP2) domain fused with making them suitable for subsequent comparative genomic analysis expanded NLR genes. Sequences from all A. thaliana proteins containing and thus representing important genetic resources. Phylogenomic this domain and the integrated PP2 detected in three Brassicaceae genomes analysis confirmed the close relationship and the diploid nature of (labelled with numbers within circles) were analyzed. Branches are coloured these two Eutrema species. However, the contrasting demographic red if a species has a documented high-altitude distribution, and green if histories of the two species may reflect their different responses to the not. The structure of NLR fused with PP2 is shown below the gene tree. See local climatic oscillations that occurred during the Quaternary. The online version for color display. dramatic genomic differences observed here may have partly contrib- uted to the high- or low-altitude adaptations that they exhibit. recovered for the low-altitude E. yunnanense (Supplementary Tables The NLRs are among the most variable genes across plant species. S11–S14). Variation in these genes between high- and low-altitude A. thaliana A close examination of the families expanded in E. heterophyllum populations was discovered recently. We found an increase in the revealed multiple genes that are involved in plant reproduction and number of gene copies for some NLR groups in E. heterophyllum DNA damage repair (DDR), probably in response to UV-B radiation and other alpine species of the family Brassicaceae. In addition, we and other abiotic stresses imposed by the harsh alpine environment. identified a fused domain encoding PP2 among these expanded NLR For example, more copies were found for NBS1, RPA32 and SNI1, genes and confirmed the independent acquisition of this domain in which are involved in homologous recombination (HR) using the different species. According to a recent comprehensive study of fused intact sister chromatid, and KU80 and XRCC4, which participate domains in plant species, this domain also exists in NLRs of in the non-homologous end-joining (NHEJ) process (Fig. 5A, Malvaceae and Poaceae species. Thus members of the PP2 family Supplementary Table S11); HR and NHEJ are the two major pathways have fused with NLR genes multiple times in different angiosperm for the repair of double-strand breaks. Duplicated genes involved in lineages. Further studies are needed to investigate whether and how the regulation of the cell cycle (SIM, ATXR5 and ICK3)werealso plant immune receptors are involved in high-altitude adaptation in detected for E. heterophyllum (Fig. 5A, Supplementary Table S11); other alpine species. SIM is a plant-specific CDK inhibitor binding CDKB1; 1, probably In the high-altitude E. heterophyllum, we found that many genes specifically controlling endocycle onset during the G2/M transition. related to reproduction and DDR had undergone duplications, prob- Of the genes with the largest family sizes (z-score normalized) in ably in response to abiotic stress. It should be noted that most of E. heterophyllum, two (SCC3 and SMC3) encode subunits of the these reproduction-related genes remain as single copies in most cohesin complex, a well-known ring-shaped molecule with a critical 60 67 angiosperms. In addition, two genes in this category (SCC3 and role in sister chromatid cohesion. In E. heterophyllum, we found SMC3) that have undergone duplication events encode cohesin subu- four copies of SCC3 and three ones of SMC3 (Fig. 5B). Most of these nits, and similar copy number variations have also been found in ani- copies were expressed in leaf, root and/or flower tissue (Fig. 5B). A duplication of SCC3 was also found in A. lyrata, another alpine cru- mals, where they may cause distinct functional changes in the 68–71 cifer species (Fig. 6). Using the branch-site model, we showed that cohesin complex. To the best of our knowledge, this is the first duplicated SCC3 copies were under positive selection (Fig. 6). The report of duplication and possible subfunctionalization of cohesin average pairwise identity between the amino acid sequences of the subunits in plants. We have further shown that duplications of SCC3 in Eutrema species occurred early, before their divergence. Although SCC3 genes in the high-altitude E. heterophyllum is only 61.1%, we cannot exclude the possibility that the apparent multiple partial indicating that the duplication of this gene may have occurred very early. Intriguingly, although only one intact SCC3 gene was found in SCC3 sequences identified in E. yunnanense may be due to the the low-altitude E. yunnanense, multiple SCC3 fragments were fragmented nature of the draft genome, the evidence for Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 313 Figure 5. Duplication of genes related to DDR in E. heterophyllum. (A) Genes involved in the DDR pathway and cell cycle control that are duplicated in E. heterophyllum. Duplicated genes are shown in light brown. HR, homologous recombination; NHEJ, non-homologous end-joining. (B) A schematic picture of the plant cohesin complex indicating duplication of genes encoding two cohesin subunits, SCC3 and SMC3. The numbers of the corresponding genes in E. yunnanense and E. heterophyllum are shown in square brackets in the format [E. yunnanense, E. heterophyllum]. The expression profiles in FPKM (frag- ments per kilobase per million reads mapped) of paralogs are plotted as log values. See online version for color display. Figure 6. Phylogenetic relationships and positive selection of SCC3 genes in Brassicaceae species. Values above branches indicates the support values obtained from 500 bootstrap replicates. Scale bar represents the number of substitutions per site. The branches tested in branch-sites tests of selection for the EhSCC3 and AlSCC3 genes are indicated. The inset table shows the results of the branch-sites tests, with a P-value based on 1 degrees of freedom. ln L, log-likelihood values; H0, null model; H1, alternative model; v , chi-square test likelihood ratio (2Dln L). pseudogenization in this species is strong. The duplication, followed that we have developed for the two Eutrema species will be very use- by subsequent loss, of SCC3 in E. yunnanense may be correlated ful for many studies in this genus, the Brassicaceae as a whole, and with its initial occurrence (or its ancestral distribution) in the high- alpine plants in general. altitude QTP and subsequent migration to the lower-altitude region. As the cohesin complex has a profound impact on repro- duction as well as on gene regulation, duplication of these subunits Availability may have played an important role in plant adaptation to harsh The short reads of Illumina DNAseq and RNAseq generated in this alpine environments. In addition, duplication of genes related to cold study are available from the National Center for Biotechnology acclimation in the high-altitude E. heterophyllum may be correlated Information (NCBI) Short Read Archive (SRA) under BioProject with plant adaptation to the low and rapidly changing temperatures PRJNA419026 (BioSample SAMN08200758 for E. heterophyllum experienced in alpine habitats. and SAMN08200758 for E. yunnanense). The genome assemblies of In summary, we assembled de novo genomes of two Eutrema spe- this project have been deposited at DDBJ/ENA/GenBank under the cies and inferred genetic changes in the alpine species that may accession PKMM00000000 for E. heterophyllum and PKML00 underlie high-altitude adaptation, although further tests in more spe- cies using multiple approaches are needed. The genome resources 000000 for E. yunnanense. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 314 Eutrema genomes and plant high-altitude adaptation 13. Bolger, A. M., Lohse, M. and Usadel, B. 2014, Trimmomatic: a flexible Acknowledgements trimmer for Illumina sequence data, Bioinformatics, 30, 2114–20. We thank Bingbing Liu for his help in DNA extraction. 14. Li, H. 2015, BFC: correcting Illumina sequencing errors, Bioinformatics, 31, 2885–7. 15. Xu, H., Luo, X., Qian, J., et al. 2012, FastUniq: a fast de novo duplicates Funding removal tool for paired short reads, PLoS One, 7, e52249. 16. Luo, R., Liu, B., Xie, Y., et al. 2012, SOAPdenovo2: an empirically improved This work was supported by grants from the National Key Research and memory-efficient short-read de novo assembler, Gigascience, 1,18. Development program (2017YFC0505203), National Natural Science 17. Kajitani, R., Toshimoto, K., Noguchi, H., et al. 2014, Efficient de novo Foundation of China (31590821), National Key Project for Basic Research assembly of highly heterozygous genomes from whole-genome shotgun (2014CB954100), Youth Science and Technology Innovation Team of short reads, Genome Res., 24, 1384–95. Sichuan Province (2014TD003), and the 985 and 211 Projects of Sichuan 18. Parra, G., Bradnam, K. and Korf, I. 2007, CEGMA: a pipeline to accu- University. rately annotate core genes in eukaryotic genomes, Bioinformatics, 23, 1061–7. 19. Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. and Conflict of interest Zdobnov, E. M. 2015, BUSCO: assessing genome assembly and annotation The authors declare that they have no competing interests. completeness with single-copy orthologs, Bioinformatics, 31, 3210–2. 20. Flutre, T., Duprat, E., Feuillet, C. and Quesneville, H. 2011, Considering transposable element diversification in de novo annotation approaches, Authors’ contributions PLoS One, 6, e16526. 21. Quesneville, H., Bergman, C. M., Andrieu, O., et al. 2005, Combined evi- J.L. conceived the study. G.H. and X.G. collected samples. X.G. and Q.H. dence annotation of transposable elements in genome sequences, PLoS assembled the genomes and performed genome annotation and evolutionary Comput. Biol., 1, 166–75. analyses with the help of X.W, D.Z. and T.M. X.G. and J.L. wrote the manu- 22. Ellinghaus, D., Kurtz, S. and Willhoeft, U. 2008, LTRharvest, a efficient script. All authors read and approved the final manuscript. and flexible software for de novo detection of LTR retrotransposons, BMC Bioinformatics, 9, 18. 23. Bao, Z. and Eddy, S. R. 2002, Automated de novo identification of repeat Supplementary data sequence families in sequenced genomes, Genome Res., 12, 1269–76. Supplementary data are available at DNARES online. 24. Edgar, R. C. and Myers, E. W. 2005, PILER: identification and classifica- tion of genomic repeats, Bioinformatics, 21(suppl_1), i152–8. 25. Huang, X. 1994, On global sequence alignment, Comput. Appl. Biosci., References 10, 227–35. 26. Bao, W., Kojima, K. K. and Kohany, O. 2015, Repbase Update, a data- 1. Hughes, C. E. and Atchison, G. W. 2015, The ubiquity of alpine plant base of repetitive elements in eukaryotic genomes, Mob. DNA, 6, 11. radiations: from the Andes to the Hengduan Mountains, New Phytol., 27. Finn, R. D., Coggill, P., Eberhardt, R. Y., et al. 2016, The Pfam protein 207, 275–82. families database: towards a more sustainable future, Nucleic Acids Res., 2. Beall, C. M. 2014, Adaptation to high altitude: phenotypes and geno- 44, D279–85. types, Annu. Rev. Anthropol., 43, 251–72. 28. Eddy, S. R. 2008, A probabilistic model of local sequence alignment that 3. Zhang, J., Tian, Y., Yan, L., et al. 2016, Genome of plant maca simplifies statistical significance estimation, PLoS Comp. Biol., 4, (Lepidium meyenii) illuminates genomic basis for high-altitude adaptation e1000069. in the central Andes, Mol. Plant., 9, 1066–77. 29. Maumus, F. and Quesneville, H. 2014, Ancestral repeats have shaped epi- 4. Bailey, C. D., Koch, M. A., Mayer, M., et al. 2006, Toward a global phy- genome and genome composition for millions of years in Arabidopsis logeny of the Brassicaceae, Mol. Biol. Evol., 23, 2142–60. thaliana, Nat. Commun., 5, 4104. 5. Koch, M. A., Dobes, C., Kiefer, C., Schmickl, R., Klimes, L. and Lysak, 30. Haas, B. J., Salzberg, S. L., Zhu, W., et al. 2008, Automated eukaryotic M. A. 2006, Supernetwork identifies multiple events of plastid trn F gene structure annotation using EVidenceModeler and the Program to (GAA) pseudogene evolution in the Brassicaceae, Mol. Biol. Evol., 24, Assemble Spliced Alignments, Genome Biol., 9, R7. 63–73. 31. Grabherr, M. G., Haas, B. J., Yassour, M., et al. 2011, Full-length tran- 6. Koenig, D. and Weigel, D. 2015, Beyond the thale: comparative genomics scriptome assembly from RNA-Seq data without a reference genome, Nat. and genetics of Arabidopsis relatives, Nat. Rev. Genet., 16, 285. Biotechnol., 29, 644–52. 7. Nikolov, L. A. and Tsiantis, M. 2017, Using mustard genomes to explore 32. Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T. the genetic basis of evolutionary change, Curr. Opin. Plant Biol., 36, and Salzberg, S. L. 2015, StringTie enables improved reconstruction of a 119–28. transcriptome from RNA-seq reads, Nat. Biotechnol., 33, 290–5. 8. Gan, X., Hay, A., Kwantes, M., et al. 2016, The Cardamine hirsuta 33. Stanke, M., Diekhans, M., Baertsch, R. and Haussler, D. 2008, Using genome offers insight into the evolution of morphological diversity, Nat. native and syntenically mapped cDNA alignments to improve de novo Plants, 2, 16167. gene finding, Bioinformatics, 24, 637–44. 9. Jiao, W. B., Accinelli, G. G., Hartwig, B., et al. 2017, Improving and cor- 34. Majoros, W. H., Pertea, M. and Salzberg, S. L. 2004, TigrScan and recting the contiguity of long-read genome assemblies of three plant spe- GlimmerHMM: two open source ab initio eukaryotic gene-finders, cies using optical mapping and chromosome conformation capture data, Bioinformatics, 20, 2878–9. Genome Res., 27, 778–86. 35. Yang, R., Jarvis, D. E., Chen, H., et al. 2013, The reference genome of the 10. Willing, E. M., Rawat, V., Manda ´ kova ´ , T., et al. 2015, Genome expan- halophytic plant Eutrema salsugineum, Front. Plant Sci., 4, 1–14. sion of Arabis alpina linked with retrotransposition and reduced symmet- 36. Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H. and ric DNA methylation, Nat. Plants, 1, nplants201423. UniProt, C. 2015, UniRef clusters: a comprehensive and scalable alterna- 11. Al-Shehbaz, I. A. and Warwick, S. I. 2005, A synopsis of Eutrema tive for improving sequence similarity searches, Bioinformatics, 31, (Brassicaceae), Harvard Pap. Bot., 10, 129–35. 926–32. 12. Hao, G. Q., Al-Shehba, I. A., Ahani, H., et al. 2017, An integrative study 37. Iwata, H. and Gotoh, O. 2012, Benchmarking spliced alignment pro- of evolutionary diversification of Eutrema (Eutremeae, Brassicaceae), Bot. grams including Spaln2, an extended version of Spaln that incorporates J. Linn. Soc., 184, 204–23. additional species-specific features, Nucleic Acids Res., 40, e161. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 315 38. Camacho, C., Coulouris, G., Avagyan, V., et al. 2009, BLASTþ: architec- 56. Sarris, P. F., Cevik, V., Dagdas, G., Jones, J. D. and Krasileva, K. V. ture and applications, BMC Bioinformatics, 10, 421. 2016, Comparative analysis of plant immune receptor architectures 39. UniProt, C. 2015, UniProt: a hub for protein information, Nucleic Acids uncovers host proteins likely targeted by pathogens, BMC Biol., 14,8. Res., 43, D204–12. 57. Dinant, S., Clark, A. M., Zhu, Y., et al. 2003, Diversity of the superfamily 40. Lamesch, P., Berardini, T. Z., Li, D., et al. 2012, The Arabidopsis of phloem lectins (phloem protein 2) in angiosperms, Plant Physiol., 131, Information Resource (TAIR): improved gene annotation and new tools, 114–28. Nucleic Acids Res., 40, D1202–10. 58. Hu, Z., Cools, T. and De Veylder, L. 2016, Mechanisms used by plants to 41. Jones, P., Binns, D., Chang, H. Y., et al. 2014, InterProScan 5: cope with DNA damage, Annu. Rev. Plant Biol., 67, 439–62. genome-scale protein function classification, Bioinformatics, 30, 59. Adachi, S., Minamisawa, K., Okushima, Y., et al. 2011, Programmed 1236–40. induction of endoreduplication by DNA double-strand breaks in 42. Huntley, R. P., Sawford, T., Mutowo-Meullenet, P., et al. 2015, The Arabidopsis, Proc. Natl. Acad. Sci. USA., 108, 10004–9. GOA database: gene Ontology annotation updates for 2015, Nucleic 60. Schubert, V. 2009, SMC proteins and their multiple functions in higher Acids Res., 43, D1057–63. plants, Cytogenet. Genome Res., 124, 202–14. 43. Li, L., Stoeckert, C. J. and Roos, D. S. 2003, OrthoMCL: identification of 61. Fursova, O. V., Pogorelko, G. V. and Tarasov, V. A. 2009, Identification ortholog groups for eukaryotic genomes, Genome Res., 13, 2178–89. of ICE2, a gene involved in cold acclimation which determines freezing 44. Han, M. V., Thomas, G. W., Lugo-Martinez, J. and Hahn, M. W. 2013, tolerance in Arabidopsis thaliana, Gene, 429, 98–103. Estimating gene gain and loss rates in the presence of error in genome 62. Wu, J., Lightner, J., Warwick, N. and Browse, J. 1997, Low-temperature assembly and annotation using CAFE 3, Mol. Biol. Evol., 30, 1987–97. damage and subsequent recovery of fab1 mutant Arabidopsis exposed to 45. Tian, T., Liu, Y., Yan, H., et al. 2017, AgriGO v2.0: a GO analysis toolkit 2 [deg] C, Plant Physiol., 113, 347–56. for the agricultural community, 2017 update, Nucleic Acids Res., 43, 63. Gao, J., Wallis, J. G. and Browse, J. 2015, Mutations in the prokaryotic W122–9. pathway rescue the fatty acid biosynthesis1 mutant in the cold, Plant 46. R Core Team. 2016, R: a language and environment for statistical com- Physiol., 169, 442–52. puting. In: R Foundation for Statistical Computing. Vienna, Austria. 64. Sung, S. and Amasino, R. M. 2004, Vernalization in Arabidopsis thaliana https://www.R-project.org/. is mediated by the PHD finger protein VIN3, Nature, 427, 159–64. 47. Stamatakis, A. 2014, RAxML version 8: a tool for phylogenetic analysis 65. Haudry, A., Platts, A. E., Vello, E., et al. 2013, An atlas of over 90,000 and post-analysis of large phylogenies, Bioinformatics, 30, 1312–3. conserved noncoding sequences provides insight into crucifer regulatory 48. Loytynoja, A. and Goldman, N. 2008, Phylogeny-aware gap placement regions, Nat. Genet., 45, 891–8. prevents errors in sequence alignment and evolutionary analysis, Science, 66. Gu ¨ nther, T., Lampei, C., Barilar, I. and Schmid, K. J. 2016, Genomic and 320, 1632–5. phenotypic differentiation of Arabidopsis thaliana along altitudinal gra- 49. Suyama, M., Torrents, D. and Bork, P. 2006, PAL2NAL: robust conver- dients in the North Italian Alps, Mol. Ecol., 25, 3574–92. sion of protein sequence alignments into the corresponding codon align- 67. De Smet, R., Adams, K. L., Vandepoele, K., Van Montagu, M. C., Maere, ments, Nucleic Acids Res., 34, W609–12. S. and Van de Peer, Y. 2013, Convergent gene loss following gene and 50. Yang, Z. 2007, PAML 4: phylogenetic analysis by maximum likelihood, genome duplications creates single-copy families in flowering plants, Proc. Mol. Biol. Evol., 24, 1586–91. Natl. Acad. Sci. USA., 110, 2898–903. 51. Guo, X. Y., Liu, J. Q., Hao, G. Q., et al. 2017, Plastome phylogeny and 68. Losada, A., Yokochi, T., Kobayashi, R. and Hirano, T. 2000, early diversification of Brassicaceae, BMC Genomics, 176, 1–9. Identification and characterization of SA/Scc3p subunits in the Xenopus 52. Wang, Y., Tang, H., DeBarry, J. D., et al. 2012, MCScanX: a toolkit for and human cohesin complexes, J. Cell Biol., 150, 405–16. detection and evolutionary analysis of gene synteny and collinearity, 69. Losada, A., Yokochi, T. and Hirano, T. 2005, Functional contribution of Nucleic Acids Res., 40, e49. Pds5 to cohesin-mediated cohesion in human cells and Xenopus egg 53. Li, H. and Durbin, R. 2011, Inference of human population history from extracts, J. Cell Sci., 118, 2133–41. individual whole-genome sequences, Nature, 475, 493–6. 70. Pezic, D., Weeks, S. L. and Hadjur, S. 2017, More to cohesin than meets 54. Li, H. and Durbin, R. 2009, Fast and accurate short read alignment with the eye: complex diversity for fine-tuning of function, Curr. Opin. Genet. Burrows-Wheeler transform, Bioinformatics, 25, 1754–60. Dev., 43, 93–100. 55. Li, H. 2011, A statistical framework for SNP calling, mutation discovery, 71. Sumara, I., Vorlaufer, E., Gieffers, C., Peters, B. H. and Peters, J. M. association mapping and population genetical parameter estimation from 2000, Characterization of vertebrate cohesin complexes and their regula- sequencing data, Bioinformatics, 27, 2987–93. tion in prophase, J. Cell Biol., 151, 749–62. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png DNA Research Oxford University Press

The genomes of two Eutrema species provide insight into plant adaptation to high altitudes

Free
9 pages

Loading next page...
 
/lp/ou_press/the-genomes-of-two-eutrema-species-provide-insight-into-plant-XUltlmjhxF
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
ISSN
1340-2838
eISSN
1756-1663
D.O.I.
10.1093/dnares/dsy003
Publisher site
See Article on Publisher Site

Abstract

Eutrema is a genus in the Brassicaceae, which includes species of scientific and economic importance. Many Eutrema species are montane and/or alpine species that arose very recently, making them ideal candidates for comparative studies to understand both ecological speciation and high-altitude adaptation in plants. Here we provide de novo whole-genome assemblies for a pair of recently diverged perennials with contrasting altitude preferences, the high-altitude E. heterophyllum from the eastern Qinghai-Tibet Plateau and its lowland congener E. yunna- nense. The two assembled genomes are 350 Mb and 412 Mb, respectively, with 29,606 and 28,881 predicted genes. Comparative analysis of the two species revealed contrasting demo- graphic trajectories and evolution of gene families. Gene family expansions shared between E. heterophyllum and other alpine species were identified, including the disease resistance R genes (NBS-LRRs or NLRs). Genes that are duplicated specifically in the high-altitude E. hetero- phyllum are involved mainly in reproduction, DNA damage repair and cold tolerance. The two Eutrema genomes reported here constitute important genetic resources for diverse studies, including the evolution of the genus Eutrema, of the Brassicaceae as a whole and of alpine plants across the world. Key words: Eutrema, high-altitude adaptation, de novo assembly, comparative genomics 1. Introduction to high-altitude conditions remain largely unknown due to a lack of Adaptation to high altitude has contributed to the high level of plant appropriate genome resources. and animal species biodiversity found on numerous mountains Species of the Brassicaceae family have a wide range of global dis- across the world. It is well known that animals (including humans) tributions and adaptations, as well as different life histories and mat- living in highlands have evolved unique genetic adaptations in order ing systems, and they therefore comprise an ideal system for many 2 4,5 to maintain oxygen homeostasis. In plants, UV-B tolerance and cold types of study. Genome resources are critically important for all 3 6,7 tolerance appear to be critical for survival at high altitude. these studies. Recent advances in high-throughput sequencing However, the genetic mechanisms by which immobile plants adapt technology have facilitated the ability of the scientific community to V C The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 307 Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 308 Eutrema genomes and plant high-altitude adaptation 13 14 sequence the genomes of an increasing number of Brassicaceae spe- Trimmomatic v0.33, error correction using BFC (version r181) 8–10 15 cies for multiple purposes. A previous genome study of an alpine and mate-pair data deduplication using FastUniq v1.1. We then polyploid Brassicaceae species suggested that expansion of numerous used SOAPec v2.01 to estimate the genome size as well as the level gene families was correlated with increased tolerance of cold and of of genome-wide heterozygosity, which was based on the 17-mer fre- 3 17 UV-B stress. However, such expansions may have arisen purely quency distribution. We adopted the Platanus pipeline, which can from genome duplications rather than because of high-altitude efficiently assemble genomes with various levels of heterozygosity. adaptations. Both genomes were initially assembled into scaffolds using Platanus Eutrema is a genus in the Brassicaceae that contains more than 30 (version 1.2.2) with the first round of gap closing. An additional species, most of which are distributed in eastern Asia. In addition to gap closing procedure was performed with GapCloser (version the model plant E. salsugineum (Pall.) Al-Shehbaz & Warwick, 1.12). Finally we evaluated genome assemblies for gene coverage 18 19 which is used for studies of abiotic stress, and the commercial crop using the CEGMA and BUSCO (version 2.0) pipeline. wasabi, E. japonicum (Miq.) Koidz, the genus contains numerous montane and/or alpine species. Most of these species occur in the 2.3. Repeat identification Qinghai-Tibet Plateau (QTP) and have originated recently in To structurally annotate repeat sequences in the two Eutrema response to geologic and climatic changes since the Pliocene. In this genomes, we used the TEdenovo pipeline from the REPET package study, we report the de novo genome sequence assembly and compa- (v2.5) with default parameters. The pipeline performed a self- rative genomic analysis of an alpine plant, E. heterophyllum, from comparison for the input genome using the BLASTER software (ver- the eastern QTP, and its lowland congener, E. yunnanense, both of sion 2.25) to identify and classify repeated elements. In order to which originated from a very recent common ancestor and are dip- attain more comprehensive prediction of TEs, we employed the pipe- loid perennials with chromosome numbers of 2n¼ 14 and similar line to carry out structural detection analysis using LTRharvest (ver- selfing breeding systems, therefore providing an ideal basis for 22 23 24 sion 1.5.8). RECON v1.0.8, PILER v1.0 and GROUPER genetic comparisons without the complicating effects of either 21 v2.27 were then employed to cluster the matches detected. By per- genome duplication or phylogenetic evolution. In addition, these two 25 forming a multiple sequence alignment using the Map algorithm, a genome assemblies provide important genetic resources for future consensus sequence for each cluster was generated to represent the comparative studies of this genus and family and of high-altitude ancestral TE. These consensus sequences were then classified by adaptation in alpine plants. looking for characteristic structural features and similarities to known TEs in Repbase (20.05), and by scanning against the Pfam 27 28 library with HMMER3. To annotate all TEs across the genome, 2. Materials and methods we further employed the TEannot pipeline with the default param- 2.1. Plant materials, genomic DNA and total RNA eters. This pipeline mines the genome sequence using repeated extraction sequences identified in the previous TEdenovo pipeline to produce classified non-redundant consensus repeat sequences along with We selected and sequenced a pair of Eutrema species that met three short simple repeats, which are exported in GFF3 format. We fol- criteria: (1) diploid (2n¼ 14), (2) having diverged recently and (3) growing at contrasting altitudes in their natural distributions lowed the method of Maumus and Quesneville to calculate the age (Supplementary Fig. S1). One individual of E. heterophyllum was of repetitive DNAs. collected in June 2015 at the Zhuodala pass (4,700 m above sea level; asl) in Ganzi County, Sichuan Province, China. Fresh and 2.4. Gene prediction healthy leaves, flowers and root were harvested and immediately fro- A combination of ab initio, homology-based and transcript-based zen in liquid nitrogen, followed by storage at 80 C in the labora- methods was used for gene prediction. We first built a comprehensive tory prior to DNA/RNA extraction. For comparison, fresh samples 30 transcriptome database with the PASA pipeline. After quality con- of leaf, leafstalk and root tissues were obtained from one E. yunna- 13 trol with Trimmomatic v0.33, Illumina RNA-seq reads were nense plant collected in April 2014, from Xinning County (600 m 31 assembled into de novo transcripts using Trinity. Then two sets of asl), Hunan Province, China. We used the CTAB method to extract genome-guided transcripts were built using (1) the genome-guided high-quality genomic DNA from 5 g of leaf tissues for each species. mode implemented in Trinity and (2) the HISAT-StringTie pipe- The quality of the high-molecular weight DNA was checked by 1% 32 33 line. We performed ab initio prediction with Augustus (v3.2.2) agarose gel electrophoresis. Around 200 lg of genomic DNA was using Arabidopsis thaliana as the species model and our comprehen- used for library construction. We extracted total RNA with RNAiso sive transcriptome dataset as the input cDNA hint. We used Plus reagent (Takara, Japan) for each tissue collected from both spe- GlimmerHMM (v3.0.4) with the pre-trained directory for A. thali- cies (i.e. leaves, flowers and roots for E. heterophyllum, and leaves, ana genes to generate another set of ab initio predictions. For homol- leafstalks and roots for E. yunnanense). The quality of the extracted ogy-based prediction, protein sequences of 26,531 E. salsugineum RNA was verified by 1% agarose gel. DNase treatment was per- 35 36 genes as well as the UniRef90 dataset for all Brassicaceae species formed before library construction. For each sample, 5 lg of the total 37 were aligned to the genome assemblies using SPALN2. Gene mod- high-quality RNA was used for sequencing. els from the three main sources (i.e. aligned transcripts, ab initio pre- dictions and aligned proteins) were merged to produce consensus 2.2. Genome sequencing and assembly models by EVidenceModeler. Weights of evidences for gene models We constructed Illumina libraries with small (200-, 500- and 800- were manually set as: ab initio predictions, weight (Augustus)¼ 1 bp) and large (2-, 5-, 10- and 20-kb) inserts and then sequenced and weight (GlimmerHMM)¼ 1; protein alignments, weight (E. sal- them following the Illumina protocols (Illumina, San Diego, CA) sugineum) ¼ 2 and weight (Brassicaceae_UniRef90)¼ 1; transcripts, using the Illumina HiSeq X Ten platform at Novogene (Tianjin, weight (PASA)¼ 10. The EVM consensus predictions were updated China). Short reads were first subjected to quality filtering using by performing an additional round of PASA annotation in order to Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 309 add UTR and alternatively spliced isoform annotations to gene with the same parameters. The results were combined and plotted models. using the plot tool in PSMC. A generation time of 3 years and a mutation rate of 7 10 per site per year were applied. 2.5. Functional annotation 38 5 2.8. NLR genes and integrated domains We used BlastP (with E-value 1 10 ) to assign functional descriptions by carrying out sequence homology searches against dif- NLR (nucleotide-binding site with leucine-rich repeat) genes in the E. ferent sources, including protein datasets from SwissProt and the heterophyllum and E. yunnanense genomes were directly extracted Arabidopsis genome annotation version TAIR10 . Motifs and from the InterProScan results using the Pfam domain of the domains within gene models were identified by deploying nucleotide-binding adaptor shared by APAF-1, R proteins and CED- InterProScan (version 5.20.29) with the options -dp -goterms -ipr- 4 (NB-ARC, PF00931) as query. We further applied the same lookup -pathways -t p against multiple publicly available databases strategy to identify NLRs in protein sequences from 13 cruciferous (Pfam, PROSITE, PRINTS, SMART, TIGRFAMs, Gene3D, species with genomes available. After excluding typical Pfam PANTHER, etc.). We used AHRD (https://github.com/groupschoof/ domains contained in NLRs (i.e. PF01582 for Toll/interleukin-1 AHRD (24 January 2018, date last accessed)) to gather information receptor [TIR], PF13855 for leucine-rich repeat [LRR] and PF05659 from the search results and added Gene Ontology (GO) information for Resistance to Powdery Mildew 8 [RPW8]), we obtained a list of using the GO annotation database for A. thaliana (https://www.ebi. additional domains that were likely to be fused with NLR proteins. ac.uk/GOA/arabidopsis_release; (24 January 2018, date last accessed)). We did not detect TIR2 (PF13676) in our annotated NLRs; this is a TIR domain newly identified among NLR proteins. However, most previously characterized gene fusions were recaptured in this study, 2.6. Comparative genomic analysis indicating that the approach we used for domain searching was We downloaded the protein-coding genes of 13 Brassicaceae species conservative and generally reliable. A maximum likelihood tree of (A. thaliana, A. lyrata, Brassica rapa, Capsella rubella, phloem protein 2 domains was constructed with RAxML under the Leavenworthia alabamica, Sisymbrium irio, Aethionema arabicum, PROTGAMMA model with BLOSUM62 matrix, specifying 500 boot- Arabis alpina, Raphanus raphanistrum, Cardamine hirsuta, strap replicates. The gene tree was visualized in FigTree v1.4.2 (http:// Schrenkiella parvula, Thlaspi arvense, E. salsugineum; see 43 tree.bio.ed.ac.uk/software/figtree/ (24 January 2018, date last accessed)). Supplementary Table S4). We used OrthoMCL to delineate gene families and cluster all genes into paralogous and orthologous groups. Gene family expansion and contraction analysis was per- 2.9. Positive selection in SCC3 genes formed with CAFE3 software. Functional enrichment analysis was The ratio of non-synonymous substitutions per non-synonymous site 45 46 carried out by agriGO v2.0. We used R to perform clustering (dN) to synonymous substitutions per synonymous site (dS), or x,is analysis for the gene families of the greatest size in E. heterophyllum. a commonly used indicator of selective pressure acting on protein- The transformed z-score profile was obtained from the number of coding genes, with x> 1 representing positive selection, x¼ 1 repre- genes per species for each family, and the hierarchically clustered senting neutral evolution and x< 1 representing purifying selection. (complete linkage clustering) ones, using Pearson correlation as a dis- To identify signals of positive selection, we extracted SCC3 sequen- tance measure. The 2,235 single-copy gene families were used to ces from all Brassicaceae genomes. A gene tree of aligned protein reconstruct phylogenies with RAxML. Alignment of protein sequences was constructed using RAxML under the PROTGAMMA sequences was performed with PRANK and they were then back model with BLOSUM62 matrix, specifying 500 bootstrap replicates. translated into coding sequences with PAL2NAL. The MCMCtree We used codeml in the PAML software package (v4.9a) to perform program within the PAML package v4.9a was used to estimate the branch-site test among duplicated SCC3 paralogs for a series of divergence time. The split of the two major lineages within the pairs of alternative versus null models, aiming to find out whether or Brassicaceae was constrained to be between 20 and 30 million years not any proportion of coding sites within each paralog has x> 1 51 52 ago. We used MCScanX to call intra- and inter-species gene colli- compared with the background level. Log-likelihood values for each nearity. Synonymous substitutions were calculated for aligned gene pair of nested models were compared using the likelihood ratio test pairs within identified collinearity blocks using the codeml program (LRT) to obtain the P-value of significance. implemented in PAML. 3. Results 2.7. SNP calling and demographic history 3.1. Genome assembly and annotation reconstruction After Illumina sequencing and quality control, we obtained 111.87 We used the pairwise sequentially Markovian coalescent (PSMC) model to estimate the history of effective population sizes (Ne) and 97.29 Gb of Illumina short reads from seven sequencing libra- based on genome-wide diploid sequence data. This method has been ries, corresponding to an estimated 249 and 230 base-pair cover- previously applied to human and other vertebrates, as well as plants, age, for E. heterophyllum and E. yunnanense, respectively (Supplementary Table S1). The estimated genome size of 400 Mb for inferring demographic histories over long evolutionary periods. The quality-filtered Illumina pair end reads with insert sizes of for both species is larger than that of their congener E. sal- 500 bp for each species were mapped to the corresponding genome sugineum, with a higher level of genome-wide heterozygosity in the assemblies using BWA (version 0.7.10-r789). For SNP calling, we low-altitude species E. yunnanense (Supplementary Fig. S2). A de used the SAMtools (v1.4) pipeline. The settings for PSMC analysis novo assembly pipeline allowed us to generate genome assemblies (-p and -t options) were chosen manually according to suggestions that captured 350 and 413 Mb in 149,415 and 317,771 contigs for given by the authors (https://github.com/lh3/psmc (24 January 2018, the two Eutrema genomes, of which 328 and 386 Mb (81% and date last accessed)). We also performed 100 rounds of bootstrapping 91% of the estimated genome sizes) were represented in scaffolds of Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 310 Eutrema genomes and plant high-altitude adaptation 1 kb or greater (Table 1). Repetitive DNA constitutes 67% and 70% (Table 1). Clustering of predicted proteomes for 15 Brassicaceae of, respectively, the E. heterophyllum and the E. yunnanense genomes yielded a total of 27,193 gene families, which comprised genome. Long terminal repeat (LTR) retrotransposons are the most 380,484 genes (85% of all 448,886 genes; see Supplementary abundant among these repetitive sequences (Supplementary Table S4). Table S2), having contributed to a recent wave of TE burst (Kimura divergence 5, Supplementary Fig. S3). Gene completeness analysis 3.2. Comparative phylogenomics and demographic revealed that we had achieved more than 95% completeness of gene histories coverage for both genomes (Supplementary Table S3). We predicted We analysed the phylogenetic relationships of 15 Brassicaceae spe- 29,606 genes for E. heterophyllum and 28,881 for E. yunnanense cies (Supplementary Table S4) using four-fold degenerate third- codon transversion (4DTv) sites in 2,193 one-to-one orthologous Table 1. Genome assembly and annotation statistics for two genes identified across their genomes. The nuclear-genome phylog- Eutrema species eny (Fig. 1; Supplementary Fig. S4) is congruent with the plastome phylogeny reported previously. We dated the divergence of the two E. heterophyllum E. yunnanense species to 3.5 (1.5–7.2) Mya (Fig. 1; Supplementary Fig. S5). Assembly statistics Clustering analysis of gene family size revealed that many duplicated Estimated genome size (Mb) 405 423 genes were retained in genomes known to be polyploid (Fig. 1). We Number of contigs 149,415 317,771 failed to detect any additional whole genome duplication in the two Contig length (Mb) 350.47 412.62 Eutrema species when compared with A. thaliana using synonymous Longest contig (Mb) 0.67 0.39 substitutions per synonymous site (dS) age distribution analysis Contig L50 (seqs) 1264 3269 (Supplementary Fig. S6). We identified 581,883 and 807,391 hetero- Contig N50 (kb) 74.65 32.56 zygous sites in, respectively, the E. heterophyllum and the E. yunna- Number of scaffolds (>1 kb) 7,690 14,031 nense genome. Consistent with the results of analysing K-mer Scaffold length (Mb) 328.16 385.58 frequency distribution (Supplementary Fig. S2), E. heterophyllum Longest scaffold (Mb) 3.5 2.07 Scaffold L50 (seqs) 189 341 had a lower heterozygous SNP rate (1.75 10 ) than E. yunna- Scaffold N50 (kb) 535.32 331.5 nense (2.18 10 ). A sharp peak was observed in the distribution Annotation statistics of genome-wide SNP density for E. heterophyllum (Fig. 2A), suggest- Number of gene models 29,606 28,881 ing that mutations within this genome are more evenly distributed With InterPro annotations 22,593 22 343 than those in the E. yunnanense genome. PSMC analysis based on With GO annotations 20,879 20,254 the local SNP densities of heterozygous sites revealed distinct demo- Mean CDS size (bp) 1201.3 1222.3 graphic histories for the two species, in that the high-altitude E. het- Mean exon size (bp) 291.5 301.3 erophyllum showed a decrease-increase-decrease trend while the Mean intron size (bp) 233.0 210.6 population size of the low-altitude E. yunnanense decreased continu- ously in the recent past (Fig. 2B). Figure 1. Dated phylogeny and gene family analysis of Brassicaceae species. Pie charts showing gains (red) and losses (blue) of gene families are indicated along branches and nodes. The numbers of gene families, orphans (single-copy gene families) and predicted genes, as well as a heat map depicting the hier- archically clustered z-score profile for gene numbers within each family, is indicated next to each species. Background colours on the phylogenetic tree (top to bottom) are basal clade, lineage II and lineage I Brassicaceae species, respectively. Full species names: Aethionema arabicum, Arabis alpina, Eutrema yunna- nense, Eutrema heterophyllum, Eutrema salsugineum, Thlaspi arvense, Schrenkiella parvula, Sisymbrium irio, Raphanus raphanistrum, Brassica rapa, Cardamine hirsuta, Leavenworthia alabamica, Capsella rubella, Arabidopsis thaliana, Arabidopsis lyrata. The star and triangle denote lineage-specific whole- genome triplications. Mya, million years ago. See online version for color display. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 311 Figure 2. SNP density distribution and demographic history of two Eutrema species. (A) Distribution of SNP density across each Eutrema genome. Heterozygous SNPs were counted and used to calculate heterozygosity density in non-overlapping 50-kb windows. (B) PSMC inference of ancestral effective population size change in the two species. The estimated ancestral population sizes are represented by central bold lines, and the PSMC estimations of 100 bootstraps resampled from the original sequence are represented by thin curves surrounding each line. A generation time of 3 years and a mutation rate of 7  10 per site per year were assumed for both species. Dark grey and abbreviations indicate glaciation periods, while light grey corresponds to interglacia- tion. BG (LGP), Baiyu Glaciation (last glaciation period, 70  10 thousand years ago, kya); GG, Guxiang Glaciation (300  130 kya); NG, Naynayxungla Glaciation (780  500 kya); XG, Xixiabangma Glaciation (1,170  800 kya). Red arrow denotes the estimated divergence time of the two species. 3.3. Expansion of NLR genes and analysis of domain integration In comparison with other Brassicaceae species, we found that 141 and 178 gene families had significantly expanded (P< 0.05) in the E. heterophyllum and E. yunnanense genomes, respectively (Supplementary Tables S5–S8). Among the gene families that had expanded within the high-altitude E. heterophyllum, GO terms including ‘response to stress’ and ‘nucleotide binding’ (adjusted P< 0.05, Supplementary Fig. S7) were overrepresented, while no particular GO term was enriched in the case of the low-altitude E. yunnanense. The terms enriched in E. heterophyllum could be largely attributed to 10 clusters of disease resistance genes encoding nucleotide-binding site with leucine-rich repeat (NLR) proteins (Supplementary Table S6). We noticed that most of these NLR gene families had also expanded in the genomes of three other Figure 3. Shared gene family expansions among four Brassicaceae species. Brassicaceae species, A. alpina, A. lyrata and C. hirsuta, of which the Venn diagram showing overlap of shared gene families and of the nuclear- former two are well documented as alpine plants (Fig. 3). Two NLR binding site with leucine-rich repeats (NLR) genes. Numbers of significantly expanded gene families among the four genomes that contain the largest families shared across these species account for 20% of the total number of NLRs are shown. Counts of the significantly expanded NLR gene number of NLR genes that differ between E. heterophyllum and families are given in parentheses. E. yunnanense (Supplementary Table S5). Overall, the gene families expanded in E. heterophyllum overlap significantly with those into the same gene family that had undergone expansion in all these expanded in each of these three species (P< 0.01, Fisher’s exact test) genomes (Supplementary Table S5). Obviously, the fusion of PP2 (Fig. 3; Supplementary Fig. S8). These results imply that parallel arose independently in these three species (Fig. 4), probably in expansion of gene sets, particularly within the NLR genes, may have response to the same class of pathogens. repeatedly occurred in alpine plants. Plant NLR genes can fuse with additional domains (hereafter 3.4. Species-specific gene duplications and referred to as NB-IDs) that serve as ‘baits’ for pathogen-derived mol- ecules. We identified 112 NB-IDs with 82 distinct integrated high-altitude adaptation domains across the Brassicaceae genomes (Supplementary Fig. S9; We also analysed the functional enrichment of genes that were dupli- Supplementary Table S9). Within the fused domains, 22 were present cated specifically in the high- or low-altitude species. In the high- in at least two genomes (Supplementary Table S10). Interestingly, in altitude E. heterophyllum, we observed 30 enriched GO terms within E. heterophyllum, we found that integration of one domain encoding the biological process ontology, including ‘response to stimulus’, phloem protein 2 (PP2, Pfam ID: PF14299), characteristic of a group ‘response to stress’, ‘reproductive process’ and ‘cell cycle’. However, of plant lectins involved in responses to various stresses, was only four terms within the biological process ontology, ‘developmen- shared by two other species, A. alpina and C. hirsuta (Fig. 4). It is tal process’, ‘anatomical structure development’, ‘multicellular worth noting that the NLRs fused with this domain were clustered organism development’ and ‘multicellular organismal process’, were Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 312 Eutrema genomes and plant high-altitude adaptation identified (Supplementary Fig. S10). Of these, one fragment was found within gene pairs that are collinear between the two Eutrema species (Supplementary Fig. S11). These findings suggest that dupli- cation of SCC3 arose before the divergence of these species and sub- sequent pseudogenization may have occurred in E. yunnanense. Three genes, ICE2, FAB1 and VIN3, involved in cold stress was also observed to be duplicated in the high-altitude E. heterophyllum (Supplementary Table S11). ICE2 encodes a transcription factor that confers rapid freezing tolerance in A. thaliana. The product of FAB1 is involved in membrane lipid synthesis and photosynthesis collapses in the Arabidopsis fab1 mutant due to degradation of chloroplasts when plants are exposed to low temperature for long 62,63 periods. VIN3, which is induced by prolonged winter cold, is essential for the repression of FLC during vernalization. 4. Discussion In this study, we performed de novo genome assemblies for E. heter- ophyllum and E. yunnanense, a pair of diploid perennials, which have diverged recently but have contrasting altitude preferences. Compared with previously published Brassicaceae genomes that 8,65 were assembled from Illumina short reads, our genome assem- blies were of high quality in terms of continuity and gene coverage, Figure 4. Phylogeny of the phloem protein 2 (PP2) domain fused with making them suitable for subsequent comparative genomic analysis expanded NLR genes. Sequences from all A. thaliana proteins containing and thus representing important genetic resources. Phylogenomic this domain and the integrated PP2 detected in three Brassicaceae genomes analysis confirmed the close relationship and the diploid nature of (labelled with numbers within circles) were analyzed. Branches are coloured these two Eutrema species. However, the contrasting demographic red if a species has a documented high-altitude distribution, and green if histories of the two species may reflect their different responses to the not. The structure of NLR fused with PP2 is shown below the gene tree. See local climatic oscillations that occurred during the Quaternary. The online version for color display. dramatic genomic differences observed here may have partly contrib- uted to the high- or low-altitude adaptations that they exhibit. recovered for the low-altitude E. yunnanense (Supplementary Tables The NLRs are among the most variable genes across plant species. S11–S14). Variation in these genes between high- and low-altitude A. thaliana A close examination of the families expanded in E. heterophyllum populations was discovered recently. We found an increase in the revealed multiple genes that are involved in plant reproduction and number of gene copies for some NLR groups in E. heterophyllum DNA damage repair (DDR), probably in response to UV-B radiation and other alpine species of the family Brassicaceae. In addition, we and other abiotic stresses imposed by the harsh alpine environment. identified a fused domain encoding PP2 among these expanded NLR For example, more copies were found for NBS1, RPA32 and SNI1, genes and confirmed the independent acquisition of this domain in which are involved in homologous recombination (HR) using the different species. According to a recent comprehensive study of fused intact sister chromatid, and KU80 and XRCC4, which participate domains in plant species, this domain also exists in NLRs of in the non-homologous end-joining (NHEJ) process (Fig. 5A, Malvaceae and Poaceae species. Thus members of the PP2 family Supplementary Table S11); HR and NHEJ are the two major pathways have fused with NLR genes multiple times in different angiosperm for the repair of double-strand breaks. Duplicated genes involved in lineages. Further studies are needed to investigate whether and how the regulation of the cell cycle (SIM, ATXR5 and ICK3)werealso plant immune receptors are involved in high-altitude adaptation in detected for E. heterophyllum (Fig. 5A, Supplementary Table S11); other alpine species. SIM is a plant-specific CDK inhibitor binding CDKB1; 1, probably In the high-altitude E. heterophyllum, we found that many genes specifically controlling endocycle onset during the G2/M transition. related to reproduction and DDR had undergone duplications, prob- Of the genes with the largest family sizes (z-score normalized) in ably in response to abiotic stress. It should be noted that most of E. heterophyllum, two (SCC3 and SMC3) encode subunits of the these reproduction-related genes remain as single copies in most cohesin complex, a well-known ring-shaped molecule with a critical 60 67 angiosperms. In addition, two genes in this category (SCC3 and role in sister chromatid cohesion. In E. heterophyllum, we found SMC3) that have undergone duplication events encode cohesin subu- four copies of SCC3 and three ones of SMC3 (Fig. 5B). Most of these nits, and similar copy number variations have also been found in ani- copies were expressed in leaf, root and/or flower tissue (Fig. 5B). A duplication of SCC3 was also found in A. lyrata, another alpine cru- mals, where they may cause distinct functional changes in the 68–71 cifer species (Fig. 6). Using the branch-site model, we showed that cohesin complex. To the best of our knowledge, this is the first duplicated SCC3 copies were under positive selection (Fig. 6). The report of duplication and possible subfunctionalization of cohesin average pairwise identity between the amino acid sequences of the subunits in plants. We have further shown that duplications of SCC3 in Eutrema species occurred early, before their divergence. Although SCC3 genes in the high-altitude E. heterophyllum is only 61.1%, we cannot exclude the possibility that the apparent multiple partial indicating that the duplication of this gene may have occurred very early. Intriguingly, although only one intact SCC3 gene was found in SCC3 sequences identified in E. yunnanense may be due to the the low-altitude E. yunnanense, multiple SCC3 fragments were fragmented nature of the draft genome, the evidence for Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 313 Figure 5. Duplication of genes related to DDR in E. heterophyllum. (A) Genes involved in the DDR pathway and cell cycle control that are duplicated in E. heterophyllum. Duplicated genes are shown in light brown. HR, homologous recombination; NHEJ, non-homologous end-joining. (B) A schematic picture of the plant cohesin complex indicating duplication of genes encoding two cohesin subunits, SCC3 and SMC3. The numbers of the corresponding genes in E. yunnanense and E. heterophyllum are shown in square brackets in the format [E. yunnanense, E. heterophyllum]. The expression profiles in FPKM (frag- ments per kilobase per million reads mapped) of paralogs are plotted as log values. See online version for color display. Figure 6. Phylogenetic relationships and positive selection of SCC3 genes in Brassicaceae species. Values above branches indicates the support values obtained from 500 bootstrap replicates. Scale bar represents the number of substitutions per site. The branches tested in branch-sites tests of selection for the EhSCC3 and AlSCC3 genes are indicated. The inset table shows the results of the branch-sites tests, with a P-value based on 1 degrees of freedom. ln L, log-likelihood values; H0, null model; H1, alternative model; v , chi-square test likelihood ratio (2Dln L). pseudogenization in this species is strong. The duplication, followed that we have developed for the two Eutrema species will be very use- by subsequent loss, of SCC3 in E. yunnanense may be correlated ful for many studies in this genus, the Brassicaceae as a whole, and with its initial occurrence (or its ancestral distribution) in the high- alpine plants in general. altitude QTP and subsequent migration to the lower-altitude region. As the cohesin complex has a profound impact on repro- duction as well as on gene regulation, duplication of these subunits Availability may have played an important role in plant adaptation to harsh The short reads of Illumina DNAseq and RNAseq generated in this alpine environments. In addition, duplication of genes related to cold study are available from the National Center for Biotechnology acclimation in the high-altitude E. heterophyllum may be correlated Information (NCBI) Short Read Archive (SRA) under BioProject with plant adaptation to the low and rapidly changing temperatures PRJNA419026 (BioSample SAMN08200758 for E. heterophyllum experienced in alpine habitats. and SAMN08200758 for E. yunnanense). The genome assemblies of In summary, we assembled de novo genomes of two Eutrema spe- this project have been deposited at DDBJ/ENA/GenBank under the cies and inferred genetic changes in the alpine species that may accession PKMM00000000 for E. heterophyllum and PKML00 underlie high-altitude adaptation, although further tests in more spe- cies using multiple approaches are needed. The genome resources 000000 for E. yunnanense. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 314 Eutrema genomes and plant high-altitude adaptation 13. Bolger, A. M., Lohse, M. and Usadel, B. 2014, Trimmomatic: a flexible Acknowledgements trimmer for Illumina sequence data, Bioinformatics, 30, 2114–20. We thank Bingbing Liu for his help in DNA extraction. 14. Li, H. 2015, BFC: correcting Illumina sequencing errors, Bioinformatics, 31, 2885–7. 15. Xu, H., Luo, X., Qian, J., et al. 2012, FastUniq: a fast de novo duplicates Funding removal tool for paired short reads, PLoS One, 7, e52249. 16. Luo, R., Liu, B., Xie, Y., et al. 2012, SOAPdenovo2: an empirically improved This work was supported by grants from the National Key Research and memory-efficient short-read de novo assembler, Gigascience, 1,18. Development program (2017YFC0505203), National Natural Science 17. Kajitani, R., Toshimoto, K., Noguchi, H., et al. 2014, Efficient de novo Foundation of China (31590821), National Key Project for Basic Research assembly of highly heterozygous genomes from whole-genome shotgun (2014CB954100), Youth Science and Technology Innovation Team of short reads, Genome Res., 24, 1384–95. Sichuan Province (2014TD003), and the 985 and 211 Projects of Sichuan 18. Parra, G., Bradnam, K. and Korf, I. 2007, CEGMA: a pipeline to accu- University. rately annotate core genes in eukaryotic genomes, Bioinformatics, 23, 1061–7. 19. Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. and Conflict of interest Zdobnov, E. M. 2015, BUSCO: assessing genome assembly and annotation The authors declare that they have no competing interests. completeness with single-copy orthologs, Bioinformatics, 31, 3210–2. 20. Flutre, T., Duprat, E., Feuillet, C. and Quesneville, H. 2011, Considering transposable element diversification in de novo annotation approaches, Authors’ contributions PLoS One, 6, e16526. 21. Quesneville, H., Bergman, C. M., Andrieu, O., et al. 2005, Combined evi- J.L. conceived the study. G.H. and X.G. collected samples. X.G. and Q.H. dence annotation of transposable elements in genome sequences, PLoS assembled the genomes and performed genome annotation and evolutionary Comput. Biol., 1, 166–75. analyses with the help of X.W, D.Z. and T.M. X.G. and J.L. wrote the manu- 22. Ellinghaus, D., Kurtz, S. and Willhoeft, U. 2008, LTRharvest, a efficient script. All authors read and approved the final manuscript. and flexible software for de novo detection of LTR retrotransposons, BMC Bioinformatics, 9, 18. 23. Bao, Z. and Eddy, S. R. 2002, Automated de novo identification of repeat Supplementary data sequence families in sequenced genomes, Genome Res., 12, 1269–76. Supplementary data are available at DNARES online. 24. Edgar, R. C. and Myers, E. W. 2005, PILER: identification and classifica- tion of genomic repeats, Bioinformatics, 21(suppl_1), i152–8. 25. Huang, X. 1994, On global sequence alignment, Comput. Appl. Biosci., References 10, 227–35. 26. Bao, W., Kojima, K. K. and Kohany, O. 2015, Repbase Update, a data- 1. Hughes, C. E. and Atchison, G. W. 2015, The ubiquity of alpine plant base of repetitive elements in eukaryotic genomes, Mob. DNA, 6, 11. radiations: from the Andes to the Hengduan Mountains, New Phytol., 27. Finn, R. D., Coggill, P., Eberhardt, R. Y., et al. 2016, The Pfam protein 207, 275–82. families database: towards a more sustainable future, Nucleic Acids Res., 2. Beall, C. M. 2014, Adaptation to high altitude: phenotypes and geno- 44, D279–85. types, Annu. Rev. Anthropol., 43, 251–72. 28. Eddy, S. R. 2008, A probabilistic model of local sequence alignment that 3. Zhang, J., Tian, Y., Yan, L., et al. 2016, Genome of plant maca simplifies statistical significance estimation, PLoS Comp. Biol., 4, (Lepidium meyenii) illuminates genomic basis for high-altitude adaptation e1000069. in the central Andes, Mol. Plant., 9, 1066–77. 29. Maumus, F. and Quesneville, H. 2014, Ancestral repeats have shaped epi- 4. Bailey, C. D., Koch, M. A., Mayer, M., et al. 2006, Toward a global phy- genome and genome composition for millions of years in Arabidopsis logeny of the Brassicaceae, Mol. Biol. Evol., 23, 2142–60. thaliana, Nat. Commun., 5, 4104. 5. Koch, M. A., Dobes, C., Kiefer, C., Schmickl, R., Klimes, L. and Lysak, 30. Haas, B. J., Salzberg, S. L., Zhu, W., et al. 2008, Automated eukaryotic M. A. 2006, Supernetwork identifies multiple events of plastid trn F gene structure annotation using EVidenceModeler and the Program to (GAA) pseudogene evolution in the Brassicaceae, Mol. Biol. Evol., 24, Assemble Spliced Alignments, Genome Biol., 9, R7. 63–73. 31. Grabherr, M. G., Haas, B. J., Yassour, M., et al. 2011, Full-length tran- 6. Koenig, D. and Weigel, D. 2015, Beyond the thale: comparative genomics scriptome assembly from RNA-Seq data without a reference genome, Nat. and genetics of Arabidopsis relatives, Nat. Rev. Genet., 16, 285. Biotechnol., 29, 644–52. 7. Nikolov, L. A. and Tsiantis, M. 2017, Using mustard genomes to explore 32. Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T. the genetic basis of evolutionary change, Curr. Opin. Plant Biol., 36, and Salzberg, S. L. 2015, StringTie enables improved reconstruction of a 119–28. transcriptome from RNA-seq reads, Nat. Biotechnol., 33, 290–5. 8. Gan, X., Hay, A., Kwantes, M., et al. 2016, The Cardamine hirsuta 33. Stanke, M., Diekhans, M., Baertsch, R. and Haussler, D. 2008, Using genome offers insight into the evolution of morphological diversity, Nat. native and syntenically mapped cDNA alignments to improve de novo Plants, 2, 16167. gene finding, Bioinformatics, 24, 637–44. 9. Jiao, W. B., Accinelli, G. G., Hartwig, B., et al. 2017, Improving and cor- 34. Majoros, W. H., Pertea, M. and Salzberg, S. L. 2004, TigrScan and recting the contiguity of long-read genome assemblies of three plant spe- GlimmerHMM: two open source ab initio eukaryotic gene-finders, cies using optical mapping and chromosome conformation capture data, Bioinformatics, 20, 2878–9. Genome Res., 27, 778–86. 35. Yang, R., Jarvis, D. E., Chen, H., et al. 2013, The reference genome of the 10. Willing, E. M., Rawat, V., Manda ´ kova ´ , T., et al. 2015, Genome expan- halophytic plant Eutrema salsugineum, Front. Plant Sci., 4, 1–14. sion of Arabis alpina linked with retrotransposition and reduced symmet- 36. Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H. and ric DNA methylation, Nat. Plants, 1, nplants201423. UniProt, C. 2015, UniRef clusters: a comprehensive and scalable alterna- 11. Al-Shehbaz, I. A. and Warwick, S. I. 2005, A synopsis of Eutrema tive for improving sequence similarity searches, Bioinformatics, 31, (Brassicaceae), Harvard Pap. Bot., 10, 129–35. 926–32. 12. Hao, G. Q., Al-Shehba, I. A., Ahani, H., et al. 2017, An integrative study 37. Iwata, H. and Gotoh, O. 2012, Benchmarking spliced alignment pro- of evolutionary diversification of Eutrema (Eutremeae, Brassicaceae), Bot. grams including Spaln2, an extended version of Spaln that incorporates J. Linn. Soc., 184, 204–23. additional species-specific features, Nucleic Acids Res., 40, e161. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018 X. Guo et al. 315 38. Camacho, C., Coulouris, G., Avagyan, V., et al. 2009, BLASTþ: architec- 56. Sarris, P. F., Cevik, V., Dagdas, G., Jones, J. D. and Krasileva, K. V. ture and applications, BMC Bioinformatics, 10, 421. 2016, Comparative analysis of plant immune receptor architectures 39. UniProt, C. 2015, UniProt: a hub for protein information, Nucleic Acids uncovers host proteins likely targeted by pathogens, BMC Biol., 14,8. Res., 43, D204–12. 57. Dinant, S., Clark, A. M., Zhu, Y., et al. 2003, Diversity of the superfamily 40. Lamesch, P., Berardini, T. Z., Li, D., et al. 2012, The Arabidopsis of phloem lectins (phloem protein 2) in angiosperms, Plant Physiol., 131, Information Resource (TAIR): improved gene annotation and new tools, 114–28. Nucleic Acids Res., 40, D1202–10. 58. Hu, Z., Cools, T. and De Veylder, L. 2016, Mechanisms used by plants to 41. Jones, P., Binns, D., Chang, H. Y., et al. 2014, InterProScan 5: cope with DNA damage, Annu. Rev. Plant Biol., 67, 439–62. genome-scale protein function classification, Bioinformatics, 30, 59. Adachi, S., Minamisawa, K., Okushima, Y., et al. 2011, Programmed 1236–40. induction of endoreduplication by DNA double-strand breaks in 42. Huntley, R. P., Sawford, T., Mutowo-Meullenet, P., et al. 2015, The Arabidopsis, Proc. Natl. Acad. Sci. USA., 108, 10004–9. GOA database: gene Ontology annotation updates for 2015, Nucleic 60. Schubert, V. 2009, SMC proteins and their multiple functions in higher Acids Res., 43, D1057–63. plants, Cytogenet. Genome Res., 124, 202–14. 43. Li, L., Stoeckert, C. J. and Roos, D. S. 2003, OrthoMCL: identification of 61. Fursova, O. V., Pogorelko, G. V. and Tarasov, V. A. 2009, Identification ortholog groups for eukaryotic genomes, Genome Res., 13, 2178–89. of ICE2, a gene involved in cold acclimation which determines freezing 44. Han, M. V., Thomas, G. W., Lugo-Martinez, J. and Hahn, M. W. 2013, tolerance in Arabidopsis thaliana, Gene, 429, 98–103. Estimating gene gain and loss rates in the presence of error in genome 62. Wu, J., Lightner, J., Warwick, N. and Browse, J. 1997, Low-temperature assembly and annotation using CAFE 3, Mol. Biol. Evol., 30, 1987–97. damage and subsequent recovery of fab1 mutant Arabidopsis exposed to 45. Tian, T., Liu, Y., Yan, H., et al. 2017, AgriGO v2.0: a GO analysis toolkit 2 [deg] C, Plant Physiol., 113, 347–56. for the agricultural community, 2017 update, Nucleic Acids Res., 43, 63. Gao, J., Wallis, J. G. and Browse, J. 2015, Mutations in the prokaryotic W122–9. pathway rescue the fatty acid biosynthesis1 mutant in the cold, Plant 46. R Core Team. 2016, R: a language and environment for statistical com- Physiol., 169, 442–52. puting. In: R Foundation for Statistical Computing. Vienna, Austria. 64. Sung, S. and Amasino, R. M. 2004, Vernalization in Arabidopsis thaliana https://www.R-project.org/. is mediated by the PHD finger protein VIN3, Nature, 427, 159–64. 47. Stamatakis, A. 2014, RAxML version 8: a tool for phylogenetic analysis 65. Haudry, A., Platts, A. E., Vello, E., et al. 2013, An atlas of over 90,000 and post-analysis of large phylogenies, Bioinformatics, 30, 1312–3. conserved noncoding sequences provides insight into crucifer regulatory 48. Loytynoja, A. and Goldman, N. 2008, Phylogeny-aware gap placement regions, Nat. Genet., 45, 891–8. prevents errors in sequence alignment and evolutionary analysis, Science, 66. Gu ¨ nther, T., Lampei, C., Barilar, I. and Schmid, K. J. 2016, Genomic and 320, 1632–5. phenotypic differentiation of Arabidopsis thaliana along altitudinal gra- 49. Suyama, M., Torrents, D. and Bork, P. 2006, PAL2NAL: robust conver- dients in the North Italian Alps, Mol. Ecol., 25, 3574–92. sion of protein sequence alignments into the corresponding codon align- 67. De Smet, R., Adams, K. L., Vandepoele, K., Van Montagu, M. C., Maere, ments, Nucleic Acids Res., 34, W609–12. S. and Van de Peer, Y. 2013, Convergent gene loss following gene and 50. Yang, Z. 2007, PAML 4: phylogenetic analysis by maximum likelihood, genome duplications creates single-copy families in flowering plants, Proc. Mol. Biol. Evol., 24, 1586–91. Natl. Acad. Sci. USA., 110, 2898–903. 51. Guo, X. Y., Liu, J. Q., Hao, G. Q., et al. 2017, Plastome phylogeny and 68. Losada, A., Yokochi, T., Kobayashi, R. and Hirano, T. 2000, early diversification of Brassicaceae, BMC Genomics, 176, 1–9. Identification and characterization of SA/Scc3p subunits in the Xenopus 52. Wang, Y., Tang, H., DeBarry, J. D., et al. 2012, MCScanX: a toolkit for and human cohesin complexes, J. Cell Biol., 150, 405–16. detection and evolutionary analysis of gene synteny and collinearity, 69. Losada, A., Yokochi, T. and Hirano, T. 2005, Functional contribution of Nucleic Acids Res., 40, e49. Pds5 to cohesin-mediated cohesion in human cells and Xenopus egg 53. Li, H. and Durbin, R. 2011, Inference of human population history from extracts, J. Cell Sci., 118, 2133–41. individual whole-genome sequences, Nature, 475, 493–6. 70. Pezic, D., Weeks, S. L. and Hadjur, S. 2017, More to cohesin than meets 54. Li, H. and Durbin, R. 2009, Fast and accurate short read alignment with the eye: complex diversity for fine-tuning of function, Curr. Opin. Genet. Burrows-Wheeler transform, Bioinformatics, 25, 1754–60. Dev., 43, 93–100. 55. Li, H. 2011, A statistical framework for SNP calling, mutation discovery, 71. Sumara, I., Vorlaufer, E., Gieffers, C., Peters, B. H. and Peters, J. M. association mapping and population genetical parameter estimation from 2000, Characterization of vertebrate cohesin complexes and their regula- sequencing data, Bioinformatics, 27, 2987–93. tion in prophase, J. Cell Biol., 151, 749–62. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/3/307/4831046 by Ed 'DeepDyve' Gillespie user on 26 June 2018

Journal

DNA ResearchOxford University Press

Published: Jan 30, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off